goldengate故障處理一例
問題描述:
我們線上的gg上線時間是上週三晚上,也就是4月19號晚上,當時上線的時候是配置在rac的節點3上的,在重啟節點3的時候由於疏忽,原本32G的記憶體,起來之後只識別了24G,當時沒有發現,執行幾天後,突然發現,每天都有那麼一、二次,節點3併發非常高,作業系統層面平均負載從幾一下飆升到五六十,造成資料庫短暫性假死現象,恰恰在這個時間點上,gg的抽取程式在top1,再看作業系統的記憶體使用情況,只剩下幾十k了,一開始懷疑是nfs掛載的問題,最後測試下來,也沒什麼問題,最後決定緊急處理節點3的記憶體問題,具體處理細節如下:
晚6點下班後,由於6點到9點這個時間段,相對來說網站和boss都還比較繁忙,這段時間就沒做任何操作,到了9點鐘,通知運維相關人員,把節點3的tomcat全部停止,然後我這裡停gg,解除安裝nfs,關閉節點3的所有資料庫程式,最後關機,操作見下:
GGSCI (rac3) 21> stop mgr
GGSCI (rac3) 21> stop extract xxxx
GGSCI (rac3) 21> stop dpump xxxx
停的過程中,errlog中的資訊如下:
2012-04-26 20:57:39 INFO OGG-00497 Oracle GoldenGate Capture for Oracle, extksr1.prm: Writing DDL operation to extract trail file.
2012-04-26 21:01:36 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): stop extksr1.
2012-04-26 21:01:38 INFO OGG-01021 Oracle GoldenGate Capture for Oracle, extksr1.prm: Command received from GGSCI: STOP.
2012-04-26 21:01:39 INFO OGG-00991 Oracle GoldenGate Capture for Oracle, extksr1.prm: EXTRACT EXTKSR1 stopped normally.
2012-04-26 21:01:41 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): stop dpksr1.
2012-04-26 21:01:43 INFO OGG-01021 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Command received from GGSCI: STOP.
2012-04-26 21:01:43 INFO OGG-00991 Oracle GoldenGate Capture for Oracle, dpksr1.prm: EXTRACT DPKSR1 stopped normally.
2012-04-26 21:01:47 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): stop mgr.
2012-04-26 21:01:49 INFO OGG-00963 Oracle GoldenGate Manager for Oracle, mgr.prm: Command received from GGSCI on host 10.1.8.49 (STOP).
2012-04-26 21:01:49 WARNING OGG-00938 Oracle GoldenGate Manager for Oracle, mgr.prm: Manager is stopping at user request.
相關程式都停止之後,解除安裝nfs,umount了節點1,2以及共享儲存,具體命令略過,很簡單,值得一提的是,在解除安裝共享儲存的時候,會出現資源忙的情況,只要加個-l引數就可以了,同時主站gg程式都停止之後,會發現gg的目標端程式雖然是running狀態,但是errlog裡會提示抽取程式已停止的相關資訊:
2012-04-26 20:54:38 INFO OGG-00484 Oracle GoldenGate Delivery for Oracle, repksr1.prm: Executing DDL operation.
2012-04-26 20:54:38 INFO OGG-00483 Oracle GoldenGate Delivery for Oracle, repksr1.prm: DDL operation successful.
2012-04-26 20:54:38 INFO OGG-01408 Oracle GoldenGate Delivery for Oracle, repksr1.prm: Restoring current schema for DDL operation to [OGG].
2012-04-26 20:58:41 INFO OGG-01735 Oracle GoldenGate Collector: Synchronizing /home/oracle/ggs/trails/t1000239 to disk.
2012-04-26 20:58:41 INFO OGG-01670 Oracle GoldenGate Collector: Closing /home/oracle/ggs/trails/t1000239.
2012-04-26 20:58:41 INFO OGG-01675 Oracle GoldenGate Collector: Terminating because extract is stopped.
以上步驟執行完了之後,停掉節點3上的資料庫相關程式和服務,略過,然後就是關機,通知在機房候命的同事,然後那邊開始處理記憶體問題.........大約30分鐘後,記憶體問題解決,伺服器啟動起來後,我這裡開始處理後續事宜:
首先就是在節點3上啟動portmap和nfs服務,略過................
之後掛載節點1,2以及共享儲存,之後在啟動mgr程式的時候會報錯,如下:
2012-04-26 21:50:18 ERROR OGG-01117 Oracle GoldenGate Command Interpreter for Oracle: Received signal: Program interrupt (2).
2012-04-26 21:50:18 ERROR OGG-01668 Oracle GoldenGate Command Interpreter for Oracle: PROCESS ABENDING.
2012-04-26 21:51:43 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): start mgr.
2012-04-26 21:52:13 ERROR OGG-01454 Oracle GoldenGate Manager for Oracle, mgr.prm: Unable to lock file "/share_disk/ggs/dirpcs/MGR.pcm" (error 37, No locks available).
2012-04-26 21:52:13 ERROR OGG-01668 Oracle GoldenGate Manager for Oracle, mgr.prm: PROCESS ABENDING.
以上紅色部分大概意思就是mgr程式無法獲得共享儲存上的相關鎖,直接會導致後續操作都無法進行,方法很簡單,就是在節點3上啟動nfslock服務,然後再啟動mgr程式就好了,待mgr啟動起來之後,發現抽取程式abend掉了,errlog裡丟擲相關extract的錯誤資訊,如下:
2012-04-26 21:54:34 INFO OGG-01026 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Rolling over remote file /home/oracle/ggs/trails/t1000240.
2012-04-26 21:54:34 INFO OGG-01053 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Recovery completed for target file /home/oracle/ggs/trails/t1000240, at RBA 1022.
2012-04-26 21:54:34 INFO OGG-01057 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Recovery completed for all targets.
2012-04-26 21:54:35 ERROR OGG-00446 Oracle GoldenGate Capture for Oracle, extksr1.prm: Could not find archived log for sequence 16857 thread 3 under alternative destinations. SQL 2012-04-26 21:54:35 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, extksr1.prm: PROCESS ABENDING.
造成這種情況的原因很簡單,就是節點3在關閉的時候,出現vip漂移至其他節點了,導致原本節點3上的歸檔歸到了其他的節點上,在gg抽取節點3的歸檔的時候,在相關目錄下找不到必須的歸檔日誌,所以就abend掉了,原因清楚之後,解決就簡單了,直接到其他節點上把節點3的歸檔日誌拷貝過來,然後再啟動抽取程式就ok了:
2012-04-26 21:57:22 INFO OGG-00993 Oracle GoldenGate Capture for Oracle, extksr1.prm: EXTRACT EXTKSR1 started.
2012-04-26 21:57:22 INFO OGG-01055 Oracle GoldenGate Capture for Oracle, extksr1.prm: Recovery initialization completed for target file /share_disk/ggs/trails/s1000239, at RBA 24518902.
2012-04-26 21:57:22 INFO OGG-01478 Oracle GoldenGate Capture for Oracle, extksr1.prm: Output file /share_disk/ggs/trails/s1 is using format RELEASE 10.4/11.1.
2012-04-26 21:57:23 INFO OGG-01517 Oracle GoldenGate Capture for Oracle, extksr1.prm: Position of first record processed for Thread 1, Sequence 29645, RBA 18568720, SCN 18.122009990, Apr 26, 2012 9:01:24 PM.
2012-04-26 21:57:23 INFO OGG-01517 Oracle GoldenGate Capture for Oracle, extksr1.prm: Position of first record processed for Thread 2, Sequence 28161, RBA 12794496, SCN 18.122010368, Apr 26, 2012 9:01:32 PM.
2012-04-26 21:57:24 INFO OGG-01026 Oracle GoldenGate Capture for Oracle, extksr1.prm: Rolling over remote file /share_disk/ggs/trails/s1000239.
2012-04-26 21:57:24 INFO OGG-01053 Oracle GoldenGate Capture for Oracle, extksr1.prm: Recovery completed for target file /share_disk/ggs/trails/s1000240, at RBA 1019.
2012-04-26 21:57:24 INFO OGG-01057 Oracle GoldenGate Capture for Oracle, extksr1.prm: Recovery completed for all targets.
gg主庫:
GGSCI (rac3) 20> info all
Program Status Group Lag Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPKSR1 00:00:00 00:00:00
EXTRACT RUNNING EXTKSR1 00:00:00 00:00:04
gg備庫:
GGSCI (rptdb) 7> info all
Program Status Group Lag Time Since Chkpt
MANAGER RUNNING
REPLICAT RUNNING REPKSR1 00:00:00 00:00:00
最後觀察了一段時間,發現主站和gg都沒什麼問題了,整過程持續了大概一個小時,接下來一週時間繼續觀察監控。
記錄一下~~
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25618347/viewspace-722335/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- goldengate故障處理一例(續)Go
- OGG 故障處理一例
- ORA-00054 故障處理一例
- database link故障處理一例Database
- 處理mysql複製故障一例薦MySql
- ORACLE 10G rac故障處理一例Oracle 10g
- oracle dataguard資料同步故障處理一例Oracle
- goldengate複製過程字符集處理一例Go
- 【故障處理】一次RAC故障處理過程
- MongoDB故障處理MongoDB
- 故障分析 | Greenplum Segment 故障處理
- GoldenGate COLMAP字串處理Go字串
- 處理set autotrace故障又一例_ora-942_sp2-0611
- GPON網路故障如何處理?GPON網路故障處理流程
- 【故障處理】ORA-600:[13013],[5001]故障處理
- 【故障處理】ORA- 2730*,status 12故障分析與處理
- linux故障處理Linux
- ora-故障處理
- mysqlconnect bug 處理一例。MySql
- 線上故障處理手冊
- MySQL show processlist故障處理MySql
- 微服務的故障處理微服務
- teams登入故障處理
- Oracle更新Opatch故障處理Oracle
- 如何快速處理線上故障
- Mysql故障處理2則MySql
- dataguard故障處理一則
- AIX系統故障處理AI
- 【Linux】 nfs 故障處理LinuxNFS
- GoldenGate常見異常處理Go
- 遠端通過監聽連線報ORA-01034故障處理一例
- ORA-04030處理一例
- ORA-16038處理一例
- OGG-00952---oracle goldengate無法purge歷史表和mark表處理一例OracleGo
- 【故障處理】CRS-1153錯誤處理
- 【故障處理】ORA-19809錯誤處理
- MySQL SLAVE故障一例MySql
- 網路故障一例