goldengate故障處理一例

skuary發表於2012-04-27

問題描述:

我們線上的gg上線時間是上週三晚上,也就是4月19號晚上,當時上線的時候是配置在rac的節點3上的,在重啟節點3的時候由於疏忽,原本32G的記憶體,起來之後只識別了24G,當時沒有發現,執行幾天後,突然發現,每天都有那麼一、二次,節點3併發非常高,作業系統層面平均負載從幾一下飆升到五六十,造成資料庫短暫性假死現象,恰恰在這個時間點上,gg的抽取程式在top1,再看作業系統的記憶體使用情況,只剩下幾十k了,一開始懷疑是nfs掛載的問題,最後測試下來,也沒什麼問題,最後決定緊急處理節點3的記憶體問題,具體處理細節如下:

晚6點下班後,由於6點到9點這個時間段,相對來說網站和boss都還比較繁忙,這段時間就沒做任何操作,到了9點鐘,通知運維相關人員,把節點3的tomcat全部停止,然後我這裡停gg,解除安裝nfs,關閉節點3的所有資料庫程式,最後關機,操作見下:

GGSCI (rac3) 21> stop mgr
GGSCI (rac3) 21> stop extract xxxx
GGSCI (rac3) 21> stop dpump xxxx

停的過程中,errlog中的資訊如下:

2012-04-26 20:57:39  INFO    OGG-00497  Oracle GoldenGate Capture for Oracle, extksr1.prm:  Writing DDL operation to extract trail file.
2012-04-26 21:01:36  INFO    OGG-00987  Oracle GoldenGate Command Interpreter for Oracle:  GGSCI command (oracle): stop extksr1.
2012-04-26 21:01:38  INFO    OGG-01021  Oracle GoldenGate Capture for Oracle, extksr1.prm:  Command received from GGSCI: STOP.
2012-04-26 21:01:39  INFO    OGG-00991  Oracle GoldenGate Capture for Oracle, extksr1.prm:  EXTRACT EXTKSR1 stopped normally.
2012-04-26 21:01:41  INFO    OGG-00987  Oracle GoldenGate Command Interpreter for Oracle:  GGSCI command (oracle): stop dpksr1.
2012-04-26 21:01:43  INFO    OGG-01021  Oracle GoldenGate Capture for Oracle, dpksr1.prm:  Command received from GGSCI: STOP.
2012-04-26 21:01:43  INFO    OGG-00991  Oracle GoldenGate Capture for Oracle, dpksr1.prm:  EXTRACT DPKSR1 stopped normally.
2012-04-26 21:01:47  INFO    OGG-00987  Oracle GoldenGate Command Interpreter for Oracle:  GGSCI command (oracle): stop mgr.
2012-04-26 21:01:49  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GGSCI on host 10.1.8.49 (STOP).
2012-04-26 21:01:49  WARNING OGG-00938  Oracle GoldenGate Manager for Oracle, mgr.prm:  Manager is stopping at user request.

相關程式都停止之後,解除安裝nfs,umount了節點1,2以及共享儲存,具體命令略過,很簡單,值得一提的是,在解除安裝共享儲存的時候,會出現資源忙的情況,只要加個-l引數就可以了,同時主站gg程式都停止之後,會發現gg的目標端程式雖然是running狀態,但是errlog裡會提示抽取程式已停止的相關資訊:

2012-04-26 20:54:38  INFO    OGG-00484  Oracle GoldenGate Delivery for Oracle, repksr1.prm:  Executing DDL operation.
2012-04-26 20:54:38  INFO    OGG-00483  Oracle GoldenGate Delivery for Oracle, repksr1.prm:  DDL operation successful.
2012-04-26 20:54:38  INFO    OGG-01408  Oracle GoldenGate Delivery for Oracle, repksr1.prm:  Restoring current schema for DDL operation to [OGG].
2012-04-26 20:58:41  INFO    OGG-01735  Oracle GoldenGate Collector:  Synchronizing /home/oracle/ggs/trails/t1000239 to disk.
2012-04-26 20:58:41  INFO    OGG-01670  Oracle GoldenGate Collector:  Closing /home/oracle/ggs/trails/t1000239.
2012-04-26 20:58:41  INFO    OGG-01675  Oracle GoldenGate Collector:  Terminating because extract is stopped.

以上步驟執行完了之後,停掉節點3上的資料庫相關程式和服務,略過,然後就是關機,通知在機房候命的同事,然後那邊開始處理記憶體問題.........大約30分鐘後,記憶體問題解決,伺服器啟動起來後,我這裡開始處理後續事宜:

首先就是在節點3上啟動portmap和nfs服務,略過................

之後掛載節點1,2以及共享儲存,之後在啟動mgr程式的時候會報錯,如下:

2012-04-26 21:50:18  ERROR   OGG-01117  Oracle GoldenGate Command Interpreter for Oracle:  Received signal: Program interrupt (2).
2012-04-26 21:50:18  ERROR   OGG-01668  Oracle GoldenGate Command Interpreter for Oracle:  PROCESS ABENDING.
2012-04-26 21:51:43  INFO    OGG-00987  Oracle GoldenGate Command Interpreter for Oracle:  GGSCI command (oracle): start mgr.
2012-04-26 21:52:13  ERROR   OGG-01454  Oracle GoldenGate Manager for Oracle, mgr.prm:  Unable to lock file "/share_disk/ggs/dirpcs/MGR.pcm" (error 37, No locks available).
2012-04-26 21:52:13  ERROR   OGG-01668  Oracle GoldenGate Manager for Oracle, mgr.prm:  PROCESS ABENDING.

以上紅色部分大概意思就是mgr程式無法獲得共享儲存上的相關鎖,直接會導致後續操作都無法進行,方法很簡單,就是在節點3上啟動nfslock服務,然後再啟動mgr程式就好了,待mgr啟動起來之後,發現抽取程式abend掉了,errlog裡丟擲相關extract的錯誤資訊,如下:

2012-04-26 21:54:34  INFO    OGG-01026  Oracle GoldenGate Capture for Oracle, dpksr1.prm:  Rolling over remote file /home/oracle/ggs/trails/t1000240.
2012-04-26 21:54:34  INFO    OGG-01053  Oracle GoldenGate Capture for Oracle, dpksr1.prm:  Recovery completed for target file /home/oracle/ggs/trails/t1000240, at RBA 1022.
2012-04-26 21:54:34  INFO    OGG-01057  Oracle GoldenGate Capture for Oracle, dpksr1.prm:  Recovery completed for all targets.
2012-04-26 21:54:35  ERROR   OGG-00446  Oracle GoldenGate Capture for Oracle, extksr1.prm:  Could not find archived log for sequence 16857 thread 3 under alternative destinations. SQL 2012-04-26 21:54:35  ERROR   OGG-01668  Oracle GoldenGate Capture for Oracle, extksr1.prm:  PROCESS ABENDING.

造成這種情況的原因很簡單,就是節點3在關閉的時候,出現vip漂移至其他節點了,導致原本節點3上的歸檔歸到了其他的節點上,在gg抽取節點3的歸檔的時候,在相關目錄下找不到必須的歸檔日誌,所以就abend掉了,原因清楚之後,解決就簡單了,直接到其他節點上把節點3的歸檔日誌拷貝過來,然後再啟動抽取程式就ok了:

2012-04-26 21:57:22  INFO    OGG-00993  Oracle GoldenGate Capture for Oracle, extksr1.prm:  EXTRACT EXTKSR1 started.
2012-04-26 21:57:22  INFO    OGG-01055  Oracle GoldenGate Capture for Oracle, extksr1.prm:  Recovery initialization completed for target file /share_disk/ggs/trails/s1000239, at RBA 24518902.
2012-04-26 21:57:22  INFO    OGG-01478  Oracle GoldenGate Capture for Oracle, extksr1.prm:  Output file /share_disk/ggs/trails/s1 is using format RELEASE 10.4/11.1.
2012-04-26 21:57:23  INFO    OGG-01517  Oracle GoldenGate Capture for Oracle, extksr1.prm:  Position of first record processed for Thread 1, Sequence 29645, RBA 18568720, SCN 18.122009990, Apr 26, 2012 9:01:24 PM.
2012-04-26 21:57:23  INFO    OGG-01517  Oracle GoldenGate Capture for Oracle, extksr1.prm:  Position of first record processed for Thread 2, Sequence 28161, RBA 12794496, SCN 18.122010368, Apr 26, 2012 9:01:32 PM.
2012-04-26 21:57:24  INFO    OGG-01026  Oracle GoldenGate Capture for Oracle, extksr1.prm:  Rolling over remote file /share_disk/ggs/trails/s1000239.
2012-04-26 21:57:24  INFO    OGG-01053  Oracle GoldenGate Capture for Oracle, extksr1.prm:  Recovery completed for target file /share_disk/ggs/trails/s1000240, at RBA 1019.
2012-04-26 21:57:24  INFO    OGG-01057  Oracle GoldenGate Capture for Oracle, extksr1.prm:  Recovery completed for all targets.

gg主庫:

GGSCI (rac3) 20> info all

Program     Status      Group       Lag           Time Since Chkpt

MANAGER     RUNNING                                          
EXTRACT     RUNNING     DPKSR1      00:00:00      00:00:00   
EXTRACT     RUNNING     EXTKSR1     00:00:00      00:00:04   

gg備庫:

GGSCI (rptdb) 7> info all

Program     Status      Group       Lag           Time Since Chkpt

MANAGER     RUNNING                                          
REPLICAT    RUNNING     REPKSR1     00:00:00      00:00:00

最後觀察了一段時間,發現主站和gg都沒什麼問題了,整過程持續了大概一個小時,接下來一週時間繼續觀察監控。

記錄一下~~

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25618347/viewspace-722335/,如需轉載,請註明出處,否則將追究法律責任。

相關文章