ADG 例項異常終止故障分析報告

jason_yehua發表於2023-12-25

問題處理

問題描述

2017-05-25 xcrmdb1 ADG 發生報障, SSC 在之前郵件及下面 SR 中進行了跟蹤與分析;本報告對這起故障進行綜合彙報,並提供原因及建議。

故障分析

1.  Instance2 lmsc 程式收到錯誤序列的 UDP 包,這個重要程式退出,導致例項被逼終止。

>>>alert_crmdb12.log

Thu May 25 05:52:38 2017

Archived Log entry 48804 added for thread 1 sequence 77393 ID 0x3d8c0bae dest 1:

Thu May 25  05:58:06  2017

Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc  (incident=384241):

ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], []

<<<lmsc 程式收到錯誤的資訊包,下面需要重啟 Instance 來保證 DB 的完整性

Incident details in: /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/incident/incdir_384241/crmdb12_lmsc_4967_i384241.trc

Thu May 25 05:58:10 2017

Dumping diagnostic data in directory=[cdmp_20170525055810], requested by (instance=2, osid=680322 (LMSC)), summary=[incident=384241].

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Thu May 25 05:58:12 2017

Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc:

ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], []

Thu May 25 05:58:12  2017

USER (ospid: 4967):  terminating the instance  due to error 484

>>>lms 程式異常, Instance 需要中斷

Thu May 25 05:58:14 2017

License high water mark = 866

Thu May 25 05:58:17 2017

Instance terminated by USER, pid = 4967

Thu May 25 05:58:18 2017

USER (ospid: 17292): terminating the instance

Thu May 25 05:58:18 2017

Instance terminated by USER, pid = 17292

Thu May 25 05:59:11 2017

Starting ORACLE instance (normal) (OS id: 17791)

>>>alert_crmdb11.log

Thu May 25 05:58:14  2017

Reconfiguration started (old inc 6, new inc 8)

List of instances (total 1) :

1

Dead instances (total 1) :

2

My inst 1  

>>>Instance1 感知 Instance2 退出,開始 Reconfiguration

 

05:58:06 Node1:[kjctr_pbmsg:badseq]

05:58:12 Node1:[kjctr_pbmsg:badseq]

05:58:12 Node1:terminating the instance

05:58:14 Node2:Reconfiguration started

05:58:18 Node1:terminating the instance

05:59:11 Node1:Starting ORACLE instance

 

2.   lmsc kjctr_pbmsg:badseq 處收到錯誤的包而 abort

>>>crmdb12_lmsc_4967.trc

*** 2017-05-25 05:58:06.823

ORA-00600: internal error code, arguments: [kjctr_pbmsg:badseq], [32], [0], [16777216], [], [], [], [], [], [], [], []

 

kjmsm: caught non-fatal error 600

lms abort after exception 600

<<<lmsc 出現 600 異常而終止 (error 484)

KJC Communication Dump:

-------------------8<-------------------

 

>>>crmdb12_lmsc_4967_i384241.trc

Error Stack: ORA-600[kjctr_pbmsg:badseq]

Main Stack: 

   kjctr_pbmsg  <- kjctr_watq <- kjctr_rksxp <- kjctrcv <- kjcsrmg <- kjmsm <- ksbrdp <- opirip

   <- opidrv <- sou2o <- opimai_real <- ssthrdmain <- main <- main_opd_entry

 

----- Incident Context Dump -----

Address: 0x9fffffffffff59f0

Incident ID: 384241

Problem Key: ORA 600 [kjctr_pbmsg:badseq]

Error: ORA-600 [kjctr_pbmsg:badseq] [32] [0] [16777216] [] [] [] [] [] [] [] []

[00]: dbgexProcessError [diag_dde]

[01]: dbgeExecuteForError [diag_dde]

[02]: dbgePostErrorKGE [diag_dde]

[03]: dbkePostKGE_kgsf [rdbms_dde]

[04]: kgeadse []

[05]: kgerinv_internal []

[06]: kgerinv []

[07]: kgeasnmierr []

[08]: kjctr_pbmsg []<-- Signaling

[09]: kjctr_watq []

[10]: kjctr_rksxp []

[11]: kjctrcv []

[12]: kjcsrmg []

[13]: kjmsm [RAC_MLMDS]

[14]: ksbrdp [background_proc]

[15]: opirip [OPI]

[16]: opidrv [OPI]

[17]: sou2o []

[18]: opimai_real [OPI]

[19]: ssthrdmain []

[20]: main []

[21]: main_opd_entry []

>>> 出錯的 function kjctr_pbmsg

 

3. 05:16:43~05:21:45/05:42:51~05:45:52/05:57:56~05:58:26 這三個波段有突發性的分片流量,導致分片超時丟棄和佇列溢位以及 UDP 校驗錯。

最後一波雖然最小持續最短,但是按機率導致 lmsc 程式收到錯誤的包而 abort ,最終 Instance2 中斷。

4. 放大一下區域性細節: ( 第二張圖裡 : 溢位是紅色,校驗錯是綠色,超時丟棄是藍色 )

5.  相關的統計項如下

udp:                                           

20067 incomplete headers                     

9124 bad checksums    <<<UDP 校驗錯                         

ip:                                            

872460826 fragments received                 

16823 fragments dropped (dup or out of space)   <<< 分片溢位

3712 fragments dropped after timeout   <<< 分片超時丟棄        

6.  分片流量突發,引起分片佇列溢位,進而出現分片超時丟棄和 UDP校驗錯。在這種情形下,上層 DB程式存在一定機率收到錯誤的包,引發例項故障。建議調整如下 OS Kernel引數:

ip_fragment_timeout 這個調整為 1s

ip_reass_mem_limit    這個調整為 10M    

 

問題總結

問題描述:

例項程式 lms 出現 ”ORA-00600: internal error code, arguments: [kjctr_pbmsg:badseq], [32], [0], [16777216], [], [], [], [], [], [], [], []” abort ,導致 Instance 終止

適用範圍:

所有多 Node RAC 結構,故障主要發生在 lms 程式;不限於 DB 版本。

問題現象:

lms 程式收到錯誤的資訊包,需要重啟 Instance

>>>alert_crmdb12.log

Thu May 25 05:52:38 2017

Archived Log entry 48804 added for thread 1 sequence 77393 ID 0x3d8c0bae dest 1:

Thu May 25  05:58:06  2017

Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc  (incident=384241):

ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], []

<<<lmsc 程式收到錯誤的資訊包,下面需要重啟 Instance 來保證 DB 的完整性

Incident details in: /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/incident/incdir_384241/crmdb12_lmsc_4967_i384241.trc

Thu May 25 05:58:10 2017

Dumping diagnostic data in directory=[cdmp_20170525055810], requested by (instance=2, osid=680322 (LMSC)), summary=[incident=384241].

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Thu May 25 05:58:12 2017

Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc:

ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], []

Thu May 25 05:58:12  2017

USER (ospid: 4967):  terminating the instance  due to error 484

>>>lms 程式異常, Instance 需要中斷

Thu May 25 05:58:14 2017

License high water mark = 866

Thu May 25 05:58:17 2017

Instance terminated by USER, pid = 4967

Thu May 25 05:58:18 2017

USER (ospid: 17292): terminating the instance

Thu May 25 05:58:18 2017

Instance terminated by USER, pid = 17292

Thu May 25 05:59:11 2017

Starting ORACLE instance (normal) (OS id: 17791)

OSW netstat 統計

udp:                                           

20067 incomplete headers                     

9124 bad checksums     <<<UDP 校驗錯                          

ip:                                            

872460826 fragments received                 

16823 fragments dropped (dup or out of space)   <<< 分片溢位

3712 fragments dropped after timeout   <<< 分片超時丟棄         

問題原因:

分片流量突發,引起分片佇列溢位,進而出現分片超時丟棄和 UDP 校驗錯。在這種情形下,上層 DB 程式存在一定機率收到錯誤的包,引發例項故障

 

解決辦法:

對於 HP OS ,在 Kernel 調整相關的分片佇列引數,一般是如下兩個引數。

ip_fragment_timeout 這個調整為 1s

ip_reass_mem_limit    這個調整為 10M    

 


來自 “ ITPUB部落格 ” ,連結:https://blog.itpub.net/31547506/viewspace-3001401/,如需轉載,請註明出處,否則將追究法律責任。

相關文章