ADG 例項異常終止故障分析報告
問題處理
問題描述
2017-05-25 日 xcrmdb1 ADG 發生報障, SSC 在之前郵件及下面 SR 中進行了跟蹤與分析;本報告對這起故障進行綜合彙報,並提供原因及建議。
故障分析
1. Instance2 的 lmsc 程式收到錯誤序列的 UDP 包,這個重要程式退出,導致例項被逼終止。
>>>alert_crmdb12.log Thu May 25 05:52:38 2017 Archived Log entry 48804 added for thread 1 sequence 77393 ID 0x3d8c0bae dest 1: Thu May 25 05:58:06 2017 Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc (incident=384241): ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], [] <<<lmsc 程式收到錯誤的資訊包,下面需要重啟 Instance 來保證 DB 的完整性 Incident details in: /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/incident/incdir_384241/crmdb12_lmsc_4967_i384241.trc Thu May 25 05:58:10 2017 Dumping diagnostic data in directory=[cdmp_20170525055810], requested by (instance=2, osid=680322 (LMSC)), summary=[incident=384241]. Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Thu May 25 05:58:12 2017 Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc: ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], [] Thu May 25 05:58:12 2017 USER (ospid: 4967): terminating the instance due to error 484 >>>lms 程式異常, Instance 需要中斷 Thu May 25 05:58:14 2017 License high water mark = 866 Thu May 25 05:58:17 2017 Instance terminated by USER, pid = 4967 Thu May 25 05:58:18 2017 USER (ospid: 17292): terminating the instance Thu May 25 05:58:18 2017 Instance terminated by USER, pid = 17292 Thu May 25 05:59:11 2017 Starting ORACLE instance (normal) (OS id: 17791) |
>>>alert_crmdb11.log Thu May 25 05:58:14 2017 Reconfiguration started (old inc 6, new inc 8) List of instances (total 1) : 1 Dead instances (total 1) : 2 My inst 1 >>>Instance1 感知 Instance2 退出,開始 Reconfiguration
05:58:06 Node1:[kjctr_pbmsg:badseq] 05:58:12 Node1:[kjctr_pbmsg:badseq] 05:58:12 Node1:terminating the instance 05:58:14 Node2:Reconfiguration started 05:58:18 Node1:terminating the instance 05:59:11 Node1:Starting ORACLE instance |
2. lmsc 於 kjctr_pbmsg:badseq 處收到錯誤的包而 abort
>>>crmdb12_lmsc_4967.trc *** 2017-05-25 05:58:06.823 ORA-00600: internal error code, arguments: [kjctr_pbmsg:badseq], [32], [0], [16777216], [], [], [], [], [], [], [], []
kjmsm: caught non-fatal error 600 lms abort after exception 600 <<<lmsc 出現 600 異常而終止 (error 484) KJC Communication Dump: -------------------8<-------------------
|
>>>crmdb12_lmsc_4967_i384241.trc Error Stack: ORA-600[kjctr_pbmsg:badseq] Main Stack: kjctr_pbmsg <- kjctr_watq <- kjctr_rksxp <- kjctrcv <- kjcsrmg <- kjmsm <- ksbrdp <- opirip <- opidrv <- sou2o <- opimai_real <- ssthrdmain <- main <- main_opd_entry
----- Incident Context Dump ----- Address: 0x9fffffffffff59f0 Incident ID: 384241 Problem Key: ORA 600 [kjctr_pbmsg:badseq] Error: ORA-600 [kjctr_pbmsg:badseq] [32] [0] [16777216] [] [] [] [] [] [] [] [] [00]: dbgexProcessError [diag_dde] [01]: dbgeExecuteForError [diag_dde] [02]: dbgePostErrorKGE [diag_dde] [03]: dbkePostKGE_kgsf [rdbms_dde] [04]: kgeadse [] [05]: kgerinv_internal [] [06]: kgerinv [] [07]: kgeasnmierr [] [08]: kjctr_pbmsg []<-- Signaling [09]: kjctr_watq [] [10]: kjctr_rksxp [] [11]: kjctrcv [] [12]: kjcsrmg [] [13]: kjmsm [RAC_MLMDS] [14]: ksbrdp [background_proc] [15]: opirip [OPI] [16]: opidrv [OPI] [17]: sou2o [] [18]: opimai_real [OPI] [19]: ssthrdmain [] [20]: main [] [21]: main_opd_entry [] >>> 出錯的 function 在 kjctr_pbmsg |
3. 在 05:16:43~05:21:45/05:42:51~05:45:52/05:57:56~05:58:26 這三個波段有突發性的分片流量,導致分片超時丟棄和佇列溢位以及 UDP 校驗錯。
最後一波雖然最小持續最短,但是按機率導致 lmsc 程式收到錯誤的包而 abort ,最終 Instance2 中斷。
4. 放大一下區域性細節: ( 第二張圖裡 : 溢位是紅色,校驗錯是綠色,超時丟棄是藍色 )
5. 相關的統計項如下
udp: 20067 incomplete headers 9124 bad checksums <<<UDP 校驗錯 ip: 872460826 fragments received 16823 fragments dropped (dup or out of space) <<< 分片溢位 3712 fragments dropped after timeout <<< 分片超時丟棄 |
6. 分片流量突發,引起分片佇列溢位,進而出現分片超時丟棄和 UDP校驗錯。在這種情形下,上層 DB程式存在一定機率收到錯誤的包,引發例項故障。建議調整如下 OS Kernel引數:
ip_fragment_timeout 這個調整為 1s ip_reass_mem_limit 這個調整為 10M |
問題總結
問題描述:
例項程式 lms 出現 ”ORA-00600: internal error code, arguments: [kjctr_pbmsg:badseq], [32], [0], [16777216], [], [], [], [], [], [], [], []” 而 abort ,導致 Instance 終止
適用範圍:
所有多 Node RAC 結構,故障主要發生在 lms 程式;不限於 DB 版本。
問題現象:
lms 程式收到錯誤的資訊包,需要重啟 Instance
>>>alert_crmdb12.log Thu May 25 05:52:38 2017 Archived Log entry 48804 added for thread 1 sequence 77393 ID 0x3d8c0bae dest 1: Thu May 25 05:58:06 2017 Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc (incident=384241): ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], [] <<<lmsc 程式收到錯誤的資訊包,下面需要重啟 Instance 來保證 DB 的完整性 Incident details in: /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/incident/incdir_384241/crmdb12_lmsc_4967_i384241.trc Thu May 25 05:58:10 2017 Dumping diagnostic data in directory=[cdmp_20170525055810], requested by (instance=2, osid=680322 (LMSC)), summary=[incident=384241]. Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Thu May 25 05:58:12 2017 Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc: ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], [] Thu May 25 05:58:12 2017 USER (ospid: 4967): terminating the instance due to error 484 >>>lms 程式異常, Instance 需要中斷 Thu May 25 05:58:14 2017 License high water mark = 866 Thu May 25 05:58:17 2017 Instance terminated by USER, pid = 4967 Thu May 25 05:58:18 2017 USER (ospid: 17292): terminating the instance Thu May 25 05:58:18 2017 Instance terminated by USER, pid = 17292 Thu May 25 05:59:11 2017 Starting ORACLE instance (normal) (OS id: 17791) |
OSW netstat 統計
udp: 20067 incomplete headers 9124 bad checksums <<<UDP 校驗錯 ip: 872460826 fragments received 16823 fragments dropped (dup or out of space) <<< 分片溢位 3712 fragments dropped after timeout <<< 分片超時丟棄 |
問題原因:
分片流量突發,引起分片佇列溢位,進而出現分片超時丟棄和 UDP 校驗錯。在這種情形下,上層 DB 程式存在一定機率收到錯誤的包,引發例項故障
解決辦法:
對於 HP OS ,在 Kernel 調整相關的分片佇列引數,一般是如下兩個引數。
ip_fragment_timeout 這個調整為 1s ip_reass_mem_limit 這個調整為 10M |
來自 “ ITPUB部落格 ” ,連結:https://blog.itpub.net/31547506/viewspace-3001401/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 3.6 延遲例項終止
- 資料庫連線異常故障報告資料庫
- 故障分析 | 從 data_free 異常說起
- oracle例項啟動異常慢案例一Oracle
- Tomcat常見異常及解決方案程式碼例項Tomcat
- 異常-異常的注意事項
- Redis CVE-2020-14147導致例項異常退出Redis
- Oracle 9i變數窺視引起執行計劃異常故障報告Oracle變數
- SQLServer異常故障恢復(二)SQLServer
- Oracle 變數窺視引起執行計劃異常故障分析Oracle變數
- GaussDB(分散式)例項故障處理分散式
- Flutter 常見異常分析Flutter
- EMC儲存Raid故障資料分析報告AI
- kubernets叢集節點NotReady故障 分析報告
- 單例項Primary快速搭建Standby RAC參考手冊(19.16 ADG)單例
- Word類報表例項 - 質量檢測報告
- ORACLE ADG 最大可用模式下例項啟動失敗分析Oracle模式
- Redis 例項分析工具Redis
- 重學c#系列——異常續[異常注意事項](七)C#
- Flex常見佈局例項Flex
- rac 正常關閉例項service不會自動漂移,只有在例項異常abort才會發生自動failoverAI
- 故障分析 | MySQL鎖等待超時一例分析MySql
- 華為AGC提包檢測報告:檢測異常GC
- binlog 異常暴漲分析
- 異源資料同步 → DataX 同步啟動後如何手動終止?
- for迴圈的例項分析
- 廣東移動:終端行業分析報告行業
- STANDBY_FILE_MANAGEMENT引數未設定auto導致的ADG備庫異常
- 跑批SQL效能異常分析SQL
- 乾貨!各種常見佈局實現+知名網站例項分析網站
- 正常終止expdp作業
- 終止指定埠的程式
- 你可以終止 forEach 嗎?
- 終止非同步任務非同步
- CentOS 將於年底終止!CentOS
- 兩階段終止模式模式
- 遞迴中Return例項分析遞迴
- 如何在12.2版本ADG備庫生成AWR報告