資料庫監聽夯故障分析
背景情況簡述
2020年 11 月 05 日 16:24:11 , N CRM2 2 C rash
故障分析
full picture
2020-11-05 15:05:00 節點 1 開始出現 IP fragments dropped
2020-11-05 16:17:09 udp overflow 29741 to 231111
2020-11-05 16:18:08 LCK1 (ospid: 11324) waits for event
~16:20:29 'libcache interrupt action by LCK' for 216 secs <<<<< 16:20:29 減去 216 秒為 16:16:53 約為網路 udp overflow 的時間點。
2020-11-05 16:22:39 節點 1 LMD0 (ospid: 16620) has detected no messaging activity from instance 2 。
2020-11-05 16:22:57 節點 1 IPC Send timeout detected. Sender: ospid 16620 。
2020-11-05 16:24:10 節點 1 : Reconfiguration started (old inc 4, new inc 8) 。
2020-11-05 16:24:11 節點 2 : Instance terminated by LMSh, pid = 11314 。
2020-11-05 16:23:14 Initiating reconfig, reason=3 。
2020-11-05 16:24:39 Reconfiguration complete (total time 28.5 secs) .
2 、 osw 分析
N etstat 中在故障時間點存在嚴重的 UDP socket overflows :
詳細資料
……
6:14:08 udp: 29741 socket overflows
16:14:38 udp: 29741 socket overflows
16:15:08 udp: 29741 socket overflows
16:15:38 udp: 29741 socket overflows
16:16:08 udp: 29741 socket overflows
16:16:38 udp: 29741 socket overflows
16:17:09 udp: 231111 socket overflows
16:17:39 udp: 231114 socket overflows
16:18:09 udp: 231121 socket overflows
16:18:39 udp: 231121 socket overflows
16:19:09 udp: 231121 socket overflows
16:19:39 udp: 231121 socket overflows
16:20:10 udp: 231121 socket overflows
……
3 、 A lert log
節點2的LCK1 hung了216秒,之後crash ; 節點1檢測到之後,發起reconfiguration
節點 1:
……
Thu Nov 05 16:22:39 2020
LMD0 (ospid: 16620) has detected no messaging activity from instance 2
Thu Nov 05 16:22:57 2020
IPC Send timeout detected. Sender: ospid 16620 [oracle@ncrmdb21 (LMD0)]
Receiver: inst 2 binc 1349400 ospid 165544
Thu Nov 05 16:23:14 2020
Errors in file /oracle/app/oracle/diag/rdbms/ncrmdb2/crmdb21/trace/crmdb21_lmd0_16620.trc:
ORA-27508: IPC error sending a message
Thu Nov 05 16:23:14 2020
IPC Send timeout to 2.0 inc 4 for msg type 65518 from opid 14
Thu Nov 05 16:23:14 2020
Communications reconfiguration: instance_number 2 by ospid 16620
Thu Nov 05 16:24:03 2020
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2
Thu Nov 05 16:24:10 2020
Reconfiguration started (old inc 4, new inc 8)
List of instances (total 1) :
1
Dead instances (total 1) :
2
My inst 1
……
節點 2 :
……
Thu Nov 05 16:18:08 2020
LCK1 (ospid: 11324) waits for event 'libcache interrupt action by LCK' for 75 secs.
Thu Nov 05 16:18:08 2020
LCK1 (ospid: 11324) is hung in an acceptable location (libcache 0x41.02).
Thu Nov 05 16:20:29 2020
LCK1 (ospid: 11324) waits for event 'libcache interrupt action by LCK' for 216 secs.
Thu Nov 05 16:20:29 2020
LCK1 (ospid: 11324) is hung in an acceptable location (libcache 0x41.02).
Thu Nov 05 16:24:03 2020
Detected an inconsistent instance membership by instance 1
Thu Nov 05 16:24:04 2020
Received an instance abort message from instance 1
Thu Nov 05 16:21:02 2020
opiodr aborting process unknown ospid (20468) as a result of ORA-609
Thu Nov 05 16:22:39 2020
LMD0 (ospid: 16620) has detected no messaging activity from instance 2
Thu Nov 05 16:22:57 2020
IPC Send timeout detected. Sender: ospid 16620 [oracle@ncrmdb21 (LMD0)]
Receiver: inst 2 binc 1349400 ospid 165544
Thu Nov 05 16:23:14 2020
Errors in file /oracle/app/oracle/diag/rdbms/ncrmdb2/crmdb21/trace/crmdb21_lmd0_16620.trc:
ORA-27508: IPC error sending a message
Thu Nov 05 16:23:14 2020
IPC Send timeout to 2.0 inc 4 for msg type 65518 from opid 14
Thu Nov 05 16:23:14 2020
Communications reconfiguration: instance_number 2 by ospid 16620
Thu Nov 05 16:24:03 2020
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2
……
Thu Nov 05 16:24:04 2020
LMSh (ospid: 11314): terminating the instance due to error 481
……
ashdump 分析
( 1 )故障時間點的 ashdump 存在缺失,資料庫整體 hang :
5. Lmon trace
lmon trace中顯示reson為3(reconflguration reason is 3 - which suggested a communication issue)
……
*** 2020-11-05 16:23:14.943
2020-11-05 16:23:14.943607 : kjxgrrcfg: done (device ok) - ret = 3 hist 0x1679a (initial rsn: 3, cachd rsn: 0)
kjxgrrcfgchk: Initiating reconfig, reason=3
kjxgrrcfgchk: COMM rcfg - Disk Vote Required
kjfmReceiverHealthCB_CheckAll: Recievers are healthy.
2020-11-05 16:23:14.943885 : kjxgrnetchk: start 0x8c02716f, end 0x8c0327b9
2020-11-05 16:23:14.943901 : kjxgrnetchk: Network Validation wait: 46 sec
kjxgrnetchk: ce-event: from inst 1 to inst 2 ver 0x149939
kjxgrrcfgchk: prev pstate 6 mapsz 512
kjxgrrcfgchk: new bmp: 1 2
kjxgrrcfgchk: work bmp: 1 2
kjxgrrcfgchk: rr bmp: 1 2
……
官方解釋:
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
Reason 3: Communications Failure. This can happen if:
a) The LMON processes lose communication between one another.
b) One instance loses communications with the LMS, LMD, process of another instance.
c) The LCK processes lose communication between one another.
d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck and is not responding to remote requests.
分析結果
本次節點 2的crash是因為 IPC Send timeout 導致,而IPC Send timeout又和主機上的UDP socket overflows激增的時間點匹配,判斷為網路上的UDP socket overflows導致節點驅逐。
來自 “ ITPUB部落格 ” ,連結:https://blog.itpub.net/31547506/viewspace-3001396/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 3節點RAC資料庫夯故障分析資料庫
- Oracle資料庫監聽Oracle資料庫
- 資料庫監聽不定期出現異常故障處理資料庫
- oracle資料庫監聽啟動不了的原因分析Oracle資料庫
- 啟動資料庫監聽資料庫
- 資料庫開啟監聽資料庫
- Oracle 修改預設監聽埠故障分析Oracle
- 配置資料庫監聽白名單資料庫
- Mysql資料庫監聽binlogMySql資料庫
- Oracle資料庫修改LISTENER的監聽埠Oracle資料庫
- 11.2資料庫監聽的改變資料庫
- 清理資料庫監聽日誌最好方法資料庫
- 監聽狀態正常,但是資料庫不能連線,提示沒有監聽....資料庫
- ORACLE的歸檔空間滿導致的監聽故障資料庫無法啟動Oracle資料庫
- MySQL資料庫故障分析-鎖等待(一)MySql資料庫
- Canal-監聽資料庫表的變化資料庫
- 故障分析 | DROP 大表造成資料庫假死資料庫
- rac的vip和監聽莫名故障
- PG 資料庫庫監聽佇列的長度問題資料庫佇列
- Oracle資料庫的監聽器掛起情況Oracle資料庫
- 故障分析 | MySQL 資料庫升級後,資料庫怎麼卡住了MySql資料庫
- 【監聽】兩庫互配靜態監聽
- 設定USB資料監聽
- Oracle RAC Database 11.1.0.6監聽故障案例OracleDatabase
- 【LISTENER】一個資料庫配置兩個監聽埠號資料庫
- dataguard主庫停監聽後還會把資料同步到備庫
- 某物流系統資料庫故障診斷案例分析資料庫
- 處理rac資料庫一個節點監聽異常資料庫
- ORACLE資料庫安全漏洞之監聽密碼設定Oracle資料庫密碼
- 【RAC】刪除RAC資料庫節點(三)——刪除監聽資料庫
- 啟動資料庫,啟動監聽,檢視是否歸檔資料庫
- 10.2.0.1監聽子程式導致資料庫不能響應資料庫
- grid監聽故障,本地地址不能連線
- oracle 11.2.0.3.6升級故障---監聽報錯Oracle
- 關於資料庫故障資料庫
- oracle 多庫配監聽Oracle
- Oracle 10g 兩個監聽程式的故障Oracle 10g
- 資料庫伺服器故障資料庫伺服器