資料庫監聽夯故障分析
背景情況簡述
2020年 11 月 05 日 16:24:11 , N CRM2 2 C rash
故障分析
full picture
2020-11-05 15:05:00 節點 1 開始出現 IP fragments dropped
2020-11-05 16:17:09 udp overflow 29741 to 231111
2020-11-05 16:18:08 LCK1 (ospid: 11324) waits for event
~16:20:29 'libcache interrupt action by LCK' for 216 secs <<<<< 16:20:29 減去 216 秒為 16:16:53 約為網路 udp overflow 的時間點。
2020-11-05 16:22:39 節點 1 LMD0 (ospid: 16620) has detected no messaging activity from instance 2 。
2020-11-05 16:22:57 節點 1 IPC Send timeout detected. Sender: ospid 16620 。
2020-11-05 16:24:10 節點 1 : Reconfiguration started (old inc 4, new inc 8) 。
2020-11-05 16:24:11 節點 2 : Instance terminated by LMSh, pid = 11314 。
2020-11-05 16:23:14 Initiating reconfig, reason=3 。
2020-11-05 16:24:39 Reconfiguration complete (total time 28.5 secs) .
2 、 osw 分析
N etstat 中在故障時間點存在嚴重的 UDP socket overflows :
詳細資料
……
6:14:08 udp: 29741 socket overflows
16:14:38 udp: 29741 socket overflows
16:15:08 udp: 29741 socket overflows
16:15:38 udp: 29741 socket overflows
16:16:08 udp: 29741 socket overflows
16:16:38 udp: 29741 socket overflows
16:17:09 udp: 231111 socket overflows
16:17:39 udp: 231114 socket overflows
16:18:09 udp: 231121 socket overflows
16:18:39 udp: 231121 socket overflows
16:19:09 udp: 231121 socket overflows
16:19:39 udp: 231121 socket overflows
16:20:10 udp: 231121 socket overflows
……
3 、 A lert log
節點2的LCK1 hung了216秒,之後crash ; 節點1檢測到之後,發起reconfiguration
節點 1:
……
Thu Nov 05 16:22:39 2020
LMD0 (ospid: 16620) has detected no messaging activity from instance 2
Thu Nov 05 16:22:57 2020
IPC Send timeout detected. Sender: ospid 16620 [oracle@ncrmdb21 (LMD0)]
Receiver: inst 2 binc 1349400 ospid 165544
Thu Nov 05 16:23:14 2020
Errors in file /oracle/app/oracle/diag/rdbms/ncrmdb2/crmdb21/trace/crmdb21_lmd0_16620.trc:
ORA-27508: IPC error sending a message
Thu Nov 05 16:23:14 2020
IPC Send timeout to 2.0 inc 4 for msg type 65518 from opid 14
Thu Nov 05 16:23:14 2020
Communications reconfiguration: instance_number 2 by ospid 16620
Thu Nov 05 16:24:03 2020
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2
Thu Nov 05 16:24:10 2020
Reconfiguration started (old inc 4, new inc 8)
List of instances (total 1) :
1
Dead instances (total 1) :
2
My inst 1
……
節點 2 :
……
Thu Nov 05 16:18:08 2020
LCK1 (ospid: 11324) waits for event 'libcache interrupt action by LCK' for 75 secs.
Thu Nov 05 16:18:08 2020
LCK1 (ospid: 11324) is hung in an acceptable location (libcache 0x41.02).
Thu Nov 05 16:20:29 2020
LCK1 (ospid: 11324) waits for event 'libcache interrupt action by LCK' for 216 secs.
Thu Nov 05 16:20:29 2020
LCK1 (ospid: 11324) is hung in an acceptable location (libcache 0x41.02).
Thu Nov 05 16:24:03 2020
Detected an inconsistent instance membership by instance 1
Thu Nov 05 16:24:04 2020
Received an instance abort message from instance 1
Thu Nov 05 16:21:02 2020
opiodr aborting process unknown ospid (20468) as a result of ORA-609
Thu Nov 05 16:22:39 2020
LMD0 (ospid: 16620) has detected no messaging activity from instance 2
Thu Nov 05 16:22:57 2020
IPC Send timeout detected. Sender: ospid 16620 [oracle@ncrmdb21 (LMD0)]
Receiver: inst 2 binc 1349400 ospid 165544
Thu Nov 05 16:23:14 2020
Errors in file /oracle/app/oracle/diag/rdbms/ncrmdb2/crmdb21/trace/crmdb21_lmd0_16620.trc:
ORA-27508: IPC error sending a message
Thu Nov 05 16:23:14 2020
IPC Send timeout to 2.0 inc 4 for msg type 65518 from opid 14
Thu Nov 05 16:23:14 2020
Communications reconfiguration: instance_number 2 by ospid 16620
Thu Nov 05 16:24:03 2020
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2
……
Thu Nov 05 16:24:04 2020
LMSh (ospid: 11314): terminating the instance due to error 481
……
ashdump 分析
( 1 )故障時間點的 ashdump 存在缺失,資料庫整體 hang :
5. Lmon trace
lmon trace中顯示reson為3(reconflguration reason is 3 - which suggested a communication issue)
……
*** 2020-11-05 16:23:14.943
2020-11-05 16:23:14.943607 : kjxgrrcfg: done (device ok) - ret = 3 hist 0x1679a (initial rsn: 3, cachd rsn: 0)
kjxgrrcfgchk: Initiating reconfig, reason=3
kjxgrrcfgchk: COMM rcfg - Disk Vote Required
kjfmReceiverHealthCB_CheckAll: Recievers are healthy.
2020-11-05 16:23:14.943885 : kjxgrnetchk: start 0x8c02716f, end 0x8c0327b9
2020-11-05 16:23:14.943901 : kjxgrnetchk: Network Validation wait: 46 sec
kjxgrnetchk: ce-event: from inst 1 to inst 2 ver 0x149939
kjxgrrcfgchk: prev pstate 6 mapsz 512
kjxgrrcfgchk: new bmp: 1 2
kjxgrrcfgchk: work bmp: 1 2
kjxgrrcfgchk: rr bmp: 1 2
……
官方解釋:
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
Reason 3: Communications Failure. This can happen if:
a) The LMON processes lose communication between one another.
b) One instance loses communications with the LMS, LMD, process of another instance.
c) The LCK processes lose communication between one another.
d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck and is not responding to remote requests.
分析結果
本次節點 2的crash是因為 IPC Send timeout 導致,而IPC Send timeout又和主機上的UDP socket overflows激增的時間點匹配,判斷為網路上的UDP socket overflows導致節點驅逐。
來自 “ ITPUB部落格 ” ,連結:https://blog.itpub.net/31547506/viewspace-3001396/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 3節點RAC資料庫夯故障分析資料庫
- Oracle 修改預設監聽埠故障分析Oracle
- 啟動資料庫監聽資料庫
- Mysql資料庫監聽binlogMySql資料庫
- 配置資料庫監聽白名單資料庫
- MySQL資料庫故障分析-鎖等待(一)MySql資料庫
- Canal-監聽資料庫表的變化資料庫
- 故障分析 | DROP 大表造成資料庫假死資料庫
- 故障分析 | MySQL 資料庫升級後,資料庫怎麼卡住了MySql資料庫
- grid監聽故障,本地地址不能連線
- 自定義元件-資料監聽器元件
- MySQL資料庫故障恢復MySql資料庫
- 故障分析 | MySQL 5.7 使用臨時表導致資料庫 CrashMySql資料庫
- [Kogel.Subscribe.Mssql]SQL Server增量訂閱,資料庫變更監聽SQLServer資料庫
- 成為MySQL DBA後,再看ORACLE資料庫(二、監聽與連線)MySqlOracle資料庫
- 資料庫無法註冊至監聽服務解決辦法資料庫
- oracle 資料庫lsnrctl監聽的日誌路徑和trace檔案Oracle資料庫
- 效能測試之資料庫監控分析工具PMM資料庫
- 資料庫資料恢復-ORACLE資料庫的常見故障&各種故障下的資料恢復可能性資料庫資料恢復Oracle
- 精讀《如何利用 Nodejs 監聽資料夾》NodeJS
- 分析型資料庫:分散式分析型資料庫資料庫分散式
- SpringBoot事件監聽器原始碼分析Spring Boot事件原始碼
- nginx建立和監聽套接字分析Nginx
- Android 監聽生命週期工具庫Android
- 由OGG引發的資料庫故障資料庫
- 雲原生資料庫夯實企業數字新基建資料庫
- 夯實基礎系列三:資料庫知識總結資料庫
- 監聽 watch props物件屬性監聽 或深度監聽物件
- 【故障公告】資料庫伺服器 CPU 100% 造成全站故障資料庫伺服器
- 【故障公告】資料庫伺服器 CPU 100% 引發全站故障資料庫伺服器
- WPF 實現檔案/資料夾監聽工具
- 3天學寫mvvm框架[一]:資料監聽MVVM框架
- 效能測試之資料庫監控分析工具Grafana+Prometheus資料庫GrafanaPrometheus
- Spring 事件監聽機制及原理分析Spring事件
- EMC儲存Raid故障資料分析報告AI
- 資料庫連線異常故障報告資料庫
- 監控資料庫活動資料庫
- 資料庫監控---PIGOSS BSM資料庫Go