背景情況簡述

2020年 11 月 05 日 16:24:11 ， N CRM2 2 C rash

故障分析

full picture

2020-11-05 15:05:00 節點 1 開始出現 IP fragments dropped

2020-11-05 16:17:09 udp overflow 29741 to 231111

2020-11-05 16:18:08 LCK1 (ospid: 11324) waits for event

~16:20:29 'libcache interrupt action by LCK' for 216 secs <<<<< 16:20:29 減去 216 秒為 16:16:53 約為網路 udp overflow 的時間點。

2020-11-05 16:22:39 節點 1 LMD0 (ospid: 16620) has detected no messaging activity from instance 2 。

2020-11-05 16:22:57 節點 1 IPC Send timeout detected. Sender: ospid 16620 。

2020-11-05 16:24:10 節點 1 ： Reconfiguration started (old inc 4, new inc 8) 。

2020-11-05 16:24:11 節點 2 ： Instance terminated by LMSh, pid = 11314 。

2020-11-05 16:23:14 Initiating reconfig, reason=3 。

2020-11-05 16:24:39 Reconfiguration complete (total time 28.5 secs) .

2 、 osw 分析

N etstat 中在故障時間點存在嚴重的 UDP socket overflows ：

詳細資料

……

6:14:08 udp: 29741 socket overflows
16:14:38 udp: 29741 socket overflows
16:15:08 udp: 29741 socket overflows
16:15:38 udp: 29741 socket overflows
16:16:08 udp: 29741 socket overflows
16:16:38 udp: 29741 socket overflows
16:17:09 udp: 231111 socket overflows
16:17:39 udp: 231114 socket overflows
16:18:09 udp: 231121 socket overflows
16:18:39 udp: 231121 socket overflows
16:19:09 udp: 231121 socket overflows
16:19:39 udp: 231121 socket overflows
16:20:10 udp: 231121 socket overflows

……

3 、 A lert log

節點2的LCK1 hung了216秒，之後crash ；節點1檢測到之後，發起reconfiguration

節點 1：

……

Thu Nov 05 16:22:39 2020

LMD0 (ospid: 16620) has detected no messaging activity from instance 2

Thu Nov 05 16:22:57 2020

IPC Send timeout detected. Sender: ospid 16620 [oracle@ncrmdb21 (LMD0)]

Receiver: inst 2 binc 1349400 ospid 165544

Thu Nov 05 16:23:14 2020

Errors in file /oracle/app/oracle/diag/rdbms/ncrmdb2/crmdb21/trace/crmdb21_lmd0_16620.trc:

ORA-27508: IPC error sending a message

Thu Nov 05 16:23:14 2020

IPC Send timeout to 2.0 inc 4 for msg type 65518 from opid 14

Thu Nov 05 16:23:14 2020

Communications reconfiguration: instance_number 2 by ospid 16620

Thu Nov 05 16:24:03 2020

Detected an inconsistent instance membership by instance 1

Evicting instance 2 from cluster

Waiting for instances to leave: 2

Thu Nov 05 16:24:10 2020

Reconfiguration started (old inc 4, new inc 8)

List of instances (total 1) :

1

Dead instances (total 1) :

2

My inst 1

……

節點 2 ：

……

Thu Nov 05 16:18:08 2020

LCK1 (ospid: 11324) waits for event 'libcache interrupt action by LCK' for 75 secs.

Thu Nov 05 16:18:08 2020

LCK1 (ospid: 11324) is hung in an acceptable location (libcache 0x41.02).

Thu Nov 05 16:20:29 2020

LCK1 (ospid: 11324) waits for event 'libcache interrupt action by LCK' for 216 secs.

Thu Nov 05 16:20:29 2020

LCK1 (ospid: 11324) is hung in an acceptable location (libcache 0x41.02).

Thu Nov 05 16:24:03 2020

Detected an inconsistent instance membership by instance 1

Thu Nov 05 16:24:04 2020

Received an instance abort message from instance 1

Thu Nov 05 16:21:02 2020

opiodr aborting process unknown ospid (20468) as a result of ORA-609

Thu Nov 05 16:22:39 2020

LMD0 (ospid: 16620) has detected no messaging activity from instance 2

Thu Nov 05 16:22:57 2020

IPC Send timeout detected. Sender: ospid 16620 [oracle@ncrmdb21 (LMD0)]

Receiver: inst 2 binc 1349400 ospid 165544

Thu Nov 05 16:23:14 2020

Errors in file /oracle/app/oracle/diag/rdbms/ncrmdb2/crmdb21/trace/crmdb21_lmd0_16620.trc:

ORA-27508: IPC error sending a message

Thu Nov 05 16:23:14 2020

IPC Send timeout to 2.0 inc 4 for msg type 65518 from opid 14

Thu Nov 05 16:23:14 2020

Communications reconfiguration: instance_number 2 by ospid 16620

Thu Nov 05 16:24:03 2020

Detected an inconsistent instance membership by instance 1

Evicting instance 2 from cluster

Waiting for instances to leave: 2

……

Thu Nov 05 16:24:04 2020

LMSh (ospid: 11314): terminating the instance due to error 481

……

ashdump 分析

（ 1 ）故障時間點的 ashdump 存在缺失，資料庫整體 hang ：

5. Lmon trace

lmon trace中顯示reson為3(reconflguration reason is 3 - which suggested a communication issue)

……

*** 2020-11-05 16:23:14.943

2020-11-05 16:23:14.943607 : kjxgrrcfg: done (device ok) - ret = 3 hist 0x1679a (initial rsn: 3, cachd rsn: 0)

kjxgrrcfgchk: Initiating reconfig, reason=3

kjxgrrcfgchk: COMM rcfg - Disk Vote Required

kjfmReceiverHealthCB_CheckAll: Recievers are healthy.

2020-11-05 16:23:14.943885 : kjxgrnetchk: start 0x8c02716f, end 0x8c0327b9

2020-11-05 16:23:14.943901 : kjxgrnetchk: Network Validation wait: 46 sec

kjxgrnetchk: ce-event: from inst 1 to inst 2 ver 0x149939

kjxgrrcfgchk: prev pstate 6 mapsz 512

kjxgrrcfgchk: new bmp: 1 2

kjxgrrcfgchk: work bmp: 1 2

kjxgrrcfgchk: rr bmp: 1 2

……

官方解釋：

Reason 0 = No reconfiguration

Reason 1 = The Node Monitor generated the reconfiguration.

Reason 2 = An instance death was detected.

Reason 3 = Communications Failure

Reason 4 = Reconfiguration after suspend

Reason 3: Communications Failure. This can happen if:

a) The LMON processes lose communication between one another.

b) One instance loses communications with the LMS, LMD, process of another instance.

c) The LCK processes lose communication between one another.

d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck and is not responding to remote requests.

分析結果

本次節點 2的crash是因為 IPC Send timeout 導致，而IPC Send timeout又和主機上的UDP socket overflows激增的時間點匹配，判斷為網路上的UDP socket overflows導致節點驅逐。

資料庫監聽夯故障分析

背景情況簡述

故障分析

full picture

2 、 osw 分析

3 、 A lert log

ashdump 分析

5. Lmon trace

相關文章