前幾日上午，在毫不知情的情況下，突然手機收到了系統的告警簡訊，報一營業系統無法連線，汗~，無法連線資料庫，這是多麼嚴重的告警，何況是營業廳的核心繫統，直接影響使用者體驗。遂立即登入系統檢視，發現雙節點的RAC環境中節點2上的instance宕掉了，還好是雙節點，可能效能會有所影響但不至於營業廳無法辦理業務（這時候，RAC的好處顯現的可謂淋淋盡致了）。

檢查程式：

■■db2:[/oracle]$ps -ef | grep ora_
oracle 914114 3548012   0 19:00:23 pts/3 0:00 grep ora_
■■db2:[/oracle]$ps -ef | grep lsnr
oracle 22257 22179 0 11:06:48 pts/1     0:00 grep lsnr
oracle 15157     1 0 07:09:39 ?         0:00 /orarac/app/oracle/product/9.2.0/bin/tnslsnr LISTENER -inherit

系統中已經沒有任何instance’s background processes了，不過監聽程式還在，監聽為：LISTENER。

還等啥呢，趕緊把庫起來吧

SQL> startup
ORACLE instance started.

Total System Global Area 3.2212E+10 bytes
Fixed Size                   782472 bytes
Variable Size            4815060992 bytes
Database Buffers         2.7380E+10 bytes
Redo Buffers               16003072 bytes
Database mounted.
Database opened.

還好，資料庫正常open了。

表象的東西解決後，就該更為深層次的分析是什麼原因導致的instance down了。

節點1的alert日誌：

Tue Oct 18 10:56:34 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:03:00 2011
Timed out trying to start process P001.
Tue Oct 18 11:06:34 2011
Evicting instance 2 from cluster

在節點1的系統時間為：

10:56:34 叢集發生腦裂，clusterware split-brain
11:03:00 啟動P001程式發生超時
11:06:34 節點2上的instance 2被逐出叢集了

節點2的alert日誌：

Tue Oct 18 10:51:57 2011
IPC Send timeout detected. Sender ospid 22239
Tue Oct 18 10:51:59 2011
Communications reconfiguration: instance 0
Tue Oct 18 10:52:03 2011
ARC0: Completed archiving log 6 thread 2 sequence 87953
Tue Oct 18 10:53:45 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:02:14 2011
Timed out trying to start process P001.
Tue Oct 18 11:03:45 2011
USER: terminating instance due to error 481
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/■■db/bdump/■■db2_lgwr_15116.trc:
ORA-00481: LMON process terminated with error
…………………..
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/■■db/bdump/■■db2_lmon_15081.trc:
ORA-29740: evicted by member 1, group incarnation 19
Tue Oct 18 11:03:45 2011
System state dump is made for local instance

在節點2的系統時間為：

10:51:57 檢測到IPC send timeout
10:51:59 Communications reconfiguration:instance 0,instance 0表示例項關閉
10:53:45 叢集發生腦裂，clusterware split-brain
11:02:14 啟動P001程式發生超時
11:03:45 發生error 481，instance down，並出現ORA-29740錯誤，我手機也是在這個時間點接到告警簡訊的

據eygle的說法，主要是在資源緊張、網路擁堵等情況下，就有可能發生IPC超時的問題，而cluster隨後就會將問題節點驅逐，引發新一輪的Reconfiguration。

在來看看產生的trc檔案，■■db2_lmon_15081.trc

*** SESSION ID:(3.1) 2011-10-18 07:08:57.162
GES IPC: Receivers 10 Senders 10
GES IPC: Buffers Receive 1000 Send (i:6830 b:6830) Reserve 1000
GES IPC: Msg Size Regular 440 Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 47, ack 52
Batching factor: cache replay 29 size per lock 64
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 0 Hewlett-Packard SKGXN 2.0)
*** 2011-10-18 07:09:01.648
kjxgmrcfg: Reconfiguration started, reason 1
*** 2011-10-18 07:09:01.649
     Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 3122016
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 18 2.
     Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 18 3.
     Name Service recovery started
     Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 18 4.
     Multicasted all local name entries for publish
     Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 18 5.
     Name Service normal
     Name Service recovery done
*** 2011-10-18 07:09:02.652
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 18 6.
*** 2011-10-18 07:09:02.652
*** 2011-10-18 07:09:02.653
Reconfiguration started (old inc 0, new inc 18)
Synchronization timeout interval: 660 sec
List of nodes:
0 1
Global Resource Directory frozen
node 0
release 9 2 0 7
node 1
release 9 2 0 7
res_master_weight for node 0 is 3122016
res_master_weight for node 1 is 3122016
Total master weight = 6244032
Dead inst
Join inst 0 1
Exist inst
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traversed, 0 cancelled, 0 closed
0 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 0
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2011-10-18 07:09:11.515
0 GCS shadows traversed, 0 replayed, 0 unopened
Submitted all GCS cache requests
0 write requests issued in 1518892 GCS resources
10 PIs marked suspect, 0 flush PI msgs
*** 2011-10-18 07:09:27.724
Reconfiguration complete
*** 2011-10-18 07:09:35.232
kjxgrtmc2: Member 1 thread 2 mounted
*** 2011-10-18 10:51:59.470
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.
*** 2011-10-18 10:52:04.504
     Name Service frozen
kjxgmcs: Setting state to 18 1.
*** 2011-10-18 10:52:04.514
Obtained RR update lock for sequence 18, RR seq 18
*** 2011-10-18 10:53:45.311
jxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)
IMR state information
Member 1, thread 2, state 4, flags 0x00a1
  RR seq 19, propstate 3, pending propstate 0
rcfg rsn 3, rcfg time 21546700, mem ct 2
master 1, master rcfg time 21546700

Member information:
    Member 0, incarn 18, version 1840361946
      thrd 1, prev thrd 65535, status 0x0007, err 0x0002
      valct 0
    Member 1, incarn 18, version 816352
      thrd 2, prev thrd 65535, status 0x0107, err 0x0000
      valct 2

Group name: ■■DB
Member id: 1
Cached SKGXN event: 0
Group State:
State: 18 1
Commited Map: 0 1
New Map: 0 1
SKGXN Map: 0 1
Master node: 0
Memcnt 2 Rcvcnt 0
Substate Proposal: false
Inc Proposal:
incarn 0 memcnt 0 master 0
proposal false matched false
map:
Master Inc State:
incarn 0 memcnt 0 agrees 0 flag 0x1
wmap:
nmap:
ubmap:
Submitting asynchronized dump request [1]
error 29740 detected in background process
ORA-29740: evicted by member 1, group incarnation 19

從trc檔案中可以得到如下資訊：

*** 2011-10-18 10:51:59.470
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.

在節點2的系統時間為：

10:51:59 Initiating reconfig,reason 3
Reconfiguration started,reason 3，這裡的reason 3是重要資訊

*** 2011-10-18 10:53:45.311
jxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)

在節點2的系統時間為：

10:53:45 Waiting for split-brain resolution，此時cluster正等待解決腦裂
11:03:45 Evicted by 1，此時節點2上的instance被驅逐

查詢metalink ID 219316.1

For troubleshooting ORA-29740 errors,the ‘reason’ will be very important.In the above example,the first section indicates the reason for the initiated reconfiguration.The reason are as follows:

Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

再看看引起 reason 3的幾種可能性：

a) The LMON processes lose communication between one another.
b) One instance loses communications with the LMS, LMD, process of another instance.
c) The LCK processes lose communication between one another.
d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck and is not responding to remote requests.

本case的reason標識就是3，所以應該Communications Failure。我想最可能的原因就是心跳出了問題（在當日凌晨2點，給主機換硬體時，曾因網路問題導致心跳異常），其他的原因，如：資源緊張，網路堵塞…我想基本上應該可以排除的，這邊的硬體環境算是比較“高檔”了，並且，從平時的監控來看，節點2的負載還是可以的。

上面的關於重要資訊的時間點寫的比較亂，簡單整理一個在日誌中記錄重要事件的時間點：

時間	節點1	節點2
10:51:57		IPC Send timeout detected
10:51:59		Communications reconfiguration:instance 0
10:53:45		Waiting for clusterware split-brain resolution
10:56:34	Waiting for clusterware split-brain resolution
11:02:14		Timed out trying to start process P001.
11:03:00	Timed out trying to start process P001.
11:03:45		USER: terminating instance due to error 481 ORA-29740: evicted by member 1, group incarnation 19
11:06:34	Evicting instance 2 from cluster

不知為何原因，當日凌成2點重啟主機後，節點2上的NTP服務異常，直到節點2上的instance 異常，處理故障時發現，節點2比節點1慢了3分鐘，具體點，應該是3分鐘11秒左右。當時我逐步以每次30秒調整到與節點2的主機系統時間基本保持了一致，後來主機工程師調整了NTP後，該主機上的NTP服務也恢復正常了，此時，兩個節點的系統時間均與時間伺服器保持同步了。

所以上面的時間點不算很準確，因為節點2的系統時間如果加上3分14秒的話，在日誌中的某些重要事件資訊的時間點，兩個節點基本就吻合了。如下：

時間	節點1	節點2
10:51:57		IPC Send timeout detected
10:51:59		Communications reconfiguration:instance 0
10:53:45		Waiting for clusterware split-brain resolution
10:56:34	Waiting for clusterware split-brain resolution
~~11:02:14~~		~~Timed out trying to start process P001.~~
~~11:03:00~~	~~Timed out trying to start process P001.~~
11:03:45		USER: terminating instance due to error 481 ORA-29740: evicted by member 1, group incarnation 19
11:06:34	Evicting instance 2 from cluster

在上面這個表格中，發生split-brain時，其實兩個節點是同時發生的，準確時間是：10:56:34；在節點的alert日誌中發生instance 2被驅逐事件的時間點，其實節點2上報ORA-29740錯誤的時間其實也是同時發生的。

這樣以來就很清楚的知道了，節點2：

10:51:57 （這個時間點其實應該再加上3分11秒，即10:55:08）發生IPC Send imeout

10:56:34 叢集發生腦裂（Split-brain）

11:06:34 節點2上的例項 instance 2被驅逐

至於，心跳超時的原因，個人感覺最好的解釋就是，因兩個節點相差3分11秒（即：節點2 慢節點1 3分鐘11秒），導致心跳異常，從而引起腦裂以至instance 2被驅逐。

如果大家有誰遇到過相同問題的，歡迎與我溝通、交流，謝謝！

HP平臺，9i RAC instance 2被驅逐故障診斷

相關文章