客戶的10.2.0.4 RAC環境前幾天出現了單節點DOWN機，幫客戶診斷問題所在。

首先檢查資料庫的告警日誌：

Fri Jun 17 18:32:17 2011
ospid 14177: network interface with IP address 192.169.1.2 no longer running (check cable)
Fri Jun 17 18:32:22 2011
Error: KGXGN aborts the instance (6)
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lmon_14175.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms0_14179.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms3_14191.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms4_14195.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms6_14203.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms5_14199.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms2_14187.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms7_14207.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms1_14183.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lmd0_14177.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_asmb_14255.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Fri Jun 17 18:32:24 2011
System state dump is made for local instance
System State dumped to trace file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_diag_14171.trc

可以看到，告警日誌中出現了很多ORA-29702錯誤，導致這個錯誤的原因顯然是CLUSTER層出現了問題，再看最後出現的ORA-15064和ORA-3113錯誤，說明ASM例項也出現了問題，導致資料庫例項崩潰。

此外，錯誤一開始的時候出現了網路地址不可用的問題，以前偶爾在其他RAC環境的告警日誌中也看到過類似的錯誤，一般是出現了這個錯誤後，1分鐘後再次檢測的時候，網路地址就恢復了正常，而類似這種單獨出現的很少見。問題是否會與這個地址告警有關還不好說，不過既然現有錯誤提示和ASM以及CLUSTER有關係，那麼首先檢查ASM上的錯誤日誌：

Fri Jun 17 18:32:08 2011
ospid 13967: network interface with IP address 192.169.1.2 no longer running (check cable)
Fri Jun 17 18:32:22 2011
Error: KGXGN aborts the instance (6)
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/+ASM/bdump/+asm2_lmon_13964.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Fri Jun 17 18:32:24 2011
System state dump is made for local instance
System State dumped to trace file /ora10g/oracle/admin/+ASM/bdump/+asm2_diag_13955.trc
Fri Jun 17 18:32:24 2011
Trace dumping is performing id=[cdmp_20110617183224]

ASM的錯誤日誌並無新奇之處，錯誤和資料庫例項相仿，包含了網路地址不可用的資訊，以及ORA-29702錯誤，ASM的問題是CLUSTER上的錯誤導致的。

繼續檢查CLUSTER日誌：

2011-06-17 18:31:52.706
[cssd(12435)]CRS-1612:node tgpsndb1 (1) at 50% heartbeat fatal, eviction in 29.094 seconds
2011-06-17 18:32:07.706
[cssd(12435)]CRS-1611:node tgpsndb1 (1) at 75% heartbeat fatal, eviction in 14.094 seconds
2011-06-17 18:32:16.715
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 5.084 seconds
2011-06-17 18:32:17.707
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 4.094 seconds
2011-06-17 18:32:18.709
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 3.094 seconds
2011-06-17 18:32:19.711
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 2.094 seconds
2011-06-17 18:32:20.713
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 1.094 seconds
2011-06-17 18:32:21.715
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in 0.094 seconds
[cssd(12435)]Invalid msg id [1609]
2011-06-17 18:36:21.314
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw5. Details in /ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.
2011-06-17 18:36:21.320
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw3. Details in /ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.
2011-06-17 18:36:21.320
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw4. Details in /ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.

很顯然，這時由於網路心跳超時導致的當前節點被踢出CLUSTER。

那麼造成網路超時的原因是否和前面例項以及ASM錯誤日誌中的IP地址不可用有關呢，繼續檢查作業系統上的日誌：

Jun 17 18:31:23 tgpsndb2 kernel: igb: eth1 NIC Link is Down
Jun 17 18:31:23 tgpsndb2 kernel: igb: eth0 NIC Link is Down
Jun 17 18:32:22 tgpsndb2 logger: Oracle CSSD failure 134.
Jun 17 18:32:22 tgpsndb2 logger: Oracle CRS failure. Rebooting for cluster integrity.
Jun 17 18:32:22 tgpsndb2 kernel: SysRq : Resetting
Jun 17 18:32:22 tgpsndb2 logger: Oracle clsomon failed with fatal status 12.
Jun 17 18:32:23 tgpsndb2 logger: Oracle CRS failure. Rebooting for cluster integrity.
Jun 17 18:32:23 tgpsndb2 logger: Oracle init script. ceding reboot to sibling 11929.
Jun 17 18:32:24 tgpsndb2 logger: Cluster Ready Services completed waiting on dependencies.

在錯誤發生時刻，作業系統上兩個網路卡都DOWN掉了，而此後就是CLUSTER的錯誤以及準備重啟的資訊，顯然這次例項自動重啟的真正原因就在於此。

一次RAC單例項DOWN機的診斷

相關文章