一次RAC單例項DOWN機的診斷
客戶的10.2.0.4 RAC環境前幾天出現了單節點DOWN機,幫客戶診斷問題所在。
首先檢查資料庫的告警日誌:
Fri Jun 17 18:32:17 2011
ospid 14177: network interface with IP address 192.169.1.2 no longer running
(check cable)
Fri Jun 17 18:32:22 2011
Error: KGXGN aborts the instance (6)
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lmon_14175.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms0_14179.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms3_14191.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms4_14195.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms6_14203.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms5_14199.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms2_14187.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms7_14207.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lms1_14183.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_lmd0_14177.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_asmb_14255.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Fri Jun 17 18:32:24 2011
System state dump is made for local instance
System State dumped to trace file
/ora10g/oracle/admin/tgpsndb/bdump/tgpsndb2_diag_14171.trc
可以看到,告警日誌中出現了很多ORA-29702錯誤,導致這個錯誤的原因顯然是CLUSTER層出現了問題,再看最後出現的ORA-15064和ORA-3113錯誤,說明ASM例項也出現了問題,導致資料庫例項崩潰。
此外,錯誤一開始的時候出現了網路地址不可用的問題,以前偶爾在其他RAC環境的告警日誌中也看到過類似的錯誤,一般是出現了這個錯誤後,1分鐘後再次檢測的時候,網路地址就恢復了正常,而類似這種單獨出現的很少見。問題是否會與這個地址告警有關還不好說,不過既然現有錯誤提示和ASM以及CLUSTER有關係,那麼首先檢查ASM上的錯誤日誌:
Fri Jun 17 18:32:08 2011
ospid 13967: network interface with IP address 192.169.1.2 no longer running
(check cable)
Fri Jun 17 18:32:22 2011
Error: KGXGN aborts the instance (6)
Fri Jun 17 18:32:22 2011
Errors in file /ora10g/oracle/admin/+ASM/bdump/+asm2_lmon_13964.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Fri Jun 17 18:32:24 2011
System state dump is made for local instance
System State dumped to trace file
/ora10g/oracle/admin/+ASM/bdump/+asm2_diag_13955.trc
Fri Jun 17 18:32:24 2011
Trace dumping is performing id=[cdmp_20110617183224]
ASM的錯誤日誌並無新奇之處,錯誤和資料庫例項相仿,包含了網路地址不可用的資訊,以及ORA-29702錯誤,ASM的問題是CLUSTER上的錯誤導致的。
繼續檢查CLUSTER日誌:
2011-06-17 18:31:52.706
[cssd(12435)]CRS-1612:node tgpsndb1 (1) at 50% heartbeat fatal, eviction in
29.094 seconds
2011-06-17 18:32:07.706
[cssd(12435)]CRS-1611:node tgpsndb1 (1) at 75% heartbeat fatal, eviction in
14.094 seconds
2011-06-17 18:32:16.715
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in
5.084 seconds
2011-06-17 18:32:17.707
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in
4.094 seconds
2011-06-17 18:32:18.709
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in
3.094 seconds
2011-06-17 18:32:19.711
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in
2.094 seconds
2011-06-17 18:32:20.713
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in
1.094 seconds
2011-06-17 18:32:21.715
[cssd(12435)]CRS-1610:node tgpsndb1 (1) at 90% heartbeat fatal, eviction in
0.094 seconds
[cssd(12435)]Invalid msg id [1609]
2011-06-17 18:36:21.314
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw5. Details in
/ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.
2011-06-17 18:36:21.320
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw3. Details in
/ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.
2011-06-17 18:36:21.320
[cssd(12459)]CRS-1605:CSSD voting file is online: /dev/raw/raw4. Details in
/ora10g/oracle/ora10g/db/log/tgpsndb2/cssd/ocssd.log.
很顯然,這時由於網路心跳超時導致的當前節點被踢出CLUSTER。
那麼造成網路超時的原因是否和前面例項以及ASM錯誤日誌中的IP地址不可用有關呢,繼續檢查作業系統上的日誌:
Jun 17 18:31:23 tgpsndb2 kernel:
igb: eth1 NIC Link is Down
Jun 17 18:31:23 tgpsndb2 kernel: igb: eth0 NIC Link is Down
Jun 17 18:32:22 tgpsndb2 logger: Oracle CSSD failure 134.
Jun 17 18:32:22 tgpsndb2 logger: Oracle CRS failure. Rebooting for cluster
integrity.
Jun 17 18:32:22 tgpsndb2 kernel: SysRq : Resetting
Jun 17 18:32:22 tgpsndb2 logger: Oracle clsomon failed with fatal status 12.
Jun 17 18:32:23 tgpsndb2 logger: Oracle CRS failure. Rebooting for cluster
integrity.
Jun 17 18:32:23 tgpsndb2 logger: Oracle init script. ceding reboot to sibling
11929.
Jun 17 18:32:24 tgpsndb2 logger: Cluster Ready Services completed waiting on
dependencies.
在錯誤發生時刻,作業系統上兩個網路卡都DOWN掉了,而此後就是CLUSTER的錯誤以及準備重啟的資訊,顯然這次例項自動重啟的真正原因就在於此。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-701570/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Oracle診斷事件例項(一)Oracle事件
- RMAN異機恢復:RAC到單例項單例
- 一次RAC VIP漂移的結果診斷及修復
- 判斷oracle是否是rac例項Oracle
- RAC起單例項make單例
- oracle單例項轉RACOracle單例
- 利用10015診斷事件解決oracle crush down 一例事件Oracle
- RAC 例項 遷移到 單例項 -- 使用RMAN Duplicate單例
- oracle效能診斷例項-row migration and row chainOracleAI
- RAC故障診斷指令碼指令碼
- RAC+DG(asm單例項)ASM單例
- rac恢復到單例項單例
- 搭建RAC到單例項DG單例
- 單例項恢復至RAC單例
- 如何診斷RAC系統中的
- RAC+單例項DG的切換單例
- rac到單例項的rman恢復單例
- 單例項opatch出現RAC資訊單例
- OGG搭建(rac到-->單例項)單例
- RAC asm恢復到單例項ASM單例
- 單例項和RAC打造的ORACLE STREAM(完)單例Oracle
- 單例項和RAC打造的ORACLE STREAM(四)單例Oracle
- 單例項和RAC打造的ORACLE STREAM(三)單例Oracle
- 單例項和RAC打造的ORACLE STREAM(二)單例Oracle
- 單例項和RAC打造的ORACLE STREAM(一)單例Oracle
- 記錄一次Oracle 11.2.0.4 RAC異地恢復到單例項Oracle單例
- 11G RAC 異機恢復至單例項測試單例
- 【RAC】Oracle Clusterware 診斷收集指令碼Oracle指令碼
- 單例項備份恢復成RAC單例
- Oracle DataBase單例項遷移到Oracle RACOracleDatabase單例
- rac asm 恢復到 單例項 1ASM單例
- rac asm 恢復到 單例項 2ASM單例
- RAC從帶庫到單例項的恢復單例
- Oracle效能診斷一例Oracle
- Oracle 11.2.0.4 從單例項,使用RMAN 異機恢復到RACOracle單例
- 一次ORACLE IO效能診斷案例Oracle
- 一次Oracle診斷案例-Spfile案例Oracle
- 一次gc buffer busy問題的診斷GC