11.2 RAC時間同步異常(一)

yangtingkun發表於2011-05-11

一個客戶的11.2RAC環境,出現了一個節點DOWN掉的問題。

 

 

這個客戶手工調整了系統的時間,使得兩個節點的系統時間只相差幾秒,而且還設定了11.2的自動時間同步功能。但是到了下午發現一個節點上的例項DOWN掉了,而且這時兩個節點的時間相差了一個小時左右。

檢查掛掉的節點上的資料庫alert檔案:

Mon May 10 15:22:03 2010
NOTE: ASMB terminating
Errors in file /oracle/app/oracle/diag/rdbms/posp/posp2/trace/posp2_asmb_8257792.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 82 Serial number: 53
Errors in file /oracle/app/oracle/diag/rdbms/posp/posp2/trace/posp2_asmb_8257792.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 82 Serial number: 53
ASMB (ospid: 8257792): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 8257792

很顯然導致問題的原因是連線不到ASM例項,這比如導致資料庫例項的崩潰,再檢查ASM例項的錯誤日誌:

Mon May 10 15:00:13 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:03:35 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:06:56 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:10:17 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:13:39 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:17:00 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:20:21 2010
WARNING: Elapsed time did not advance (delta = 0),
WARNING: skipping disk repair timer updates for
WARNING: offline disks
Mon May 10 15:22:03 2010
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon May 10 15:22:03 2010
NOTE: client exited [4718636]
NOTE: force a map free for map id 2
Mon May 10 15:22:04 2010
PMON (ospid: 6816236): terminating the instance due to error 481
Instance terminated by PMON, pid = 6816236

ASM例項的被中止的原因是錯誤481,而Oracle對於這個錯誤的描述為:

ORA-00481: LMON process terminated with error
Cause: The global enqueue service monitor process died
Action: Warm start instance

看來問題和CLUSTER環境有關,而且ASM的告警日誌中包含了大量的警告資訊。

進一步檢查CLUSTER上的相關資訊:

2011-05-10 14:33:46.819
[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .
2011-05-10 14:35:01.116
[cssd(4391224)]CRS-1632:Node jzdbiufo is being removed from the cluster in cluster incarnation 200459465
2011-05-10 14:35:01.144
[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc .
2011-05-10 14:35:03.192
[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .
2011-05-10 14:36:18.384
[cssd(4391224)]CRS-1612:Network communication with node jzdbiufo (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.896 seconds
2011-05-10 14:36:26.422
[cssd(4391224)]CRS-1611:Network communication with node jzdbiufo (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.859 seconds
2011-05-10 14:36:30.452
[cssd(4391224)]CRS-1610:Network communication with node jzdbiufo (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.829 seconds
2011-05-10 14:36:33.285
[cssd(4391224)]CRS-1632:Node jzdbiufo is being removed from the cluster in cluster incarnation 200459467
2011-05-10 14:36:33.312
[cssd(4391224)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc .
2011-05-10 14:54:16.181
[crsd(6422586)]CRS-2765:Resource 'ora.net1.network' has failed on server 'jzdbnc'.
2011-05-10 14:54:18.539
[/oracle/app/11.2.0/grid/bin/oraagent.bin(3014938)]CRS-5016:Process "/oracle/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbnc/agent/crsd/oraagent_grid/oraagent_grid.log"
2011-05-10 14:54:18.574
[/oracle/app/11.2.0/grid/bin/oraagent.bin(3014938)]CRS-5016:Process "/oracle/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbnc/agent/crsd/oraagent_grid/oraagent_grid.log"
2011-05-10 14:56:13.440
[crsd(6422586)]CRS-2765:Resource 'ora.net1.network' has failed on server 'jzdbnc'.
2011-05-10 14:56:15.055
[/oracle/app/11.2.0/grid/bin/oraagent.bin(3014938)]CRS-5016:Process "/oracle/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbnc/agent/crsd/oraagent_grid/oraagent_grid.log"
2011-05-10 14:56:16.179
[/oracle/app/11.2.0/grid/bin/oraagent.bin(3014938)]CRS-5016:Process "/oracle/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbnc/agent/crsd/oraagent_grid/oraagent_grid.log"

單從節點1CLUSTERtrace資訊看,似乎是網路的問題導致了另一個節點被關閉,檢查節點2上的資訊:

2010-05-10 14:38:20.599
[ctssd(5243264)]CRS-2404:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /oracle/app/11.2.0/grid/log/jzdbiufo/ctssd/octssd.log.
2010-05-10 14:38:20.599
[ctssd(5243264)]CRS-2408:The clock on host jzdbiufo has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2010-05-10 15:11:25.818
[ctssd(5243264)]CRS-2404:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /oracle/app/11.2.0/grid/log/jzdbiufo/ctssd/octssd.log.
2010-05-10 15:11:25.818
[ctssd(5243264)]CRS-2408:The clock on host jzdbiufo has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2010-05-10 15:21:46.319
[cssd(4259964)]CRS-1612:Network communication with node jzdbnc (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.926 seconds
2010-05-10 15:21:54.322
[cssd(4259964)]CRS-1611:Network communication with node jzdbnc (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.923 seconds
2010-05-10 15:21:58.348
[cssd(4259964)]CRS-1610:Network communication with node jzdbnc (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.897 seconds
2010-05-10 15:22:01.253
[cssd(4259964)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.
2010-05-10 15:22:01.254
[cssd(4259964)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log
2010-05-10 15:22:01.339
[cssd(4259964)]CRS-1652:Starting clean up of CRSD resources.
2010-05-10 15:22:01.927
[cssd(4259964)]CRS-1608:This node was evicted by node 1, jzdbnc; details at (:CSSNM00005:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.
2010-05-10 15:22:03.111
[/oracle/app/11.2.0/grid/bin/oraagent.bin(6946850)]CRS-5016:Process "/oracle/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/crsd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:22:03.611
[/oracle/app/11.2.0/grid/bin/oraagent.bin(6946850)]CRS-5016:Process "/oracle/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/oracle/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/crsd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:22:03.618
[cssd(4259964)]CRS-1654:Clean up of CRSD resources finished successfully.
2010-05-10 15:22:03.631
[cssd(4259964)]CRS-1655:CSSD on node jzdbiufo detected a problem and started to shutdown.
2010-05-10 15:22:03.722
[/oracle/app/11.2.0/grid/bin/oraagent.bin(7340044)]CRS-5822:Agent '/oracle/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:3:4} in /oracle/app/11.2.0/grid/log/jzdbiufo/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2010-05-10 15:22:03.722
[/oracle/app/11.2.0/grid/bin/orarootagent.bin(6488506)]CRS-5822:Agent '/oracle/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:1:681} in /oracle/app/11.2.0/grid/log/jzdbiufo/agent/crsd/orarootagent_root/orarootagent_root.log.
2010-05-10 15:22:03.720
[/oracle/app/11.2.0/grid/bin/oraagent.bin(6946850)]CRS-5822:Agent '/oracle/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:2:9} in /oracle/app/11.2.0/grid/log/jzdbiufo/agent/crsd/oraagent_grid/oraagent_grid.log.
2010-05-10 15:22:03.927
[ohasd(4456806)]CRS-2765:Resource 'ora.crsd' has failed on server 'jzdbiufo'.
2010-05-10 15:22:05.042
[crsd(8978838)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in /oracle/app/11.2.0/grid/log/jzdbiufo/crsd/crsd.log.
2010-05-10 15:22:05.130
[ohasd(4456806)]CRS-2765:Resource 'ora.evmd' has failed on server 'jzdbiufo'.
2010-05-10 15:22:05.256
[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:22:05.431
[ohasd(4456806)]CRS-2765:Resource 'ora.ctssd' has failed on server 'jzdbiufo'.
2010-05-10 15:22:05.447
[ohasd(4456806)]CRS-2765:Resource 'ora.asm' has failed on server 'jzdbiufo'.
2010-05-10 15:22:05.512
[ohasd(4456806)]CRS-2765:Resource 'ora.crsd' has failed on server 'jzdbiufo'.
2010-05-10 15:22:06.490
[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.
2010-05-10 15:22:06.513
[ctssd(9765310)]CRS-2402:The Cluster Time Synchronization Service aborted on host jzdbiufo. Details at (:ctss_css_init1:) in /oracle/app/11.2.0/grid/log/jzdbiufo/ctssd/octssd.log.
2010-05-10 15:22:06.520
[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:22:06.550
[ohasd(4456806)]CRS-2765:Resource 'ora.cluster_interconnect.haip' has failed on server 'jzdbiufo'.
2010-05-10 15:22:06.697
[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:22:06.868
[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:22:07.038
[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:22:07.056
[ohasd(4456806)]CRS-2765:Resource 'ora.cssd' has failed on server 'jzdbiufo'.
2010-05-10 15:22:07.210
[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:22:07.381
[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:22:07.551
[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:22:16.783
[cssd(6029520)]CRS-1713:CSSD daemon is started in clustered mode
2010-05-10 15:22:23.271
[cssd(6029520)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed
2010-05-10 15:22:24.721
[cssd(6029520)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.
2010-05-10 15:22:31.678
[cssd(6029520)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .
2010-05-10 15:22:58.291
[cssd(6029520)]CRS-1612:Network communication with node jzdbnc (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.741 seconds
2010-05-10 15:23:06.292
[cssd(6029520)]CRS-1611:Network communication with node jzdbnc (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.740 seconds
2010-05-10 15:23:31.691
[cssd(6029520)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log
2010-05-10 15:23:32.432
[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:23:32.904
[/oracle/app/11.2.0/grid/bin/oraagent.bin(5243064)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/oracle/app/11.2.0/grid/log/jzdbiufo/agent/ohasd/oraagent_grid/oraagent_grid.log"
2010-05-10 15:23:36.240
[cssd(7143634)]CRS-1713:CSSD daemon is started in clustered mode
2010-05-10 15:23:38.862
[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.
2010-05-10 15:23:42.191
[cssd(7143634)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed
2010-05-10 15:23:43.612
[cssd(7143634)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.
2010-05-10 15:23:47.928
[cssd(7143634)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .
2010-05-10 15:24:47.942
[cssd(7143634)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log
2010-05-10 15:24:52.471
[cssd(7143640)]CRS-1713:CSSD daemon is started in clustered mode
2010-05-10 15:24:56.013
[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.
2010-05-10 15:24:58.524
[cssd(7143640)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed
2010-05-10 15:24:59.942
[cssd(7143640)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.
2010-05-10 15:25:04.024
[cssd(7143640)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .
2010-05-10 15:26:04.037
[cssd(7143640)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log
2010-05-10 15:26:08.590
[cssd(9765338)]CRS-1713:CSSD daemon is started in clustered mode
2010-05-10 15:26:10.161
[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.
2010-05-10 15:26:14.713
[cssd(9765338)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed
2010-05-10 15:26:16.131
[cssd(9765338)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.
2010-05-10 15:26:21.233
[cssd(9765338)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .
2010-05-10 15:27:21.247
[cssd(9765338)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log
2010-05-10 15:27:27.281
[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.
2010-05-10 15:27:31.316
[cssd(11403498)]CRS-1713:CSSD daemon is started in clustered mode
2010-05-10 15:27:37.276
[cssd(11403498)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed
2010-05-10 15:27:38.714
[cssd(11403498)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.
2010-05-10 15:27:42.826
[cssd(11403498)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .
2010-05-10 15:28:42.840
[cssd(11403498)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log
2010-05-10 15:28:47.578
[cssd(6488556)]CRS-1713:CSSD daemon is started in clustered mode
2010-05-10 15:28:50.447
[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.
2010-05-10 15:28:53.652
[cssd(6488556)]CRS-1707:Lease acquisition for node jzdbiufo number 2 completed
2010-05-10 15:28:55.086
[cssd(6488556)]CRS-1605:CSSD voting file is online: /dev/rhdisk2; details in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log.
2010-05-10 15:28:59.200
[cssd(6488556)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jzdbnc jzdbiufo .
2010-05-10 15:29:59.214
[cssd(6488556)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00011:) in /oracle/app/11.2.0/grid/log/jzdbiufo/cssd/ocssd.log
2010-05-10 15:30:07.582
[ohasd(4456806)]CRS-2765:Resource 'ora.diskmon' has failed on server 'jzdbiufo'.
2010-05-10 15:30:09.668
[ohasd(4456806)]CRS-2771:Maximum restart attempts reached for resource 'ora.cssd'; will not restart.

錯誤資訊比較多,但是仔細觀察不難發現,Oracle一直在進行時間同步的相關操作。

隨後出現了網路問題,導致節點2CLUSTER中被踢出,當節點2上的CRS等程式嘗試重啟時,又由於CLUSTER的同步服務錯誤而異常中止。

看來導致問題的原因和11.2的時間同步有關。

 

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-695061/,如需轉載,請註明出處,否則將追究法律責任。

相關文章