oracle has a disk HB, but no network HB, DHB has r cfg , wrtcnt

xulongxc發表於2015-06-23
oracle  node 1, rac0101, has a disk HB, but no network HB, DHB has r cfg , wrtcnt

db突然停止執行
alert.log
Archived Log entry 52886 added for thread 2 sequence 27344 ID 0x980341ff dest 1:
Mon Jun 08 22:00:00 2015
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Mon Jun 08 22:00:00 2015
Starting background process VKRM
Mon Jun 08 22:00:00 2015
VKRM started with pid=291, OS id=21332 
Tue Jun 09 00:11:51 2015
Thread 2 advanced to log sequence 27346 (LGWR switch)
  Current log# 7 seq# 27346 mem# 0: +DATA/oradb/onlinelog/group_7.453.868440067
  Current log# 7 seq# 27346 mem# 1: +DATA/oradb/onlinelog/group_7.457.868440069
Tue Jun 09 00:11:54 2015
Archived Log entry 52888 added for thread 2 sequence 27345 ID 0x980341ff dest 1:
Tue Jun 09 02:00:00 2015
Closing Resource Manager plan via scheduler window
Clearing Resource Manager plan via parameter
Tue Jun 09 12:08:08 2015
WARNING: db_recovery_file_dest is same as db_create_file_dest
Tue Jun 09 14:17:38 2015
NOTE: ASMB terminating
Errors in file /data/oracle/diag/rdbms/oradb/oradb2/trace/oradb2_asmb_25897.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID: 
?? ID: 386 ???: 5
Errors in file /data/oracle/diag/rdbms/oradb/oradb2/trace/oradb2_asmb_25897.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID: 
?? ID: 386 ???: 5
ASMB (ospid: 25897): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 25897
Tue Jun 23 14:52:20 2015

alert_+ASM2.log

SQL> ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:8:9} */ 
SUCCESS: ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent *//* {0:8:9} */
Mon Jun 01 12:22:18 2015
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
Mon Jun 01 12:22:18 2015
NOTE: [crsd.bin@rac0102.tempus.cn (TNS V1-V3) 25061] opening OCR file
Starting background process ASMB
Mon Jun 01 12:22:18 2015
ASMB started with pid=25, OS id=25075 
Mon Jun 01 12:22:18 2015
NOTE: client +ASM2:+ASM registered, osid 25077, mbr 0x0
Mon Jun 01 12:22:19 2015
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: Voting file relocation is required in diskgroup DATA
NOTE: Attempting voting file relocation on diskgroup DATA
Mon Jun 01 12:25:35 2015
ALTER SYSTEM SET local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=172.16.2.43)(PORT=1521))))' SCOPE=MEMORY SI
D='+ASM2';
Mon Jun 01 12:26:05 2015
NOTE: client oradb2:oradb registered, osid 25903, mbr 0x1
Tue Jun 09 14:17:38 2015
Tue Jun 09 14:17:38 2015
NOTE: client exited [25061]NOTE: ASMB process exiting, either shutdown is in progress 


NOTE: or foreground connected to ASMB was killed. 
Tue Jun 09 14:17:39 2015
Received an instance abort message from instance 1
Tue Jun 09 14:17:39 2015
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
Please check instance 1 alert and LMON trace files for detail.
opidcl aborting process unknown ospid (25073) as a result of ORA-29709
LMD0 (ospid: 25013): terminating the instance due to error 481
Instance terminated by LMD0, pid = 25013


1.執行crsctl start crs提示服務已啟動
2.執行crsctl check crs 
    CRS-4638: Oracle High Availability Services is online
    CRS-4535: Cannot communicate with Cluster Ready Services
    CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
    CRS-4534: Cannot communicate with Event Manager
3.crsctl stop crs -f 強制關閉後重啟問題還是一樣,cssd服務無法正常啟動
4.正常重啟系統後啟動crsctl start crs還是不行,具體錯誤如下:
ocssd.log
2015-06-23 13:52:10.599: [    CSSD][1087736128]clssscSelect: cookie accept request 0x2aaab009e7c0
2015-06-23 13:52:10.599: [    CSSD][1087736128]clssscevtypSHRCON: getting client with cmproc 0x2aaab009e7c0
2015-06-23 13:52:10.599: [    CSSD][1087736128]clssgmRegisterClient: proc(4/0x2aaab009e7c0), client(1/0x2aaab009e930)
2015-06-23 13:52:10.599: [    CSSD][1087736128]clssgmJoinGrock: global grock CRF- new client 0x2aaab009e930 with con 0x37e8, request
ed num -1, flags 0x4000e00
2015-06-23 13:52:10.599: [    CSSD][1087736128]clssgmJoinGrock: ignoring grock join for client not requiring fencing until group inf
ormation has been received from the master; group name CRF-, member number -1, flags 0x4000e00
2015-06-23 13:52:10.600: [    CSSD][1087736128]clssgmDiscEndpcl: gipcDestroy 0x37e8
2015-06-23 13:52:10.600: [    CSSD][1087736128]clssgmDeadProc: proc 0x2aaab009e7c0
2015-06-23 13:52:10.600: [    CSSD][1087736128]clssgmDestroyProc: cleaning up proc(0x2aaab009e7c0) con(0x37b9) skgpid  ospid 6002 wi
th 0 clients, refcount 0
2015-06-23 13:52:10.600: [    CSSD][1087736128]clssgmDiscEndpcl: gipcDestroy 0x37b9
2015-06-23 13:52:10.732: [    CSSD][1113405760]clssnmSendingThread: sending join msg to all nodes
2015-06-23 13:52:10.732: [    CSSD][1113405760]clssnmSendingThread: sent 4 join msgs to all nodes
2015-06-23 13:52:11.231: [    CSSD][1104025920]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2015-06-23 13:52:11.299: [    CSSD][1097464128]clssnmvDHBValidateNCopy: node 1, rac0101, has a disk HB, but no network HB, DHB has r
cfg 244238741, wrtcnt, 85705272, LATS 3151914, lastSeqNo 85705271, uniqueness 1433132495, timestamp 1435038730/2671193568
2015-06-23 13:52:12.233: [    CSSD][1104025920]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2015-06-23 13:52:12.301: [    CSSD][1097464128]clssnmvDHBValidateNCopy: node 1, rac0101, has a disk HB, but no network HB, DHB has r
cfg 244238741, wrtcnt, 85705273, LATS 3152914, lastSeqNo 85705272, uniqueness 1433132495, timestamp 1435038731/2671194568
2015-06-23 13:52:13.235: [    CSSD][1104025920]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2015-06-23 13:52:13.303: [    CSSD][1097464128]clssnmvDHBValidateNCopy: node 1, rac0101, has a disk HB, but no network HB, DHB has r
cfg 244238741, wrtcnt, 85705274, LATS 3153914, lastSeqNo 85705273, uniqueness 1433132495, timestamp 1435038732/2671195568
2015-06-23 13:52:14.237: [    CSSD][1104025920]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2015-06-23 13:52:14.306: [    CSSD][1097464128]clssnmvDHBValidateNCopy: node 1, rac0101, has a disk HB, but no network HB, DHB has r
cfg 244238741, wrtcnt, 85705275, LATS 3154914, lastSeqNo 85705274, uniqueness 1433132495, timestamp 1435038733/2671196578
2015-06-23 13:52:14.741: [    CSSD][1113405760]clssnmSendingThread: sending join msg to all nodes
2015-06-23 13:52:14.741: [    CSSD][1113405760]clssnmSendingThread: sent 4 join msgs to all nodes
2015-06-23 13:52:15.240: [    CSSD][1104025920]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2015-06-23 13:52:15.309: [    CSSD][1097464128]clssnmvDHBValidateNCopy: node 1, rac0101, has a disk HB, but no network HB, DHB has r
cfg 244238741, wrtcnt, 85705276, LATS 3155924, lastSeqNo 85705275, uniqueness 1433132495, timestamp 1435038734/2671197578
2015-06-23 13:52:15.609: [    CSSD][1087736128]clssscSelect: cookie accept request 0x2aaaac02a570
2015-06-23 13:52:15.610: [    CSSD][1087736128]clssgmAllocProc: (0x2aaab009e7c0) allocated
2015-06-23 13:52:15.610: [    CSSD][1087736128]clssgmClientConnectMsg: properties of cmProc 0x2aaab009e7c0 - 1,2,3,4,5
2015-06-23 13:52:15.610: [    CSSD][1087736128]clssgmClientConnectMsg: Connect from con(0x3884) proc(0x2aaab009e7c0) pid(6002) versi
on 11:2:1:4, properties: 1,2,3,4,5
2015-06-23 13:52:15.610: [    CSSD][1087736128]clssgmClientConnectMsg: msg flags 0x0000
2015-06-23 13:52:15.611: [    CSSD][1087736128]clssscSelect: cookie accept request 0x2aaab009e7c0
2015-06-23 13:52:15.611: [    CSSD][1087736128]clssscevtypSHRCON: getting client with cmproc 0x2aaab009e7c0
2015-06-23 13:52:15.611: [    CSSD][1087736128]clssgmRegisterClient: proc(4/0x2aaab009e7c0), client(1/0x2aaab00e3250)
2015-06-23 13:52:15.611: [    CSSD][1087736128]clssgmJoinGrock: global grock CRF- new client 0x2aaab00e3250 with con 0x38b3, request
ed num -1, flags 0x4000e00
2015-06-23 13:52:15.611: [    CSSD][1087736128]clssgmJoinGrock: ignoring grock join for client not requiring fencing until group inf
ormation has been received from the master; group name CRF-, member number -1, flags 0x4000e00
2015-06-23 13:52:15.611: [    CSSD][1087736128]clssgmDiscEndpcl: gipcDestroy 0x38b3
2015-06-23 13:52:15.611: [    CSSD][1087736128]clssgmDeadProc: proc 0x2aaab009e7c0
2015-06-23 13:52:15.611: [    CSSD][1087736128]clssgmDestroyProc: cleaning up proc(0x2aaab009e7c0) con(0x3884) skgpid  ospid 6002 wi
th 0 clients, refcount 0
2015-06-23 13:52:15.611: [    CSSD][1087736128]clssgmDiscEndpcl: gipcDestroy 0x3884

提示兩個節點通訊有問題,檢查網路好像是正常的
執行crsctl stop has -f 強制停止所有服務,然後透過嘗試把心跳線拔下稍等五分鐘再插回去,
啟動crsctl start has 
系統恢復正常。


以下為參考資料:

問題 2:CRS-4530:聯絡叢集同步服務守護程式時出現通訊故障,ocssd.bin 未執行狀:

1. 命令“$GRID_HOME/bin/crsctl check crs”返回錯誤:
    CRS-4638: Oracle High Availability Services is online
    CRS-4535: Cannot communicate with Cluster Ready Services
    CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
    CRS-4534: Cannot communicate with Event Manager
2. 命令“ps -ef | grep d.bin”不顯示類似於如下所示的行:
    oragrid 21543 1 1 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ocssd.bin
3. ocssd.bin 正在執行,但在 ocssd.log 中顯示訊息“CLSGPNP_CALL_AGAIN”後又中止執行
4. ocssd.log 顯示如下內容:

   2012-01-27 13:42:58.796: [ CSSD][19]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 223132864, wrtcnt, 1112, LATS 783238209,  
   lastSeqNo 1111, uniqueness 1327692232, timestamp 1327693378/787089065

5. 對於 3 個或更多節點的情況,2 個節點形成的叢集一切正常,但是,當第 3 個節點加入時就出現故障,ocssd.log 顯示如下內容:

   2012-02-09 11:33:53.048: [ CSSD][1120926016](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 2 nodes with leader 2, racnode2, is smaller than   
   cohort of 2 nodes led by node 1, racnode1, based on map type 2
   2012-02-09 11:33:53.048: [ CSSD][1120926016]###################################
   2012-02-09 11:33:53.048: [ CSSD][1120926016]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

6. 10 分鐘後 ocssd.bin 啟動超時

   2012-04-08 12:04:33.153: [    CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1333911873
   ......
   2012-04-08 12:14:31.994: [    CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
   2012-04-08 12:14:31.994: [    CSSD][5]###################################
   2012-04-08 12:14:31.994: [    CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
   2012-04-08 12:14:31.994: [    CSSD][5]###################################
   2012-04-08 12:14:31.994: [    CSSD][5](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

可能的原因:

1. 表決磁碟丟失或無法訪問
2. 多播未正常工作(對於 11.2.0.2 及以上版本)
3. 私網未工作,ping 或 traceroute 顯示無法訪問目標。或雖然 ping/traceroute 正常工作,但是在私網中啟用了防火牆
4. 使用正常 ping 命令可對私網進行 ping 操作,但啟用巨幀時(MTU:9000+),不能使用巨幀尺寸(如:ping -s 8900 )進行 ping 操作。或部分叢集節點設定了巨幀(MTU:9000),但問題節點未設定巨幀(MTU:1500)
5. gpnpd 未出現,卡在 dispatch 執行緒中, 
6. 透過 asm_diskstring 發現的磁碟太多,或由於 Bug 13454354 導致掃描太慢(僅在 Solaris 11.2.0.3 上出現)


解決方案:

1. 透過檢查儲存存取性、磁碟許可權等恢復表決磁碟存取。
   如果 OCR ASM 磁碟組中的 voting disk已經丟失,以獨佔模式啟動 CRS,並重建表決磁碟:
   # crsctl start crs -excl
   # crsctl replace votedisk
2. 請參考 Document 1212703.1 ,瞭解多播功能的測試及修正
3. 諮詢網路管理員,恢復私網訪問或禁用私網防火牆(對於 Linux,請檢查服務 iptables 狀態和服務 ip6tables 狀態)
4. 如果巨幀在網路卡中啟用,則聯絡網路管理員在交換機層也啟用。
5. 終止正常執行節點上的 gpnpd.bin 程式,請參考 Document 10105195.8
   一旦以上問題得以解決,請重新啟動 Grid Infrastructure。
   如果 ping/traceroute 對私網均可用,但是問題發生在從 11.2.0.1 至 11.2.0.2 升級過程中,請檢查 
    獲取解決方法。
6. 透過提供更加具體的 asm_diskstring,限制 ASM 掃描磁碟的數量,請參考 
   對於 Solaris 11.2.0.3,請應用補丁 13250497,請參閱 Document 1451367.1.





來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29440247/viewspace-1709227/,如需轉載,請註明出處,否則將追究法律責任。

相關文章