11.2 RAC時間同步異常(二)

yangtingkun發表於2011-05-12

一個客戶的11.2RAC環境,出現了一個節點DOWN掉的問題。

這篇繼續描述定位問題、解決問題的過程。

11.2 RAC時間同步異常(一): http://yangtingkun.itpub.net/post/468/517712

 

 

在檢查了ASM和資料庫的告警日誌後,發現錯誤是由於CLUSTER問題導致,而分析了CLUSTER上的告警資訊,進一步發現問題似乎和時間同步服務有關。

檢查otcssd日誌資訊發現:

2010-05-10 15:21:25.785: [ CTSS][2571]ctssslave_swm: The magnitude [31532764620334] of the offset [-31532764620334 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:25.785: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764620334] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:25.785: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:25.786: [ CTSS][2571]ctssslave_swm17: LT [1273476085sec 785752usec], MT [1305008850sec 406086usec], Delta [1412usec]
2010-05-10 15:21:25.786: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:25.786: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:25.786: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:25.786: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:26.886: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:26.886: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:26.886: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:26.887: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:21:26.887: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:21:26.887: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:26.887: [ CTSS][2571]ctssslave_swm: The magnitude [31532764520368] of the offset [-31532764520368 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:26.887: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764520368] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:26.887: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:26.888: [ CTSS][2571]ctssslave_swm17: LT [1273476086sec 887774usec], MT [1305008851sec 408142usec], Delta [1430usec]
2010-05-10 15:21:26.888: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:26.889: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:26.889: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:26.889: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:27.989: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:27.989: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:27.989: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:27.990: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:21:27.990: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:21:27.990: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:27.990: [ CTSS][2571]ctssslave_swm: The magnitude [31532764419342] of the offset [-31532764419342 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:27.990: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764419342] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:27.990: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:27.990: [ CTSS][2571]ctssslave_swm17: LT [1273476087sec 990725usec], MT [1305008852sec 410067usec], Delta [1347usec]
2010-05-10 15:21:27.991: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:27.991: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:27.991: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:27.991: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:29.062: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xc4], offset[-1467993347 ms]}, length=[8].
2010-05-10 15:21:29.091: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:29.091: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:29.091: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:29.092: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:21:29.092: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:21:29.092: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:29.092: [ CTSS][2571]ctssslave_swm: The magnitude [31532764319434] of the offset [-31532764319434 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:29.092: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764319434] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:29.092: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:29.093: [ CTSS][2571]ctssslave_swm17: LT [1273476089sec 92763usec], MT [1305008853sec 412197usec], Delta [1454usec]
2010-05-10 15:21:29.093: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:29.093: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:29.093: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:29.093: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:30.193: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:30.193: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:30.193: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:30.194: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:21:30.194: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:21:30.194: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:30.194: [ CTSS][2571]ctssslave_swm: The magnitude [31532764219409] of the offset [-31532764219409 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:30.194: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764219409] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:30.194: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:30.195: [ CTSS][2571]ctssslave_swm17: LT [1273476090sec 194781usec], MT [1305008854sec 414190usec], Delta [1382usec]
2010-05-10 15:21:30.195: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:30.195: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:30.195: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:30.195: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:31.295: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:31.295: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:31.295: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:31.296: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:21:31.296: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:21:31.296: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:31.296: [ CTSS][2571]ctssslave_swm: The magnitude [31532764119367] of the offset [-31532764119367 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:31.296: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764119367] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:31.297: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:31.297: [ CTSS][2571]ctssslave_swm17: LT [1273476091sec 296807usec], MT [1305008855sec 416174usec], Delta [1431usec]
2010-05-10 15:21:31.297: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:31.297: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:31.297: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:31.297: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:32.397: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:32.397: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:32.397: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:59.396: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xc4], offset[-1467993047 ms]}, length=[8].
2010-05-10 15:22:02.684: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:22:02.684: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:22:02.684: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:22:02.684: [ CTSS][2571]ctssslave_swm: sendtime{sec[1273476092], usec[397374]}, receivetime{sec[1273476122], usec[684693]}.
2010-05-10 15:22:02.684: [ CTSS][2571]ctssslave_swm: The RTT of sync msg [30287319] is too large for time sync to be accurate. Recommends retry. Returns [17].
2010-05-10 15:22:02.684: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:22:02.684: [ CTSS][2571]ctsselect_msm: Failed in sync_with_master [17]
2010-05-10 15:22:02.684: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:22:02.685: [ CTSS][2571]ctsselect_msm: Sync interval returned in [1]
2010-05-10 15:22:04.592: [GIPCHAUP][1029] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 111411950 [000000000000025a] { gipchaEndpoint : port 'CTSSGROUP_2/811b-0c78-d761-0a0d', peer 'jzdbnc:9ead-4b66-b4c7-458c', srcCid 00000000-0000025a, dstCid 00000000-0000028a, numSend 0, maxSend 100, usrFlags 0x4000, flags 0x204 }
2010-05-10 15:22:04.848: [ CSSCLNT][1543]clsssRecvMsg: got a disconnect from the server while waiting for message type 1
2010-05-10 15:22:04.848: [ CSSCLNT][1543]clssgsGroupGetStatus: communications failed (0/3/-1)

2010-05-10 15:22:04.848: [ CSSCLNT][1543]clssgsGroupGetStatus: returning 8

2010-05-10 15:22:04.848: [ CRSCCL][1543]Daemon exiting due to error in clssgsgrpstat retval = 8

2010-05-10 15:22:05.636: [ CTSS][1]Oracle Database CTSS Release 11.2.0.2.0 Production Copyright 2006, 2007 Oracle. All rights reserved.
2010-05-10 15:22:05.637: [ CTSS][1]ctss_scls_init: SCLs Context is 11098c370
[ clsdmt][515]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jzdbiufoDBG_CTSSD))
2010-05-10 15:22:05.664: [ clsdmt][515]PID for the Process [9765310], connkey 11
2010-05-10 15:22:05.664: [ clsdmt][515]Creating PID [9765310] file for home /oracle/app/11.2.0/grid host jzdbiufo bin ctss to /oracle/app/11.2.0/grid/ctss/init/
2010-05-10 15:22:05.664: [ clsdmt][515]Writing PID [9765310] to the file [/oracle/app/11.2.0/grid/ctss/init/jzdbiufo.pid]
2010-05-10 15:22:06.505: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0x0], offset[0 ms]}, length=[8].
2010-05-10 15:22:06.512: [ CTSS][1]ctss_css_init: clsssinit() failed with [3].
2010-05-10 15:22:06.512: [ CTSS][1](:ctss_css_init1:): CTSS daemon exiting. CSS unresponsive [3]
[ CTSS][1]SLOS : SLOS: cat=1311719766, pn=skgxnqtsz, dep=0, loc=SKGXN not av

2010-05-10 15:22:06.512: [ CTSS][1]ctss_main: CSS init failed [7]
2010-05-10 15:22:06.512: [ CTSS][1]ctss_main: CTSS init failed [7]
2010-05-10 15:22:06.512: [ CTSS][1]ctss_main: CTSS daemon aborting [7].
2010-05-10 15:22:06.512: [ CTSS][1]CTSS daemon aborting
2010-05-10 15:22:07.510: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0x0], offset[0 ms]}, length=[8].
2010-05-10 15:22:08.511: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0x0], offset[0 ms]}, length=[8].
2010-05-10 15:22:09.513: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0x0], offset[0 ms]}, length=[8].

在錯誤資訊中,出現了多次告警資訊:

[CTSS][2571]ctssslave_swm: The magnitude [31532764119367] of the offset [-31532764119367 usec] is larger than [86400000000 usec] sec which is the CTSS limit.

顯然Oracle的時間同步後臺一直在調整兩個伺服器的時間,不過令人意外的是,本來只相差幾秒的時間,為什麼調整後變成了1個小時左右,而且這裡的數值如此巨大。

簡單分析一下不難發現,Oracle的時間調整限制是1天:

SQL> set numw 15
SQL> select 24 * 60 * 60 * 1000000 from dual;

24*60*60*1000000
----------------
86400000000

那麼那個大的出奇的時間範圍是什麼呢:

SQL> select 31532764119367 / 1000000 / 60 / 60 / 24 from dual;

31532764119367/1000000/60/60/24
-------------------------------
364.96254767786

居然是365天,也就是1年的時間,這是巧合嗎,檢查兩個節點上的時間發現:

$ hostname
jzdbiufo
$ date
Mon May 10 16:31:33 GMT+08:00 2010

另一個節點:

# hostname
jzdbnc
# date
Tue May 10 15:38:04 GMT+08:00 2011

二者相差不是1個小時,而是差1個小時1年的時間差。

顯然是操作人員在手工調整時間的時候將一個節點上的日期錯誤的深入為2010年,導致兩個節點的時間差達到1年之久,這麼大的時間差,遠遠超過Oracle時間同步服務允許的最大限制,因此Oracle的時間同步只能在後臺選擇一個可接受的時間間隔,大概是0.13秒,來嘗試不斷的修改系統時間,以期達到不同兩個節點時間的目的。

CLUSTER出於不斷的同步時間的過程中,客戶又在兩個節點上進行了大資料量的匯入工作,由此導致了短時間內兩個節點間的通訊出現了超時,導致一個節點被踢出了CLUSTER,而由於時間同步的問題,導致了節點重啟後試圖再次加入到CLUSTER中報錯。

找到問題,解決就很容易了,關閉CLUSTER後,將錯誤的節點上的時間調整到當前時間,再次啟動CLUSTERDB,問題解決。

 

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-695164/,如需轉載,請註明出處,否則將追究法律責任。

相關文章