記一次RAC Brain Split腦裂分析過程
DB版本:11.2.0.4 64位
OS版本:AIX 6.1 位
2.錯誤現象
node1 節點1ASM例項日誌:
Mon Jan 12 09:08:48 2015
Reconfiguration started (old inc 8, new inc 10)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:08:48 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:08:48 2015
NOTE: SMON starting instance recovery for group DG domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.332
NOTE: starting recovery of thread=2 ckpt=10.4585 group=1 (DG)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DG)
NOTE: SMON successfully validated lock domain 1
NOTE: advancing ckpt for group 1 (DG) thread=2 ckpt=10.4585
NOTE: SMON did instance recovery for group DG domain 1
NOTE: SMON starting instance recovery for group VOTE domain 2 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: starting recovery of thread=2 ckpt=15.26 group=2 (VOTE)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 2 (VOTE)
NOTE: SMON successfully validated lock domain 2
NOTE: advancing ckpt for group 2 (VOTE) thread=2 ckpt=15.26
NOTE: SMON did instance recovery for group VOTE domain 2
Mon Jan 12 09:21:06 2015
Reconfiguration started (old inc 10, new inc 12)
List of instances:
1 2 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:21:06 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:25:08 2015
Reconfiguration started (old inc 12, new inc 14)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:25:08 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:51:08 2015
Reconfiguration started (old inc 14, new inc 18)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:51:08 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 10:15:52 2015
Reconfiguration started (old inc 18, new inc 20)
List of instances:
1 2 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 10:15:53 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 12:20:06 2015
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more
details.
Mon Jan 12 12:20:26 2015
Reconfiguration started (old inc 20, new inc 22)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 12:20:26 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 12:20:26 2015
NOTE: SMON starting instance recovery for group DG domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.332
NOTE: starting recovery of thread=2 ckpt=11.4610 group=1 (DG)
NOTE: SMON waiting for thread 2 recovery enqueue
Mon Jan 12 12:32:50 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 13:44:50 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
Private Interface 'en0' configured from GPnP for use as a private interconnect.
[name='en0', type=1, ip=169.254.189.151, mac=00-90-fa-52-21-be, net=169.254.0.0/16,
mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'en2' configured from GPnP for use as a public interface.
[name='en2', type=1, ip=192.168.0.12, mac=00-90-fa-52-22-3c, net=192.168.0.0/24,
mask=255.255.255.0, use=public/1]
CELL communication is configured to use 0 interface(s):
CELL IP affinity details:
NUMA status: non-NUMA system
cellaffinity.ora status: N/A
CELL communication will use 1 IP group(s):
Grp 0:
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/grid/asm/dbs/arch
Autotune of undo retention is turned on.c
,Mon Jan 12 09:21:06 2015,Mon Jan 12 09:25:08 2015,Mon Jan 12 09:51:08 2015,Mon Jan 12 10:15:52 2015,Mon Jan 12 12:20:26 2015多次進行reconfiguration,也就是asm例項不斷的重啟,出現腦裂現象,且en0為private網路卡,IP地址為169.254.189.151,en2 為public網路卡,IP地址為192.168.0.12。
node2 節點2ASM例項日誌
Mon Jan 12 09:09:29 2015
NOTE: client exited [4719298]
Mon Jan 12 09:09:29 2015
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon Jan 12 09:09:29 2015
PMON (ospid: 3867214): terminating the instance due to error 471
Mon Jan 12 09:51:49 2015
Shutting down instance (abort)
License high water mark = 2
USER (ospid: 2425590): terminating the instance
Instance terminated by USER, pid = 2425590
Mon Jan 12 09:51:50 2015
Instance shutdown complete
Mon Jan 12 10:16:27 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Mon Jan 12 12:21:07 2015
NOTE: client exited [2098012]
Mon Jan 12 12:21:07 2015
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon Jan 12 12:21:09 2015
PMON (ospid: 2818726): terminating the instance due to error 481
Instance terminated by PMON, pid = 2818726
Mon Jan 12 12:33:30 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
Mon Jan 12 12:38:19 2015
NOTE: client jzhprd2:jzhprd registered, osid 3146138, mbr 0x1
Mon Jan 12 12:38:51 2015
ALTER SYSTEM SET local_listener=' (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.0.13)(PORT=1521))))' SCOPE=MEMORY SID='+ASM2';
Mon Jan 12 13:01:34 2015
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon Jan 12 13:01:34 2015
NOTE: client exited [4915336]
NOTE: force a map free for map id 2
Mon Jan 12 13:01:36 2015
PMON (ospid: 3212222): terminating the instance due to error 481
Instance terminated by PMON, pid = 3212222
Mon Jan 12 13:46:02 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
說明:標紅字型顯示節點JZH2中ASMB程式在9:9:29s時開始出現異常,ASM異常需要檢查CRS、CSS相關日誌。
2015-01-12 09:09:30.742: [ CSSCLNT][1]clssscConnect: gipc request failed with 13 (1a)
CSS失敗了(CSS,cluster synchronization services叢集同步服務-涉及netwok hearbeat,disk heartbeat兩種機制)
2015-01-12 09:09:30.742: [ CSSCLNT][1]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_)) failed, rc 13
2015-01-12 09:09:30.745: [ CRSRTI][1] CSS is not ready. Received status
3
CSS not ready
2015-01-12 09:09:30.745: [ CRSMAIN][1] First attempt: init CSS context failed. Error = 3
[ clsdmt][515]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jzh2DBG_CRSD))
第一次償試,失敗了
2015-01-12 09:09:30.812: [ clsdmt][515]PID for the Process [4522624], connkey 1
2015-01-12 09:09:30.812: [ clsdmt][515]Creating PID [4522624] file for home /u01/app/oracle/grid/asm host jzh2 bin crs to /u01/app/oracle/grid/asm/crs/init/
2015-01-12 09:09:30.812: [ clsdmt][515]Writing PID [4522624] to the file [/u01/app/oracle/grid/asm/crs/init/jzh2.pid]
2015-01-12 09:09:31.863: [ CRSMAIN][1] CRS Daemon Starting--> CRS staring(crsd服務沒有問題,一會驗證)
說明:以上說明CRS服務沒有問題,在09:09:30s時CSS叢集同步服務出現異常,叢集同步服務涉及到disk heartbeat(磁碟心跳)network heartbear(網路心跳),也就是說網路與磁碟心跳有問題, 接下來看一下jzh1 ocssd.log記錄node1 jzh1在2015-01-12 09:09:30幹什麼?。
2015-01-12 09:09:00.934: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:00.934: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:04.946: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:04.946: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:08.980: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:08.980: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:12.994: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:12.994: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:16.758: [ CSSD][2577]clssnmSetupReadLease: status 1
2015-01-12 09:09:16.762: [ CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1
2015-01-12 09:09:16.762: [ CSSD][2577]clssgmDoneQEle: re-queueing req 110617b30 status 1
2015-01-12 09:09:16.763: [ CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (111b38850), operation status 1, client status 0
2015-01-12 09:09:17.009: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:17.009: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:21.028: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:21.028: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:25.349: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:25.349: [ CSSD][3862]clssnmSendingThread: sent 3 status msgs to all nodes
2015-01-12 09:09:29.355: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:29.355: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:33.362: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:33.362: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:37.366: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:37.366: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:41.377: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:41.377: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:45.394: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:09:45.394: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:09:49.371: [ CSSD][1029]clssgmQueueGrockEvent: groupName(DAALL_DB_jzh-cluster) count(2) master(0) event(6), incarn 6, mbrc 1, to member 0, events 0x8, state 0x0
2015-01-12 09:09:49.371: [ CSSD][1029]clssgmQueueGrockEvent: groupName(DAALL_DB_jzh-cluster) count(2) master(0) event(6), incarn 6, mbrc 1, to member 1, events 0x8, state 0x0說明:可以看到node1的CSSD程式send status messagesto all nondes,為什麼要send?在RAC啟動後(在node1的ASM日誌中可以看到在09:08:48s時reconfiguration了),各個node要將自己的資訊寫入ocr與vote中,然後 master收集這些資訊傳送給所有node,告訴所有的node,誰是master,有幾個node,在votedisk中記錄node相關資訊,然後進行投票,到這裡,我們可以看到整個叢集中有兩個member,分別是member 0(jzh1)和member 1(jzh2),也就是說CRSD程式沒有問題(已驗證),還說明什麼?其他node可以將自己的資訊寫入vote,就是說disk heatbeat沒什麼問題(一會驗證)。
接著往下看:
2015-01-12
09:17:02.539: [ CSSD][4376]clssscSelect: cookie accept
request 110991628
2015-01-12 09:17:02.539: [ CSSD][4376]clssnmeventhndlr: gipcAssociate endp 1d2198b in container 73 type of conn gipcha
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmConnSetNames: hostname jzh2 privname 10.10.0.20 con 1d2198b-->連線jzh2 ,private IP為10.10.0.20(記得在jzh1的asm日誌中顯示private IP為169.254.189.151)
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmSetNodeProperties: properties node 2 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17-->jzh2 屬性
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmConnComplete: node node2 softver 11.2.0.4.0-->node2的software版本
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmCompleteConnProtocol: Incoming connect from node 2 (node2) ninf endp 0, probendp 0, endp 1d2198b
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmSendConnAck: connected to node 2, jzh2, con (1d2198b), state 0
2015-01-12 09:17:02.540: [ CSSD][4376]clssnmCompleteConnProtocol: node jzh2, 2, uniqueness 1421024974, msg uniqueness 1421024974, endp 1d2198b probendp 0 endp 1d2198b
2015-01-12 09:17:03.044: [ CSSD][4376]clssnmHandleJoin: node 2 JOINING, state 0->1 ninfendp 1d2198b
2015-01-12 09:17:03.354: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2015-01-12 09:17:03.355: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvReadDskHeartbeat: Reading DHBs to get the latest info for node(2/jzh2), LATSvalid(0), nodeInfoDHB uniqueness(1420692326)àread disk heartbeat
2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvDHBValidateNcopy: Setting LATS valid due to uniqueness change for node(jzh2) number(2), nodeInfoDHB(1420692326), readInfo(1421024974)
2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvDHBValidateNcopy: Saving DHB uniqueness for node jzh2, number 2 latestInfo(1421024974), readInfo(1421024974), nodeInfoDHB(1420692326)à儲存jzh2的disk heartbeat資訊
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: Initiating sync 315891915
2015-01-12 09:17:03.754: [ CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress val 1, from -1, changes 20
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 200000-->設定disk heartbeat(磁碟心跳設定200s,11g預設200s)
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed-->本地與遠端disk heartbeat生效
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 315891915
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: Ack message type (11)
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE-->node 1 是alive活的
2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE-->node 2 是alive活的說明:node1 jzh1覺得jzh1,jzh2都是活的,注意:這裡還是disk heartbeat,再一次驗證disk heatbeat沒什麼問題。
接著往下看:
2015-01-12 09:24:20.392: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: node jzh2 (2) at 50% heartbeat fatal, removal in 14.647 seconds-->node1 cssd程式檢查jzh2了,到了50%失敗,14.647s(記住這個時間)要移除jzh2,上面說disk hearbeat沒有問題,這裡為什麼會報錯?
2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: node jzh2 (2) is impending reconfig, flag 2294796, misstime 15353-->misstime 15.353s (記住這個時間)+ 14.647s=30s
2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)-->disk timeout被設定成27s了,不是200s嗎?
2015-01-12 09:24:23.578: [ CSSD][2577]clssnmvDHBValidateNcopy: node 2, jzh2, has a disk HB, but no network HB, DHB has rcfg 315891916, wrtcnt, 19771770, LATS 706581505, lastSeqNo 19771331, uniqueness 1421024974, timestamp 1421025905/706212683à原來node2 jzh2的disk heartbeat可以檢測到,所以不需要200s了,DHB has rcfg,再次驗證heartbeat 沒有問題,but no network HB,難道network heartbeat有問題?
2015-01-12 09:24:23.618: [ CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 1421025863/706581544
2015-01-12 09:24:24.082: [ CSSD][2577]clssnmvDHBValidateNcopy: node 2, jzh2, has a disk HB, but no network HB, DHB has rcfg 315891916, wrtcnt, 19771771, LATS 706582008, lastSeqNo 19771770, uniqueness 1421024974, timestamp 1421025905/706213190
2015-01-12 09:24:24.119: [ CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 1421025864/706582045-->disk ping錯誤了。
2015-01-12 09:24:24.398: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes-->node1 要告訴大傢什麼呢?
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmNeedConfReq: No configuration to change
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, jzh2, misstime(31566) state(5)-->要終止node2 jzh2了,misstime為31.566s,記得上面時間是15.353+14.647=30s,這是oracle網路心跳預設最大閥值30s
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)-->要更新votedisk,要投票了
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info...
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckSplit: Node 2, jzh2, is alive, DHB (1421025918, 706226248) more than disk timeout of 27000 after the last
NHB (1421025890, 706197520)-->再次驗證disk heartbeat沒有問題
2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 1
2015-01-12 09:24:39.805: [ CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, jzh2, from the cluster in incarnation 315891916, node birth incarnation 315891915, death incarnation 315891916, stateflags 0x234000 uniqueness value 1421024974—>node1 要驅逐node2 jzh2了
2015-01-12 09:24:39.806: [ default][4119]kgzf_gen_node_reid2: generated reid cid=41daa0e19d0a6f84ff29b9f37a2f1a38,icin=315891908,nmn=2,lnid=315891915,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100
2015-01-12 09:24:39.806: [ CSSD][4119]clssnmrFenceSage: Fenced node jzh2, number 2, with EXADATA, handle 0
2015-01-12 09:24:39.806: [ CSSD][4119]clssnmSendShutdown: req to node 2, kill time 706597731-->node1要將 node2 shutdown kill
2015-01-12 09:24:39.806: [ CSSD][4119]clssnmsendmsg: not connected to node 2-->連不上node2
2015-01-12 09:24:39.806: [ CSSD][4119]clssnmSendShutdown: Send to node 2 failed-->為了保證資料一致,要將node2 shutdown,但是shutown 失敗
2015-01-12 09:24:39.806: [ CSSD][4119]clssnmWaitOnEvictions: Start-->開始驅逐。
2015-01-12 09:25:07.095: [ CSSD][4376]clssnmUpdateNodeState: node jzh1, number 1, current state 3, proposed state 3, current unique 1420396557, proposed u
nique 1420396557, prevConuni 0, birth 315891909
2015-01-12 09:25:07.095: [ CSSD][4376]clssnmUpdateNodeState: node jzh2, number 2, current state 5, proposed state 0, current unique 1421024974, proposed u
nique 1421024974, prevConuni 1421024974, birth 315891915
2015-01-12 09:25:07.095: [ CSSD][4376]clssnmDeactivateNode: node 2, state 5
2015-01-12 09:25:07.095: [ CSSD][4376]clssnmDeactivateNode: node 2 (jzh2) left cluster-->node2 jzh2離開了cluster
2015-01-12 10:11:27.825: [ CSSD][4119]clssnmWaitForAcks: Ack message type(11), ackCount(2)
2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: Node jzh1, number 1, is EXADATA fence capable
2015-01-12 10:11:27.825: [ CSSD][4376]clssscUpdateEventValue: NMReconfigInProgress val 1, changes 33
2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: local disk timeout set to 200000 ms, remote disk timeout set to 200000-->本地和遠端disk timeout設定為200s
2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: initleader 1 newleader 1-->node1 是leader了,也就是master node
說明:根據以上分析,磁碟心跳沒有問題,問題出現在網路心跳。node 2節點2CSSD日誌
2015-01-12 09:25:19.224: [ CSSD][1]clssgmSuspendAllGrocks: done
2015-01-12 09:25:19.224: [ CSSD][1]clssgmCompareSwapEventValue: changed CmInfo State val 2, from 5, changes 13
2015-01-12 09:25:19.224: [ CSSD][1]clssgmUpdateEventValue: ConnectedNodes val 315891915, changes 5
2015-01-12 09:25:19.224: [ CSSD][1]clssgmCleanupNodeContexts(): cleaning up nodes, rcfg(315891915)
2015-01-12 09:25:19.224: [ CSSD][1]clssgmCleanupNodeContexts(): successful cleanup of nodes rcfg(315891915)
2015-01-12 09:25:19.224: [ CSSD][1]clssgmStartNMMon: completed node cleanup
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSendSync: syncSeqNo(315891916), indicating EXADATA fence initialization complete
2015-01-12 09:25:19.224: [ CSSD][4119]List of nodes that have ACKed my sync: 2
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmWaitForAcks: done, syncseq(315891916), msg type(11)
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion:node2 product/protocol (11.2/1.4)
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: properties common to all nodes: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: min product/protocol (11.2/1.4)
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: max product/protocol (11.2/1.4)
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmNeedConfReq: No configuration to change
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, jzh1, misstime(30000) state(5)-->node2 jzh2與node1同步,misstime 30s(網路心跳閥值),要終止node1。
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)-->等待投票。
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info...
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckSplit: Node 1, jzh1, is alive, DHB (1421025877, 706595081) more than disk timeout of 27000 after the last
NHB (1421025847, 706565177)-->node1 jzh1 disk heartbeat沒有問題
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 2-->本地編號
2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1-->node1 jzh1活著
2015-01-12 09:25:19.224: [ CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, jzh2, is smaller than cohort of 1 nodes led by node 1, jzh1, based on map type 2à終止本地節點node 2 jzh2,node1 jzh1為leader。
2015-01-12 09:25:19.224: [ CSSD][4119]###################################
2015-01-12 09:25:19.224: [ CSSD][4119]clssscExit: CSSD aborting from
thread clssnmRcfgMgrThread-->CSSD在呼叫clssnmRcfgMgrThread時終止
至此可以判斷RAC出現brain split是由於節點心跳網路通訊異常導致RAC 出現腦裂!
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/27767315/viewspace-2105679/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- RAC 腦裂 處理機制 Oracle RAC Brain SplitOracleAI
- Oracle10g RAC clusterware split-brain - 腦裂OracleAI
- RAC腦裂(split brain)相關: DTO(disktimeout) - SDTO,LDTOAI
- 轉:瞭解Oracle RAC Brain Split Resolution叢集腦裂協議OracleAI協議
- split-brain 腦裂問題(Keepalived)AI
- Oracle RAC叢集腦裂split-brain概述及解決辦法OracleAI
- ZooKeeper 05 - ZooKeeper 叢集的腦裂問題(Split Brain)AI
- RAC之Split brainAI
- RAC中的腦裂(Split Brain)是根據什麼原則進行節點驅逐(Node Eviction)的 ? 1546004.1AI
- How To Deal With Split Brain In Oracle 19c RacAIOracle
- RAC腦裂的理解
- Oracle 19C RAC腦裂問題分析Oracle
- How To Deal With Split Brain In Oracle 19c Rac-202203AIOracle
- 記一次VMware的崩潰除錯分析過程除錯
- Hbase split的三種方式和split的過程
- 記一次實現遠端控制電腦開機過程
- Split Brain in Oracle Clusterware and Real Application ClusterAIOracleAPP
- Waiting for clusterware split-brain resolutionAI
- 記一次開啟資料庫慢原因分析過程資料庫
- 記一次透過Memory Analyzer分析記憶體洩漏的解決過程記憶體
- 【故障處理】一次RAC故障處理過程
- MySQL InnoDB Cluster – how to manage a split-brain situationMySqlAI
- 記一次ORACLE的UNDO表空間爆滿分析過程Oracle
- 記一次SQL調優過程SQL
- 一次DG故障診斷過程分析
- 一次ORACLE字元轉換分析過程Oracle字元
- 一次RAC節點當機的解決過程
- 記一次系統演變過程
- 記一次OOM問題排查過程OOM
- 記一次ElementUI原始碼修改過程UI原始碼
- MySQL 記一次 Bug發現過程MySql
- 記一次安卓webview查錯過程安卓WebView
- 記一次"記憶體洩露"排查過程記憶體洩露
- 記一次dump檔案分析歷程
- 一次分析的全過程,和大家交流
- 整理:RAC搭建過程
- 記一次記憶體溢位問題的排查、分析過程及解決思路記憶體溢位
- 記錄一次記憶體洩漏排查過程記憶體