1.環境介紹：
DB版本：11.2.0.4 64位
OS版本：AIX 6.1 位
2.錯誤現象
node1 節點1ASM例項日誌：
Mon Jan 12 09:08:48 2015
Reconfiguration started (old inc 8, new inc 10)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:08:48 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:08:48 2015
NOTE: SMON starting instance recovery for group DG domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.332
NOTE: starting recovery of thread=2 ckpt=10.4585 group=1 (DG)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 1 (DG)
NOTE: SMON successfully validated lock domain 1
NOTE: advancing ckpt for group 1 (DG) thread=2 ckpt=10.4585
NOTE: SMON did instance recovery for group DG domain 1
NOTE: SMON starting instance recovery for group VOTE domain 2 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: starting recovery of thread=2 ckpt=15.26 group=2 (VOTE)
NOTE: SMON waiting for thread 2 recovery enqueue
NOTE: SMON about to begin recovery lock claims for diskgroup 2 (VOTE)
NOTE: SMON successfully validated lock domain 2
NOTE: advancing ckpt for group 2 (VOTE) thread=2 ckpt=15.26
NOTE: SMON did instance recovery for group VOTE domain 2
Mon Jan 12 09:21:06 2015
Reconfiguration started (old inc 10, new inc 12)
List of instances:
1 2 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:21:06 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:25:08 2015
Reconfiguration started (old inc 12, new inc 14)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:25:08 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 09:51:08 2015
Reconfiguration started (old inc 14, new inc 18)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 09:51:08 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 10:15:52 2015
Reconfiguration started (old inc 18, new inc 20)
List of instances:
1 2 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 10:15:53 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 12:20:06 2015
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more
details.
Mon Jan 12 12:20:26 2015
Reconfiguration started (old inc 20, new inc 22)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jan 12 12:20:26 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 12:20:26 2015
NOTE: SMON starting instance recovery for group DG domain 1 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.332
NOTE: starting recovery of thread=2 ckpt=11.4610 group=1 (DG)
NOTE: SMON waiting for thread 2 recovery enqueue
Mon Jan 12 12:32:50 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Jan 12 13:44:50 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
Private Interface 'en0' configured from GPnP for use as a private interconnect.
[name='en0', type=1, ip=169.254.189.151, mac=00-90-fa-52-21-be, net=169.254.0.0/16,
mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'en2' configured from GPnP for use as a public interface.
[name='en2', type=1, ip=192.168.0.12, mac=00-90-fa-52-22-3c, net=192.168.0.0/24,
mask=255.255.255.0, use=public/1]
CELL communication is configured to use 0 interface(s):
CELL IP affinity details:
NUMA status: non-NUMA system
cellaffinity.ora status: N/A
CELL communication will use 1 IP group(s):
Grp 0:
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/grid/asm/dbs/arch
Autotune of undo retention is turned on.c

說明：主節點node1的asm日誌中顯示，asm例項分別在Mon Jan 12 09:08:48 2015
，Mon Jan 12 09:21:06 2015，Mon Jan 12 09:25:08 2015，Mon Jan 12 09:51:08 2015，Mon Jan 12 10:15:52 2015，Mon Jan 12 12:20:26 2015多次進行reconfiguration，也就是asm例項不斷的重啟，出現腦裂現象，且en0為private網路卡，IP地址為169.254.189.151，en2 為public網路卡，IP地址為192.168.0.12。

3.分析過程
node2 節點2ASM例項日誌
Mon Jan 12 09:09:29 2015
NOTE: client exited [4719298]
Mon Jan 12 09:09:29 2015
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon Jan 12 09:09:29 2015
PMON (ospid: 3867214): terminating the instance due to error 471
Mon Jan 12 09:51:49 2015
Shutting down instance (abort)
License high water mark = 2
USER (ospid: 2425590): terminating the instance
Instance terminated by USER, pid = 2425590
Mon Jan 12 09:51:50 2015
Instance shutdown complete
Mon Jan 12 10:16:27 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Mon Jan 12 12:21:07 2015
NOTE: client exited [2098012]
Mon Jan 12 12:21:07 2015
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon Jan 12 12:21:09 2015
PMON (ospid: 2818726): terminating the instance due to error 481
Instance terminated by PMON, pid = 2818726
Mon Jan 12 12:33:30 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
Mon Jan 12 12:38:19 2015
NOTE: client jzhprd2:jzhprd registered, osid 3146138, mbr 0x1
Mon Jan 12 12:38:51 2015
ALTER SYSTEM SET local_listener=' (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.0.13)(PORT=1521))))' SCOPE=MEMORY SID='+ASM2';
Mon Jan 12 13:01:34 2015
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Mon Jan 12 13:01:34 2015
NOTE: client exited [4915336]
NOTE: force a map free for map id 2
Mon Jan 12 13:01:36 2015
PMON (ospid: 3212222): terminating the instance due to error 481
Instance terminated by PMON, pid = 3212222
Mon Jan 12 13:46:02 2015
NOTE: No asm libraries found in the system
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 8
說明：標紅字型顯示節點JZH2中ASMB程式在9:9:29s時開始出現異常，ASM異常需要檢查CRS、CSS相關日誌。

node 2 節點2CRSD日誌

2015-01-12 09:09:30.742: [ CSSCLNT][1]clssscConnect: gipc request failed with 13 (1a)

CSS失敗了（CSS，cluster synchronization services叢集同步服務-涉及netwok hearbeat，disk heartbeat兩種機制）

2015-01-12 09:09:30.742: [ CSSCLNT][1]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_)) failed, rc 13

2015-01-12 09:09:30.745: [ CRSRTI][1] CSS is not ready. Received status 3
CSS not ready

2015-01-12 09:09:30.745: [ CRSMAIN][1] First attempt: init CSS context failed. Error = 3

[ clsdmt][515]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jzh2DBG_CRSD))

第一次償試，失敗了

2015-01-12 09:09:30.812: [ clsdmt][515]PID for the Process [4522624], connkey 1

2015-01-12 09:09:30.812: [ clsdmt][515]Creating PID [4522624] file for home /u01/app/oracle/grid/asm host jzh2 bin crs to /u01/app/oracle/grid/asm/crs/init/

2015-01-12 09:09:30.812: [ clsdmt][515]Writing PID [4522624] to the file [/u01/app/oracle/grid/asm/crs/init/jzh2.pid]

2015-01-12 09:09:31.863: [ CRSMAIN][1] CRS Daemon Starting--> CRS staring（crsd服務沒有問題，一會驗證）

說明：以上說明CRS服務沒有問題，在09：09：30s時CSS叢集同步服務出現異常，叢集同步服務涉及到disk heartbeat（磁碟心跳）network heartbear(網路心跳)，也就是說網路與磁碟心跳有問題, 接下來看一下jzh1 ocssd.log記錄node1 jzh1在2015-01-12 09:09:30幹什麼？。

node1 節點1 CSSD日誌：
2015-01-12 09:09:00.934: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:00.934: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:04.946: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:04.946: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:08.980: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:08.980: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:12.994: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:12.994: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:16.758: [ CSSD][2577]clssnmSetupReadLease: status 1

2015-01-12 09:09:16.762: [ CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1

2015-01-12 09:09:16.762: [ CSSD][2577]clssgmDoneQEle: re-queueing req 110617b30 status 1

2015-01-12 09:09:16.763: [ CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (111b38850), operation status 1, client status 0

2015-01-12 09:09:17.009: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:17.009: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:21.028: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:21.028: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:25.349: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:25.349: [ CSSD][3862]clssnmSendingThread: sent 3 status msgs to all nodes

2015-01-12 09:09:29.355: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:29.355: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:33.362: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:33.362: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:37.366: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:37.366: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:41.377: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:41.377: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:45.394: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:09:45.394: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:09:49.371: [ CSSD][1029]clssgmQueueGrockEvent: groupName(DAALL_DB_jzh-cluster) count(2) master(0) event(6), incarn 6, mbrc 1, to member 0, events 0x8, state 0x0

2015-01-12 09:09:49.371: [ CSSD][1029]clssgmQueueGrockEvent: groupName(DAALL_DB_jzh-cluster) count(2) master(0) event(6), incarn 6, mbrc 1, to member 1, events 0x8, state 0x0

說明：可以看到node1的CSSD程式send status messagesto all nondes，為什麼要send？在RAC啟動後(在node1的ASM日誌中可以看到在09:08:48s時reconfiguration了)，各個node要將自己的資訊寫入ocr與vote中，然後 master收集這些資訊傳送給所有node，告訴所有的node，誰是master，有幾個node，在votedisk中記錄node相關資訊，然後進行投票，到這裡，我們可以看到整個叢集中有兩個member，分別是member 0(jzh1)和member 1（jzh2），也就是說CRSD程式沒有問題(已驗證)，還說明什麼？其他node可以將自己的資訊寫入vote，就是說disk heatbeat沒什麼問題(一會驗證)。

接著往下看：
2015-01-12 09:17:02.539: [ CSSD][4376]clssscSelect: cookie accept request 110991628

2015-01-12 09:17:02.539: [ CSSD][4376]clssnmeventhndlr: gipcAssociate endp 1d2198b in container 73 type of conn gipcha

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmConnSetNames: hostname jzh2 privname 10.10.0.20 con 1d2198b-->連線jzh2 ，private IP為10.10.0.20(記得在jzh1的asm日誌中顯示private IP為169.254.189.151)

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmSetNodeProperties: properties node 2 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17-->jzh2 屬性

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmConnComplete: node node2 softver 11.2.0.4.0-->node2的software版本

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmCompleteConnProtocol: Incoming connect from node 2 (node2) ninf endp 0, probendp 0, endp 1d2198b

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmSendConnAck: connected to node 2, jzh2, con (1d2198b), state 0

2015-01-12 09:17:02.540: [ CSSD][4376]clssnmCompleteConnProtocol: node jzh2, 2, uniqueness 1421024974, msg uniqueness 1421024974, endp 1d2198b probendp 0 endp 1d2198b

2015-01-12 09:17:03.044: [ CSSD][4376]clssnmHandleJoin: node 2 JOINING, state 0->1 ninfendp 1d2198b

2015-01-12 09:17:03.354: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2015-01-12 09:17:03.355: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvReadDskHeartbeat: Reading DHBs to get the latest info for node(2/jzh2), LATSvalid(0), nodeInfoDHB uniqueness(1420692326)àread disk heartbeat

2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvDHBValidateNcopy: Setting LATS valid due to uniqueness change for node(jzh2) number(2), nodeInfoDHB(1420692326), readInfo(1421024974)

2015-01-12 09:17:03.360: [ CSSD][2577]clssnmvDHBValidateNcopy: Saving DHB uniqueness for node jzh2, number 2 latestInfo(1421024974), readInfo(1421024974), nodeInfoDHB(1420692326)à儲存jzh2的disk heartbeat資訊

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: Initiating sync 315891915

2015-01-12 09:17:03.754: [ CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress val 1, from -1, changes 20

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 200000-->設定disk heartbeat(磁碟心跳設定200s，11g預設200s)

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed-->本地與遠端disk heartbeat生效

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 315891915

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: Ack message type (11)

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE-->node 1 是alive活的

2015-01-12 09:17:03.754: [ CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE-->node 2 是alive活的

說明：node1 jzh1覺得jzh1,jzh2都是活的，注意：這裡還是disk heartbeat，再一次驗證disk heatbeat沒什麼問題。

接著往下看：

2015-01-12 09:24:20.392: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: node jzh2 (2) at 50% heartbeat fatal, removal in 14.647 seconds-->node1 cssd程式檢查jzh2了，到了50%失敗，14.647s(記住這個時間)要移除jzh2，上面說disk hearbeat沒有問題，這裡為什麼會報錯？

2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: node jzh2 (2) is impending reconfig, flag 2294796, misstime 15353-->misstime 15.353s (記住這個時間)+ 14.647s=30s

2015-01-12 09:24:23.578: [ CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)-->disk timeout被設定成27s了，不是200s嗎？

2015-01-12 09:24:23.578: [ CSSD][2577]clssnmvDHBValidateNcopy: node 2, jzh2, has a disk HB, but no network HB, DHB has rcfg 315891916, wrtcnt, 19771770, LATS 706581505, lastSeqNo 19771331, uniqueness 1421024974, timestamp 1421025905/706212683à原來node2 jzh2的disk heartbeat可以檢測到，所以不需要200s了，DHB has rcfg，再次驗證heartbeat 沒有問題，but no network HB，難道network heartbeat有問題？

2015-01-12 09:24:23.618: [ CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 1421025863/706581544

2015-01-12 09:24:24.082: [ CSSD][2577]clssnmvDHBValidateNcopy: node 2, jzh2, has a disk HB, but no network HB, DHB has rcfg 315891916, wrtcnt, 19771771, LATS 706582008, lastSeqNo 19771770, uniqueness 1421024974, timestamp 1421025905/706213190

2015-01-12 09:24:24.119: [ CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 1421025864/706582045-->disk ping錯誤了。

2015-01-12 09:24:24.398: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes-->node1 要告訴大傢什麼呢？

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmNeedConfReq: No configuration to change

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, jzh2, misstime(31566) state(5)-->要終止node2 jzh2了，misstime為31.566s，記得上面時間是15.353+14.647=30s，這是oracle網路心跳預設最大閥值30s

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)-->要更新votedisk，要投票了

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info...

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckSplit: Node 2, jzh2, is alive, DHB (1421025918, 706226248) more than disk timeout of 27000 after the last

NHB (1421025890, 706197520)-->再次驗證disk heartbeat沒有問題

2015-01-12 09:24:39.805: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 1

2015-01-12 09:24:39.805: [ CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, jzh2, from the cluster in incarnation 315891916, node birth incarnation 315891915, death incarnation 315891916, stateflags 0x234000 uniqueness value 1421024974—>node1 要驅逐node2 jzh2了

2015-01-12 09:24:39.806: [ default][4119]kgzf_gen_node_reid2: generated reid cid=41daa0e19d0a6f84ff29b9f37a2f1a38,icin=315891908,nmn=2,lnid=315891915,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100

2015-01-12 09:24:39.806: [ CSSD][4119]clssnmrFenceSage: Fenced node jzh2, number 2, with EXADATA, handle 0

2015-01-12 09:24:39.806: [ CSSD][4119]clssnmSendShutdown: req to node 2, kill time 706597731-->node1要將 node2 shutdown kill

2015-01-12 09:24:39.806: [ CSSD][4119]clssnmsendmsg: not connected to node 2-->連不上node2

2015-01-12 09:24:39.806: [ CSSD][4119]clssnmSendShutdown: Send to node 2 failed-->為了保證資料一致，要將node2 shutdown,但是shutown 失敗

2015-01-12 09:24:39.806: [ CSSD][4119]clssnmWaitOnEvictions: Start-->開始驅逐。

2015-01-12 09:25:07.095: [ CSSD][4376]clssnmUpdateNodeState: node jzh1, number 1, current state 3, proposed state 3, current unique 1420396557, proposed u

nique 1420396557, prevConuni 0, birth 315891909

2015-01-12 09:25:07.095: [ CSSD][4376]clssnmUpdateNodeState: node jzh2, number 2, current state 5, proposed state 0, current unique 1421024974, proposed u

nique 1421024974, prevConuni 1421024974, birth 315891915

2015-01-12 09:25:07.095: [ CSSD][4376]clssnmDeactivateNode: node 2, state 5

2015-01-12 09:25:07.095: [ CSSD][4376]clssnmDeactivateNode: node 2 (jzh2) left cluster-->node2 jzh2離開了cluster

2015-01-12 10:11:27.825: [ CSSD][4119]clssnmWaitForAcks: Ack message type(11), ackCount(2)

2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: Node jzh1, number 1, is EXADATA fence capable

2015-01-12 10:11:27.825: [ CSSD][4376]clssscUpdateEventValue: NMReconfigInProgress val 1, changes 33

2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: local disk timeout set to 200000 ms, remote disk timeout set to 200000-->本地和遠端disk timeout設定為200s

2015-01-12 10:11:27.825: [ CSSD][4376]clssnmHandleSync: initleader 1 newleader 1-->node1 是leader了，也就是master node

說明：根據以上分析，磁碟心跳沒有問題，問題出現在網路心跳。
node 2節點2CSSD日誌

2015-01-12 09:25:19.224: [ CSSD][1]clssgmSuspendAllGrocks: done

2015-01-12 09:25:19.224: [ CSSD][1]clssgmCompareSwapEventValue: changed CmInfo State val 2, from 5, changes 13

2015-01-12 09:25:19.224: [ CSSD][1]clssgmUpdateEventValue: ConnectedNodes val 315891915, changes 5

2015-01-12 09:25:19.224: [ CSSD][1]clssgmCleanupNodeContexts(): cleaning up nodes, rcfg(315891915)

2015-01-12 09:25:19.224: [ CSSD][1]clssgmCleanupNodeContexts(): successful cleanup of nodes rcfg(315891915)

2015-01-12 09:25:19.224: [ CSSD][1]clssgmStartNMMon: completed node cleanup

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSendSync: syncSeqNo(315891916), indicating EXADATA fence initialization complete

2015-01-12 09:25:19.224: [ CSSD][4119]List of nodes that have ACKed my sync: 2

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmWaitForAcks: done, syncseq(315891916), msg type(11)

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion:node2 product/protocol (11.2/1.4)

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: properties common to all nodes: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: min product/protocol (11.2/1.4)

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmSetMinMaxVersion: max product/protocol (11.2/1.4)

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmNeedConfReq: No configuration to change

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, jzh1, misstime(30000) state(5)-->node2 jzh2與node1同步，misstime 30s（網路心跳閥值），要終止node1。

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)-->等待投票。

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info...

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckSplit: Node 1, jzh1, is alive, DHB (1421025877, 706595081) more than disk timeout of 27000 after the last

NHB (1421025847, 706565177)-->node1 jzh1 disk heartbeat沒有問題

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 2-->本地編號

2015-01-12 09:25:19.224: [ CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1-->node1 jzh1活著

2015-01-12 09:25:19.224: [ CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, jzh2, is smaller than cohort of 1 nodes led by node 1, jzh1, based on map type 2à終止本地節點node 2 jzh2,node1 jzh1為leader。

2015-01-12 09:25:19.224: [ CSSD][4119]###################################

2015-01-12 09:25:19.224: [ CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread-->CSSD在呼叫clssnmRcfgMgrThread時終止

說明： node2在與node1同步時出現misstime超時，將終止local node。

至此可以判斷RAC出現brain split是由於節點心跳網路通訊異常導致RAC 出現腦裂！

記一次RAC Brain Split腦裂分析過程

相關文章