Date	2008-12-06 09:42:43
Component	CRS
Title	What can cause a Node Eviction ?
Version	10.1.0 - 11.1.0.7
Problem	Node evictions can occur in a cluster environment, the main question is why did the eviction occured ? Below I try to make that part easier.
Solution	There are 4 possible causes why a node eviction can occur. Kernel Hang/ extreem load on the system. (OPROCD and/or HANCHECK TIMER) Heartbeat lost Interconnect Heartbeat lost Voting Disk OCLSMON detects CSSD hang. The title start with cause, but an Node eviction is a symptom of another problem not the cause. Keep this always in mind when investigating why a node eviction can occur. Kernel Hang depended on the Operation System used. For Window or Linux this can be done based on the Hangcheck Timer and other Unix environments OPROCD is started. From Oracle 10.2.0.4 and higher OPROCD is also active on LINUX. (Still install the hangcheck timer) To validate if HANGCHECK timer or OPROCD was causing the node eviction validate the OS logfiles for the hangcheck timer. For OPROCD validate the OPROCD logfile. An other possible node eviction can be triggered by OCLSMON starting with the 10.2.0.3 patchset or higher. The Clusterware proces is validating if there is an issue with CSSD. When this is the case it will kill the CSSD deamon, which will lead to the eviction. When this issue occur validate the oclsmon logfile and contact Oracle support. In this note we don’t focus on these parts, but on heartbeat lost. Below are two examples of a heartbeat lost symptom. The OCSSD background process is taking care of the heartbeats. In the cssd.log file you can find detail information about the node eviction. In case of an eviction validate all the cssd.log file on all the nodes in your cluster environment. But start with the evicted node. The logging information logged can be changed during patchset and Oracle releases. Node eviction due to Interconnect lost symptom. Oracle 11g [ CSSD]2008-11-20 10:59:36.510 [1220598112] >TRACE: clssnmCheckDskSleepTime: Node 3, dbq0223, dead, last DHB (1227175136, 73583764) after NHB (1227175121, 73568724), but LATS - current (39090) > DTO (27000) [ CSSD]2008-11-20 10:59:36.512 [1147169120] >TRACE: clssnmReadDskHeartbeat: node 1, dbq0123, has a disk HB, but no network HB, DHB has rcfg 122475875, wrtcnt, 164452, LATS 58728604, lastSeqNo 164452, timestamp 1227175122/73251784 [ CSSD]2008-11-20 10:59:37.513 [1199618400] >WARNING: clssnmPollingThread: node dbq0227 (5) at 90% heartbeat fatal, eviction in 1.660 seconds [ CSSD]2008-11-20 10:59:37.513 [1220598112] >TRACE: clssnmSendSync: syncSeqNo(122475875) [ CSSD]2008-11-20 10:59:37.513 [1220598112] >TRACE: clssnm_print_syncacklist: syncacklist (4) Oracle 10g [ CSSD]2006-10-18 23:49:06.199 [3600] >TRACE: clssnmCheckDskInfo: Checking disk info... [ CSSD]2006-10-18 23:49:06.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) timeout(172) state_network(0) state_disk(3) missCount(30) [ CSSD]2006-10-18 23:49:06.226 [1] >USER: NMEVENT_SUSPEND [00][00][00][06] [ CSSD]2006-10-18 23:49:07.028 [1030] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(23) wrtcnt(634353) LATS(2345204583) Disk lastSeqNo(634353) [ CSSD]2006-10-18 23:49:07.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) disk HB found, network state 0, disk state(3) missCount(31) [ CSSD]2006-10-18 23:49:08.032 [1030] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(23) wrtcnt(634354) LATS(2345205587) Disk lastSeqNo(634354) [ CSSD]2006-10-18 23:49:08.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) disk HB found, network state 0, disk state(3) missCount(32) [ CSSD]2006-10-18 23:49:09.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) timeout(1167) state_network(0) state_disk(3) missCount(33) [ CSSD]2006-10-18 23:49:10.199 [3600] >TRACE: clssnmCheckDskInfo: node(2) timeout(2167) state_network(0) state_disk(3) missCount(33) ……. [ CSSD]2006-10-18 23:49:18.571 [3086] >WARNING: clssnmPollingThread: state(0) clusterState(2) exit [ CSSD]2006-10-18 23:49:18.572 [1287] >ERROR: clssnmvDiskKillCheck: Evicted by node 1, sync 23, stamp -1949751541, [ CSSD]2006-10-18 23:49:18.698 [3600] >TRACE: 0x110013a80 00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00 Here we see that the Diskkillcheck is report by node 1 and this node is evicted. The diskkillcheck is done using a poison packets trough the voting disk, as interconnect is lost. Possible action: check the availability of the Adapters, large network load/port scans and the OS logfiles for reported errrors related to the interconnect. Node eviction due to Voting disk lost symptom. Below an example where we lose the heartbeat to the voting disk. [ CSSD]2006-10-11 00:35:33.658 [1801] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] srcName[alligator] seq[9] sync[15] [ CSSD]2006-10-11 00:35:36.956 [1801] >TRACE: clssnmHandleSync: diskTimeout set to (27000)ms [ CSSD]2006-10-11 00:35:36.957 [1801] >WARNING: CLSSNMCTX_NODEDB_UNLOCK: lock held for 3300 ms [ CSSD]2006-10-11 00:35:36.956 [1544] >TRACE: clssnmDiskPMT: stale disk (32490 ms) (0//dev/rora_vote_raw) [ CSSD]2006-10-11 00:35:36.966 [1544] >ERROR: clssnmDiskPMT: 1 of 1 voting disks unavailable (0/0/1) [ CSSD]2006-10-11 00:35:37.043 [2058] >TRACE: clssgmClientConnectMsg: Connect from con(112a8a9f0) proc(112a8f9d0) pid(480150) proto(10:2:1:1) [ CSSD]2006-10-11 00:35:37.960 [3343] >TRACE: clscsendx: (11145a3f0) Physical connection (111459b30) not active [ CSSD]2006-10-11 00:35:37.051 [1] >USER: NMEVENT_SUSPEND [00][00][00]06] Possible action: check the availability of the Disk subsystem and the OS logfiles for reported errrors related to the voting disk Trace the heartbeat: If needed you can enable a higher level of tracing to debug the heartbeat part. This can be done using the command, level 5 tracing. Level 0 disables the extra trace again. Please keep in mind that this can make your cssd.log growth hard. (4 lines added every second). crsctl debug log css CSSD:5 crsctl debug log css CSSD:0 NOTICE: Node evictions is a symptom for another problem !

轉載自：

Oracle RAC 10.1.0 - 11.1.0.7引起節點被踢出的原因

相關文章