CRS can not Start After Node Reboot (文件 ID 733260.1)

In this Document

Applies to:

Oracle Database - Enterprise Edition - Version 10.1.0.2 to 11.1.0.7 [Release 10.1 to 11.1]
Information in this document applies to any platform.
***Checked for relevance on 23-Apr-2013***

Symptoms

On a 2-node RAC cluster, it's possible that on one node, CRS is running but on the other node, CRS is not coming up after node reboot. Even rebooting a few times does not alleviate the problem. This can happen to a multi-node cluster too.

ocssd.log for 10g(located in $CRS_HOME/log//cssd) shows repeated messages like:

[ CSSD]2008-07-28 16:30:42.369 [1126189408] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(4) wrtcnt(2969549) LATS(576133094) Disk lastSeqNo(2969549)
[ CSSD]2008-07-28 16:30:42.909 [1136679264] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(4) wrtcnt(2969549) LATS(576133634) Disk lastSeqNo(2969549)
[ CSSD]2008-0-28 16:30:43.172 [1262557536] >TRACE: clssnmRcfgMgrThread: Local Join
[ CSSD]2008-07-28 16:30:43.172 [1262557536] >WARNING: clssnmLocalJoinEvent: takeover aborted due to ALIVE node on Disk
[ CSSD]2008-07-28 16:30:43.371 [1115699552] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(4) wrtcnt(2969550) LATS(576134094) Disk lastSeqNo(2969550)
.... << repeated messages

ocssd.log for 11gR1 (located in $CRS_HOME/log//cssd) shows repeated messages like:

[ CSSD]2008-10-01 01:03:36.658 [62843792] >TRACE: clssnmReadDskHeartbeat:
node 1, ndb-01, has a disk HB, but no network HB, DHB has rcfg 111839697,
wrtcnt, 6238187, LATS 4479404, lastSeqNo 6238187, timestamp 1222823015/258149934
[ CSSD]2008-10-01 01:03:37.661 [62843792] >TRACE: clssnmReadDskHeartbeat:
node 1, ndb-01, has a disk HB, but no network HB, DHB has rcfg 111839697,
wrtcnt, 6238188, LATS 4480404, lastSeqNo 6238188, timestamp 1222823016/258150944
[ CSSD]2008-10-01 01:03:38.504 [3007802256] >TRACE: clssnmRcfgMgrThread: Local Join
[ CSSD]2008-10-01 01:03:38.504 [3007802256] >WARNING: clssnmLocalJoinEvent:
takeover aborted due to ALIVE node on Disk
<
OR

There are no messages in the ocssd.log at all,

ps -ef |grep init shows "/etc/init.d/init.cssd startcheck" is running constantly.

cat /tmp/crsctl.xxxx shows:

OCR initialization failed with invalid format: PROC-22: The OCR backend has an invalid format

Changes

This can happen in an environment where a node is shutdown for various reasons, then restarted.

Cause

These messages usually indicate a communication problem with the private network to the other node(s) or OCR device has problem.
There can be several possible causes:

1. During reboot, CRS is started automatically before the network interface is ready.

This can be confirmed by:

ping
AND
traceroute

Command are working fine for both nodes.

AND
In the OS messages log, it shows CRS started before network layer is ready:

Jul 28 15:16:30 bt33 syslogd 1.4.1: restart.
...
Jul 28 15:16:38 bt33 logger: Cluster Ready Services completed waiting on
dependencies.
Jul 28 15:16:24 bt33 network: Bringing up interface eth0: succeeded
Jul 28 15:16:39 bt33 logger: Oracle CSS Family monitor starting.
Jul 28 15:16:39 bt33 logger: Running CRSD with TZ =
Jul 28 15:16:39 bt33 kernel: NET: Registered protocol family 16
Jul 28 15:16:26 bt33 network: Bringing up interface eth1: succeeded
Jul 28 15:16:39 bt33 kernel: PCI: Using configuration type 1
Jul 28 15:16:26 bt33 ifup: cannot change name of eth3 to eth5: File exists
Jul 28 15:16:30 bt33 network: Bringing up interface eth5: succeeded
....
Jul 28 15:16:47 bt33 kernel: e1000: eth5: e1000_watchdog_task: NIC Link is
Up 100 Mbps Half Duplex
Jul 28 15:16:47 bt33 kernel: e1000: eth5: e1000_watchdog_task: 10/100 speed:
disabling TSO

2. /etc/hosts mismatch, wrong definition for the problem node

For example:
on node 1:

192.168.0.1 prvnode1
192.168.0.2 prvnode2

on node 2:

192.168.0.1 prvnode1
192.168.1.2 prvnode2 <<<

3. The private network IP has been changed, but /etc/hosts reflects the changes in a wrong way, for example:

/etc/hosts
#new private IP
192.168.216.1 prvnode1_new
192.168.216.2 prvnode2_new

# old Private IP
192.168.217.4 prvnode1
192.168.217.5 prvnode2

Network for old IP 192.168.217.x has been disconnected at cluster level and replaced by 192.168.216.x. A new private hostname prvnode_new is given in /etc/hosts while OCR still uses previous private hostname prvnode.
Although ping/traceroute works fine for 192.168.216.x, it does not work for 192.168.217.x. When CRS starts, it retrieves private IP via OCR private hostname. Private hostname in OCR cannot be modified. In case of private network change, just change the IP address in /etc/hosts, do not modify the private hostname.

4. Private network is not pingable or ping response is slow, there is packet loss from ping command
ping and traceroute can confirm this. Sometimes even ping works, traceroute might show problem, for example:

tracetroute to node01-priv (192.168.25.1) 0.110ms 0.024ms 0.021ms
tracetroute to node02-priv (192.168.25.2) 0.473ms *

In case of Jumbo frame. is in use, to confirm its status:

$ traceroute -F 8972
traceroute to prvnode1 (192.168.216.1), 30 hops max, 8972 byte packets
send: Message too long
- this indicates the jumbo frame. is not setup properly

5. Different clusterware used for different nodes

For example: Node 1 is using Oracle CRS, while node 2 is using Veritas clusterware:

node1 ocssd.log:
--------
[ CSSD]2008-07-22 02:58:35.676 [5] >TRACE: clssnm_skgxninit: Compatible vendor clusterware not in use
[ CSSD]2008-07-22 02:58:35.676 [5] >TRACE: clssnm_skgxnmon: skgxn init failed
[ CSSD]2008-07-22 02:58:35.677 [1] >TRACE: clssnmNMInitialize: misscount set to (30)

node 2 ocssd.log:
-------
[ CSSD]2008-07-17 12:30:09.273 [5] >TRACE: clssnm_skgxninit: initialized skgxn version(2/0/Veritas Cluster Server MM

This will caused the two nodes to not be able to communicate with each other properly.
Or node 1 (HP UX) uses HP Service Guide and node 2 uses Oracle CRS:

node 1:
$ORA_CRS_HOME/lib/libskgxn2.so links to /opt/nmapi/nmapi2/lib/hpux64/libnmapi2.so

node 2:
$ORA_CRS_HOME/lib/libskgxn2.so links to $ORA_CRS_HOME/lib/libskgxns.so

6. If /etc/init.d/init.cssd startcheck does not complete, usually /tmp/crsctl.xxx file should give the clue as to why it does not complete. In case there is no /tmp/crsctl.xxx file generated, run the following debug command as root:

# sh -x /etc/init.d/init.cssd startcheck
....
/tmp/crsctl.12887: cannot create[Permission denied]

# ls -l /
drwsr-xr-t root root /tmp

The problem is caused by a lack (or removal) of the write permissions for other group on /tmp before rebooting the node.

7. OCR is pointing to a wrong device
For example, on the problem node:

# more /etc/oracle/ocr.loc
ocrconfig_loc=/dev/raw/raw1
local_only=FALSE

# raw -qa
/dev/raw/raw1: bound to major 120, minor 32
/dev/raw/raw2: bound to major 120, minor 32

# ls -l /dev|grep 120 |grep 32
brw------- 1 root root 120, 32 2007-10-10 07:41 emcpowerc

On the working node:

# more /etc/oracle/ocr.loc
ocrconfig_loc=/dev/raw/raw1
local_only=FALSE

# ls -l /dev/ |grep 120 |grep 257
brw------- 1 root root 120, 257 2007-11-28 09:55 emcpowerq1

# ls -l /dev/ |grep 120 |grep 258
brw------- 1 root root 120, 258 2007-11-28 09:55 emcpowerq2

8. localconfig has been run on cluster node accidentally
To confirm, check ocr.loc, it will have content similar to:

$ more /etc/oracle/ocr.loc
ocrconfig_loc=/opt/oracle/product/crs/cdata/localhost/local.ocr
local_only=TRUE

9. If CRS does not start automatically after node reboot, please check if auto start is disable by:

cat /etc/oracle/scls_scr//root/crsstart

If it has the value "disable" then auto start is disabled.

Solution

1. For case #1, check the sequence# of the init.crs and make sure it is executed AFTER the network initialization. To check, run the following commands (Linux):

find /etc -name 'S*init.crs' -exec ls -l {} \;
find /etc -name 'S*network' -exec ls -l {} \;

The startup of CRS should always be one of the last steps, for example: /etc/rc.d/rc3.d/S96init.crs.
If /etc/rc.d/rc3.d/S02init.crs exists, please adjust the number so that it starts as last.

2. For case #2, correct the /etc/hosts file to reflect the correct definition. Make sure all nodes have the same definition.

3. For case #3, replace the old private IP with the new private IP, do not change the private hostname.
For example:

modify /etc/hosts to be:

# New RAC addresses
192.168.216.1 prvnode1
192.168.216.2 prvnode2

Then stop and restart CRS on both nodes:

crsctl stop crs on node 1
crsctl stop crs on node 2
crsctl start crs

Finally, confirm that CRS is UP on both nodes and that CRS managed resources are ONLINE.

4. For case #4, please consult with your network administrator to restore the network connectivity.

5. For case #5, if vendor clusterware should be used, check to make sure that $ORA_CRS_HOME/lib/libskgxn2.so is linked properly to the vendor clusterware library on all nodes.

For Veritas clusterware:

$ORA_CRS_HOME/lib/libskgxn2.so -> /opt/ORCLcluster/lib/libskgsn2.so

For HP MC ServiceGuard (on HP UX):

$ORA_CRS_HOME/lib/libskgxn2.so -> /opt/nmapi/nmapi2/lib/hpux64/libnmapi2.so

For SUN Solaris clusterware:

$ORA_CRS_HOME/lib/libskgxn2.so -> /opt/ORCLcluster/lib/libskgsn2.so

For AIX HACMP:

$ORA_CRS_HOME/lib/libskgxn2.so -> /opt/ORCLcluster/lib/libskgsn2.so

6. For case #6, allow other groups access to /tmp directory:

chmod 777 /tmp

7. For case #7, engage system admin to map OCR/Voting to the correct device, same as the working node.

8. For cast #8, please follow Document 747415.1 How to Restore CRS after accidentally run localconfig on RAC system to fix the problem.

9. Enable the auto start, as root user:

# crsctl enable crs
# crsctl start crs

Community Discussions

Still have questions? Use the communities window below to search for similar discussions or start a new discussion on this subject.

Note: Window is the LIVE community not a screenshot.

Click here to open in main browser window.