CRS can not Start After Node Reboot (文件 ID 733260.1)
CRS can not Start After Node Reboot (文件 ID 733260.1)
In this Document
Symptoms |
Changes |
Cause |
Solution |
Community Discussions |
References |
Applies to:
Oracle Database - Enterprise Edition - Version 10.1.0.2 to 11.1.0.7 [Release 10.1 to 11.1]Information in this document applies to any platform.
***Checked for relevance on 23-Apr-2013***
Symptoms
On a 2-node RAC cluster, it's possible that on one node, CRS is running but on the other node, CRS is not coming up after node reboot. Even rebooting a few times does not alleviate the problem. This can happen to a multi-node cluster too.
ocssd.log for 10g(located in $CRS_HOME/log/
[ CSSD]2008-07-28 16:30:42.909 [1136679264] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(4) wrtcnt(2969549) LATS(576133634) Disk lastSeqNo(2969549)
[ CSSD]2008-0-28 16:30:43.172 [1262557536] >TRACE: clssnmRcfgMgrThread: Local Join
[ CSSD]2008-07-28 16:30:43.172 [1262557536] >WARNING: clssnmLocalJoinEvent: takeover aborted due to ALIVE node on Disk
[ CSSD]2008-07-28 16:30:43.371 [1115699552] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(4) wrtcnt(2969550) LATS(576134094) Disk lastSeqNo(2969550)
.... << repeated messages
ocssd.log for 11gR1 (located in $CRS_HOME/log/
node 1, ndb-01, has a disk HB, but no network HB, DHB has rcfg 111839697,
wrtcnt, 6238187, LATS 4479404, lastSeqNo 6238187, timestamp 1222823015/258149934
[ CSSD]2008-10-01 01:03:37.661 [62843792] >TRACE: clssnmReadDskHeartbeat:
node 1, ndb-01, has a disk HB, but no network HB, DHB has rcfg 111839697,
wrtcnt, 6238188, LATS 4480404, lastSeqNo 6238188, timestamp 1222823016/258150944
[ CSSD]2008-10-01 01:03:38.504 [3007802256] >TRACE: clssnmRcfgMgrThread: Local Join
[ CSSD]2008-10-01 01:03:38.504 [3007802256] >WARNING: clssnmLocalJoinEvent:
takeover aborted due to ALIVE node on Disk
<
OR
There are no messages in the ocssd.log at all,
ps -ef |grep init shows "/etc/init.d/init.cssd startcheck" is running constantly.
cat /tmp/crsctl.xxxx shows:
Changes
This can happen in an environment where a node is shutdown for various reasons, then restarted.
Cause
These messages usually indicate a communication problem with the private network to the other node(s) or OCR device has problem.
There can be several possible causes:
1. During reboot, CRS is started automatically before the network interface is ready.
This can be confirmed by:
AND
traceroute
Command are working fine for both nodes.
AND
In the OS messages log, it shows CRS started before network layer is ready:
...
Jul 28 15:16:38 bt33 logger: Cluster Ready Services completed waiting on
dependencies.
Jul 28 15:16:24 bt33 network: Bringing up interface eth0: succeeded
Jul 28 15:16:39 bt33 logger: Oracle CSS Family monitor starting.
Jul 28 15:16:39 bt33 logger: Running CRSD with TZ =
Jul 28 15:16:39 bt33 kernel: NET: Registered protocol family 16
Jul 28 15:16:26 bt33 network: Bringing up interface eth1: succeeded
Jul 28 15:16:39 bt33 kernel: PCI: Using configuration type 1
Jul 28 15:16:26 bt33 ifup: cannot change name of eth3 to eth5: File exists
Jul 28 15:16:30 bt33 network: Bringing up interface eth5: succeeded
....
Jul 28 15:16:47 bt33 kernel: e1000: eth5: e1000_watchdog_task: NIC Link is
Up 100 Mbps Half Duplex
Jul 28 15:16:47 bt33 kernel: e1000: eth5: e1000_watchdog_task: 10/100 speed:
disabling TSO
2. /etc/hosts mismatch, wrong definition for the problem node
For example:
on node 1:
192.168.0.2 prvnode2
on node 2:
192.168.1.2 prvnode2 <<<
3. The private network IP has been changed, but /etc/hosts reflects the changes in a wrong way, for example:
#new private IP
192.168.216.1 prvnode1_new
192.168.216.2 prvnode2_new
# old Private IP
192.168.217.4 prvnode1
192.168.217.5 prvnode2
Network for old IP 192.168.217.x has been disconnected at cluster level and replaced by 192.168.216.x. A new private hostname prvnode
Although ping/traceroute works fine for 192.168.216.x, it does not work for 192.168.217.x. When CRS starts, it retrieves private IP via OCR private hostname. Private hostname in OCR cannot be modified. In case of private network change, just change the IP address in /etc/hosts, do not modify the private hostname.
4. Private network is not pingable or ping response is slow, there is packet loss from ping command
ping
tracetroute to node02-priv (192.168.25.2) 0.473ms *
In case of Jumbo frame. is in use, to confirm its status:
traceroute to prvnode1 (192.168.216.1), 30 hops max, 8972 byte packets
send: Message too long
- this indicates the jumbo frame. is not setup properly
5. Different clusterware used for different nodes
For example: Node 1 is using Oracle CRS, while node 2 is using Veritas clusterware:
--------
[ CSSD]2008-07-22 02:58:35.676 [5] >TRACE: clssnm_skgxninit: Compatible vendor clusterware not in use
[ CSSD]2008-07-22 02:58:35.676 [5] >TRACE: clssnm_skgxnmon: skgxn init failed
[ CSSD]2008-07-22 02:58:35.677 [1] >TRACE: clssnmNMInitialize: misscount set to (30)
node 2 ocssd.log:
-------
[ CSSD]2008-07-17 12:30:09.273 [5] >TRACE: clssnm_skgxninit: initialized skgxn version(2/0/Veritas Cluster Server MM
This will caused the two nodes to not be able to communicate with each other properly.
Or node 1 (HP UX) uses HP Service Guide and node 2 uses Oracle CRS:
$ORA_CRS_HOME/lib/libskgxn2.so links to /opt/nmapi/nmapi2/lib/hpux64/libnmapi2.so
node 2:
$ORA_CRS_HOME/lib/libskgxn2.so links to $ORA_CRS_HOME/lib/libskgxns.so
6. If /etc/init.d/init.cssd startcheck does not complete, usually /tmp/crsctl.xxx file should give the clue as to why it does not complete. In case there is no /tmp/crsctl.xxx file generated, run the following debug command as root:
....
/tmp/crsctl.12887: cannot create[Permission denied]
# ls -l /
drwsr-xr-t root root /tmp
The problem is caused by a lack (or removal) of the write permissions for other group on /tmp before rebooting the node.
7. OCR is pointing to a wrong device
For example, on the problem node:
ocrconfig_loc=/dev/raw/raw1
local_only=FALSE
# raw -qa
/dev/raw/raw1: bound to major 120, minor 32
/dev/raw/raw2: bound to major 120, minor 32
# ls -l /dev|grep 120 |grep 32
brw------- 1 root root 120, 32 2007-10-10 07:41 emcpowerc
On the working node:
ocrconfig_loc=/dev/raw/raw1
local_only=FALSE
# ls -l /dev/ |grep 120 |grep 257
brw------- 1 root root 120, 257 2007-11-28 09:55 emcpowerq1
# ls -l /dev/ |grep 120 |grep 258
brw------- 1 root root 120, 258 2007-11-28 09:55 emcpowerq2
8. localconfig has been run on cluster node accidentally
To confirm, check ocr.loc, it will have content similar to:
ocrconfig_loc=/opt/oracle/product/crs/cdata/localhost/local.ocr
local_only=TRUE
9. If CRS does not start automatically after node reboot, please check if auto start is disable by:
If it has the value "disable" then auto start is disabled.
Solution
1. For case #1, check the sequence# of the init.crs and make sure it is executed AFTER the network initialization. To check, run the following commands (Linux):
find /etc -name 'S*network' -exec ls -l {} \;
The startup of CRS should always be one of the last steps, for example: /etc/rc.d/rc3.d/S96init.crs.
If /etc/rc.d/rc3.d/S02init.crs exists, please adjust the number so that it starts as last.
2. For case #2, correct the /etc/hosts file to reflect the correct definition. Make sure all nodes have the same definition.
3. For case #3, replace the old private IP with the new private IP, do not change the private hostname.
For example:
modify /etc/hosts to be:
192.168.216.1 prvnode1
192.168.216.2 prvnode2
Then stop and restart CRS on both nodes:
crsctl stop crs on node 2
crsctl start crs
Finally, confirm that CRS is UP on both nodes and that CRS managed resources are ONLINE.
4. For case #4, please consult with your network administrator to restore the network connectivity.
5. For case #5, if vendor clusterware should be used, check to make sure that $ORA_CRS_HOME/lib/libskgxn2.so is linked properly to the vendor clusterware library on all nodes.
For Veritas clusterware:
For HP MC ServiceGuard (on HP UX):
For SUN Solaris clusterware:
For AIX HACMP:
6. For case #6, allow other groups access to /tmp directory:
7. For case #7, engage system admin to map OCR/Voting to the correct device, same as the working node.
8. For cast #8, please follow Document 747415.1 How to Restore CRS after accidentally run localconfig on RAC system to fix the problem.
9. Enable the auto start, as root user:
# crsctl start crs
Community Discussions
Still have questions? Use the communities window below to search for similar discussions or start a new discussion on this subject.
Note: Window is the LIVE community not a screenshot.
Click here to open in main browser window.
References
NOTE:747415.1 - How to Restore CRS after accidentally run localconfig on RAC systemNOTE:803661.1 - How To Determine if Vendor Clusterware is Running
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/17252115/viewspace-768225/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- OHASD not Starting After Reboot on SLES (Doc ID 1325718.1)boot
- Troubleshooting when srvctl can't start RAC instance, but sqlplus can start it [ID 844272.1]SQL
- hp-ux: CRS not Start on One of Nodes in a Two Node Cluster With HP MC_967090.1UX
- Unable to start HTTP server after restoreHTTPServerREST
- crsctl start/stop crs and crsctl start/stop cluster 區別
- CRS-1205:Auto-start failed for the CRS resourceAI
- clean all Oracle 10gR2 CRS after a failed CRS installationOracle 10gAI
- rac中 crsctl start/stop crs and crsctl start/stop cluster 區別
- shut down and start crs for Oracle10GOracle
- How to Restore CRS after accidentally run localconfig on RAC system_747415.1RESTIDE
- Oracle dbconsole can't startOracle
- mysql can't start dues to the disk space is fullMySql
- 解決IllegalStateException: Can not perform this action after onSaveInstanceStateExceptionORM
- How to Clean Up After a Failed Oracle Clusterware (CRS) InstallationAIOracle
- Akka Stream文件翻譯:Quick Start Guide: Reactive TweetsGUIIDEReact
- 【BUG】 CRS: reboot advisory message show wrong reason when disable privatboot
- vipca遇到CRS-0215,Could not start onsPCA
- How to restore ASM based OCR after complete loss of the CRS diskgroupRESTASM
- Spark文件閱讀之二:Programming Guides - Quick StartSparkGUIIDE
- CRS-4000: Command Start failed, or completed with errors.AIError
- CRS-215 Srvctl unable to start ASM, Listener, RDBMS ResourcesASM
- 10g RAC: How to Clean Up After a Failed CRS InstallAI
- wmi provider error 0x800742a2 :: MSSQL instance can not startIDEErrorSQL
- Bitcoin Node Numbers Fall After Spam Transaction "Attack"
- MySQL不能啟動 Can't start server : Bind on unix sockeMySqlServer
- Reboot-less node fencing in Oracle Clusterware 11g Release 2bootOracle
- Log Write Methods can Cause 'log file sync' Waits (文件 ID 1462942.1)AI
- 最常見的5個CRS/Grid Infrastructure 安裝問題 (文件 ID 1549192.1)ASTStruct
- CRS: Resource in UNKNOWN state and srvctl Cannot Start/Stop Resource_845709.1
- Exchange 2000/2003 System Attendant does not start after disaster recovery installation, event ID 90AST
- Metlink:10g RAC How to Clean Up After a Failed CRS InstallAI
- crs reboot_toc引起主機重啟問題(patch sets:MLR#10 6273339)boot
- node.js自動生成api文件(apidocjs)Node.jsAPI
- grid安裝執行root.sh時Timed out waiting for the CRS stack to start - 解除安裝gridAI
- MySQL 5.5 原始碼安裝報錯"[ERROR] Can't start server"MySql原始碼ErrorServer
- oracle11gR2 Timed out waiting for the CRS stack to startOracleAI
- stop/start The CRS, OHAS, ASM, RDBMS & ACFS Services on RAC 11.2_1355977.1ASM
- 安裝 11gR2 Grid Infrastructure(CRS)失敗的處理過程 (文件 ID 1946678.1)ASTStruct