CRS can not Start After Node Reboot (文件 ID 733260.1)
CRS can not Start After Node Reboot (文件 ID 733260.1)
In this Document
Symptoms |
Changes |
Cause |
Solution |
Community Discussions |
References |
Applies to:
Oracle Database - Enterprise Edition - Version 10.1.0.2 to 11.1.0.7 [Release 10.1 to 11.1]Information in this document applies to any platform.
***Checked for relevance on 23-Apr-2013***
Symptoms
On a 2-node RAC cluster, it's possible that on one node, CRS is running but on the other node, CRS is not coming up after node reboot. Even rebooting a few times does not alleviate the problem. This can happen to a multi-node cluster too.
ocssd.log for 10g(located in $CRS_HOME/log/
[ CSSD]2008-07-28 16:30:42.909 [1136679264] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(4) wrtcnt(2969549) LATS(576133634) Disk lastSeqNo(2969549)
[ CSSD]2008-0-28 16:30:43.172 [1262557536] >TRACE: clssnmRcfgMgrThread: Local Join
[ CSSD]2008-07-28 16:30:43.172 [1262557536] >WARNING: clssnmLocalJoinEvent: takeover aborted due to ALIVE node on Disk
[ CSSD]2008-07-28 16:30:43.371 [1115699552] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(4) wrtcnt(2969550) LATS(576134094) Disk lastSeqNo(2969550)
.... << repeated messages
ocssd.log for 11gR1 (located in $CRS_HOME/log/
node 1, ndb-01, has a disk HB, but no network HB, DHB has rcfg 111839697,
wrtcnt, 6238187, LATS 4479404, lastSeqNo 6238187, timestamp 1222823015/258149934
[ CSSD]2008-10-01 01:03:37.661 [62843792] >TRACE: clssnmReadDskHeartbeat:
node 1, ndb-01, has a disk HB, but no network HB, DHB has rcfg 111839697,
wrtcnt, 6238188, LATS 4480404, lastSeqNo 6238188, timestamp 1222823016/258150944
[ CSSD]2008-10-01 01:03:38.504 [3007802256] >TRACE: clssnmRcfgMgrThread: Local Join
[ CSSD]2008-10-01 01:03:38.504 [3007802256] >WARNING: clssnmLocalJoinEvent:
takeover aborted due to ALIVE node on Disk
<
OR
There are no messages in the ocssd.log at all,
ps -ef |grep init shows "/etc/init.d/init.cssd startcheck" is running constantly.
cat /tmp/crsctl.xxxx shows:
Changes
This can happen in an environment where a node is shutdown for various reasons, then restarted.
Cause
These messages usually indicate a communication problem with the private network to the other node(s) or OCR device has problem.
There can be several possible causes:
1. During reboot, CRS is started automatically before the network interface is ready.
This can be confirmed by:
AND
traceroute
Command are working fine for both nodes.
AND
In the OS messages log, it shows CRS started before network layer is ready:
...
Jul 28 15:16:38 bt33 logger: Cluster Ready Services completed waiting on
dependencies.
Jul 28 15:16:24 bt33 network: Bringing up interface eth0: succeeded
Jul 28 15:16:39 bt33 logger: Oracle CSS Family monitor starting.
Jul 28 15:16:39 bt33 logger: Running CRSD with TZ =
Jul 28 15:16:39 bt33 kernel: NET: Registered protocol family 16
Jul 28 15:16:26 bt33 network: Bringing up interface eth1: succeeded
Jul 28 15:16:39 bt33 kernel: PCI: Using configuration type 1
Jul 28 15:16:26 bt33 ifup: cannot change name of eth3 to eth5: File exists
Jul 28 15:16:30 bt33 network: Bringing up interface eth5: succeeded
....
Jul 28 15:16:47 bt33 kernel: e1000: eth5: e1000_watchdog_task: NIC Link is
Up 100 Mbps Half Duplex
Jul 28 15:16:47 bt33 kernel: e1000: eth5: e1000_watchdog_task: 10/100 speed:
disabling TSO
2. /etc/hosts mismatch, wrong definition for the problem node
For example:
on node 1:
192.168.0.2 prvnode2
on node 2:
192.168.1.2 prvnode2 <<<
3. The private network IP has been changed, but /etc/hosts reflects the changes in a wrong way, for example:
#new private IP
192.168.216.1 prvnode1_new
192.168.216.2 prvnode2_new
# old Private IP
192.168.217.4 prvnode1
192.168.217.5 prvnode2
Network for old IP 192.168.217.x has been disconnected at cluster level and replaced by 192.168.216.x. A new private hostname prvnode
Although ping/traceroute works fine for 192.168.216.x, it does not work for 192.168.217.x. When CRS starts, it retrieves private IP via OCR private hostname. Private hostname in OCR cannot be modified. In case of private network change, just change the IP address in /etc/hosts, do not modify the private hostname.
4. Private network is not pingable or ping response is slow, there is packet loss from ping command
ping
tracetroute to node02-priv (192.168.25.2) 0.473ms *
In case of Jumbo frame. is in use, to confirm its status:
traceroute to prvnode1 (192.168.216.1), 30 hops max, 8972 byte packets
send: Message too long
- this indicates the jumbo frame. is not setup properly
5. Different clusterware used for different nodes
For example: Node 1 is using Oracle CRS, while node 2 is using Veritas clusterware:
--------
[ CSSD]2008-07-22 02:58:35.676 [5] >TRACE: clssnm_skgxninit: Compatible vendor clusterware not in use
[ CSSD]2008-07-22 02:58:35.676 [5] >TRACE: clssnm_skgxnmon: skgxn init failed
[ CSSD]2008-07-22 02:58:35.677 [1] >TRACE: clssnmNMInitialize: misscount set to (30)
node 2 ocssd.log:
-------
[ CSSD]2008-07-17 12:30:09.273 [5] >TRACE: clssnm_skgxninit: initialized skgxn version(2/0/Veritas Cluster Server MM
This will caused the two nodes to not be able to communicate with each other properly.
Or node 1 (HP UX) uses HP Service Guide and node 2 uses Oracle CRS:
$ORA_CRS_HOME/lib/libskgxn2.so links to /opt/nmapi/nmapi2/lib/hpux64/libnmapi2.so
node 2:
$ORA_CRS_HOME/lib/libskgxn2.so links to $ORA_CRS_HOME/lib/libskgxns.so
6. If /etc/init.d/init.cssd startcheck does not complete, usually /tmp/crsctl.xxx file should give the clue as to why it does not complete. In case there is no /tmp/crsctl.xxx file generated, run the following debug command as root:
....
/tmp/crsctl.12887: cannot create[Permission denied]
# ls -l /
drwsr-xr-t root root /tmp
The problem is caused by a lack (or removal) of the write permissions for other group on /tmp before rebooting the node.
7. OCR is pointing to a wrong device
For example, on the problem node:
ocrconfig_loc=/dev/raw/raw1
local_only=FALSE
# raw -qa
/dev/raw/raw1: bound to major 120, minor 32
/dev/raw/raw2: bound to major 120, minor 32
# ls -l /dev|grep 120 |grep 32
brw------- 1 root root 120, 32 2007-10-10 07:41 emcpowerc
On the working node:
ocrconfig_loc=/dev/raw/raw1
local_only=FALSE
# ls -l /dev/ |grep 120 |grep 257
brw------- 1 root root 120, 257 2007-11-28 09:55 emcpowerq1
# ls -l /dev/ |grep 120 |grep 258
brw------- 1 root root 120, 258 2007-11-28 09:55 emcpowerq2
8. localconfig has been run on cluster node accidentally
To confirm, check ocr.loc, it will have content similar to:
ocrconfig_loc=/opt/oracle/product/crs/cdata/localhost/local.ocr
local_only=TRUE
9. If CRS does not start automatically after node reboot, please check if auto start is disable by:
If it has the value "disable" then auto start is disabled.
Solution
1. For case #1, check the sequence# of the init.crs and make sure it is executed AFTER the network initialization. To check, run the following commands (Linux):
find /etc -name 'S*network' -exec ls -l {} \;
The startup of CRS should always be one of the last steps, for example: /etc/rc.d/rc3.d/S96init.crs.
If /etc/rc.d/rc3.d/S02init.crs exists, please adjust the number so that it starts as last.
2. For case #2, correct the /etc/hosts file to reflect the correct definition. Make sure all nodes have the same definition.
3. For case #3, replace the old private IP with the new private IP, do not change the private hostname.
For example:
modify /etc/hosts to be:
192.168.216.1 prvnode1
192.168.216.2 prvnode2
Then stop and restart CRS on both nodes:
crsctl stop crs on node 2
crsctl start crs
Finally, confirm that CRS is UP on both nodes and that CRS managed resources are ONLINE.
4. For case #4, please consult with your network administrator to restore the network connectivity.
5. For case #5, if vendor clusterware should be used, check to make sure that $ORA_CRS_HOME/lib/libskgxn2.so is linked properly to the vendor clusterware library on all nodes.
For Veritas clusterware:
For HP MC ServiceGuard (on HP UX):
For SUN Solaris clusterware:
For AIX HACMP:
6. For case #6, allow other groups access to /tmp directory:
7. For case #7, engage system admin to map OCR/Voting to the correct device, same as the working node.
8. For cast #8, please follow Document 747415.1 How to Restore CRS after accidentally run localconfig on RAC system to fix the problem.
9. Enable the auto start, as root user:
# crsctl start crs
Community Discussions
Still have questions? Use the communities window below to search for similar discussions or start a new discussion on this subject.
Note: Window is the LIVE community not a screenshot.
Click here to open in main browser window.
References
NOTE:747415.1 - How to Restore CRS after accidentally run localconfig on RAC systemNOTE:803661.1 - How To Determine if Vendor Clusterware is Running
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/17252115/viewspace-768225/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 【BUG】 CRS: reboot advisory message show wrong reason when disable privatboot
- rac中 crsctl start/stop crs and crsctl start/stop cluster 區別
- crsd.bin Fail With Error CRS-1019 When ohasd Restarted (文件 ID 2291799.1)AIErrorREST
- RMAN restore fails with ORA-01180: can not create datafile 1 (文件 ID 1265151.1)RESTAI
- CAN ID 中的J1939-PGN
- ORA-00445: background process "J000" did not start after 120 seconds
- Bitcoin Node Numbers Fall After Spam Transaction "Attack"
- Can't debug c++ project because unable to static library start program *.libC++Project
- How to redirect to a specific web page after sign out from Entra IDWeb
- Spark文件閱讀之二:Programming Guides - Quick StartSparkGUIIDE
- Can GoldenGate Replicate An Oracle Table That Contains Only CLOB Column(s)? (Doc ID 971833.1)GoOracleAI
- CRS-0019 CRS-0014 LFI-00142
- Node.js API參考文件(關於文件)Node.jsAPI
- ORA-8103 Troubleshooting, Diagnostic and Solution (文件 ID 8103.1)
- Fabric 1.0原始碼分析(32) Peer #peer node start命令實現原始碼
- 如何利用 Node 書寫 API 文件API
- 【Linux學習筆記】reboot命令Linux筆記boot
- Reboot Restore Rx Pro中文版bootREST
- 【MOS】Creating a PDB ... Fails With ORA-17630 (文件 ID 2090019.1)AI
- CRS-4124: Oracle High Availability Services startup failed. CRS-4000OracleAI
- node.js自動生成api文件(apidocjs)Node.jsAPI
- Library Cache 診斷:Lock, Pin 以及 Load Lock (文件 ID 1548524.1)
- start uniappAPP
- solaris下清除crs的方法
- 11.2.0.1.0 RAC啟動使用root使用者啟動crs報錯CRS-4535
- RAC 管理(crs_stat、crsctl、srvctl)
- Oracle 18c - 配置只讀 OracleHome / DBCA / Patching / Upgrade (文件 ID 2469646.1)Oracle
- linux online掃描共享儲存磁碟(無需reboot)Linuxboot
- 外部插入.after()
- [20190507]crs_stat與crsctl.txt
- 3.1.5.6 Forcing an Instance to Start
- 開發springboot startSpring Boot
- CAN_NM
- CAN協議協議
- NVIDIA Xavier CAN
- 如何確定Single-Primary模式下的MGR主節點(文件 ID 2214438.1)模式
- missing ) after argument list
- 03-dispatch_after
- [Kick Start] 2021 Round B