【MOS】Top 5 Grid Infrastructure Startup Issues (文件 ID 1368382.1)
Purpose |
Scope |
Details |
Issue #1: CRS-4639: Could not contact Oracle High Availability Services, ohasd.bin not running or ohasd.bin is running but no init.ohasd or other processes |
Issue #2: CRS-4530: Communications failure contacting Cluster Synchronization Services daemon, ocssd.bin is not running |
Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running |
Issue #4: Agent or mdnsd.bin, gpnpd.bin, gipcd.bin not running |
Issue #5: ASM instance does not start, ora.asm is OFFLINE |
References |
APPLIES TO:
Oracle Database - Enterprise Edition - Version 11.2.0.1 to 11.2.0.4 [Release 11.2]Information in this document applies to any platform.
PURPOSE
The purpose of this note is to provide a summary of the top 5 issues that may prevent the successful startup of the Grid Infrastructure (GI) stack.
SCOPE
This note applies to 11gR2 Grid Infrastructure only.
To determine the status of GI, please run the following commands:
2. $GRID_HOME/bin/crsctl stat res -t -init
3. $GRID_HOME/bin/crsctl stat res -t
4. ps -ef | egrep 'init|d.bin'
DETAILS
Issue #1: CRS-4639: Could not contact Oracle High Availability Services, ohasd.bin not running or ohasd.bin is running but no init.ohasd or other processes
Symptoms:
1. Command '$GRID_HOME/bin/crsctl check crs' returns error:
CRS-4639: Could not contact Oracle High Availability Services
2. Command 'ps -ef | grep init' does not show a line similar to:
root 4878 1 0 Sep12 ? 00:00:02 /bin/sh /etc/init.d/init.ohasd run
3. Command 'ps -ef | grep d.bin' does not show a line similar to:
root 21350 1 6 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
Or it may only show "ohasd.bin reboot" process without any other processes
4. ohasd.log report:
2013-11-04 09:09:15.541: [ default][2609911536] Created alert : (:OHAS00117:) : TIMED OUT WAITING FOR OHASD MONITOR
5. ohasOUT.log report:
2013-11-04 08:59:14
Changing directory to /u01/app/11.2.0/grid/log/lc1n1/ohasd
OHASD starting
Timed out waiting for init.ohasd script to start; posting an alert
6. ohasd.bin keeps restarting, ohasd.log report:
2014-08-31 15:00:25.132: [ CRSSEC][733177600]{0:0:2} Exception: PrimaryGroupEntry constructor failed to validate group name with error: 0 groupId: 0x7f8df8022450 acl_string: pgrp:spec:r-x
2014-08-31 15:00:25.132: [ CRSSEC][733177600]{0:0:2} Exception: ACL entry creation failed for: pgrp:spec:r-x
2014-08-31 15:00:25.132: [ INIT][733177600]{0:0:2} Dump State Starting ...
7. Only the ohasd.bin is runing, but there is nothing written in ohasd.log. OS /var/log/messages shows:
2015-07-12 racnode1 logger: autorun file for ohasd is missing
Possible Causes:
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 For OL6/RHEL6+, upstart is not configed properly.
2. runlevel 3 has not been reached, some rc3 script is hanging
3. the init process (pid 1) did not spawn the process defined in /etc/inittab (h1) or a bad entry before init.ohasd like xx:wait: blocked the start of init.ohasd
4. CRS autostart is disabled
5. The Oracle Local Registry ($GRID_HOME/cdata/.olr) is missing or corrupted (check as root user via "ocrdump -local /tmp/olr.log", the /tmp/olr.log should contain all GI daemon processes related information, compare with a working cluster to verify)
6. root user was in group "spec" before but now the group "spec" has been removed, the old group for root user is still recorded in the OLR, this can be verified in OLR dump
7. HOSTNAME was null when init.ohasd started especially after a node reboot
Solutions:
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 and then run "init q" as the root user.
For Linux OL6/RHEL6, please refer to Note 1607600.1
2. Run command 'ps -ef | grep rc' and kill any remaining rc3 scripts that appear to be stuck.
3. Remove the bad entry before init.ohasd. Consult with OS vendor if "init q" does not spawn "init.ohasd run" process. As a workaround,
start the init.ohasd manually, eg: as root user, run "/etc/init.d/init.ohasd run >/dev/null 2>&1 4. Enable CRS autostart:
# crsctl enable crs
# crsctl start crs
5. Restore OLR from backup, as root user: (refer to Note 1193643.1)
# crsctl stop crs -f
# touch /cdata/.olr
# chown root:oinstall /cdata/.olr
# ocrconfig -local -restore /cdata//backup__.olr
# crsctl start crs
If OLR backup does not exist for any reason, perform deconfig and rerun root.sh is required to recreate OLR, as root user:
# /crs/install/rootcrs.pl -deconfig -force
# /root.sh
6. Reinitialize/Recreate the OLR is required, using the same command as recreating OLR per above
7. Restart the init.ohasd process or add "sleep 30" in init.ohasd to allow hostname populated correctly before starting Clusterware, refer to Note 1427234.1
8. If above does not help, check OS messages for ohasd.bin logger message and manually execute crswrapexece.pl command mentioned in the OS message with LD_LIBRARY_PATH set to /lib to continue debug.
Issue #2: CRS-4530: Communications failure contacting Cluster Synchronization Services daemon, ocssd.bin is not running
Symptoms:
1. Command '$GRID_HOME/bin/crsctl check crs' returns errors:
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
2. Command 'ps -ef | grep d.bin' does not show a line similar to:
oragrid 21543 1 1 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ocssd.bin
3. ocssd.bin is running but abort with message "CLSGPNP_CALL_AGAIN" in ocssd.log
4. ocssd.log shows:
2012-01-27 13:42:58.796: [ CSSD][19]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 223132864, wrtcnt, 1112, LATS 783238209,
lastSeqNo 1111, uniqueness 1327692232, timestamp 1327693378/787089065
5. for 3 or more node cases, 2 nodes form cluster fine, the 3rd node joined then failed, ocssd.log show:
2012-02-09 11:33:53.048: [ CSSD][1120926016](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 2 nodes with leader 2, racnode2, is smaller than
cohort of 2 nodes led by node 1, racnode1, based on map type 2
2012-02-09 11:33:53.048: [ CSSD][1120926016]###################################
2012-02-09 11:33:53.048: [ CSSD][1120926016]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
6. ocssd.bin startup timeout after 10minutes
2012-04-08 12:04:33.153: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1333911873
......
2012-04-08 12:14:31.994: [ CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
7. alert.log shows:
2014-02-05 06:16:56.815
[cssd(3361)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/bdprod2/cssd/ocssd.log
...
2014-02-05 06:27:01.707
[ohasd(2252)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'bdprod2'.
2014-02-05 06:27:02.075
[ohasd(2252)]CRS-2771:Maximum restart attempts reached for resource 'ora.cssd'; will not restart.
Possible Causes:
2. Multicast is not working for private network for 11.2.0.2.x (expected behavior) or 11.2.0.3 PSU5/PSU6/PSU7 or 12.1.0.1 (due to )
3. private network is not working, ping or traceroute shows destination unreachable. Or firewall is enable for private network while ping/traceroute work fine
4. gpnpd does not come up, stuck in dispatch thread,
5. too many disks discovered via asm_diskstring or slow scan of disks due to on Solaris 11.2.0.3 only
6. In some cases, known bug could cause 2nd node ocssd.bin can not join the cluster after private network issue is fixed, refer to Note 1479380.1
Solutions:
If the disk is not accessible at OS level, please engage system administrator to restore the disk access.
If the voting disk is missing from the OCR ASM diskgroup, start CRS in exclusive mode and recreate the voting disk:
# crsctl start crs -excl
# crsctl replace votedisk <+OCRVOTE diskgroup>
2. Refer to Document 1212703.1 for multicast test and fix. For 11.2.0.3 PSU5/PSU6/PSU7 or 12.1.0.1, either enable multicast for private network or apply patch 16547309 or latest PSU. Refer to Document 1564555.1
3. Consult with the network administrator to restore private network access or disable firewall for private network (for Linux, check service iptables status and service ip6tables status)
4. Kill the gpnpd.bin process on surviving node, refer Document 10105195.8
Once above issues are resolved, restart Grid Infrastructure stack.
If ping/traceroute all work for private network, there is a failed 11.2.0.1 to 11.2.0.2 upgrade happened, please check out
for workaround
5. Limit the number of ASM disks scan by supplying a more specific asm_diskstring, refer to
For Solaris 11.2.0.3 only, please apply patch 13250497, see Note 1451367.1.
6. Refer to the solution and workaround in Note 1479380.1
Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running
Symptoms:
1. Command '$GRID_HOME/bin/crsctl check crs' returns errors:
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
2. Command 'ps -ef | grep d.bin' does not show a line similar to:
root 23017 1 1 22:34 ? 00:00:00 /u01/app/11.2.0/grid/bin/crsd.bin reboot
3. Even if the crsd.bin process exists, command 'crsctl stat res -t -init' shows:
ora.crsd
1 ONLINE INTERMEDIATE
Possible Causes:
2. +ASM instance can not startup due to various reason
3. OCR is inaccessible
4. Network configuration has been changed causing gpnp profile.xml mismatch
5. $GRID_HOME/crs/init/.pid file for crsd has been removed or renamed manually, crsd.log shows: 'Error3 -2 writing PID to the file'
6. ocr.loc content mismatch with other cluster nodes. crsd.log shows: 'Shutdown CacheLocal. my hash ids don't match'
7. private network is pingable with normal ping command but not pingable with jumbo frame size (eg: ping -s 8900 ) when jumbo frame is enabled (MTU: 9000+). Or partial cluster nodes have jumbo frame set (MTU: 9000) and the problem node does not have jumbo frame set (MTU:1500)
8. On AIX 6.1 TL08 SP01 and AIX 7.1 TL02 SP01, due to truncation of multicast packets.
9. udp_sendspace is set to default 9216 on AIX platform
Solutions:
2. For 11.2.0.2+, ensure that the resource ora.cluster_interconnect.haip is ONLINE, refer to Document 1383737.1 for ASM startup issues related to HAIP.
Check if GRID_HOME/bin/oracle binary is linked with RAC option Document 284785.1
3. Ensure the OCR disk is available and accessible. If the OCR is lost for any reason, refer to Document 1062983.1 on how to restore the OCR.
4. Restore network configuration to be the same as interface defined in $GRID_HOME/gpnp//profiles/peer/profile.xml, refer to Document 283684.1 for private network modification.
5. touch the file with .pid under $GRID_HOME/crs/init.
For 11.2.0.1, the file is owned by user.
For 11.2.0.2, the file is owned by root user.
6. Using ocrconfig -repair command to fix the ocr.loc content:
for example, as root user:
# ocrconfig -repair -add +OCR2 (to add an entry)
# ocrconfig -repair -delete +OCR2 (to remove an entry)
ohasd.bin needs to be up and running in order for above command to run.
Once above issues are resolved, either restart GI stack or start crsd.bin via:
# crsctl start res ora.crsd -init
7. Engage network admin to enable jumbo frame from switch layer if it is enabled at the network interface. If jumbo frame is not required, change MTU to 1500 for the private network on all nodes, then restart GI stack on all nodes.
8. On AIX 6.1 TL08 SP01 and AIX 7.1 TL02 SP01, apply AIX patch per Document 1528452.1 AIX 6.1 TL8 or 7.1 TL2: 11gR2 GI Second Node Fails to Join the Cluster as CRSD and EVMD are in INTERMEDIATE State
9. Increase udp_sendspace to recommended value, refer to Document 1280234.1
Issue #4: Agent or mdnsd.bin, gpnpd.bin, gipcd.bin not running
Symptoms:
1. orarootagent not running. ohasd.log shows:
2012-12-21 02:14:05.071: [ A**][24] {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /grid/11.2.0/grid_2/bin/orarootagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/grid/11.2.0/grid_2/bin/orarootagent]
2. mdnsd.bin, gpnpd.bin or gipcd.bin not running, here is a sample for mdnsd log file:
2012-12-31 21:37:27.601: [ clsdmt][1088776512]Creating PID [4526] file for home /u01/app/11.2.0/grid host lc1n1 bin mdns to /u01/app/11.2.0/grid/mdns/init/
2012-12-31 21:37:27.602: [ clsdmt][1088776512]Error3 -2 writing PID [4526] to the file []
2012-12-31 21:37:27.602: [ clsdmt][1088776512]Failed to record pid for MDNSD
or
2012-12-31 21:39:52.656: [ clsdmt][1099217216]Creating PID [4645] file for home /u01/app/11.2.0/grid host lc1n1 bin mdns to /u01/app/11.2.0/grid/mdns/init/
2012-12-31 21:39:52.656: [ clsdmt][1099217216]Writing PID [4645] to the file [/u01/app/11.2.0/grid/mdns/init/lc1n1.pid]
2012-12-31 21:39:52.656: [ clsdmt][1099217216]Failed to record pid for MDNSD
3. oraagent or appagent not running, crsd.log shows:
2012-12-01 00:06:24.462: [ A**][1164069184] {0:2:27} Created alert : (:CRSAGF00130:) : Failed to start the agent /u01/app/grid/11.2.0/bin/appagent_oracle
Possible Causes:
2. missing process associated .pid file or the file has wrong ownership or permission
3. wrong permission/ownership within GRID_HOME
4. GRID_HOME disk space 100% full
Solutions:
# cd /crs/install
# ./rootcrs.pl -unlock
# ./rootcrs.pl -patch
This will stop clusterware stack, set permssion/owership to root for required files and restart clusterware stack.
2. If the corresponding .pid does not exist, touch the file with correct ownership and permission, otherwise correct the .pid ownership/permission as required, then restart the clusterware stack.
Here is the list of .pid file under , owned by root:root, permission 644:
./ologgerd/init/.pid
./osysmond/init/.pid
./ctss/init/.pid
./ohasd/init/.pid
./crs/init/.pid
Owned by :oinstall, permission 644:
./mdns/init/.pid
./evm/init/.pid
./gipc/init/.pid
./gpnp/init/.pid
3. For cause 3, please refer to solution 1.
4. Please clean up the disk space from GRID_HOME, particularly clean up old files under /log//client/, /tnslsnr///alert/
Issue #5: ASM instance does not start, ora.asm is OFFLINE
Symptoms:
1. Command 'ps -ef | grep asm' shows no ASM processes
2. Command 'crsctl stat res -t -init' shows:
ora.asm
1 ONLINE OFFLINE
Possible Causes:
2. ASM discovery string is incorrect and therefore voting disk/OCR cannot be discovered
3. ASMlib configuration problem
4. ASM instances are using different cluster_interconnect, HAIP OFFLINE on 1 node causing the 2nd ASM instance could not start
Solutions:
2. Refer to Document 1077094.1 to correct the ASM discovery string.
3. Refer to Document 1050164.1 to fix ASMlib configuration.
4. Refer to Document 1383737.1 for solution. For more information about HAIP, please refer to Document 1210883.1
For further debugging GI startup issue, please refer to Document 1050908.1 Troubleshoot Grid Infrastructure Startup Issues.
About Me
...............................................................................................................................
● 本文來自於MOS轉載文章(文件 ID 1368382.1)
● 小麥苗雲盤地址:http://blog.itpub.net/26736162/viewspace-1624453/
● QQ群:230161599 微信群:私聊
● 聯絡我請加QQ好友(642808185),註明新增緣由
● 版權所有,歡迎分享本文,轉載請保留出處
...............................................................................................................................
手機長按下圖識別二維碼或微信客戶端掃描下邊的二維碼來關注小麥苗的微信公眾號:xiaomaimiaolhr,免費學習最實用的資料庫技術。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26736162/viewspace-2130321/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- How to Troubleshoot Grid Infrastructure Startup IssuesASTStruct
- How to Troubleshoot Grid Infrastructure Startup Issues [ID 1050908.1]ASTStruct
- Troubleshoot Grid Infrastructure Startup Issues (Doc ID 1050908.1)ASTStruct
- 【RAC】How to Troubleshoot Grid Infrastructure Startup Issues [ID 1050908.1]ASTStruct
- 【MOS】How to backup or restore OLR in 11.2/12c Grid InfrastructureRESTASTStruct
- Pre 11.2 Database Issues in 11gR2 Grid Infrastructure Environment_948456.1DatabaseASTStruct
- 最常見的5個CRS/Grid Infrastructure 安裝問題 (文件 ID 1549192.1)ASTStruct
- Oracle Grid Infrastructure for a Standalone ServerOracleASTStructServer
- clone grid INfrastructure Home and clusterwareASTStruct
- Oracle Clusterware and Oracle Grid InfrastructureOracleASTStruct
- 診斷 Grid Infrastructure 啟動問題 (文件 ID 1623340.1)ASTStruct
- 11g oracle database installation with oracle grid infrastructure on linux(文件)OracleDatabaseASTStructLinux
- Oracle grid infrastructure 解除安裝OracleASTStruct
- Grid Infrastructure 啟動的五大問題 (文件 ID 1526147.1)ASTStruct
- DNS and DHCP Setup Example for Grid Infrastructure GNSDNSASTStruct
- 記錄下 patch Grid Infrastructure for StandaloneASTStruct
- 12c Grid Infrastructure 管理資料庫(GIMR) 問答 (文件 ID 2047608.1)ASTStruct資料庫
- Ins-06001 During Grid Infrastructure Installation (文件 ID 1270620.1)ASTStruct
- Oracle Grid Infrastructure Patch Set Update 11.2.0.4.3OracleASTStruct
- Top 5 Database and/or Instance Performance Issues in RAC EnvironmentDatabaseORM
- 常見的 11gR2 Grid Infrastructure 升級問題 (文件 ID 1602048.1)ASTStruct
- 重新配置 11gR2 Grid InfrastructureASTStruct
- Database Creation on 11.2 Grid Infrastructure with Role SeparationDatabaseASTStruct
- 【MOS】Top Ten Performance Mistakes Found in Oracle Systems. (文件 ID 858539.1)ORMOracle
- Troubleshooting Database Control Startup IssuesDatabase
- 安裝 11gR2 Grid Infrastructure(CRS)失敗的處理過程 (文件 ID 1946678.1)ASTStruct
- 【GRID】Grid Infrastructure 啟動的五大問題 (Doc ID 1526147.1)ASTStruct
- Apply PSU for Grid Infrastructure Standalone and DB with Oracle RestartAPPASTStructOracleREST
- 升級Grid Infrastructure到10.2.0.2 遭遇bug 9413827ASTStruct
- backup or restore OLR in 11.2 Grid Infrastructure (Doc ID 1193643.1)RESTASTStruct
- Oracle 12c Grid Infrastructure for a Standalone Server on Oracle Linux 7OracleASTStructServerLinux
- oracle linux 11.2 rac grid infrastructure add scan ipOracleLinuxASTStruct
- redhat linux 11.2 rac grid infrastructure add scan ipRedhatLinuxASTStruct
- Master Note for RAC Oracle Clusterware and Oracle Grid Infrastructure 1096952.ASTOracleStruct
- 【轉】How to recover from root.sh on 11.2 Grid Infrastructure FailedASTStructAI
- [INS-40406] The installer detects no existing Oracle Grid Infrastructure ...OracleASTStruct
- 聊聊兩種給Grid Infrastructure打補丁的方法(上)ASTStruct
- 聊聊兩種給Grid Infrastructure打補丁的方法(下)ASTStruct