How to Troubleshoot Grid Infrastructure Startup Issues [ID 1050908.1]

wmsok發表於2014-03-23

Start up sequence:

In a nutshell, the operating system starts ohasd, ohasd starts ohasd agents to start up daemons (gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd asm etc), and crsd starts crsd agents that start user resources (database, SCAN, listener etc).

For detailed Grid Infrastructure clusterware startup sequence, please refer to note 1053147.1

 

Cluster status


To find out cluster and daemon status:

 

$GRID_HOME/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

$GRID_HOME/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       rac1                  Started
ora.crsd
      1        ONLINE  ONLINE       rac1
ora.cssd
      1        ONLINE  ONLINE       rac1
ora.cssdmonitor
      1        ONLINE  ONLINE       rac1
ora.ctssd
      1        ONLINE  ONLINE       rac1                  OBSERVER
ora.diskmon
      1        ONLINE  ONLINE       rac1
ora.drivers.acfs
      1        ONLINE  ONLINE       rac1
ora.evmd
      1        ONLINE  ONLINE       rac1
ora.gipcd
      1        ONLINE  ONLINE       rac1
ora.gpnpd
      1        ONLINE  ONLINE       rac1
ora.mdnsd
      1        ONLINE  ONLINE       rac1

For 11.2.0.2 and above, there will be two more processes:

ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       rac1
ora.crf
      1        ONLINE  ONLINE       rac1
For 11.2.0.3 onward in non-Exadata, ora.diskmon will be offline:

ora.diskmon
      1        OFFLINE  OFFLINE       rac1

For 12c onward, ora.storage is introduced:

ora.storage
1 ONLINE ONLINE racnode1 STABLE


In this Document
  Goal
  Solution
     Start up sequence:
     Cluster status
     Case 1: OHASD.BIN does not start
     Case 2: OHASD Agents does not start
     Case 3: CSSD.BIN does not start
     Case 4: CRSD.BIN does not start
     Case 5: GPNPD.BIN does not start
     Case 6: Various other daemons does not start
     Case 7: CRSD Agents does not start
     Network and Naming Resolution Verification
     Log File Location, Ownership and Permission
     Network Socket File Location, Ownership and Permission
     Diagnostic file collection
  References


--------------------------------------------------------------------------------

 

 

Applies to:
Oracle Server - Enterprise Edition - Version: 11.2.0.1 and later   [Release: 11.2 and later ]
Information in this document applies to any platform.

Goal
This goal of the note is to provide reference to troubleshoot 11gR2 Grid Infrastructure clusterware startup issues. It applies to issues in both new environments (during root.sh or rootupgrade.sh) and unhealthy existing environments.  To look specifically at root.sh issues, see Note: 1053970.1 for more information. 

Solution

Start up sequence:
In a nutshell, the operating system starts ohasd, ohasd starts agents to start up daemons (gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd asm etc), and crsd starts agents that start user resources (database, SCAN, listener etc).

For detailed Grid Infrastructure clusterware startup sequence, please refer to note 1053147.1


Cluster status

To find out cluster and daemon status:


$GRID_HOME/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

$GRID_HOME/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       rac1                  Started
ora.crsd
      1        ONLINE  ONLINE       rac1
ora.cssd
      1        ONLINE  ONLINE       rac1
ora.cssdmonitor
      1        ONLINE  ONLINE       rac1
ora.ctssd
      1        ONLINE  ONLINE       rac1                  OBSERVER
ora.diskmon
      1        ONLINE  ONLINE       rac1
ora.drivers.acfs
      1        ONLINE  ONLINE       rac1
ora.evmd
      1        ONLINE  ONLINE       rac1
ora.gipcd
      1        ONLINE  ONLINE       rac1
ora.gpnpd
      1        ONLINE  ONLINE       rac1
ora.mdnsd
      1        ONLINE  ONLINE       rac1

 

Case 1: OHASD.BIN does not start

As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up.

Automatic ohasd.bin start up depends on the following:

1. OS is at appropriate run level:

OS need to be at specified run level before CRS will try to start up.

To find out at which run level the clusterware needs to come up:


cat /etc/inittab|grep init.ohasd
h1:35 :respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1


Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.

To find out current run level:


who -r


2. "init.ohasd run" is up

On Linux/UNIX, as "init.ohasd run" is configured in /etc/inittab, process init (pid 1, /sbin/init on Linux, Solaris and hp-ux, /usr/sbin/init on AIX) will start and respawn "init.ohasd run" if it fails. Without "init.ohasd run" up and running, ohasd.bin will not start:


ps -ef|grep init.ohasd|grep -v grep
root      2279     1  0 18:14 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run


3. Cluserware auto start is enabled - its enabled by default

By default CRS is enabled for auto start upon node reboot, to enable:


$GRID_HOME/bin/crsctl enable crs

To verify whether its currently enabled or not:


cat $SCRBASE/$HOSTNAME/root/ohasdstr
enable

SCRBASE is /etc/oracle/scls_scr on Linux and AIX, /var/opt/oracle/scls_scr on hp-ux and Solaris

Note: NEVER EDIT THE FILE MANUALLY, use "crsctl enable/disable crs" command instead.

4. File System thats GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file:


Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.
..
Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin "reboot"

 

If you see the first line, but not the last line, likely the filesystem containing the GRID_HOME was not online while S96ohasd is executed.

5. Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible


ls -l $GRID_HOME/cdata/*.olr
-rw------- 1 root  oinstall 272756736 Feb  2 18:20 rac1.olr


If the OLR is inaccessible or corrupted, likely ohasd.log will have similar messages like following:


..
2010-01-24 22:59:10.470: [ default][1373676464] Initializing OLR
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.473: [  OCRRAW][1373676464]proprinit: Could not open raw device
2010-01-24 22:59:10.473: [  OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 22:59:10.473: [  CRSOCR][1373676464] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2010-01-24 22:59:10.473: [ default][1373676464] OLR initalization failured, rc=26
2010-01-24 22:59:10.474: [ default][1373676464]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR


OR


..
2010-01-24 23:01:46.275: [  OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprinit: Could not open raw device
2010-01-24 23:01:46.275: [  OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 23:01:46.276: [  CRSOCR][1228334000] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage
2010-01-24 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26
2010-01-24 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR

6. ohasd.bin is able to access network socket files, refer to "Network Socket File Location, Ownership and Permission " section for example output.


Case 2: OHASD Agents does not start

OHASD.BIN will spawn four agents/monitors to start level resource:

  oraagent : responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc
  orarootagent : responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc
  cssdagent / cssdmonitor : responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for cssdmonitor itself)

If ohasd.bin can not start any of above agents properly, clusterware will not come to healthy state; common causes of agent failure are that the log file or log directory for the agents don't have proper ownership or permission.

Refer to below section "Log File Location, Ownership and Permission " for general reference.
 

Case 3: CSSD.BIN does not start

Successful cssd.bin startup depends on the following:

1. GPnP profile is accessible - gpnpd needs to be fully up to serve profile

If ocssd.bin is able to get the profile successfully, likely ocssd.log will have similar messages like following:


2010-02-02 18:00:16.251: [    GPnP][408926240]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "ipc://GPNPD_rac1", try 4 of 500...
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileVerifyForCall: [at clsgpnp.c:1867] Result: (87) CLSGPNP_SIG_VALPEER. Profile verified.  prf=0x165160d0
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileGetSequenceRef: [at clsgpnp.c:841] Result: (0) CLSGPNP_OK. seq of p=0x165160d0 is '6'=6
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2186] Result: (0) CLSGPNP_OK. Successful get-profile CALL to remote "ipc://GPNPD_rac1" disco ""

Otherwise messages like following will show in ocssd.log


2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1100] GIPC gipcretConnectionRefused (29) gipcConnect(ipc-ipc://GPNPD_rac1)
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1101] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "ipc://GPNPD_rac1"
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnp_getProfileEx: [at clsgpnp.c:546] Result: (13) CLSGPNP_NO_DAEMON. Can't get GPnP service profile from local GPnP daemon
2010-02-03 22:26:17.057: [ default][3852126240]Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2010-02-03 22:26:17.057: [    CSSD][3852126240]clsgpnp_getProfile failed , rc(13)


2. Voting Disk is accessible

In 11gR2, ocssd.bin discover voting disk with setting from GPnP profile, if not enough voting disks can be identified, ocssd.bin will abort itself.


2010-02-03 22:37:22.212: [    CSSD][2330355744]clssnmReadDiscoveryProfile: voting file discovery string(/share/storage/di*)
..
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvDiskVerify: Successful discovery of 0 disks
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found
2010-02-03 22:37:22.228: [    CSSD][1145538880]###################################
2010-02-03 22:37:22.228: [    CSSD][1145538880]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread

If the voting disk is located on a non-ASM device, ownership and permissions should be:

-rw-r----- 1 ogrid oinstall 21004288 Feb  4 09:13 votedisk1

3. Network is functional and name resolution is working:

If ocssd.bin can't bind to any network, likely the ocssd.log will have messages like following:


2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1)
2010-02-03 23:26:25.804: [GIPCGMOD][1206540320]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ]  failed to determine host from clsinet, using default
..
2010-02-03 23:26:25.810: [    CSSD][1206540320]clsssclsnrsetup: gipcEndpoint failed, rc 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://rac1:nm_eotcs- ret 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssscmain: failed to open gipc endp


To validate network, please refer to note 1054902.1


4. Vendor clusterware is up (if using vendor clusterware)

Grid Infrastructure provide full clusterware functionality and doesn't need Vendor clusterware to be installed; but if you happened to have Grid Infrastructure on top of Vendor clusterware in your environment, then Vendor clusterware need to come up fully before CRS can be started, to verify:


$GRID_HOME/bin/lsnodes -n

Before the cluserware is installed, execute the command below:


$INSTALL_SOURCE/install/lsnodes -v


Case 4: CRSD.BIN does not start

Successful crsd.bin startup depends on the following:

1. ocssd is fully up

If ocssd.bin is not fully up, crsd.log will show messages like following:


2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clssscConnect: gipc request failed with 29 (0x16)
2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clsssInitNative: connect failed, rc 29
2010-02-03 22:37:51.639: [  CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS. Waiting for good status ..

 

2. OCR is accessible

If the OCR is located on ASM and it's unavailable, likely the crsd.log will show messages like:

2010-02-03 22:22:55.186: [  OCRASM][2603807664]proprasmo: Error in open/create file in dg [GI]
[  OCRASM][2603807664]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup

2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: kgfoCheckMount returned [7]
2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: The ASM instance is down
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: Failed to open [+GI]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: No OCR/OLR devices are usable
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprinit: Could not open raw device
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRAPI][2603807664]a_init:16!: Backend init unsuccessful : [26]
2010-02-03 22:22:55.190: [  CRSOCR][2603807664] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup
] [7]
2010-02-03 22:22:55.190: [    CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26

Note: in 11.2 ASM starts before crsd.bin, and brings up the diskgroup automatically if it contains the OCR.


If the OCR is located on a non-ASM device, expected ownership and permissions are:

-rw-r----- 1 root  oinstall  272756736 Feb  3 23:24 ocr


If OCR is located on non-ASM device and its unavailable, likely crsd.log will show similar message like following:


2010-02-03 23:14:33.583: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:33.583: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:33.583: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:6m':failed in stat OCR file/disk /share/storage/ocr, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:34.587: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:35.589: [    CRSD][2346668976][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26


If the OCR is corrupted, likely crsd.log will show messages like the following:


2010-02-03 23:19:38.417: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]propriogid:1_2: INVALID FORMAT
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprioini: all disks are not OCR/OLR formatted
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprinit: Could not open raw device
2010-02-03 23:19:39.429: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:40.432: [    CRSD][3360863152][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26


If owner or group of grid user got changed, even ASM is available, likely crsd.log will show following:


2010-03-10 11:45:12.510: [  OCRASM][611467760]proprasmo: Error in open/create file in dg [SYSTEMDG]
[  OCRASM][611467760]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges

2010-03-10 11:45:12.528: [  OCRASM][611467760]proprasmo: kgfoCheckMount returned [7]
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmo: The ASM instance is down
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: Failed to open [+SYSTEMDG]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: No OCR/OLR devices are usable
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprinit: Could not open raw device
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRAPI][611467760]a_init:16!: Backend init unsuccessful : [26]
2010-03-10 11:45:12.530: [  CRSOCR][611467760] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges
] [7]


3. Network is functional and name resolution is working:

If the network is not fully functioning, ocssd.bin may still come up, but crsd.bin may fail and the crsd.log will show messages like:


2010-02-03 23:34:28.412: [    GPnP][2235814832]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=867, tl=3, f=0
2010-02-03 23:34:28.428: [  OCRAPI][2235814832]clsu_get_private_ip_addresses: no ip addresses found.
..
2010-02-03 23:34:28.434: [  OCRAPI][2235814832]a_init:13!: Clusterware init unsuccessful : [44]
2010-02-03 23:34:28.434: [  CRSOCR][2235814832] OCR context init failure.  Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]
2010-02-03 23:34:28.434: [    CRSD][2235814832][PANIC] CRSD exiting: Could not init OCR, code: 44

Or:


2009-12-10 06:28:31.974: [  OCRMAS][20]proath_connect_master:1: could not connect to master  clsc_ret1 = 9, clsc_ret2 = 9
2009-12-10 06:28:31.974: [  OCRMAS][20]th_master:11: Could not connect to the new master
2009-12-10 06:29:01.450: [ CRSMAIN][2] Policy Engine is not initialized yet!
2009-12-10 06:29:31.489: [ CRSMAIN][2] Policy Engine is not initialized yet!

Or:


2009-12-31 00:42:08.110: [ COMMCRS][10]clsc_receive: (102b03250) Error receiving, ns (12535, 12560), transport (505, 145, 0)

To validate the network, please refer to note 1054902.1
 
Case 5: GPNPD.BIN does not start
1. Name Resolution is not working

gpnpd.bin fails with following error in gpnpd.log:


2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "tcp://node2:9393", try 1 of 3...
2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1015] ENTRY
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1066] GIPC gipcretFail (1) gipcConnect(tcp-tcp://node2:9393)
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1067] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "tcp://node2:9393"

In above example, please make sure current node is able to ping "node2", and no firewall between them.


Case 6: Various other daemons does not start
Two common causes:

1. Log file or directory for the daemon doesn't have appropriate ownership or permission

If the log file or log directory for the daemon doesn't have proper ownership or permissions, usually there is no new info in the log file and the timestamp remains the same while the daemon tries to come up.

Refer to below section "Log File Location, Ownership and Permission " for general reference.

2. Network socket file doesn't have appropriate ownership or permission

In this case, the daemon log will show messages like:


2010-02-02 12:55:20.485: [ COMMCRS][1121433920]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

2010-02-02 12:55:20.485: [  clsdmt][1110944064]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

 

 

Case 7: CRSD Agents does not start

CRSD.BIN will spawn two agents to start up user resource -the two agent share same name and binary as ohasd.bin agents:

  orarootagent : responsible for ora.netn .network, ora.nodename .vip, ora.scann .vip and  ora.gns
  oraagent : responsible for ora.asm, ora.eons, ora.ons, listener, SCAN listener, diskgroup, database, service resource etc

To find out the user resource status:


$GRID_HOME/crsctl stat res -t


If crsd.bin can not start any of the above agents properly, user resources may not come up.  A common cause of agent failure is that the log file or log directory for the agents don't have proper ownership or permissions.

Refer to below section "Log File Location, Ownership and Permission " for general reference.


Network and Naming Resolution Verification

CRS depends on a fully functional network and name resolution. If the network or name resolution is not fully functioning, CRS may not come up successfully.

To validate network and name resolution setup, please refer to note 1054902.1


Log File Location, Ownership and Permission

Appropriate ownership and permission of sub-directories and files in $GRID_HOME/log is critical for CRS components to come up properly.

Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and two separate RDBMS owner rdbmsap and rdbmsar, here's what it looks like under $GRID_HOME/log:

 

drwxrwxr-x 5 grid oinstall 4096 Dec  6 09:20 log
  drwxr-xr-x  2 grid oinstall 4096 Dec  6 08:36 crs
  drwxr-xr-t 17 root   oinstall 4096 Dec  6 09:22 rac1
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:20 admin
    drwxrwxr-t 4 root   oinstall  4096 Dec  6 09:20 agent
      drwxrwxrwt 7 root    oinstall 4096 Jan 26 18:15 crsd
        drwxr-xr-t 2 grid  oinstall 4096 Dec  6 09:40 application_grid
        drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 oraagent_grid
        drwxr-xr-t 2 rdbmsap oinstall 4096 Jan 26 18:15 oraagent_rdbmsap
        drwxr-xr-t 2 rdbmsar oinstall 4096 Jan 26 18:15 oraagent_rdbmsar
        drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 ora_oc4j_type_grid
        drwxr-xr-t 2 root    root     4096 Jan 26 20:09 orarootagent_root
      drwxrwxr-t 6 root oinstall 4096 Dec  6 09:24 ohasd
        drwxr-xr-t 2 grid oinstall 4096 Jan 26 18:14 oraagent_grid
        drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdagent_root
        drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdmonitor_root
        drwxr-xr-t 2 root   root     4096 Jan 26 18:14 orarootagent_root    
    -rw-rw-r-- 1 root root     12931 Jan 26 21:30 alertrac1.log
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:44 client
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 crsd
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:24 cssd
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 ctssd
    drwxr-x--- 2 grid oinstall  4096 Jan 26 18:14 diskmon
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:25 evmd     
    drwxr-x--- 2 grid oinstall  4096 Jan 26 21:20 gipcd     
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:20 gnsd      
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:58 gpnpd    
    drwxr-x--- 2 grid oinstall  4096 Jan 26 21:19 mdnsd    
    drwxr-x--- 2 root oinstall  4096 Jan 26 21:20 ohasd     
    drwxrwxr-t 5 grid oinstall  4096 Dec  6 09:34 racg       
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgeut
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgevtf
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgmain
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:57 srvm        


Please note most log files in sub-directory inherit ownership of parent directory; and above are just for general reference to tell whether there's unexpected recursive ownership and permission changes inside the CRS home . If you have a working node with the same version, the working node should be used as a reference.


Network Socket File Location, Ownership and Permission

Network socket files can be located in /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle

Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and clustername eotcs, below is an example output from the network socket directory:


drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle:
drwxrwxrwt 2 root  oinstall 4096 Feb  2 21:25 .
srwxrwx--- 1 grid oinstall    0 Feb  2 18:00 master_diskmon
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 mdnsd
-rw-r--r-- 1 grid oinstall    5 Feb  2 18:00 mdnsd.pid
prw-r--r-- 1 root  root        0 Feb  2 13:33 npohasd
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 ora_gipc_GPNPD_rac1
-rw-r--r-- 1 grid oinstall    0 Feb  2 13:34 ora_gipc_GPNPD_rac1_lock
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sAevm
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sCevm
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_UI_SOCKET
srwxrwxrwx 1 root  root        0 Feb  2 21:25 srac1DBG_CRSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_CSSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_CTSSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_EVMD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GIPCD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GPNPD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_MDNSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN3
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs_lock
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1_lock
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sora_crsqs
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROC
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROL
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sSYSTEM.evm.acceptor.auth

 

Diagnostic file collection

If the issue can't be identified with the note, as root, please run $GRID_HOME/bin/diagcollection.sh on all nodes, and upload all .gz files it generated in current directory.

 

References
NOTE:1053970.1 - Troubleshooting 11.2 Grid Infastructure Installation Root.sh Issues
NOTE:1054902.1 - How to Validate Network and Name Resolution Setup for the Clusterware and RAC
NOTE:1068835.1 - What to Do if 11gR2 Clusterware is Unhealthy
NOTE:942166.1 - How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation
NOTE:969254.1 - How to Proceed from Failed Upgrade to 11gR2 Grid Infrastructure (CRS)

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25960404/viewspace-1127524/,如需轉載,請註明出處,否則將追究法律責任。

相關文章