Cluster failed to start due to problem with socket pipe npohasd (文件 ID 1612325.1)

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.

SYMPTOMS

CRS stack not coming up on one node.

Sockets permission issue with Grid Infrastructure and CRS stack fails to come up with crsctl start crs after the server reboot.

Init process is running fine after reboot :

   test-133(root)/>ps -ef|grep init
   root 28717 28382  2 19:03:20 pts/9     0:00 grep d.bin
   root     28756     1  0 10:01 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run   
   root 28676 27170  0 19:02:57 pts/5     0:00

Deleting files in /tmp didn't help :

   rm -rf /tmp/.oracle/* /usr/tmp/.oracle/* /var/tmp/.oracle/*

CRS stack still won't start as shown below :

   test-133(root)/>/oracle/app/11.2.0/grid/bin/crsctl start crs

   test-133(root)/>ps -ef|grep d.bin
   root 28717 28382  2 19:03:20 pts/9     0:00 grep d.bin
   root 28680 0         19:02:57           0:00 /oracle/app/11.2.0/grid/bin/ohasd.bin reboot

For a fraction of a second ohasd.bin comes up and we can see one socket got created :

   ls -lrt /tmp/.oracle

prw-r--r-- 1 root root 0 Jan 6 09:50 npohasd

Taking a strace/truss on ohasd.bin process, we find :

   tusc output -> seems stuck in sleeping.

   test-133:(root)/>/hpk/tusc -faep -T %H:%M:%S -p 28680

  ( Attached to process 28680 ("/oracle/app/11.2.0/grid/bin/ohasd.bin reboot") 
  [64             -bit] )
  19:03:42 [28680] open(0x40000000007789b0, O_WRONLY|0x800, 023240) [sleeping]
 
  tusc:
  ttrace(TT_PROC_STOP, 0, 0, 0, 0, 0): Permission denied

CHANGES

Problem Started after patching failed and the server rebooted.

CAUSE

Permission issue
Relinked the binaries and restarted the server again so that init.ohasd came up fine, but ohasd and other daemons wouldn't start and no sockets get created

OS start S96ohasd, it will wait for init.ohasd to write the pipe.

What happened here is init.ohasd was started, then all socket files got removed by the manual removal, then when you start ohasd again, it will wait there since those socket files was removed manually

SOLUTION

WORKAROUND:
-----------
Clear all sockets under /var/tmp/.oracle or /tmp/.oracle if any and then open two terminals of the same node, where stack is not coming up.

1) On Terminal 1 , issue as Root user :-

crsctl start crs

2) Simultaneously , on node2 , issue below command as Root user , once npohasd socket has been created.

/bin/dd if=/tmp/.oracle/npohasd of=/dev/null bs=1024 count=1

3) Now if you check on terminal 1 , the CRS stack would start coming up.

ps -ef |grep d.bin

4) Once entire CRS stack is up, you can press CTRL+C and come out of the dd command running on 2nd terminal.

Check and validate all resources are online using

crsctl stat res -t

crsctl stat res -t -init

RAC一個節點OHASD啟動不了waiting for init.ohasd to be started

APPLIES TO:

SYMPTOMS

CHANGES

CAUSE

SOLUTION

相關文章