RAC一個節點OHASD啟動不了waiting for init.ohasd to be started

wl365365發表於2016-08-11
Cluster failed to start due to problem with socket pipe npohasd (文件 ID 1612325.1) 轉到底部轉到底部

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.

SYMPTOMS

CRS stack not coming up on one node.

Sockets permission issue with Grid Infrastructure and CRS stack fails to come up with crsctl start crs after the server reboot.

Init process is running fine after reboot :
   test-133(root)/>ps -ef|grep init
   root 28717 28382  2 19:03:20 pts/9     0:00 grep d.bin
   root     28756     1  0 10:01 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run   
   root 28676 27170  0 19:02:57 pts/5     0:00
 
Deleting files in /tmp didn't help :
   rm -rf /tmp/.oracle/* /usr/tmp/.oracle/* /var/tmp/.oracle/* 
 
CRS stack still won't start as shown below :
   test-133(root)/>/oracle/app/11.2.0/grid/bin/crsctl start crs
   test-133(root)/>ps -ef|grep d.bin
   root 28717 28382  2 19:03:20 pts/9     0:00 grep d.bin
   root 28680 0         19:02:57           0:00 /oracle/app/11.2.0/grid/bin/ohasd.bin reboot

For a fraction of a second ohasd.bin comes up and we can see one socket got created :
   ls -lrt /tmp/.oracle

    prw-r--r-- 1 root root 0 Jan 6 09:50 npohasd


Taking a strace/truss on ohasd.bin process, we find :
   tusc output -> seems stuck in sleeping.
   test-133:(root)/>/hpk/tusc -faep -T %H:%M:%S -p 28680
  ( Attached to process 28680 ("/oracle/app/11.2.0/grid/bin/ohasd.bin reboot") 
  [64             -bit] )
  19:03:42 [28680] open(0x40000000007789b0, O_WRONLY|0x800, 023240) [sleeping]
 
  tusc:
  ttrace(TT_PROC_STOP, 0, 0, 0, 0, 0): Permission denied
  

CHANGES

 Problem Started after patching failed and the server rebooted.


CAUSE

Permission issue
Relinked the binaries and restarted the server again so that init.ohasd came up fine, but ohasd and other daemons wouldn't start and no sockets get created

OS start S96ohasd, it will wait for init.ohasd to write the pipe.

What happened here is init.ohasd was started, then all socket files got removed by the manual removal, then when you start ohasd again, it will wait there since those socket files was removed manually

SOLUTION


WORKAROUND:
-----------
Clear all sockets under /var/tmp/.oracle or /tmp/.oracle if any and then open two terminals of the same node, where stack is not coming up.

1) On Terminal 1 , issue as Root user :-
crsctl start crs
2) Simultaneously , on node2 , issue below command as Root user , once npohasd socket has been created.
/bin/dd if=/tmp/.oracle/npohasd of=/dev/null bs=1024 count=1


3)  Now if you check on terminal 1 , the CRS stack would start coming up.

ps -ef |grep d.bin

4) Once entire CRS stack is up, you can press CTRL+C and come out of the dd command running on 2nd terminal.

  Check and validate all resources are online using

crsctl stat res -t                                                                                                                                            
crsctl stat res -t -init



來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/22039464/viewspace-2123312/,如需轉載,請註明出處,否則將追究法律責任。

相關文章