診斷 Grid Infrastructure 啟動問題 (文件 ID 1623340.1)

531968912發表於2017-06-22

文件內容

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 


適用於:

Oracle Database - Enterprise Edition - 版本 11.2.0.1 和更高版本
本文件所含資訊適用於所有平臺

用途

本文提供了診斷 11GR2 和 12C Grid Infrastructure 啟動問題的方法。對於新安裝的環境(root.sh 和 rootupgrade.sh 執行過程中)和有故障的舊環境都適用。針對 root.sh 的問題,我們可以參考 note 1053970.1 來獲取更多的資訊。

適用範圍

本文適用於叢集/RAC管理員和  支援工程師。

詳細資訊

啟動順序:

簡而言之,負責啟動 ohasd 程式,ohasd 程式啟動 agents 用來啟動守護程式(gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd ,asm …) ,crsd 啟動 agents 用來啟動使用者資源(database,SCAN,Listener 等)。

如果需要了解更詳細的 Grid Infrastructure Cluster 啟動順序,請參閱 note 1053147.1

叢集狀態


查詢叢集和守護程式的狀態:

$GRID_HOME/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

$GRID_HOME/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       rac1                  Started
ora.crsd
      1        ONLINE  ONLINE       rac1
ora.cssd
      1        ONLINE  ONLINE       rac1
ora.cssdmonitor
      1        ONLINE  ONLINE       rac1
ora.ctssd
      1        ONLINE  ONLINE       rac1                  OBSERVER
ora.diskmon
      1        ONLINE  ONLINE       rac1
ora.drivers.acfs
      1        ONLINE  ONLINE       rac1
ora.evmd
      1        ONLINE  ONLINE       rac1
ora.gipcd
      1        ONLINE  ONLINE       rac1
ora.gpnpd
      1        ONLINE  ONLINE       rac1
ora.mdnsd
      1        ONLINE  ONLINE       rac1

對於11.2.0.2 和以上的版本,會有以下兩個額外的程式:

ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       rac1
ora.crf
      1        ONLINE  ONLINE       rac1

對於11.2.0.3 以上的非EXADA他的系統,ora.diskmon會處於offline的狀態,如下:

ora.diskmon
      1        OFFLINE  OFFLINE       rac1

對於 12c 以上的版本, 會出現ora.storage資源:

ora.storage
1 ONLINE ONLINE racnode1 STABLE



如果守護程式 offline 我們可以透過以下命令啟動:

$GRID_HOME/bin/crsctl start res ora.crsd -init


問題 1: OHASD 無法啟動


由於 ohasd.bin 的責任是直接或者間接的啟動叢集所有的其它程式,所以只有這個程式正常啟動了,其它的程式才能起來,如果 ohasd.bin 的程式沒有起來,當我們檢查資源狀態的時候會報錯 CRS-4639 (Could not contact  High Availability Services); 如果 ohasd.bin 已經啟動了,而再次嘗試啟時,錯誤 CRS-4640 會出現;如果它啟動失敗了,那麼我們會看到以下的錯誤資訊:

CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.



自動啟動 ohasd.bin 依賴於以下的配置:

1. 作業系統配置了正確的 run level:

OS 需要在 CRS 啟動之前設定成指定的 run level 來確保 CRS 的正常啟動。

我們可以透過以下方式找到 CRS 需要 OS 設定的 run level:

cat /etc/inittab|grep init.ohasd
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null



以上例子展示了,CRS 需要 OS 執行在 run level 3 或 5;請注意,由於作業系統的不同,CRS 啟動需要的 OS 的 run level 也會不同。

找到當前 OS 正在執行的 run level:

who -r



2. "init.ohasd run" 啟動

在 /Unix 平臺上,由於"init.ohasd run" 是配置在 /etc/inittab中,程式 init(程式id 1,,Solars和HP-UX上為/sbin/init ,Aix上為/usr/sbin/init)會啟動並且產生"init.ohasd run"程式,如果這個過程失敗了,就不會有"init.ohasd run"的啟動和執行,ohasd.bin 也是無法啟動的:

ps -ef|grep init.ohasd|grep -v grep
root      2279     1  0 18:14 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run

注意:Oracle Linux (OL6)以及 Red Hat Linux 6 (RHEL6) 已經不再支援 inittab 了,所以 init.ohasd 會被配置在 /etc/init 中,並被 /etc/init 啟動,儘管如此,我們還是應該能看到程式 "/etc/init.d/init.ohasd run" 被啟動;

如果任何 rc Snncommand 的指令碼(在 rcn.d 中,如 S98gcstartup)在啟動的過程中掛死,此時 init 的程式可能無法啟動"/etc/init.d/init.ohasd run";您需要尋求 OS 廠商的幫助,找到為什麼 Snncommand 指令碼掛死或者無法正常啟動的原因;

錯誤"[ohasd(<pid>)] CRS-0715:Oracle High Availability Service has timed out waiting for init.ohasd to be started." 可能會在 init.ohasd 無法在指定時間內啟動後出現
 
如果系統管理員無法在短期內找到 init.ohasd 無法啟動的原因,以下辦法可以作為一個臨時的解決辦法:

 cd <location-of-init.ohasd>
 nohup ./init.ohasd run &




3. Clusterware 自動啟動;--自動啟動預設是開啟的

預設情況下 CRS 自動啟動是開啟的,我們可以透過以下方式開啟:

$GRID_HOME/bin/crsctl enable crs


檢查這個功能是否被開啟:

$GRID_HOME/bin/crsctl config crs


如果以下資訊被輸出在OS的日誌中

Feb 29 16:20:36 racnode1 logger: Oracle Cluster Ready Services startup disabled.
Feb 29 16:20:36 racnode1 logger: Could not access /var/opt/oracle/scls_scr/racnode1/root/ohasdstr


原因是由於這個檔案不存在或者不可訪問,產生這個問題的原因一般是人為的修改或者是打 GI 補丁的過程中使用了錯誤的 opatch (如:使用 Solaris 平臺上的 opatch 在 Linux 上打補丁)


4. syslogd 啟動並且 OS 能夠執行 init 指令碼 S96ohasd

節點啟動之後,OS 可能停滯在一些其它的 Snn 的指令碼上,所以可能沒有機會執行到指令碼 S96ohasd;如果是這種情況,我們不會在 OS 日誌中看到以下資訊

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.


如果在 OS 日誌裡看不到上面的資訊,還有一種可能是 syslogd((/usr/sbin/syslogd)沒有被完全啟動。GRID 在這種情況下也是無法正常啟動的,這種情況不適用於 AIX 的平臺。

為了瞭解 OS 啟動之後是否能夠執行 S96ohasd 指令碼,可以按照以下的方法修改該指令碼:

From:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        $LOGERR "Oracle HA daemon is enabled for autostart."


To:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        /bin/touch /tmp/ohasd.start."`date`"
        $LOGERR "Oracle HA daemon is enabled for autostart."


重啟節點後,如果您沒有看到檔案 /tmp/ohasd.start.timestamp 被建立,那麼就是說 OS 停滯在其它的 Snn 的指令碼上。如果您能看到 /tmp/ohasd.start.timestamp 生成了,但是"Oracle HA daemon is enabled for autostart"沒有寫入到messages 檔案裡,就是 syslogd 沒有被完全啟動了。以上的兩種情況,您都需要尋求系統管理員的幫助,從 OS 的層面找到問題的原因,對於後一種情況,有個臨時的解決辦法是“休眠”2分鐘, 按照以下的方法修改 ohasd 指令碼:

From:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        $LOGERR "Oracle HA daemon is enabled for autostart."


To:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        /bin/sleep 120
        $LOGERR "Oracle HA daemon is enabled for autostart."


5.
 GRID_HOME 所在的檔案系統在執行初始化指令碼 S96ohasd 的時候線上;正常情況下一旦 S96ohasd 執行結束,我們會在 OS message 裡看到以下資訊:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.
..
Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin "reboot"


如果您只看到了第一行,沒有看到最後一行的資訊,很可能是 GRID_HOME 所在的檔案系統在指令碼 S96ohasd 執行的時候還沒有正常掛載。


6. Oracle Local Registry  (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) 有效並可以正常讀寫

ls -l $GRID_HOME/cdata/*.olr
-rw------- 1 root  oinstall 272756736 Feb  2 18:20 rac1.olr



如果 OLR 是不可讀寫的或者損壞的,我們會在 ohasd.log 中看到以下的相關資訊

..
2010-01-24 22:59:10.470: [ default][1373676464] Initializing OLR
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.473: [  OCRRAW][1373676464]proprinit: Could not open raw device
2010-01-24 22:59:10.473: [  OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 22:59:10.473: [  CRSOCR][1373676464] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2010-01-24 22:59:10.473: [ default][1373676464] OLR initalization failured, rc=26
2010-01-24 22:59:10.474: [ default][1373676464]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR


或者

..
2010-01-24 23:01:46.275: [  OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprinit: Could not open raw device
2010-01-24 23:01:46.275: [  OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 23:01:46.276: [  CRSOCR][1228334000] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage
2010-01-24 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26
2010-01-24 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR


或者

..
2010-11-07 03:00:08.932: [ default][1] Created alert : (:OHAS00102:) : OHASD is not running as privileged user
2010-11-07 03:00:08.932: [ default][1][PANIC] OHASD exiting: must be run as privileged user


或者

ohasd.bin comes up but output of "crsctl stat res -t -init"shows no resource, and "ocrconfig -local -manualbackup" fails


或者

..
2010-08-04 13:13:11.102: [   CRSPE][35] Resources parsed
2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has been registered with the PE data model
2010-08-04 13:13:11.103: [   CRSPE][35] STARTUPCMD_REQ = false:
2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has changed state from [Invalid/unitialized] to [VISIBLE]
2010-08-04 13:13:11.103: [  CRSOCR][31] Multi Write Batch processing...
2010-08-04 13:13:11.103: [ default][35] Dump State Starting ...
..
2010-08-04 13:13:11.112: [   CRSPE][35] SERVERS:
:VISIBLE:address{{Absolute|Node:0|Process:-1|Type:1}}; recovered state:VISIBLE. Assigned to no pool

------------- SERVER POOLS:
Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED

2010-08-04 13:13:11.113: [   CRSPE][35] Dumping ICE contents...:ICE operation count: 0
2010-08-04 13:13:11.113: [ default][35] Dump State Done.


解決辦法就是使用下面的命令,恢復一個好的備份 "ocrconfig -local -restore <ocr_backup_name>"。

預設情況下,OLR 在系統安裝結束後會自動的備份在 $GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr 。

7. ohasd.bin可以正常的訪問到網路的 socket 檔案:

2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))

2010-06-29 10:31:01.571: [  OCRSRV][1217390912]th_listen: CLSCLISTEN failed clsc_ret= 3, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]
2010-06-29 10:31:01.571: [  OCRSRV][3267002960]th_init: Local listener did not reach valid state


在 Grid Infrastructure 環境中,和 ohasd 有關的 socket 檔案屬主應該是 root 使用者,但是在 Oracle Restart 的環境中,他們應該是屬於 grid 使用者的,關於更多的關於網路 socket 檔案許可權和屬主,請參考章節"網路 socket 檔案,屬主和許可權" 給出的例子.

8. ohasd.bin 能夠訪問日誌檔案的位置:

OS messages/syslog 顯示以下資訊:

Feb 20 10:47:08 racnode1 OHASD[9566]: OHASD exiting; Directory /ocw/grid/log/racnode1/ohasd not found.


請參考章節"日誌位置, 屬主和許可權"部分的例子,並確定這些必要的目錄是否有丟失的,並且是按照正確的許可權和屬主建立的。

9. 節點啟動後,在 SUSE Linux 的系統上,ohasd 可能無法啟動,此問題請參考 note 1325718.1 - OHASD not Starting After Reboot on SLES

10. OHASD 無法啟動,使用 "ps -ef| grep ohasd.bin" 顯示 ohasd.bin 的程式已經啟動,但是 $GRID_HOME/log/<node>/ohasd/ohasd.log 在好幾分鐘之後都沒有任何資訊更新,使用 OS 的 truss 工具 可以看到該程式一致在迴圈的執行關閉從未被開啟的檔案控制程式碼的操作:

..
15058/1:         0.1995 close(2147483646)                               Err#9 EBADF
15058/1:         0.1996 close(2147483645)                               Err#9 EBADF
..


透過 ohasd.bin 的 Call stack ,可以看到以下資訊:

_close  sclssutl_closefiledescriptors  main ..


這是由於  導致的, 該問題在 11.2.0.3 和之上的版本已經被修復,該 bug 的其它症狀還有:叢集的程式無法啟動,而且做 call stack 和 truss 檢視的時候也會看到相同的情況(迴圈的執行 OS 函式 "close") . 如果該 bug 發生在啟動其它的資源時,我們會看到錯誤資訊: "CRS-5802: Unable to start the agent process" 提示。

11. 其它的一些潛在的原因和解決辦法請參見 note 1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device

12. ohasd.bin 正常啟動,但是, "crsctl check crs" 只顯示以下一行資訊:

CRS-4638: Oracle High Availability Services is online

並且命令 "crsctl stat res -p -init" 無法顯示任何資訊

這個問題是由於 OLR 損壞導致的,請參考 note 1193643.1 進行恢復。


13. 如果 ohasd 仍然無法啟動,請參見 ohasd 的日誌 <grid-home>/log/<nodename>/ohasd/ohasd.log 和 ohasdOUT.log 來獲取更多的資訊;
 

問題 2: OHASD Agents  未啟動


OHASD.BIN 會啟動 4 個 agents/monitors 來啟動其它的資源:

  oraagent: 負責啟動  ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd 等
  orarootagent: 負責啟動 ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs 等
  cssdagent / cssdmonitor: 負責啟動 ora.cssd(對應 ocssd.bin) 和 ora.cssdmonitor(對應 cssdmonitor)

如果 ohasd.bin 不能正常地啟動以上任何一個 agents,叢集都無法執行在正常的狀態。

1. 通常情況下,agents 無法啟動的原因是 agent 的日誌或者日誌所在的目錄沒有正確設定屬主和許可權。

關於日誌檔案和資料夾的許可權和屬主設定,請參見章節 "日誌檔案位置, 屬主和許可權" 中的介紹。

2. 如果 agent 的二進位制檔案(oraagent.bin 或者 orarootagent.bin 等)損壞, agent 也將無法啟動,從而導致相關的資源也無法啟動:

2011-05-03 11:11:13.189
[ohasd(25303)]CRS-5828:Could not start agent '/ocw/grid/bin/orarootagent_grid'. Details at (:CRSAGF00130:) {0:0:2} in /ocw/grid/log/racnode1/ohasd/ohasd.log.


2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Created alert : (:CRSAGF00130:) :  Failed to start the agent /ocw/grid/bin/orarootagent_grid
2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_START[ora.diskmon 1 1] ID 4098:403
2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Can not stop the agent: /ocw/grid/bin/orarootagent_grid because pid is not initialized
..
2011-05-03 12:03:17.492: [   CRSPE][1128372576] {0:0:184} Fatal Error from AGFW Proxy: Unable to start the agent process
2011-05-03 12:03:17.492: [   CRSPE][1128372576] {0:0:184} CRS-2674: Start of 'ora.diskmon' on 'racnode1' failed

..

2011-06-27 22:34:57.805: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdagent]
2011-06-27 22:34:57.805: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00126:) :  Agent start failed
..
2011-06-27 22:34:57.806: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]


解決辦法: 您可以和正常節點上的 agent 檔案進行比較,並且恢復一個好的副本回來。

問題 3: OCSSD.BIN 無法啟動


cssd.bin 的正常啟動依賴於以下幾個必要的條件:

1. GPnP profile 可正常讀寫 - gpnpd  需要完全正常啟動來為profile服務。

如果 ocssd.bin 能夠正常的獲取 profile,通常情況下,我們會在 ocssd.log 中看到以下類似的資訊:

2010-02-02 18:00:16.251: [    GPnP][408926240]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "ipc://GPNPD_rac1", try 4 of 500...
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileVerifyForCall: [at clsgpnp.c:1867] Result: (87) CLSGPNP_SIG_VALPEER. Profile verified.  prf=0x165160d0
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileGetSequenceRef: [at clsgpnp.c:841] Result: (0) CLSGPNP_OK. seq of p=0x165160d0 is '6'=6
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2186] Result: (0) CLSGPNP_OK. Successful get-profile CALL to remote "ipc://GPNPD_rac1" disco ""


否則,我們會看到以下資訊顯示在 ocssd.log 中。

2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1100] GIPC gipcretConnectionRefused (29) gipcConnect(ipc-ipc://GPNPD_rac1)
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1101] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "ipc://GPNPD_rac1"
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnp_getProfileEx: [at clsgpnp.c:546] Result: (13) CLSGPNP_NO_DAEMON. Can't get GPnP service profile from local GPnP daemon
2010-02-03 22:26:17.057: [ default][3852126240]Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2010-02-03 22:26:17.057: [    CSSD][3852126240]clsgpnp_getProfile failed, rc(13)



2. Voting Disk 可以正常讀寫

在 11gR2 的版本中, ocssd.bin 透過 GPnP profile 中的記錄獲取 Voting disk 的資訊, 如果沒有足夠多的選舉盤是可讀寫的,那麼 ocssd.bin 會終止掉自己。

2010-02-03 22:37:22.212: [    CSSD][2330355744]clssnmReadDiscoveryProfile: voting file discovery string(/share/storage/di*)
..
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvDiskVerify: Successful discovery of 0 disks
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found
2010-02-03 22:37:22.228: [    CSSD][1145538880]###################################
2010-02-03 22:37:22.228: [    CSSD][1145538880]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread



如果所有節點上的 ocssd.bin 因為以下錯誤無法啟動,這是因為 voting file 正在被修改:

2010-05-02 03:11:19.033: [    CSSD][1197668093]clssnmCompleteInitVFDiscovery: Detected voting file add in progress for CIN 0:1134513465:0, waiting for configuration to complete 0:1134513098:0


解決的辦法是,參照 note 1364971.1 中的步驟,以 exclusive 模式啟動 ocssd.bin。 


如果選舉盤的位置是非 ASM 的裝置,它的許可權和屬主應該是如下顯示:

-rw-r----- 1 ogrid oinstall 21004288 Feb  4 09:13 votedisk1

3. 網路功能是正常的,並且域名解析能夠正常工作:

如果 ocssd.bin 無法正常的繫結到任何網路上,我們會在 ocssd.log 中看到以下類似的日誌資訊:

2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1)
2010-02-03 23:26:25.804: [GIPCGMOD][1206540320]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ]  failed to determine host from clsinet, using default
..
2010-02-03 23:26:25.810: [    CSSD][1206540320]clsssclsnrsetup: gipcEndpoint failed, rc 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://rac1:nm_eotcs- ret 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssscmain: failed to open gipc endp



如果私網上出現了聯通性的故障(包含多播功能關閉),我們會在 ocssd.log 中看到以下類似的日誌資訊:

2010-09-20 11:52:54.014: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 453, LATS 328297844, lastSeqNo 452, uniqueness 1284979488, timestamp 1284979973/329344894
2010-09-20 11:52:54.016: [    CSSD][1078421824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
..  >>>> after a long delay
2010-09-20 12:02:39.578: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 1037, LATS 328883434, lastSeqNo 1036, uniqueness 1284979488, timestamp 1284980558/329930254
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0xe1ad870)
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmShutDown: Received abortive shutdown request from client.
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssscExit: CSSD aborting from thread GMClientListener
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################


驗證網路是否正常,請參見:note 1054902.1

$GRID_HOME/bin/lsnodes -n
racnode1    1
racnode1    0


如果第三方的叢集管理軟體沒有完全正常啟動,我們在 ocssd.log 中看到以下類似的日誌資訊:

2010-08-30 18:28:13.207: [    CSSD][36]clssnm_skgxninit: skgxncin failed, will retry
2010-08-30 18:28:14.207: [    CSSD][36]clssnm_skgxnmon: skgxn init failed
2010-08-30 18:28:14.208: [    CSSD][36]###################################
2010-08-30 18:28:14.208: [    CSSD][36]clssscExit: CSSD signal 11 in thread skgxnmon



未安裝叢集管理軟體之前,請使用 grid 使用者執行以下操作驗證:

$INSTALL_SOURCE/install/lsnodes -v

5. 在錯誤的 GRID_HOME 下執行命令"crsctl" 

命令"crsctl" 必須在正確的 GRID_HOME 下執行,才能正常啟動其它程式,否則我們會看到以下的錯誤資訊提示:

2012-11-14 10:21:44.014: [    CSSD][1086675264]ASSERT clssnm1.c 3248
2012-11-14 10:21:44.014: [    CSSD][1086675264](:CSSNM00056:)clssnmvStartDiscovery: Terminating because of the release version(11.2.0.2.0) of this node being lesser than the active version(11.2.0.3.0) that the cluster is at
2012-11-14 10:21:44.014: [    CSSD][1086675264]###################################
2012-11-14 10:21:44.014: [    CSSD][1086675264]clssscExit: CSSD aborting from thread clssnmvDDiscThread#

 

 

問題 4: CRSD.BIN 無法啟動


crsd.bin 的正常啟動依賴於以下幾個必要的條件:

1. ocssd 已經完全正常啟動

如果 ocssd.bin 沒有完全正常啟動,我們會在 crsd.log 中看到以下提示資訊:

2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clssscConnect: gipc request failed with 29 (0x16)
2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clsssInitNative: connect failed, rc 29
2010-02-03 22:37:51.639: [  CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS. Waiting for good status ..



2. OCR 可以正常讀寫

如果 OCR 儲存在 ASM 中,那麼 ora.asm 資源(ASM 例項) 必須已經啟動而且 OCR 所在的磁碟組必須已經被掛載,否則我們在 crsd.log 會看到以下的類似資訊:

2010-02-03 22:22:55.186: [  OCRASM][2603807664]proprasmo: Error in open/create file in dg [GI]
[  OCRASM][2603807664]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup

2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: kgfoCheckMount returned [7]
2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: The ASM instance is down
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: Failed to open [+GI]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: No OCR/OLR devices are usable
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprinit: Could not open raw device
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRAPI][2603807664]a_init:16!: Backend init unsuccessful : [26]
2010-02-03 22:22:55.190: [  CRSOCR][2603807664] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup
] [7]
2010-02-03 22:22:55.190: [    CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26


注意:在11.2 的版本中 ASM 會比 crsd.bin 先啟動,並且會把含有 OCR 的磁碟組自動掛載。


如果您的 OCR 在非 ASM 的儲存中,該檔案的屬主和許可權如下:

-rw-r----- 1 root  oinstall  272756736 Feb  3 23:24 ocr


如果 OCR 是在非 ASM 的儲存中,並且不能被正常訪問,在 crsd.log 會看到以下的類似資訊

2010-02-03 23:14:33.583: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:33.583: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:33.583: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:6m':failed in stat OCR file/disk /share/storage/ocr, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:34.587: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:35.589: [    CRSD][2346668976][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26



如果 OCR 是壞掉了,在 crsd.log 會看到以下的類似資訊:

2010-02-03 23:19:38.417: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]propriogid:1_2: INVALID FORMAT
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprioini: all disks are not OCR/OLR formatted
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprinit: Could not open raw device
2010-02-03 23:19:39.429: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:40.432: [    CRSD][3360863152][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26



如果您的 grid 使用者的許可權或者所在組發生了變化,儘管 ASM 還是可以訪問的,在 crsd.log 會看到以下的類似資訊:

2010-03-10 11:45:12.510: [  OCRASM][611467760]proprasmo: Error in open/create file in dg [SYSTEMDG]
[  OCRASM][611467760]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges

2010-03-10 11:45:12.528: [  OCRASM][611467760]proprasmo: kgfoCheckMount returned [7]
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmo: The ASM instance is down
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: Failed to open [+SYSTEMDG]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: No OCR/OLR devices are usable
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprinit: Could not open raw device
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRAPI][611467760]a_init:16!: Backend init unsuccessful : [26]
2010-03-10 11:45:12.530: [  CRSOCR][611467760] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges
] [7]



如果 GRID_HOME 下的 oracle 二進位制檔案的屬主或者許可權錯誤,儘管 ASM 正常啟動並執行,在 crsd.log 會看到以下的類似資訊:

2012-03-04 21:34:23.139: [  OCRASM][3301265904]proprasmo: Error in open/create file in dg [OCR]
[  OCRASM][3301265904]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=12547, loc=kgfokge

2012-03-04 21:34:23.139: [  OCRASM][3301265904]ASM Error Stack : ORA-12547: TNS:lost contact

2012-03-04 21:34:23.633: [  OCRASM][3301265904]proprasmo: kgfoCheckMount returned [7]
2012-03-04 21:34:23.633: [  OCRASM][3301265904]proprasmo: The ASM instance is down
2012-03-04 21:34:23.634: [  OCRRAW][3301265904]proprioo: Failed to open [+OCR]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2012-03-04 21:34:23.634: [  OCRRAW][3301265904]proprioo: No OCR/OLR devices are usable
2012-03-04 21:34:23.635: [  OCRASM][3301265904]proprasmcl: asmhandle is NULL
2012-03-04 21:34:23.636: [    GIPC][3301265904] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5326]
2012-03-04 21:34:23.639: [ default][3301265904]clsvactversion:4: Retrieving Active Version from local storage.
2012-03-04 21:34:23.643: [  OCRRAW][3301265904]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required.
2012-03-04 21:34:23.645: [  OCRRAW][3301265904]proprinit: Could not open raw device
2012-03-04 21:34:23.646: [  OCRASM][3301265904]proprasmcl: asmhandle is NULL
2012-03-04 21:34:23.650: [  OCRAPI][3301265904]a_init:16!: Backend init unsuccessful : [26]
2012-03-04 21:34:23.651: [  CRSOCR][3301265904] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage
ORA-12547: TNS:lost contact

2012-03-04 21:34:23.652: [ CRSMAIN][3301265904] Created alert : (:CRSD00111:) :  Could not init OCR, error: PROC-26: Error while accessing the physical storage
ORA-12547: TNS:lost contact

2012-03-04 21:34:23.652: [    CRSD][3301265904][PANIC] CRSD exiting: Could not init OCR, code: 26


正常的 GRID_HOME 下該檔案的屬主和許可權應該是如下顯示:

-rwsr-s--x 1 grid oinstall 184431149 Feb  2 20:37 /ocw/grid/bin/oracle



如果 OCR 檔案或者它的映象檔案無法正常訪問 (可能是 ASM 已經啟動, 但是 OCR/mirror 所在的磁碟組沒有掛載),在 crsd.log 會看到以下的類似資訊:

2010-05-11 11:16:38.578: [  OCRASM][18]proprasmo: Error in open/create file in dg [OCRMIR]
[  OCRASM][18]SLOS : SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +OCRMIR.255.4294967295
ORA-17503: ksfdopn:2 Failed to open file +OCRMIR.255.4294967295
ORA-15001: diskgroup "OCRMIR
..
2010-05-11 11:16:38.647: [  OCRASM][18]proprasmo: kgfoCheckMount returned [6]
2010-05-11 11:16:38.648: [  OCRASM][18]proprasmo: The ASM disk group OCRMIR is not found or not mounted
2010-05-11 11:16:38.648: [  OCRASM][18]proprasmdvch: Failed to open OCR location [+OCRMIR] error [26]
2010-05-11 11:16:38.648: [  OCRRAW][18]propriodvch: Error  [8] returned device check for [+OCRMIR]
2010-05-11 11:16:38.648: [  OCRRAW][18]dev_replace: non-master could not verify the new disk (8)
[  OCRSRV][18]proath_invalidate_action: Failed to replace [+OCRMIR] [8]
[  OCRAPI][18]procr_ctx_set_invalid_no_abort: ctx set to invalid
..
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91: Comparing device hash ids between local and master failed
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Local dev (1862408427, 1028247821, 0, 0, 0)
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Master dev (1862408427, 1859478705, 0, 0, 0)
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:9: Shutdown CacheLocal. my hash ids don't match
[  OCRAPI][19]procr_ctx_set_invalid_no_abort: ctx set to invalid
[  OCRAPI][19]procr_ctx_set_invalid: aborting...
2010-05-11 11:16:46.587: [    CRSD][19] Dump State Starting ...



3. crsd.bin 的程式號檔案(<GRID_HOME>/crs/init/<節點名>.pid)存在,但是卻指向其它的程式

如果程式號檔案不存在,在日誌 $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log 我們會看到以下的提示資訊:

2010-02-14 17:40:57.927: [ora.crsd][1243486528] [check] PID FILE doesn't exist.
..
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Creating PID [30269] file for home /ocw/grid host racnode1 bin crs to /ocw/grid/crs/init/
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Error3 -2 writing PID [30269] to the file []
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Failed to record pid for CRSD
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Terminating process
2010-02-14 17:41:57.927: [ default][1092499776] CRSD exiting on stop request from clsdms_thdmai


解決辦法,我們可以手工建立一個程式號檔案:使用 grid 使用者執行 "touch" 命令,然後重新啟動 ora.crsd 資源。

如果程式號檔案存在,但是記錄的 PID 是指向了其它的程式,而不是 crsd.bin 的程式,在日誌 $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log 我們會看到以下的提示資訊:

2011-04-06 15:53:38.777: [ora.crsd][1160390976] [check] PID will be looked for in /ocw/grid/crs/init/racnode1.pid
2011-04-06 15:53:38.778: [ora.crsd][1160390976] [check] PID which will be monitored will be 1535                               >> 1535 is output of "cat /ocw/grid/crs/init/racnode1.pid"
2011-04-06 15:53:38.965: [ COMMCRS][1191860544]clsc_connect: (0x2aaab400b0b0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD))
[  clsdmc][1160390976]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD)) with status 9
2011-04-06 15:53:38.966: [ora.crsd][1160390976] [check] Error = error 9 encountered when connecting to CRSD
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Calling PID check for daemon
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Trying to check PID = 1535
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] PID check returned ONLINE CLSDM returned OFFLINE
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] DaemonAgent::check returned 5
2011-04-06 15:53:39.203: [    AGFW][1160390976] check for resource: ora.crsd 1 1 completed with status: FAILED
2011-04-06 15:53:39.203: [    AGFW][1170880832] ora.crsd 1 1 state changed from: UNKNOWN to: FAILED
..
2011-04-06 15:54:10.511: [    AGFW][1167522112] ora.crsd 1 1 state changed from: UNKNOWN to: CLEANING
..
2011-04-06 15:54:10.513: [ora.crsd][1146542400] [clean] Trying to stop PID = 1535
..
2011-04-06 15:54:11.514: [ora.crsd][1146542400] [clean] Trying to check PID = 1535


在 OS 層面檢查該問題:

ls -l /ocw/grid/crs/init/*pid
-rwxr-xr-x 1 ogrid oinstall 5 Feb 17 11:00 /ocw/grid/crs/init/racnode1.pid
cat /ocw/grid/crs/init/*pid
1535
ps -ef| grep 1535
root      1535     1  0 Mar30 ?        00:00:00 iscsid                  >> 注意:程式 1535 不是 crsd.bin


解決辦法是,使用 root 使用者,建立一個空的程式號檔案,然後重啟資源 ora.crsd:

# > $GRID_HOME/crs/init/<racnode1>.pid
# $GRID_HOME/bin/crsctl stop res ora.crsd -init
# $GRID_HOME/bin/crsctl start res ora.crsd -init



4. 網路功能是正常的,並且域名解析能夠正常工作:

如果網路功能不正常,ocssd.bin 程式仍然可能被啟動, 但是 crsd.bin 可能會失敗,同時在 crsd.log 中會提示以下資訊:

2010-02-03 23:34:28.412: [    GPnP][2235814832]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=867, tl=3, f=0
2010-02-03 23:34:28.428: [  OCRAPI][2235814832]clsu_get_private_ip_addresses: no ip addresses found.
..
2010-02-03 23:34:28.434: [  OCRAPI][2235814832]a_init:13!: Clusterware init unsuccessful : [44]
2010-02-03 23:34:28.434: [  CRSOCR][2235814832] OCR context init failure.  Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]
2010-02-03 23:34:28.434: [    CRSD][2235814832][PANIC] CRSD exiting: Could not init OCR, code: 44


或者:

2009-12-10 06:28:31.974: [  OCRMAS][20]proath_connect_master:1: could not connect to master  clsc_ret1 = 9, clsc_ret2 = 9
2009-12-10 06:28:31.974: [  OCRMAS][20]th_master:11: Could not connect to the new master
2009-12-10 06:29:01.450: [ CRSMAIN][2] Policy Engine is not initialized yet!
2009-12-10 06:29:31.489: [ CRSMAIN][2] Policy Engine is not initialized yet!


或者:

2009-12-31 00:42:08.110: [ COMMCRS][10]clsc_receive: (102b03250) Error receiving, ns (12535, 12560), transport (505, 145, 0)


關於網路和域名解析的驗證,請參考:note 1054902.1

5. crsd 可執行檔案(crsd.bin 和 crsd in GRID_HOME/bin) 的許可權或者屬主正確並且沒有進行過手工的修改, 一個簡單可行的檢查辦法是對比好的節點和壞節點的以下命令輸出 "ls -l <grid-home>/bin/crsd <grid-home>/bin/crsd.bin".


6. 關於CRSD程式啟動問題的進一步深入診斷,請參考 note 1323698.1 - Troubleshooting CRSD Start up Issue
 

問題 5: GPNPD.BIN 無法啟動

1. 網路的域名解析不正常

gpnpd.bin 程式啟動失敗,以下資訊提示在 gpnpd.log 中:

2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "tcp://node2:9393", try 1 of 3...
2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1015] ENTRY
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1066] GIPC gipcretFail (1) gipcConnect(tcp-tcp://node2:9393)
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1067] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "tcp://node2:9393"


以上的例子,請確定當前節點能夠正常的 ping 到“node2” ,並且確認兩個節點之間沒有任何防火牆。

2. 

由於 , gpnp 的排程執行緒(dispatch thread)可能被阻斷,例如:網路掃描。這個 bug 在 11.2.0.2 GI PSU2,11.2.0.3 及以上版本被修復,具體資訊,請參見 note 10105195.8

問題 6: 其它的一些守護程式無法啟動

常見原因:

1. 守護程式的日誌檔案或者日誌所在的路徑許可權或者屬主不正確。

如果日誌檔案或者日誌檔案所在的路徑許可權或者屬主設定有問題,通常我們會看到程式嘗試啟動,但是日誌裡的資訊卻始終沒有更新.

關於日誌位置和許可權屬主的限制,請參見 "日誌檔案位置, 屬主和許可權" 獲取更多的資訊。


2. 網路的 socket 檔案許可權或者屬主錯誤

這種情況下,守護程式的日誌會顯示以下資訊:

2010-02-02 12:55:20.485: [ COMMCRS][1121433920]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

2010-02-02 12:55:20.485: [  clsdmt][1110944064]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))



3. OLR 檔案損壞

這種情況下,守護程式的日誌會顯示以下資訊(以下是個 ora.ctssd 無法啟動的例子):

2012-07-22 00:15:16.565: [ default][1]clsvactversion:4: Retrieving Active Version from local storage.
2012-07-22 00:15:16.575: [    CTSS][1]clsctss_r_av3: Invalid active version [] retrieved from OLR. Returns [19].
2012-07-22 00:15:16.585: [    CTSS][1](:ctss_init16:): Error [19] retrieving active version. Returns [19].
2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS init failed [19]
2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS daemon aborting [19].
2012-07-22 00:15:16.585: [    CTSS][1]CTSS daemon aborting


 
解決辦法,請恢復一個好的OLR的副本,具體辦法請參見 note 1193643.1   
 

問題 7: CRSD Agents 無法啟動


CRSD.BIN 會負責衍生出兩個 agents 程式來啟動使用者的資源,這兩個 agents 的名字和 ohasd.bin 的 agents 的名字相同:

  orarootagent: 負責啟動 ora.netn.network, ora.nodename.vip, ora.scann.vip 和 ora.gns
  oraagent: 負責啟動 ora.asm, ora.eons, ora.ons, listener, SCAN listener, diskgroup, database, service 等資源

我們可以透過以下命令檢視使用者的資源狀態:

$GRID_HOME/crsctl stat res -t



如果 crsd.bin 無法正常啟動以上任何一個 agent,使用者的資源都將無法正常啟動.  

1. 通常這些 agent 無法啟動的常見原因是 agent 的日誌或者日誌所在的路徑沒有設定合適的許可權或者屬主。

請參見以下 "日誌檔案位置, 屬主和許可權" 部分關於日誌許可權的設定。

2. agent 可能因為  無法啟動,此時我們會看到 "CRS-5802: Unable to start the agent process"錯誤資訊,請參見 "OHASD 無法啟動"  #10 獲取更多資訊。

問題 8: HAIP 無法啟動

HAIP 無法啟動的原因有很多,例如:

[ohasd(891)]CRS-2807:Resource 'ora.cluster_interconnect.haip' failed to start automatically.

請參見 note 1210883.1 獲取更多關於 HAIP 的資訊。

網路和域名解析的驗證


CRS 的啟動,依賴於網路功能和域名解析的正常工作,如果網路功能或者域名解析不能正常工作,CRS 將無法正常啟動。

關於網路和域名解析的驗證,請參考: note 1054902.1

日誌檔案位置, 屬主和許可權


正確的設定 $GRID_HOME/log 和這裡的子目錄以及檔案對 CRS 元件的正常啟動是至關重要的。

在 Grid Infrastructure 的環境中:

我們假設一個 Grid Infrastructure 環境,節點名字為 rac1, CRS 的屬主是 grid, 並且有兩個單獨的 RDBMS 屬主分別為: rdbmsap 和 rdbmsar,以下是 $GRID_HOME/log 中正常的設定情況:

drwxrwxr-x 5 grid oinstall 4096 Dec  6 09:20 log
  drwxr-xr-x  2 grid oinstall 4096 Dec  6 08:36 crs
  drwxr-xr-t 17 root   oinstall 4096 Dec  6 09:22 rac1
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:20 admin
    drwxrwxr-t 4 root   oinstall  4096 Dec  6 09:20 agent
      drwxrwxrwt 7 root    oinstall 4096 Jan 26 18:15 crsd
        drwxr-xr-t 2 grid  oinstall 4096 Dec  6 09:40 application_grid
        drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 oraagent_grid
        drwxr-xr-t 2 rdbmsap oinstall 4096 Jan 26 18:15 oraagent_rdbmsap
        drwxr-xr-t 2 rdbmsar oinstall 4096 Jan 26 18:15 oraagent_rdbmsar
        drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 ora_oc4j_type_grid
        drwxr-xr-t 2 root    root     4096 Jan 26 20:09 orarootagent_root
      drwxrwxr-t 6 root oinstall 4096 Dec  6 09:24 ohasd
        drwxr-xr-t 2 grid oinstall 4096 Jan 26 18:14 oraagent_grid
        drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdagent_root
        drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdmonitor_root
        drwxr-xr-t 2 root   root     4096 Jan 26 18:14 orarootagent_root     
    -rw-rw-r-- 1 root root     12931 Jan 26 21:30 alertrac1.log
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:44 client
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 crsd 
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:24 cssd 
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 ctssd
    drwxr-x--- 2 grid oinstall  4096 Jan 26 18:14 diskmon 
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:25 evmd      
    drwxr-x--- 2 grid oinstall  4096 Jan 26 21:20 gipcd      
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:20 gnsd       
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:58 gpnpd     
    drwxr-x--- 2 grid oinstall  4096 Jan 26 21:19 mdnsd     
    drwxr-x--- 2 root oinstall  4096 Jan 26 21:20 ohasd      
    drwxrwxr-t 5 grid oinstall  4096 Dec  6 09:34 racg        
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgeut
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgevtf
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgmain
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:57 srvm        

請注意,絕大部分的子目錄都繼承了父目錄的屬主和許可權,以上僅作為一個參考,來判斷 CRS HOME 中是否有一些遞迴的許可權和屬主改變,如果您已經有一個相同版本的正在執行的工作節點,您可以把該執行的節點作為參考。

在 Oracle Restart 的環境中:

這裡顯示了在 Oracle Restart 環境中 $GRID_HOME/log 目錄下的許可權和屬主設定:

drwxrwxr-x 5 grid oinstall 4096 Oct 31  2009 log
  drwxr-xr-x  2 grid oinstall 4096 Oct 31  2009 crs
  drwxr-xr-x  3 grid oinstall 4096 Oct 31  2009 diag
  drwxr-xr-t 17 root   oinstall 4096 Oct 31  2009 rac1
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 admin
    drwxrwxr-t 4 root   oinstall  4096 Oct 31  2009 agent
      drwxrwxrwt 2 root oinstall 4096 Oct 31  2009 crsd
      drwxrwxr-t 8 root oinstall 4096 Jul 14 08:15 ohasd
        drwxr-xr-x 2 grid oinstall 4096 Aug  5 13:40 oraagent_grid
        drwxr-xr-x 2 grid oinstall 4096 Aug  2 07:11 oracssdagent_grid
        drwxr-xr-x 2 grid oinstall 4096 Aug  3 21:13 orarootagent_grid
    -rwxr-xr-x 1 grid oinstall 13782 Aug  1 17:23 alertrac1.log
    drwxr-x--- 2 grid oinstall  4096 Nov  2  2009 client
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 crsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 cssd
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 ctssd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 diskmon
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 evmd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 gipcd
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 gnsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 gpnpd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 mdnsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 ohasd
    drwxrwxr-t 5 grid oinstall  4096 Oct 31  2009 racg
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgeut
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgevtf
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgmain
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 srvm

網路socket檔案的位置,屬主和許可權


網路的 socket 檔案可能位於目錄: /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle 中。

當網路的 socket 檔案許可權或者屬主設定不正確的時候,我們通常會在守護程式的日誌中看到以下類似的資訊:

2011-06-18 14:07:28.545: [ COMMCRS][772]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_EVMD))

2011-06-18 14:07:28.545: [  clsdmt][515]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=lena042DBG_EVMD))
2011-06-18 14:07:28.545: [  clsdmt][515]Terminating process
2011-06-18 14:07:28.559: [ default][515] EVMD exiting on stop request from clsdms_thdmai


以下錯誤也有可能提示:

CRS-5017: The resource action "ora.evmd start" encountered the following error:
CRS-2674: Start of 'ora.evmd' on 'racnode1' failed
..


解決的辦法:請使用 root 使用者停掉 GI,刪除這些 socket 檔案,並重新啟動 GI。


我們假設一個 Grid Infrastructure 環境,節點名為 rac1, CRS 的屬主是 grid,以下是 socket 資料夾(../.oracle)正常的設定情況:

在 Grid Infrastructure cluster 環境中:

以下例子是叢集環境中的例子:

drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle:
drwxrwxrwt 2 root  oinstall 4096 Feb  2 21:25 .
srwxrwx--- 1 grid oinstall    0 Feb  2 18:00 master_diskmon
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 mdnsd
-rw-r--r-- 1 grid oinstall    5 Feb  2 18:00 mdnsd.pid
prw-r--r-- 1 root  root        0 Feb  2 13:33 npohasd
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 ora_gipc_GPNPD_rac1
-rw-r--r-- 1 grid oinstall    0 Feb  2 13:34 ora_gipc_GPNPD_rac1_lock
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sAevm
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sCevm
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_UI_SOCKET
srwxrwxrwx 1 root  root        0 Feb  2 21:25 srac1DBG_CRSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_CSSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_CTSSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_EVMD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GIPCD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GPNPD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_MDNSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN3
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs_lock
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1_lock
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sora_crsqs
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROC
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROL
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sSYSTEM.evm.acceptor.auth

 

在 Oracle Restart 環境中:

以下是 Oracle Restart 環境中的輸出例子:

drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle:
srwxrwx--- 1 grid oinstall 0 Aug  1 17:23 master_diskmon
prw-r--r-- 1 grid oinstall 0 Oct 31  2009 npohasd
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.1
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.2
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.1
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.2
srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.1
srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.2
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.1
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.2
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.1
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.2
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sCRSD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_CSSD
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sEXTPROC1521
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost_lock
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1_lock
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sprocr_local_conn_0_PROL


診斷檔案收集


如果透過本文沒有找到問題原因,請使用 root 使用者,在所有的節點上執行 $GRID_HOME/bin/diagcollection.sh ,並上傳在當前目錄下生成所有的 .gz 壓縮檔案來做進一步診斷。

參考

 - PROC-32 ACCESSING OCR; CRS DOES NOT COME UP ON NODE


NOTE:1323698.1 - Troubleshooting CRSD Start up Issue
NOTE:1325718.1 - OHASD not Starting After Reboot on SLES
NOTE:1077094.1 - How to fix the "DiscoveryString in profile.xml" or "asm_diskstring in ASM" if set wrongly
NOTE:1068835.1 - What to Do if 11gR2 Grid Infrastructure is Unhealthy
NOTE:942166.1 - How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation
NOTE:969254.1 - How to Proceed from Failed Upgrade to 11gR2 Grid Infrastructure on Linux/Unix
NOTE:10105195.8 - Bug 10105195 - Clusterware fails to start after reboot due to gpnpd fails to start
NOTE:1053147.1 - 11gR2 Clusterware and Grid Home - What You Need to Know
NOTE:1053970.1 - Troubleshooting 11.2 Grid Infrastructure root.sh Issues
NOTE:1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device
NOTE:1054902.1 - How to Validate Network and Name Resolution Setup for the Clusterware and RAC
 - OHASD FAILED TO START TIMELY

NOTE:1564555.1 - 11.2.0.3 PSU5/PSU6/PSU7 or 12.1.0.1 CSSD Fails to Start if Multicast Fails on Private Network
NOTE:1427234.1 - autorun file for ohasd is missing

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25462274/viewspace-2141150/,如需轉載,請註明出處,否則將追究法律責任。

相關文章