一個CRS CRS-5818 gpnpd、mdnsd程式無法啟動案例處理

startay發表於2016-05-12

crsctl start crs 啟動叢集后,資源死活拉不起來,重啟機器後還是這樣。 檢視日誌發現下面資訊:

  1. 2016-05-12 20:22:18.341:
  2. [gpnpd(4653074)]CRS-2329:GPNPD on node pos5gpp1 shutdown.
  3. 2016-05-12 20:24:15.080:
  4. [/grid/app/11.2/bin/oraagent.bin(5832974)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:
  5. 2} in /grid/app/11.2/log/pos5gpp1/agent/ohasd/oraagent_grid/oraagent_grid.log.
  6. 2016-05-12 20:24:19.084:
  7. [ohasd(2359742)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0
  8. :2} in /grid/app/11.2/log/pos5gpp1/ohasd/ohasd.log.
  9. 2016-05-12 20:24:20.560:
  10. [mdnsd(6619232)]CRS-5602:mDNS service stopping by request.
  11. 2016-05-12 20:26:20.471:
  12. [/grid/app/11.2/bin/oraagent.bin(5832976)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:
  13. 2} in /grid/app/11.2/log/pos5gpp1/agent/ohasd/oraagent_grid/oraagent_grid.log.
  14. 2016-05-12 20:26:24.476:
  15. [ohasd(2359742)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0
  16. :2} in /grid/app/11.2/log/pos5gpp1/ohasd/ohasd.log.
  17. 2016-05-12 20:26:29.033:
  18. [gpnpd(3539086)]CRS-2329:GPNPD on node pos5gpp1 shutdown
檢查oraagent_grid.log 報錯如下
  1. [ clsdmc][1800]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=pos5gpp1DBG_MDNSD)) with status 9
  2. 2016-05-12 20:46:05.720: [ora.mdnsd][1800]{0:0:2} [start] Error = error 9 encountered when connecting to MDNSD
  3. 2016-05-12 20:46:06.720: [ora.mdnsd][1800]{0:0:2} [start] without returnbuf
  4. 2016-05-12 20:46:06.884: [ COMMCRS][1034]clsc_connect: (111ed9190) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=pos5gpp1DBG_MDNSD))
ohasd.log發下以下報錯
  1. ================================================================================
  2. 2016-05-12 20:47:19.870: [ default][1]gpnpd START pid=6357104 Oracle Grid Plug-and-Play Daemon
  3. 2016-05-12 20:47:19.870: [ GPNP][1]clsgpnp_Init: [at clsgpnp0.c:586 clsgpnp_Init] '/grid/app/11.2' in effect as GPnP home base.
  4. 2016-05-12 20:47:19.870: [ GPNP][1]clsgpnp_Init: [at clsgpnp0.c:632 clsgpnp_Init] GPnP pid=6357104, GPNP comp tracelevel=1, depcomp tracelevel=0, tlsrc:ORA_DAEMON_LOGGING_LEVELS, apitl:0, complog:1, tstenv:0, devenv:0, envopt:0, flags=3
  5. 2016-05-12 20:47:22.896: [ GPNP][1]clsgpnpkwf_initwfloc: [at clsgpnpkwf.c:399 clsgpnpkwf_initwfloc] Using FS Wallet Location : /grid/app/11.2/gpnp/pos5gpp1/wallets/peer/

  6. [ CLWAL][1]clsw_Initialize: OLR initlevel [70000]
  7. 2016-05-12 20:47:22.988: [ COMMCRS][1029]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=pos5gpp1DBG_GPNPD))

  8. 2016-05-12 20:47:22.988: [ clsdmt][772]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=pos5gpp1DBG_GPNPD))
  9. 2016-05-12 20:47:22.988: [ clsdmt][772]Terminating process
  10. 2016-05-12 20:47:22.988: [ GPNP][772]CLSDM requested exit
  11. 2016-05-12 20:47:22.988: [ default][772]GPNPD on node pos5gpp1 shutdown.

透過METALINK 發現解決方案 “診斷 Grid Infrastructure 啟動問題 (文件 ID 1623340.1)”

當網路的 socket 檔案許可權或者屬主設定不正確的時候,我們通常會上面的類似報錯資訊。

網路的 socket 檔案可能位於目錄: /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle 中。

檢視我們的socket檔案所屬資訊:(所屬使用者基本全是oracle,但我們的CRS的所屬應該grid)  <--問題出在這裡
  1. grid@aaggpp1:/grid/app/11.2/bin/>ls -l /tmp/.oracle
  2. total 8
  3. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 mdnsd
  4. -rw-r--r-- 1 oracle oinstall 8 Sep 1 2015 mdnsd.pid
  5. prw-r--r-- 1 oracle oinstall 0 Jun 10 2015 npohasd
  6. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 ora_gipc_GPNPD_pos5gpp1
  7. -rw-r--r-- 1 oracle oinstall 0 Jun 10 2015 ora_gipc_GPNPD_pos5gpp1_lock
  8. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 ora_gipc_gipcd_pos5gpp1
  9. -rw-r--r-- 1 oracle oinstall 0 Sep 1 2015 ora_gipc_gipcd_pos5gpp1_lock
  10. srwxrwxrwx 1 root system 0 Jan 4 02:17 ora_gipc_spos5gpp1gridpos5gpp-clusterCRFM_CLIIPC
  11. -rw-r--r-- 1 oracle oinstall 0 Jun 10 2015 ora_gipc_spos5gpp1gridpos5gpp-clusterCRFM_CLIIPC_lock
  12. srwxrwxrwx 1 root system 0 Jan 4 02:17 ora_gipc_spos5gpp1gridpos5gpp-clusterCRFM_SIPC
  13. -rw-r--r-- 1 oracle oinstall 0 Jun 10 2015 ora_gipc_spos5gpp1gridpos5gpp-clusterCRFM_SIPC_lock
  14. srwxrwxrwx 1 oracle oinstall 0 Sep 21 2015 s#48758948.1
  15. srwxrwxrwx 1 oracle oinstall 0 Jun 8 2015 s#66978252.1
  16. srwxrwxrwx 1 oracle oinstall 0 Sep 23 2015 s#8455124.1
  17. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 sAevm
  18. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 sCRSD_IPC_SOCKET_11
  19. -rw-r--r-- 1 oracle oinstall 0 Jun 10 2015 sCRSD_IPC_SOCKET_11_lock
  20. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 sCRSD_UI_SOCKET
  21. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 sCevm
  22. srwxrwxrwx 1 oracle oinstall 0 Sep 23 2015 sEXTPROC1521
  23. srwxrwxrwx 1 oracle oinstall 0 Sep 21 2015 sLISTENER
  24. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 sOCSSD_LL_pos5gpp1_
  25. -rw-r--r-- 1 oracle oinstall 0 Jun 10 2015 sOCSSD_LL_pos5gpp1__lock
  26. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 sOCSSD_LL_pos5gpp1_pos5gpp-cluster
  27. -rw-r--r-- 1 oracle oinstall 0 Jun 10 2015 sOCSSD_LL_pos5gpp1_pos5gpp-cluster_lock
  28. srwxrwxrwx 1 root system 0 May 12 20:13 sOHASD_IPC_SOCKET_11
  29. -rw-r--r-- 1 oracle oinstall 0 Jun 10 2015 sOHASD_IPC_SOCKET_11_lock
  30. srwxrwxrwx 1 root system 0 May 12 20:13 sOHASD_UI_SOCKET
  31. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 sOracle_CSS_LclLstnr_pos5gpp-cluster_1
  32. -rw-r--r-- 1 oracle oinstall 0 Jun 10 2015 sOracle_CSS_LclLstnr_pos5gpp-cluster_1_lock
  33. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 sSYSTEM.evm.acceptor.auth
  34. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 sora_crsqs
  35. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 spos5gpp1DBG_CRSD
  36. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 spos5gpp1DBG_CSSD
  37. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 spos5gpp1DBG_CTSSD
  38. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 spos5gpp1DBG_EVMD
  39. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 spos5gpp1DBG_GIPCD
  40. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 spos5gpp1DBG_GPNPD
  41. srwxrwxrwx 1 root system 0 Dec 17 15:17 spos5gpp1DBG_LOGD
  42. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 spos5gpp1DBG_MDNSD
  43. srwxrwxrwx 1 root system 0 Jan 4 02:17 spos5gpp1DBG_MOND
  44. srwxrwxrwx 1 root system 0 May 12 20:13 spos5gpp1DBG_OHASD
  45. srwxrwxrwx 1 oracle oinstall 0 Sep 1 2015 sprocr_local_conn_0_PROC
  46. -rw-r--r-- 1 oracle oinstall 0 Jun 10 2015 sprocr_local_conn_0_PROC_lock
  47. srwxrwxrwx 1 root system 0 May 12 20:13 sprocr_local_conn_0_PROL
  48. -rw-r--r-- 1 oracle oinstall 0 Jun 10 2015 sprocr_local_conn_0_PROL_lock

我們已經找到原因,下面有兩種方法解決。
第一,批次修改許可權
第二,刪除scoket檔案, crs重啟後會重建。

第一種方法:批次修
  1. chown grid:oinstall mdnsd
  2. chown grid:oinstall mdnsd.pid
  3. chown root:system npohasd
  4. chown grid:oinstall ora_gipc_GPNPD_pos5gpp1
  5. chown grid:oinstall ora_gipc_GPNPD_pos5gpp1_lock
  6. chown grid:oinstall ora_gipc_gipcd_pos5gpp1
  7. chown grid:oinstall ora_gipc_gipcd_pos5gpp1_lock
  8. chown root:system ora_gipc_spos5gpp1gridpos5gpp-clusterCRFM_CLIIPC
  9. chown root:system ora_gipc_spos5gpp1gridpos5gpp-clusterCRFM_CLIIPC_lock
  10. chown root:system ora_gipc_spos5gpp1gridpos5gpp-clusterCRFM_SIPC
  11. chown root:system ora_gipc_spos5gpp1gridpos5gpp-clusterCRFM_SIPC_lock
  12. chown oracle:oinstall s#14287242.1
  13. chown grid:oinstall s#48955460.1
  14. chown grid:oinstall s#5046366.1
  15. chown grid:oinstall sAevm
  16. chown root:system sCRSD_IPC_SOCKET_11
  17. chown root:system sCRSD_IPC_SOCKET_11_lock
  18. chown root:system sCRSD_UI_SOCKET
  19. chown grid:oinstall sCevm
  20. chown grid:oinstall sLISTENER
  21. chown grid:oinstall sLISTENER_SCAN1
  22. chown grid:oinstall sOCSSD_LL_pos5gpp1_
  23. chown grid:oinstall sOCSSD_LL_pos5gpp1__lock
  24. chown grid:oinstall sOCSSD_LL_pos5gpp1_pos5gpp-cluster
  25. chown grid:oinstall sOCSSD_LL_pos5gpp1_pos5gpp-cluster_lock
  26. chown root:system sOHASD_IPC_SOCKET_11
  27. chown root:system sOHASD_IPC_SOCKET_11_lock
  28. chown root:system sOHASD_UI_SOCKET
  29. chown grid:oinstall sOracle_CSS_LclLstnr_pos5gpp-cluster_2
  30. chown grid:oinstall sOracle_CSS_LclLstnr_pos5gpp-cluster_2_lock
  31. chown grid:oinstall sSYSTEM.evm.acceptor.auth
  32. chown root:system sora_crsqs
  33. chown root:system spos5gpp1DBG_CRSD
  34. chown grid:oinstall spos5gpp1DBG_CSSD
  35. chown root:system spos5gpp1DBG_CTSSD
  36. chown grid:oinstall spos5gpp1DBG_EVMD
  37. chown grid:oinstall spos5gpp1DBG_GIPCD
  38. chown grid:oinstall spos5gpp1DBG_GPNPD
  39. chown root:system spos5gpp1DBG_LOGD
  40. chown grid:oinstall spos5gpp1DBG_MDNSD
  41. chown root:system spos5gpp1DBG_MOND
  42. chown root:system spos5gpp1DBG_OHASD
  43. chown root:system sprocr_local_conn_0_PROC
  44. chown root:system sprocr_local_conn_0_PROC_lock
  45. chown root:system sprocr_local_conn_0_PROL
  46. chown root:system sprocr_local_conn_0_PROL_lock

第二種方法,使用root 使用者停掉 GI,刪除這些 socket 檔案,並重新啟動 GI。

安裝上面解決方法處理後,CRS恢復正常。
  1. grid@aadgpp1:/home/grid/>crsctl check crs
  2. CRS-4638: Oracle High Availability Services is online
  3. CRS-4537: Cluster Ready Services is online
  4. CRS-4529: Cluster Synchronization Services is online
  5. CRS-4533: Event Manager is online
  6. grid@aadgpp1:/home/grid/>ps -ef|grep d.bin
  7.     grid 4063236 1 0 21:37:23 - 0:00 /grid/app/11.2/bin/mdnsd.bin
  8.     root 5308482 1 0 21:38:05 - 0:01 /grid/app/11.2/bin/crsd.bin reboot
  9.     root 6029390 1 0 21:37:26 - 0:01 /grid/app/11.2/bin/osysmond.bin
  10.     root 6291706 1 0 21:37:07 - 0:01 /grid/app/11.2/bin/ohasd.bin reboot
  11.     grid 6422532 6815890 0 21:37:28 - 0:01 /grid/app/11.2/bin/ocssd.bin
  12.     grid 3342780 1 1 21:38:05 - 0:00 /grid/app/11.2/bin/evmd.bin
  13.     grid 5046692 1 0 21:37:24 - 0:00 /grid/app/11.2/bin/gpnpd.bin
  14.     grid 6095300 1 0 21:37:26 - 0:00 /grid/app/11.2/bin/gipcd.bin
  15.     root 6488494 1 0 21:37:56 - 0:00 /grid/app/11.2/bin/octssd.bin reboot


需要注意:

  1. Caution:
  2. After installation is complete, do not remove manually or run cron jobs that remove /tmp/.oracle or /var/tmp/.oracle directories or their files while Oracle software is running on the server. If you remove these files, then the Oracle software can encounter intermittent hangs. Oracle Clusterware installations can fail with the error:
  3. CRS-0184: Cannot communicate with the CRS daemon.
安裝完成後,在正常執行的RAC下,不能手工移動或者刪除/tmp/.oracle,/var/tmp/.oracle,如果移除了,會造成間斷性的hang,oracle叢集軟體也會有報錯。

記住:
如果RAC正在執行千萬不能刪除!
如果檔案刪除了,重新CRS會自動重新建立!
如果目錄刪除了,那就只能參考下面的命令重建了
Create the /var/tmp and /var/tmp/.oracle directory:
/bin/mkdir -p /var/tmp/.oracle
/bin/chmod 01777 /var/tmp/
/bin/chown root /var/tmp/ 
/bin/chmod 01777 /var/tmp/.oracle
/bin/chown root /var/tmp/.oracle


另外補充說明(隱藏SOCKET檔案):
  1. The hidden directory '/var/tmp/.oracle' (or /tmp/.oracle on some platforms) or its content was removed while instances & the CRS stack were up and running. Typically this directory contains a number of "special" socket files that are used by local clients to connect via the IPC protocol (sqlnet) to various Oracle processes including the TNS listener, the CSS, CRS & EVM daemons or even database or ASM instances. These files are created when the "listening" process starts.
這些socket 文被用作 本地客戶端使用程式間通訊協議(ipc)和不同的oracle的程式通訊,而這些程式包括:tns 監聽,css ,crs,evm 守護程式;甚至資料庫和asm 例項。這些socket 由‘主動監聽’的程式建立。在這裡oracle tns listener 建立這些socket 檔案主要使用用作pmon 和 tnslsnr  通訊從報錯資訊裡就可以看出問題.






來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/17086096/viewspace-2098909/,如需轉載,請註明出處,否則將追究法律責任。

相關文章