CRS-0184 Cannot communicate with the CRS daemon

賀子_DBA時代發表於2017-12-01
oracle rac遇到了問題:報錯:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4534: Cannot communicate with Event Manager‘
問題分析:由於網站上雲,oracle有一套rac從idc機房撤回到了公司本地,,按著步驟關閉了資料庫,領導關閉的,只是su - oracle 然後shu immediate,關閉了oracle例項,asm例項則沒有關閉,然後搬到公司按著原來的位置插好了網線並嘗試啟動,我只嘗試著把ora010的例項起來了,然後就不管了,後來要用這套庫的時候,我才看ora102的狀態,才意識到資料庫例項和asm例項都沒有啟動,於是嘗試啟動,但是報錯如下:
首先先說下oracle rac伺服器需要重啟的時候,oracle相關資源關閉的的流程:
方法一:
1)關閉oracle例項
[ ~]$ srvctl  stop database  -d ORCL
2)關閉asm例項
[ ~]$ srvctl  stop asm -n ora102
[ ~]$ srvctl  stop asm -n ora101
如果報錯就強制關閉,如下
[root@ora101 bin]# ./srvctl stop asm
PRCR-1065 : Failed to stop resource ora.asm
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
加上強制關閉 即可:
[grid@ora101 ~]$ srvctl stop asm -f
[grid@ora101 ~]$ srvctl status asm
ASM is not running.
3)最後還需要關閉crs
[root@ora101 bin]# ./crsctl stop cluster -all
方法二:
1)關閉oracle例項,兩個節點都執行
su - oracle
sqlplus / as sysdba
shu immediate
2)關閉asm例項,兩個節點都執行
su - grid
sqlplus / as sysasm
shu immediate
sqlplu abort強制關閉
[grid@ora101 ~]$ sqlplus / as sysasm
SQL> shu abort
ASM instance shutdown
3)最後還需要關閉crs
[root@ora101 bin]# ./crsctl stop cluster -all
檢查資料庫和asm例項的狀態,以及crs的狀態
[grid@ora101 ~]$ srvctl status asm
ASM is running on ora101,ora102
[grid@ora101 ~]$ srvctl status database -d ORCL
Instance orcl1 is not running on node ora101
Instance orcl2 is not running on node ora102
好了言歸正傳,繼續說遇到的問題。
[root@ora102 ~]# su - grid
[grid@ora102 ~]$ sqlplus / as sysasm
[grid@ora102 ~]$ sqlplus / as sysasm
SQL*Plus: Release 11.2.0.4.0 Production on Wed Nov 29 22:28:20 2017
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
SQL> startup
報錯。。。
在ora102節點上檢查叢集服務的狀態,報錯
[root@ora102 ~]# /u01/app/11.2.0/grid/bin/crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
根據上面報錯,可以判斷出crs是有問題。
嘗試啟動也報錯:注意需要使用root

[root@ora102 ~]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
正常情況是:
[root@ora102 bin]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
檢查crs服務,發現有問題:
[grid@ora102 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services demon
CRS-4534: Cannot communicate with Event Manager‘

然後節點ora102檢視ip情況,發現vip和scan ip都已經不在,vip在節點ora101上了,可以判斷出節點ora102已經脫離了叢集。
檢視ip配置。。。
[root@ora102 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.44 ora101
192.168.0.45 ora102
192.168.0.46 ora101-vip
192.168.0.47 ora102-vip
192.168.0.48 ora-cluster-scan
172.168.56.101 ora101-priv
172.168.56.102 ora102-priv
檢視節點的ip情況,發現只有物理ip(192.168.0.45 )了。
[root@ora102 ~]# ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp11s0f0: mtu 1500 qdisc mq state UP qlen 1000
link/ether 5c:f3:fc:e6:63:40 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.45/24 brd 192.168.0.255 scope global enp11s0f0
valid_lft forever preferred_lft forever
inet6 fe80::f451:31ab:4b4a:b224/64 scope link
valid_lft forever preferred_lft forever
3: enp11s0f1: mtu 1500 qdisc mq state UP qlen 1000
link/ether 5c:f3:fc:e6:63:42 brd ff:ff:ff:ff:ff:ff
inet 172.168.56.102/24 brd 172.168.56.255 scope global enp11s0f1
valid_lft forever preferred_lft forever
inet 169.254.20.215/16 brd 169.254.255.255 scope global enp11s0f1:1
valid_lft forever preferred_lft forever
inet6 fe80::7ee2:d8da:d7fa:12d5/64 scope link
valid_lft forever preferred_lft forever
4: enp0s29f0u2: mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
link/ether 5e:f3:fc:de:63:43 brd ff:ff:ff:ff:ff:ff
5: virbr0: mtu 1500 qdisc noqueue state DOWN qlen 1000
link/ether 52:54:00:f5:11:c7 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
valid_lft forever preferred_lft forever
6: virbr0-nic: mtu 1500 qdisc pfifo_fast master virbr0 state DOWN qlen 1000
link/ether 52:54:00:f5:11:c7 brd ff:ff:ff:ff:ff:ff
解決問題過程。。。。
首先嚐試重啟節點2的crs
關閉crs
[root@ora102 bin]# ./crsctl stop crs
或者
[root@ora102 bin]# ./crsctl stop cluster
之後啟動cluster叢集:
方法一和方法二的區別:crsctl start/stop crs 只能管理本地節點的clusterware stack,並不允許我們管理遠端節點,crsctl strat/stop cluster既可以管理本地 clusterware stack,也可以管理整個叢集
指定–all 啟動叢集中所有節點的叢集件,即啟動整個叢集。-n 啟動指定節點的叢集件.
方法一:
[root@ora102 bin]# ./crsctl start crs
或者
方法二:
[root@ora102 bin]# ./crsctl start cluster
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'ora102'
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'ora102' succeeded
CRS-2679: Attempting to clean 'ora.asm' on 'ora102'
CRS-2681: Clean of 'ora.asm' on 'ora102' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'ora102'
CRS-2676: Start of 'ora.asm' on 'ora102' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'ora102'
CRS-2676: Start of 'ora.crsd' on 'ora102' succeeded
如果還是有問題那麼清理節點2的配置資訊,然後重新執行root.sh
[root@ora102 trace]$ /u01/app/11.2.0/grid/crs/install/rootcrs.pl -verbose -deconfig -force
[root@ora102 ~]# /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
[root@ora102 bin]# /u01/app/11.2.0/grid/root.sh
然後檢查狀態是否正常,如果不正常,再次重啟crs,就好了。
檢查狀態,發現正常。。。。
[root@ora102 bin]# ./crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.DATA.dg ora....up.type ONLINE ONLINE ora101
ora.FRA.dg ora....up.type ONLINE ONLINE ora101
ora....ER.lsnr ora....er.type ONLINE ONLINE ora101
ora....N1.lsnr ora....er.type ONLINE ONLINE ora101
ora.OCR.dg ora....up.type ONLINE ONLINE ora101
ora.asm ora.asm.type ONLINE ONLINE ora101
ora.cvu ora.cvu.type ONLINE ONLINE ora101
ora.gsd ora.gsd.type OFFLINE OFFLINE
ora....network ora....rk.type ONLINE ONLINE ora101
ora.oc4j ora.oc4j.type ONLINE ONLINE ora101
ora.ons ora.ons.type ONLINE ONLINE ora101
ora....SM1.asm application ONLINE ONLINE ora101
ora....01.lsnr application ONLINE ONLINE ora101
ora.ora101.gsd application OFFLINE OFFLINE
ora.ora101.ons application ONLINE ONLINE ora101
ora.ora101.vip ora....t1.type ONLINE ONLINE ora101
ora....SM2.asm application ONLINE ONLINE ora102
ora....02.lsnr application ONLINE ONLINE ora102
ora.ora102.gsd application OFFLINE OFFLINE
ora.ora102.ons application ONLINE ONLINE ora102
ora.ora102.vip ora....t1.type ONLINE ONLINE ora102
ora.orcl.db ora....se.type ONLINE ONLINE ora101
ora.scan1.vip ora....ip.type ONLINE ONLINE ora101
檢查ocr狀態
[grid@ora101 ~]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 2948
Available space (kbytes) : 259172
ID : 87127720
Device/File Name : +OCR
Device/File integrity check succeeded
Device/File not configured
Device/File not configured
Device/File not configured
Device/File not configured
Cluster registry integrity check succeeded
Logical corruption check bypassed due to non-privileged user
檢查crs狀態 狀態正常。。。。
[grid@ora101 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

題外話。。
一:關閉asm例項報錯。。。。
[root@ora101 bin]# ./srvctl stop asm
PRCR-1065 : Failed to stop resource ora.asm
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
加上強制關閉 即可:
[grid@ora101 ~]$ srvctl stop asm -f
[grid@ora101 ~]$ srvctl status asm
ASM is not running.
或者 sqlplu abort強制關閉
[grid@ora101 ~]$ sqlplus / as sysasm
SQL> shu abort
ASM instance shutdown
此時檢視crs:
[grid@ora101 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
使用crsctl stop crs停止CRS,同時也停止了ASM磁碟
從停止的過程可以看到VIP的飄移,
[root@ora101 bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ora101'
CRS-2673: Attempting to stop 'ora.crsd' on 'ora101'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'ora101'
CRS-2673: Attempting to stop 'ora.OCR.dg' on 'ora101'
CRS-2673: Attempting to stop 'ora.DATA.dg' on 'ora101'
CRS-2673: Attempting to stop 'ora.FRA.dg' on 'ora101'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'ora101'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.ora101.vip' on 'ora101'
CRS-2677: Stop of 'ora.FRA.dg' on 'ora101' succeeded
CRS-2677: Stop of 'ora.DATA.dg' on 'ora101' succeeded
CRS-2677: Stop of 'ora.ora101.vip' on 'ora101' succeeded
CRS-2672: Attempting to start 'ora.ora101.vip' on 'ora102'
CRS-2676: Start of 'ora.ora101.vip' on 'ora102' succeeded -----實現vip飄逸
CRS-2677: Stop of 'ora.OCR.dg' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.asm' on 'ora101'
CRS-2677: Stop of 'ora.asm' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.ons' on 'ora101'
CRS-2677: Stop of 'ora.ons' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.net1.network' on 'ora101'
CRS-2677: Stop of 'ora.net1.network' on 'ora101' succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'ora101' has completed
CRS-2677: Stop of 'ora.crsd' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.ctssd' on 'ora101'
CRS-2673: Attempting to stop 'ora.evmd' on 'ora101'
CRS-2673: Attempting to stop 'ora.asm' on 'ora101'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'ora101'
CRS-2677: Stop of 'ora.evmd' on 'ora101' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'ora101' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'ora101' succeeded
CRS-2677: Stop of 'ora.asm' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'ora101'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'ora101'
CRS-2677: Stop of 'ora.cssd' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'ora101'
CRS-2677: Stop of 'ora.crf' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'ora101'
CRS-2677: Stop of 'ora.gipcd' on 'ora101' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'ora101'
CRS-2677: Stop of 'ora.gpnpd' on 'ora101' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ora101' has completed
CRS-4133: Oracle High Availability Services has been stopped.
啟動asm,先啟動crs服務
[root@ora101 bin]# ./crsctl start crs
[root@ora101 bin]# ./crsctl status crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
啟動RAC例項和資料庫
[ ~]$ srvctl start asm 
PRCC-1014 : asm was already running
二:簡單概述CRS架構 :
1)Cluster Synchronization Services (CSS)—管理群集配置,誰是成員、誰來、誰走,通知成員。
2)Cluster Ready Services (CRS)—管理群集內高可用操作的主要程式,crs管理的全部內容都被看作資源,包括資料庫、例項、服務、監聽器、vip地址、應用程式等。Crs程式根據OCR中的配置資訊管理群集資源,包括啟動、停止、監視和容錯操作。當某個資源的狀態發生改變時,crs程式產生事件。RAC安裝完成後,crs程式監視各種資源,發生異常時自動重啟該資源,一般來說重啟5次,如不成功不再嘗試。
3)Event Management (EVM)—後臺程式釋出由crs生成的事件。
4)Oracle Notification Service (ONS)—通訊FAN訊息的釋出和訂閱服務。
5)RACG—擴充套件叢集支援oracle特定的需求和複雜的資源。
6)Process Monitor Daemon (OPROCD)—鎖定在記憶體中監視叢集執行並執行I/O隔離。利用 hangchecker,監測、停止、再監測、再停止,如果醒來時時間不對則重啟該節點。
注意:
CRS程式棧預設隨著作業系統的啟動而自啟動,有時出於維護目的需要關閉這個特性,可以用root使用者執行下面命令。
[root@rac1 bin]# ./crsctl disable crs
[root@rac1 bin]# ./crsctl enable crs
這個命令實際是修改了/etc/oracle/scls_scr/raw/root/crsstart這個檔案裡的內容
CRS由CRS,CSS,EVM三個服務組成,每個服務又是由一系列module組成,crsctl允許對每個module進行跟蹤,並把跟蹤內容記錄到日誌中。
[root@rac1 bin]# ./crsctl lsmodules css
[root@rac1 bin]# ./crsctl lsmodules evm
–跟蹤CSSD模組,需要root使用者執行:
[root@rac1 bin]# ./crsctl debug log css "CSSD:1"
Configuration parameter trace is now set to 1.
Set CRSD Debug Module: CSSD Level: 1
–檢視跟蹤日誌
[root@rac1 cssd]# pwd
/u01/app/oracle/product/crs/log/rac1/cssd
[root@rac1 cssd]# more ocssd.log
四:Oracle Cluster Registry (OCR):
管理Oracle叢集軟體和Oracle RAC資料庫配置資訊;類似於windows的登錄檔;這也包含Oracle Local Registry (OLR),存在於叢集的每個節點上,管理Oracle每個節點的叢集配置資訊。Oracle Clusterware 把整個叢集的配置資訊放在共享儲存上,這個儲存就是OCR Disk.在整個叢集中,只有一個節點能對OCR Disk進行讀寫操作,這個節點叫作Master Node,所有節點都會在記憶體中保留一份OCR的複製,同時有一個OCR Process從這個記憶體中讀取內容。OCR內容發生改變時,由Master Node的OCR Process負責同步到其他節點的OCR Process。
ocrcheck:
Ocrcheck命令用於檢查OCR內容的一致性,命令執行過程會在$CRS_HOME\log\nodename\client目錄下產生ocrcheck_pid.log日誌檔案。 這個命令不需要引數。
[root@rac1 bin]#./ocrcheck
五:最後檢查資料庫的狀態:
1)檢查資料庫例項的狀態:
[root@ora102 bin]# ./srvctl status database -d ORCL
Instance orcl1 is running on node ora101
Instance orcl2 is running on node ora102
2)檢查asm例項的狀態:
[root@ora102 bin]# ./srvctl status asm
ASM is running on ora101,ora102
3)檢查crs的狀態,如下是正常的
[root@ora102 bin]# ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
–檢查單個狀態
[root@rac1 bin]# ./crsctl check cssd
CSS appears healthy
[root@rac1 bin]# ./crsctl check crsd
CRS appears healthy
[root@rac1 bin]# ./crsctl check evmd
EVM appears healthy
總結:oracle rac叢集,是一個整體,需要同時啟動和關閉,如果你只啟動其中一個,那麼另一個節點的vip就會飄到這個節點,voting disk投票把這個節點踢出叢集,也就是腦裂。解決腦裂問題的基本思路就是:首先重啟被踢出叢集的節點的crs(crsctl stop crs ,然後crsctl start crs ),如果不行,那就清理節點2的配置資訊,然後重新執行root.sh,然後執行crsctlstart crs開啟crs即可。



來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29654823/viewspace-2148123/,如需轉載,請註明出處,否則將追究法律責任。

相關文章