oracle兩節點RAC,由於gipc導致某節點crs無法啟動問題分析

巡完南山巡南山發表於2019-01-14

兩節點RAC,其中1 節點叢集CRS無法啟動。經過分析原因為2節點gipcd程式異常,導致節點之間無法正常通訊,重啟2節點gipcd.bin後問題得以恢復。 從現象來看,是 ora.crsd ora.evmd 無法啟動,其他元件正常。

叢集 alert 日誌

節點1叢集日誌13:08分時手動重啟內容如下, 關於olsnodes.log無法刪除的資訊本環境中一直存在,此處資訊可忽略。

 

2018-11-26   13:08:29.521:

[client(892)]CRS-0009:log   file   "/home/u01/app/grid/11.2.0/product/log/sxmms1/client/olsnodes.log"   reopened

2018-11-26   13:08:29.521:

[client(892)]CRS-0019:file   rotation terminated. log file: "/home/u01/app/grid/11.2.0/product/log/sxmms1/client/olsnodes.log"

2018-11-26   13:08:42.421:

[ohasd(903)]CRS-2112:The   OLR service started on node sxmms1.

2018-11-26   13:08:42.433:

[ohasd(903)]CRS-1301:Oracle   High Availability Service started on node sxmms1.

2018-11-26   13:08:42.433:

[ohasd(903)]CRS-8017:location:   /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0   errors occurred

2018-11-26   13:08:45.864:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(948)]CRS-2302:Cannot   get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).

2018-11-26   13:08:51.238:

[gpnpd(1118)]CRS-2328:GPNPD   started on node sxmms1.

2018-11-26   13:08:53.710:

[cssd(1184)]CRS-1713:CSSD   daemon is started in clustered mode

2018-11-26   13:08:55.508:

[ohasd(903)]CRS-2767:Resource   state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE

2018-11-26   13:08:55.509:

[ohasd(903)]CRS-2769:Unable   to failover resource 'ora.diskmon'.

2018-11-26   13:09:03.406:

[cssd(1184)]CRS-1707:Lease   acquisition for node sxmms1 number 1 completed

2018-11-26   13:09:04.658:

[cssd(1184)]CRS-1605:CSSD   voting file is online: ORCL:OCR2; details in   /home/u01/app/grid/11.2.0/product/log/sxmms1/cssd/ocssd.log.

2018-11-26   13:09:07.670:

[cssd(1184)]CRS-1601:CSSD   Reconfiguration complete. Active nodes are sxmms1 sxmms2 .

2018-11-26   13:09:09.989:

[ctssd(1269)]CRS-2407:The   new Cluster Time Synchronization Service reference node is host sxmms2.

2018-11-26   13:09:09.990:

[ctssd(1269)]CRS-2401:The   Cluster Time Synchronization Service started on host sxmms1.

2018-11-26   13:09:11.701:

[ohasd(903)]CRS-2767:Resource   state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE

2018-11-26   13:09:11.701:

[ohasd(903)]CRS-2769:Unable   to failover resource 'ora.diskmon'.

2018-11-26   13:10:08.710:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(1129)]CRS-5818:Aborted   command 'start' for resource 'ora.ctssd'. Details at (:CRSAGF00113:) {0:0:2}   in /home/u01/app/grid/11.2.0/product/log/sxmms1/agent/ohasd/orarootagent_root/orarootagent_root.log.

2018-11-26   13:10:12.714:

[ohasd(903)]CRS-2757:Command   'Start' timed out waiting for response from the resource 'ora.ctssd'. Details   at (:CRSPE00111:) {0:0:2} in /home/u01/app/grid/11.2.0/product/log/sxmms1/ohasd/ohasd.log.

[client(1584)]CRS-10001:26-Nov-18   13:10 ACFS-9391: Checking for existing ADVM/ACFS installation.

[client(1589)]CRS-10001:26-Nov-18   13:10 ACFS-9392: Validating ADVM/ACFS installation files for operating   system.

[client(1591)]CRS-10001:26-Nov-18   13:10 ACFS-9393: Verifying ASM Administrator setup.

[client(1594)]CRS-10001:26-Nov-18   13:10 ACFS-9308: Loading installed ADVM/ACFS drivers.

[client(1597)]CRS-10001:26-Nov-18   13:10 ACFS-9154: Loading 'oracleoks.ko' driver.

[client(1625)]CRS-10001:26-Nov-18   13:10 ACFS-9154: Loading 'oracleadvm.ko' driver.

[client(1653)]CRS-10001:26-Nov-18   13:10 ACFS-9154: Loading 'oracleacfs.ko' driver.

[client(1764)]CRS-10001:26-Nov-18   13:10 ACFS-9327: Verifying ADVM/ACFS devices.

[client(1773)]CRS-10001:26-Nov-18   13:10 ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.

[client(1777)]CRS-10001:26-Nov-18   13:10 ACFS-9156: Detecting control device '/dev/ofsctl'.

[client(1782)]CRS-10001:26-Nov-18   13:10 ACFS-9322: completed

2018-11-26   13:10:14.067:

[ohasd(903)]CRS-2807:Resource   'ora.asm' failed to start automatically.

2018-11-26   13:10:14.067:

[ohasd(903)]CRS-2807:Resource   'ora.crsd' failed to start automatically.

2018-11-26   13:10:14.067:

[ohasd(903)]CRS-2807:Resource   'ora.evmd' failed to start automatically.

2018-11-26   13:11:42.738:

[ohasd(903)]CRS-2765:Resource   'ora.ctssd' has failed on server 'sxmms1'.

2018-11-26   13:11:45.381:

[ctssd(2151)]CRS-2407:The   new Cluster Time Synchronization Service reference node is host sxmms2.

2018-11-26   13:11:45.382:

[ctssd(2151)]CRS-2401:The   Cluster Time Synchronization Service started on host sxmms1.

 

分析

日誌只擷取了部分內容,從日誌來看,幾乎很多元件在啟動時都出現了超時

/home/u01/app/grid/11.2.0/product/log/sxmms1/agent/ohasd/orarootagent_root/orarootagent_root.log

2018-11-26 13:10:06.792:   [ora.ctssd][2525660928]{0:0:2} [start] clsdmc_respget return: status=0,   ecode=0, returnbuf=[0x7f51780ce0c0], buflen=8

2018-11-26 13:10:06.792:   [ora.ctssd][2525660928]{0:0:2} [start] Start: Extended check return buffer:   "? with length of 8

2018-11-26 13:10:06.792:   [ora.ctssd][2525660928]{0:0:2} [start] translateReturnCodes, return = 0,   state detail = Checkcb data [0x7f51780ce0c0]: mode[0xc0] offset[0 ms].

[    clsdmc][2525660928]CLSDMC.C returnbuflen=8, extraDataBuf=C0,   returnbuf=7805FCE0

2018-11-26 13:10:07.793:   [ora.ctssd][2525660928]{0:0:2} [start] clsdmc_respget return: status=0,   ecode=0, returnbuf=[0x7f517805fce0], buflen=8

2018-11-26 13:10:07.793:   [ora.ctssd][2525660928]{0:0:2} [start] Start: Extended check return buffer:   "? with length of 8

2018-11-26 13:10:07.793:   [ora.ctssd][2525660928]{0:0:2} [start] translateReturnCodes, return = 0,   state detail = Checkcb data [0x7f517805fce0]: mode[0xc0] offset[0 ms].

2018-11-26 13:10:08.710: [   AGENT][2527762176]{0:0:2} {0:0:2} Created   alert : (:CRSAGF00113:) :  Aborting the   command: start for resource: ora.ctssd 1 1

2018-11-26 13:10:08.710:   [ora.ctssd][2527762176]{0:0:2} [start] (:CLSN00110:) clsn_agent::abort {

2018-11-26 13:10:08.710:   [ora.ctssd][2527762176]{0:0:2} [start] abort {

2018-11-26 13:10:08.710:   [ora.ctssd][2527762176]{0:0:2} [start] Agent::abort last call info:  "Agent::Agent refreshAttr"

2018-11-26 13:10:08.710: [ora.ctssd][2527762176]{0:0:2}   [start] abort command: start

2018-11-26 13:10:08.710:   [ora.ctssd][2527762176]{0:0:2} [start] tryActionLock {

[    clsdmc][2525660928]CLSDMC.C returnbuflen=8, extraDataBuf=C0,   returnbuf=780CE0C0

2018-11-26 13:10:08.795:   [ora.ctssd][2525660928]{0:0:2} [start] clsdmc_respget return: status=0,   ecode=0, returnbuf=[0x7f51780ce0c0], buflen=8

2018-11-26 13:10:08.795:   [ora.ctssd][2525660928]{0:0:2} [start] Start: Extended check return buffer:   "? with length of 8

2018-11-26 13:10:08.795: [ora.ctssd][2525660928]{0:0:2}   [start] translateReturnCodes, return = 0, state detail = Checkcb data   [0x7f51780ce0c0]: mode[0xc0] offset[0 ms].

2018-11-26 13:10:08.795:   [ora.ctssd][2525660928]{0:0:2} [start] Start action aborted

2018-11-26 13:10:08.796: [ora.ctssd][2525660928]{0:0:2}   [start] clsnUtils::error Exception type=2 string=

CRS-5017: The resource action "ora.ctssd   start" encountered the following error:

Start action for octssd aborted. For details   refer to "(:CLSN00107:)" in   "/home/u01/app/grid/11.2.0/product/log/sxmms1/agent/ohasd/orarootagent_root/orarootagent_root.log".

 

2018-11-26 13:10:08.796: [    AGFW][2525660928]{0:0:2} sending status   msg [CRS-5017: The resource action "ora.ctssd start" encountered   the following error:

Start action for octssd aborted. For details   refer to "(:CLSN00107:)" in   "/home/u01/app/grid/11.2.0/product/log/sxmms1/agent/ohasd/orarootagent_root/orarootagent_root.log".

] for start for resource: ora.ctssd 1 1

2018-11-26 13:10:08.796:   [ora.ctssd][2525660928]{0:0:2} [start] (:CLSN00107:) clsn_agent::start }

2018-11-26 13:10:08.797: [    AGFW][2523559680]{0:0:2} Agent sending   reply for: RESOURCE_START[ora.ctssd 1 1] ID 4098:361

2018-11-26 13:10:12.711:   [ora.ctssd][2527762176]{0:0:2} [start] got lock

2018-11-26 13:10:12.711: [ora.ctssd][2527762176]{0:0:2}   [start] tryActionLock }

2018-11-26 13:10:12.711:   [ora.ctssd][2527762176]{0:0:2} [start] abort    }

2018-11-26 13:10:12.711:   [ora.ctssd][2527762176]{0:0:2} [start] (:CLSN00110:) clsn_agent::abort }

2018-11-26 13:10:12.711: [    AGFW][2527762176]{0:0:2} Command: start   for resource: ora.ctssd 1 1 completed with status: TIMEDOUT

2018-11-26 13:10:12.712: [    AGFW][2523559680]{0:0:2} Agent sending   reply for: RESOURCE_START[ora.ctssd 1 1] ID 4098:361

[    clsdmc][2527762176]CLSDMC.C returnbuflen=8, extraDataBuf=C0,   returnbuf=84006B40

 

 

日誌分析

CRSD 日誌看到異常資訊,但並未有特別的指向具體是哪裡出現問題導致無法啟動。

2018-11-26 13:03:59.877: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:04:29.881: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:04:59.886: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:05:29.890: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:05:59.895: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:06:29.897: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:11:55.357: [ CRSMAIN][3963266848]   First attempt: init CSS context succeeded.

[    clsdmt][3956815616]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=sxmms1DBG_CRSD))

2018-11-26 13:11:55.361: [  clsdmt][3956815616]PID for the Process   [2324], connkey 1

2018-11-26 13:11:55.361: [  clsdmt][3956815616]Creating PID [2324] file   for home /home/u01/app/grid/11.2.0/product host sxmms1 bin crs to   /home/u01/app/grid/11.2.0/product/crs/init/

2018-11-26 13:11:55.362: [  clsdmt][3956815616]Writing PID [2324] to   the file [/home/u01/app/grid/11.2.0/product/crs/init/sxmms1.pid]

2018-11-26 13:11:56.304: [ CRSMAIN][3956815616]   Policy Engine is not initialized yet!

2018-11-26 13:11:56.304: [ CRSMAIN][3963266848]   CRS Daemon Starting

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: allcomp  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: default  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: COMMCRS  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: COMMNS  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: CSSCLNT  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: GIPCLIB  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: GIPCXBAD  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: GIPCLXPT  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: GIPCUNDE  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: GIPC  0

日誌分析

evmd 日誌為早上無法啟動時的資訊,但後續內容一致,所以時間09:51的錯誤資訊也可用來分析;日誌顯示CRS無法啟動,gipc連線超時,由於另外節點都正常,所以排除OCR出現問題。

2018-11-26 09:51:56.086: [  OCRMSG][2406381344]prom_connect: error   while waiting for connection complete [24]

2018-11-26 09:51:56.086: [  CRSOCR][2406381344] OCR context init   failure.  Error: PROC-32: Cluster Ready   Services on the local node is not running Messaging error   [gipcretConnectionRefused] [29]

2018-11-26 09:51:57.088: [  OCRMSG][2406381344]prom_waitconnect: CONN   NOT ESTABLISHED (0,29,1,2)

2018-11-26 09:51:57.088: [  OCRMSG][2406381344]GIPC error [29] msg   [gipcretConnectionRefused]

2018-11-26 09:51:57.088: [  OCRMSG][2406381344]prom_connect: error   while waiting for connection complete [24]

2018-11-26 09:51:57.088: [  CRSOCR][2406381344] OCR context init   failure.  Error: PROC-32: Cluster Ready   Services on the local node is not running Messaging error   [gipcretConnectionRefused] [29]

2018-11-26 09:51:58.089: [  OCRMSG][2406381344]prom_waitconnect: CONN   NOT ESTABLISHED (0,29,1,2)

2018-11-26 09:51:58.089: [  OCRMSG][2406381344]GIPC error [29] msg   [gipcretConnectionRefused]

2018-11-26 09:51:58.089: [  OCRMSG][2406381344]prom_connect: error   while waiting for connection complete [24]

2018-11-26 09:51:58.089: [  CRSOCR][2406381344] OCR context init   failure.  Error: PROC-32: Cluster Ready   Services on the local node is not running Messaging error   [gipcretConnectionRefused] [29]

 

日誌分析

分析gipc日誌發現,在檢測與2節點之間連線時發生超時,最終導致連線無法建立,gipc是RAC叢集中負責私網的連線建立,當建立連線出現問題時就需要進一步分析。

2018-11-26 13:05:27.553: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics,   endp 0000000000001a67

2018-11-26 13:05:28.954: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000348

2018-11-26 13:05:29.108: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000723

2018-11-26 13:05:29.109: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms1

2018-11-26 13:05:29.110: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms2

2018-11-26 13:05:29.110: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: updating timeout node sxmms2

2018-11-26 13:05:29.110: [GIPCDMON][2818569984]   gipcdMonitorFailZombieNodes: skipping live node 'sxmms2', time 0 ms, endp   0000000000000000, 00000000000007ba

2018-11-26 13:05:30.624: [ CLSINET][2818569984]   Returning NETDATA: 1 interfaces

2018-11-26 13:05:30.624: [ CLSINET][2818569984]   # 0 Interface   'eth2',ip='192.168.0.1',mac='a0-36-9f-5d-5c-54',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26 13:05:31.960: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 000000000000032d

2018-11-26 13:05:32.478: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000121

2018-11-26 13:05:32.559: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000001a67

2018-11-26 13:05:33.550: [GIPCDMON][2818569984]   gipcdMonitorSaveInfMetrics: inf[ 0]    eth2                 - rank   99, avgms 1.111111 [ 151 / 117 / 117 ]

2018-11-26 13:05:33.964: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000348

2018-11-26 13:05:34.114: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000723

2018-11-26 13:05:34.115: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms1

2018-11-26 13:05:34.116: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms2

2018-11-26 13:05:34.116: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: updating timeout node sxmms2

2018-11-26 13:05:34.116: [GIPCDMON][2818569984]   gipcdMonitorFailZombieNodes: skipping live node 'sxmms2', time 0 ms, endp   0000000000000000, 00000000000007ba

2018-11-26 13:05:35.646: [ CLSINET][2818569984]   Returning NETDATA: 1 interfaces

2018-11-26 13:05:35.646: [ CLSINET][2818569984]   # 0 Interface 'eth2',ip='192.168.0.1',mac='a0-36-9f-5d-5c-54',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26 13:05:36.971: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 000000000000032d

2018-11-26 13:05:37.480: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000121

2018-11-26 13:05:37.565: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000001a67

2018-11-26 13:05:38.975: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000348

2018-11-26 13:05:39.120: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics,   endp 0000000000000723

2018-11-26 13:05:39.121: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms1

2018-11-26 13:05:39.122: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms2

2018-11-26 13:05:39.122: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: updating timeout node sxmms2

2018-11-26 13:05:39.122: [GIPCDMON][2818569984]   gipcdMonitorFailZombieNodes: skipping live node 'sxmms2', time 0 ms, endp   0000000000000000, 00000000000007ba

2018-11-26 13:05:40.626: [ CLSINET][2818569984]   Returning NETDATA: 1 interfaces

2018-11-26 13:05:40.626: [ CLSINET][2818569984]   # 0 Interface   'eth2',ip='192.168.0.1',mac='a0-36-9f-5d-5c-54',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26 13:05:41.982: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 000000000000032d

2018-11-26 13:05:42.482: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000121

2018-11-26 13:05:42.572: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000001a67

2018-11-26 13:05:43.986: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000348

2018-11-26 13:05:44.127: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000723

2018-11-26 13:05:44.128: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms1

2018-11-26 13:05:44.128: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms2

2018-11-26 13:05:44.128: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: updating timeout node sxmms2

2018-11-26 13:05:44.128: [GIPCDMON][2818569984]   gipcdMonitorFailZombieNodes: skipping live node 'sxmms2', time 0 ms, endp   0000000000000000, 00000000000007ba

2018-11-26 13:05:45.632: [ CLSINET][2818569984]   Returning NETDATA: 1 interfaces

2018-11-26 13:05:45.632: [ CLSINET][2818569984]   # 0 Interface   'eth2',ip='192.168.0.1',mac='a0-36-9f-5d-5c-54',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26 13:05:46.992: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 000000000000032d

2018-11-26 13:05:47.485: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000121

2018-11-26 13:05:47.578: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000001a67

2018-11-26 13:05:48.997: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000348

2018-11-26 13:05:48.997: [GIPCDCLT][2946344704]   gipcdClientInterfaceMetrics: Received type(gipcdmsgtypeInterfaceMetrics),   endp(0000000000000348), len(1032), buf(0x7f7ca0254448), inf(ip:   192.168.0.1:14459, mask: 255.255.255.0, subnet: 192.168.0.0, mac: , ifname: )   time(0), retry(0), stamp(0), send(0), recv(0)

2018-11-26 13:05:48.997: [GIPCDCLT][2946344704]   gipcdClientInterfaceMetrics: enqueue local interface metrics (1) to worklist

2018-11-26 13:05:49.133: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics,   endp 0000000000000723

2018-11-26 13:05:49.133: [GIPCDCLT][2946344704]   gipcdClientInterfaceMetrics: Received type(gipcdmsgtypeInterfaceMetrics),   endp(0000000000000723), len(1032), buf(0x7f7ca0254448), inf(ip:   192.168.0.1:29489, mask: 255.255.255.0, subnet: 192.168.0.0, mac: , ifname: )   time(0), retry(0), stamp(0), send(0), recv(0)

2018-11-26 13:05:49.133: [GIPCDCLT][2946344704]   gipcdClientInterfaceMetrics: enqueue local interface metrics (1) to worklist

2018-11-26 13:05:49.134: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms1

2018-11-26 13:05:49.134: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms2

2018-11-26 13:05:49.135: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: updating timeout node sxmms2

透過以上分析,一些資源無法啟動皆是gipc建立建立超時導致,而私網出現問題可分為幾種情況

1 、os層面網路卡、網路配置、路由、網路交換機等出現問題;

2 、iptable、selinux等出現異常;

2 、兩個節點gipc至少有一個程式存在異常;

透過1節點系統日誌massage、netstat、iptables、selinux等方面的檢查,確認系統層面網路無異常,私網也可以ping通,其他透過叢集狀態也可間接判斷,因為1節點cssd服務正常,基本可以確認私網的連通性是正常的,而1節點的gipc日誌中eht2網路卡的rank值為99,也表名網路卡狀態正常,排除1節點的網路卡和gipc後,需要檢視2節點來進一步判斷。

 

叢集 ALERT 日誌

節點2叢集ALERT日誌分析,13:08分節點1重啟期間,節點2的gipc、evmd和ctss服務也出現了異常

2018-11-26   13:00:40.747:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   13:06:57.926:

[cssd(15973)]CRS-1625:Node   sxmms1, number 1, was manually shut down

2018-11-26   13:06:57.944:

[cssd(15973)]CRS-1601:CSSD   Reconfiguration complete. Active nodes are sxmms2 .

2018-11-26   13:09:08.318:

[cssd(15973)]CRS-1601:CSSD   Reconfiguration complete. Active nodes are sxmms1 sxmms2 .

2018-11-26   13:10:46.812:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   13:20:52.862:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   13:31:02.879:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   13:41:12.924:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   13:51:20.976:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   14:01:33.018:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   14:11:45.027:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   14:20:10.218:

[ohasd(5572)]CRS-2765:Resource   'ora.gipcd' has failed on server 'sxmms2'.

2018-11-26   14:20:10.287:

[ohasd(5572)]CRS-2765:Resource   'ora.ctssd' has failed on server 'sxmms2'.

2018-11-26   14:20:10.870:

[ohasd(5572)]CRS-2765:Resource   'ora.evmd' has failed on server 'sxmms2'.

2018-11-26   14:20:11.060:

[/home/u01/app/grid/11.2.0/product/bin/scriptagent.bin(52396)]CRS-5822:Agent   '/home/u01/app/grid/11.2.0/product/bin/scriptagent_grid' disconnected from   server. Details at (:CRSAGF00117:) {0:11:97} in   /home/u01/app/grid/11.2.0/product/log/sxmms2/agent/crsd/scriptagent_grid/scriptagent_grid.log.

2018-11-26   14:20:11.060:

日誌分析

分析日誌發現,rank值出現過0的情況,並且drop資訊出現超時,同時2節點的系統日誌、網路卡以及netstat等資訊無異常,至此已基本可以確認2節點的gipc程式異常導致的兩節點間無法建立連線。

2018-11-26   13:07:41.526: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:07:41.526: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 737498576 ms, endp 0000000000000000,   00000000470e5359

2018-11-26   13:07:41.537: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2018-11-26   13:07:42.226: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:07:42.384: [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:07:42.384: [ CLSINET][4093638400] # 0 Interface   'eth2',ip='192.168.0.2',mac='a0-36-9f-5d-66-ec',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26   13:07:43.840: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000370

2018-11-26   13:07:44.077: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011bf00f6

2018-11-26   13:07:44.888: [GIPCDMON][4093638400] gipcdMonitorSaveInfMetrics: inf[ 0]  eth2                 - rank    0, avgms 30000000000.000000 [ 16 / 0 / 0   ]

2018-11-26   13:07:44.888: [GIPCDMON][4093638400] gipcdMonitorSaveInfMetrics: saving:  eth2:0

2018-11-26   13:07:45.542: [GIPCHALO][4091537152] gipchaLowerProcessNode: no valid   interfaces found to node for 737502596 ms, node 0x7f00d022b420 { host   'sxmms1', haName 'gipcd_ha_name', srcLuid 3de74e37-09e82f35, dstLuid   50a77767-85458162 numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq   [11888 : 11888], createTime 725590486, sentRegister 1, localMonitor 1, flags   0x4 }

2018-11-26   13:07:45.812: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000ca1

2018-11-26   13:07:46.529: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befc2f

2018-11-26   13:07:46.531: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:07:46.531: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 737503586 ms, endp 0000000000000000,   00000000470e5359

2018-11-26   13:07:46.544: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2018-11-26   13:07:47.232: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:07:47.232: [GIPCDCLT][4166940416] gipcdClientInterfaceMetrics: Received   type(gipcdmsgtypeInterfaceMetrics), endp(0000000011befdfb), len(1032),   buf(0x7f00e82872e8), inf(ip: 192.168.0.2:39280, mask: 255.255.255.0, subnet:   192.168.0.0, mac: , ifname: ) time(0), retry(0), stamp(0), send(0), recv(0)

2018-11-26   13:07:47.232: [GIPCDCLT][4166940416] gipcdClientInterfaceMetrics: enqueue   local interface metrics (1) to worklist

2018-11-26 13:07:47.380:   [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:07:47.380: [ CLSINET][4093638400] # 0 Interface   'eth2',ip='192.168.0.2',mac='a0-36-9f-5d-66-ec',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26 13:07:48.846:   [GIPCDCLT][4166940416] gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000370

2018-11-26   13:07:49.083: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011bf00f6

2018-11-26   13:07:50.813: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000ca1

2018-11-26   13:07:51.534: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befc2f

2018-11-26   13:07:51.536: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:07:51.536: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 737508586 ms, endp 0000000000000000,   00000000470e5359

2018-11-26   13:07:51.550: [GIPCHALO][4091537152] gipchaLowerProcessNode: no valid   interfaces found to node for 737508606 ms, node 0x7f00d022b420 { host   'sxmms1', haName 'gipcd_ha_name', srcLuid 3de74e37-09e82f35, dstLuid   50a77767-85458162 numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq   [11894 : 11894], createTime 725590486, sentRegister 1, localMonitor 1, flags   0x4 }

2018-11-26   13:07:51.550: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2018-11-26   13:07:52.237: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:07:52.386: [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:07:52.386: [ CLSINET][4093638400] # 0 Interface   'eth2',ip='192.168.0.2',mac='a0-36-9f-5d-66-ec',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26   13:07:53.852: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000370

2018-11-26   13:07:54.089: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011bf00f6

2018-11-26   13:07:55.815: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000ca1

2018-11-26   13:07:56.540: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befc2f

2018-11-26   13:07:56.541: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:07:56.541: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 737513596 ms, endp 0000000000000000,   00000000470e5359

2018-11-26   13:07:56.556: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2018-11-26   13:07:57.243: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:07:57.381: [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:07:57.381: [ CLSINET][4093638400] # 0 Interface 'eth2',ip='192.168.0.2',mac='a0-36-9f-5d-66-ec',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26   13:07:57.556: [GIPCHALO][4091537152] gipchaLowerProcessNode: no valid   interfaces found to node for 737514606 ms, node 0x7f00d022b420 { host 'sxmms1',   haName 'gipcd_ha_name', srcLuid 3de74e37-09e82f35, dstLuid 50a77767-85458162   numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [11900 : 11900],   createTime 725590486, sentRegister 1, localMonitor 1, flags 0x4 }

2018-11-26   13:07:58.856: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000370

2018-11-26   13:07:59.095: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011bf00f6

2018-11-26   13:08:00.816: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000ca1

2018-11-26   13:08:01.544: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befc2f

2018-11-26   13:08:01.546: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:08:01.546: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 737518596 ms, endp 0000000000000000,   00000000470e5359

2018-11-26   13:08:01.561: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2018-11-26   13:08:02.248: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:08:02.401: [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:09:41.659: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms1

2018-11-26   13:09:41.659: [GIPCDMON][4093638400] gipcdMonitorCssCheck: updating timeout   node sxmms1

2018-11-26   13:09:41.659: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:09:41.659: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 0 ms, endp 0000000000000000, 00000000470e5359

2018-11-26   13:09:41.705: [GIPCHALO][4091537152] gipchaLowerDropMsg: dropping because of   sequence timeout, waited 30050, msg 0x7f00dc0a8478 { len 1160, seq 2, type   gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 1, flags 0x1,   srcLuid bcd67f68-d3658d20, dstLuid 00000000-00000000, msgId 1 }, node   0x7f00d02332c0 { host 'sxmms1', haName 'gipcd_ha_name', srcLuid   3de74e37-fe7418b0, dstLuid bcd67f68-d3658d20 numInf 1, contigSeq 0, lastAck   0, lastValidAck 0, sendSeq [30 : 30], createTime 737588686, sentRegister 1,   localMonitor 1, flags 0x0 }

2018-11-26   13:09:42.350: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:09:42.427: [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:09:42.427: [ CLSINET][4093638400] # 0 Interface 'eth2',ip='192.168.0.2',mac='a0-36-9f-5d-66-ec',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26   13:09:42.707: [GIPCHALO][4091537152] gipchaLowerDropMsg: dropping because of   sequence timeout, waited 30060, msg 0x7f00dc0c71b8 { len 1160, seq 3, type   gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 2, flags 0x1,   srcLuid bcd67f68-d3658d20, dstLuid 00000000-00000000, msgId 2 }, node   0x7f00d02332c0 { host 'sxmms1', haName 'gipcd_ha_name', srcLuid   3de74e37-fe7418b0, dstLuid bcd67f68-d3658d20 numInf 1, contigSeq 0, lastAck   0, lastValidAck 0, sendSeq [31 : 31], createTime 737588686, sentRegister 1,   localMonitor 1, flags 0x0 }

2018-11-26   13:09:42.708: [GIPCHALO][4091537152] gipchaLowerDropMsg: dropping because of   sequence timeout, waited 30060, msg 0x7f00dc038cc8 { len 1160, seq 3, type   gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 2, flags 0x1,   srcLuid bcd67f68-d3658d20, dstLuid 00000000-00000000, msgId 2 }, node   0x7f00d02332c0 { host 'sxmms1', haName 'gipcd_ha_name', srcLuid   3de74e37-fe7418b0, dstLuid bcd67f68-d3658d20 numInf 1, contigSeq 0, lastAck   0, lastValidAck 0, sendSeq [31 : 31], createTime 737588686, sentRegister 1,   localMonitor 1, flags 0x0 }

2018-11-26   13:09:42.708: [GIPCHALO][4091537152] gipchaLowerDropMsg: dropping because of   sequence timeout, waited 30060, msg 0x7f00dc04f288 { len 1160, seq 3, type   gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 2, flags 0x1,   srcLuid bcd67f68-d3658d20, dstLuid 00000000-00000000, msgId 2 }, node   0x7f00d02332c0 { host 'sxmms1', haName 'gipcd_ha_name', srcLuid   3de74e37-fe7418b0, dstLuid bcd67f68-d3658d20 numInf 1, contigSeq 0, lastAck   0, lastValidAck 0, sendSeq [31 : 31], createTime 737588686, sentRegister 1,   localMonitor 1, flags 0x0 }

2018-11-26   13:09:42.708: [GIPCHALO][4091537152] gipchaLowerProcessNode: no valid   interfaces found to node for 737619746 ms, node 0x7f00d02332c0 { host   'sxmms1', haName 'gipcd_ha_name', srcLuid 3de74e37-fe7418b0, dstLuid   bcd67f68-d3658d20 numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq   [31 : 31], createTime 737588686, sentRegister 1, localMonitor 1, flags 0x4 }

2018-11-26   13:09:42.708: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

   透過以上過程分析,1節點CRS無法啟動的原因為2節點gipc程式異常,進而導致兩個節點間無法正常建立連線,而gipc為RAC的重要組成部分,從11gR2(11.2.0.2)開始,oracle決定由叢集自己來管理私網網路卡,叢集新特性gipc(Grid IPC)被介紹,這個新特性以守護程式gipcd.bin的形式存在於叢集中,主要的功能如下:

1. 當叢集啟動時,發現叢集的私網網路卡,叢集私網的資訊是從gpnp profile中獲得的。並對發現的私網介面進行檢查;

2. 利用之前發現的私網網路卡,發現叢集中的其他節點,並和其他節點的私網網路卡建立聯絡;

3. 如果叢集配置了多塊私網網路卡,當某個節點的某一個/幾個私網網路卡出現問題時,離線有問題的私網,並通知其他節點;

確認gipcd.bin的作用後,其實1節點CRS無法啟動的原因就已經找到,叢集私網的連線是透過該程式實現,但2節點gipc程式處於異常狀態,所以1節點經過多次重啟始終無法加入叢集。

從以上分析確定為2節點gipc程式異常導致的1節點CRS無法正常啟動,雖然gipc是私網連線使用,但它本身的重啟不會導致叢集異常,所以透過手動kill -9 gipcd.bin程式,隨後gipc程式會自動啟動,而1節點也隨之啟動成功。


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26964624/viewspace-2557324/,如需轉載,請註明出處,否則將追究法律責任。

相關文章