Oracle Linux 6.7中 Oracle 11.2.0.4 RAC叢集CRS異常處理
最近一個月單位上的Oracle RAC叢集CRS異常導致叢集命令無法使用,執行crsctl stat res -t命令出現如下錯誤:
[grid@db1 ~]$ crsctl stat res -t CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors. [grid@db2 ~]$ crsctl stat res -t CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors.
但是資料庫可以正常訪問,業務系統也執行正常
[root@db1 ~]# ps -ef | grep pmon root 8024 204594 0 21:11 pts/0 00:00:00 grep pmon grid 77120 1 0 Dec21 ? 00:04:21 asm_pmon_+ASM1 oracle 77790 1 0 Dec21 ? 00:05:18 ora_pmon_CAIWU1 oracle 77794 1 0 Dec21 ? 00:05:08 ora_pmon_dadb1 oracle 77848 1 0 Dec21 ? 00:05:39 ora_pmon_chdyl1 oracle 77910 1 0 Dec21 ? 00:07:47 ora_pmon_RLZY1 [root@db2 ~]# ps -ef | grep pmon grid 27745 1 0 Dec21 ? 00:04:21 asm_pmon_+ASM2 oracle 28393 1 0 Dec21 ? 00:05:21 ora_pmon_dadb2 oracle 28569 1 0 Dec21 ? 00:04:58 ora_pmon_CAIWU2 oracle 28573 1 0 Dec21 ? 00:05:36 ora_pmon_chdyl2 oracle 28583 1 0 Dec21 ? 00:07:49 ora_pmon_RLZY2
檢視ASM磁碟組的狀態,發現OCR磁碟組確實offline了
[grid@db1 ~]$ asmcmd lsdg State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED EXTERN N 512 4096 1048576 3072000 2679522 0 2679522 0 N ARCH/ MOUNTED EXTERN N 512 4096 1048576 204800 151138 0 151138 0 N CWDATA/ MOUNTED EXTERN N 512 4096 1048576 512000 472546 0 472546 0 N DADATA/ MOUNTED EXTERN N 512 4096 1048576 3072000 595334 0 595334 0 N DATA/ MOUNTED EXTERN N 512 4096 1048576 1843200 609953 0 609953 0 N SBDATA/ [grid@db2 ~]$ asmcmd lsdg State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED EXTERN N 512 4096 1048576 3072000 2679522 0 2679522 0 N ARCH/ MOUNTED EXTERN N 512 4096 1048576 204800 151138 0 151138 0 N CWDATA/ MOUNTED EXTERN N 512 4096 1048576 512000 472546 0 472546 0 N DADATA/ MOUNTED EXTERN N 512 4096 1048576 3072000 595334 0 595334 0 N DATA/ MOUNTED EXTERN N 512 4096 1048576 1843200 609953 0 609953 0 N SBDATA/
手工將crsdg上線,命令能夠執行成功,但執行crsctl stat res -t命令仍然報錯。
[grid@db1 ~]$ sqlplus / as sysasm SQL*Plus: Release 11.2.0.4.0 Production on Mon Dec 30 21:15:33 2019 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options SQL> alter diskgroup ocr mount; Diskgroup altered. SQL> exit Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options [grid@db1 ~]$ crsctl stat res -t CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors. [grid@db2 ~]$ sqlplus / as sysasm SQL*Plus: Release 11.2.0.4.0 Production on Mon Dec 30 21:15:05 2019 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options SQL> alter diskgroup ocr mount; Diskgroup altered. SQL> exit Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options [grid@db2 ~]$ crsctl stat res -t CRS-4535: Cannot communicate with Cluster Ready Services CRS-4000: Command Status failed, or completed with errors.
檢查節點db1的alert_+ASM1.log有如下報錯,說是不能訪問OCR磁碟組的相關磁碟而強制dismount了OCR磁碟,但使用dd命令測試是能夠訪問的
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_77212.trc: ORA-15078: ASM diskgroup was forcibly dismounted WARNING: requested mirror side 1 of virtual extent 0 logical extent 0 offset 102400 is not allocated; I/O request failed WARNING: requested mirror side 2 of virtual extent 0 logical extent 1 offset 102400 is not allocated; I/O request failed Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_77212.trc: ORA-15078: ASM diskgroup was forcibly dismounted ORA-15078: ASM diskgroup was forcibly dismounted Sat Dec 28 05:30:48 2019 SQL> alter diskgroup OCR check /* proxy */ ORA-15032: not all alterations performed ORA-15001: diskgroup "OCR" does not exist or is not mounted ERROR: alter diskgroup OCR check /* proxy */ NOTE: client exited [77184] Sat Dec 28 05:30:49 2019 NOTE: [crsd.bin@db1 (TNS V1-V3) 35285] opening OCR file Sat Dec 28 05:30:51 2019 NOTE: [crsd.bin@db1 (TNS V1-V3) 35305] opening OCR file Sat Dec 28 05:30:53 2019 NOTE: [crsd.bin@db1 (TNS V1-V3) 35322] opening OCR file Sat Dec 28 05:30:55 2019 NOTE: [crsd.bin@db1 (TNS V1-V3) 35346] opening OCR file Sat Dec 28 05:30:57 2019 NOTE: [crsd.bin@db1 (TNS V1-V3) 35363] opening OCR file Sat Dec 28 05:31:00 2019 NOTE: [crsd.bin@db1 (TNS V1-V3) 35459] opening OCR file Sat Dec 28 05:31:02 2019 NOTE: [crsd.bin@db1 (TNS V1-V3) 35481] opening OCR file Sat Dec 28 05:31:04 2019 NOTE: [crsd.bin@db1 (TNS V1-V3) 35520] opening OCR file Sat Dec 28 05:31:06 2019 NOTE: [crsd.bin@db1 (TNS V1-V3) 35539] opening OCR file Sat Dec 28 05:31:08 2019 NOTE: [crsd.bin@db1 (TNS V1-V3) 35557] opening OCR file Sat Dec 28 21:00:10 2019 Warning: VKTM detected a time drift. Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
檢查錯誤日誌
[root@db1 ~]# more /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_77212.trc Trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_77212.trc Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options ORACLE_HOME = /u01/app/11.2.0/grid System name: Linux Node name: db1 Release: 3.8.13-68.3.4.el6uek.x86_64 Version: #2 SMP Tue Jul 14 15:03:36 PDT 2015 Machine: x86_64 Instance name: +ASM1 Redo thread mounted by this instance: 0 Oracle process number: 24 Unix process pid: 77212, image: oracle@db1 (TNS V1-V3) *** 2019-12-28 05:30:44.894 *** SESSION ID:(2929.3) 2019-12-28 05:30:44.894 *** CLIENT ID:() 2019-12-28 05:30:44.894 *** SERVICE NAME:() 2019-12-28 05:30:44.894 *** MODULE NAME:(crsd.bin@db1 (TNS V1-V3)) 2019-12-28 05:30:44.894 *** ACTION NAME:() 2019-12-28 05:30:44.894 Received ORADEBUG command (#1) 'CLEANUP_KFK_FD' from process 'Unix process pid: 35253, image: ' *** 2019-12-28 05:30:44.895 Finished processing ORADEBUG command (#1) 'CLEANUP_KFK_FD' *** 2019-12-28 05:30:48.235 WARNING:failed xlate 1 ORA-15078: ASM diskgroup was forcibly dismounted ksfdrfms:Mirror Read file=+OCR.255.4294967295 fob=0x9b00e5d8 bufp=0x7f5dd012ba00 blkno=25 nbytes=4096 WARNING:failed xlate 1 WARNING: requested mirror side 1 of virtual extent 0 logical extent 0 offset 102400 is not allocated; I/O request failed ksfdrfms:Read failed from mirror side=1 logical extent number=0 dskno=65535 WARNING:failed xlate 1 WARNING: requested mirror side 2 of virtual extent 0 logical extent 1 offset 102400 is not allocated; I/O request failed ksfdrfms:Read failed from mirror side=2 logical extent number=1 dskno=65535 ORA-15078: ASM diskgroup was forcibly dismounted ORA-15078: ASM diskgroup was forcibly dismounted
檢查節點db1的alertdb1.log有如下報錯,也是說不能訪問OCR磁碟組的相關磁碟
2019-12-28 05:30:48.468: [/u01/app/11.2.0/grid/bin/oraagent.bin(77466)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:4} in /u01/app/11.2.0/grid/log/db1/agent/crsd/oraagent_grid/oraagent_grid.lo g. 2019-12-28 05:30:48.468: [/u01/app/11.2.0/grid/bin/oraagent.bin(77684)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:7:332} in /u01/app/11.2.0/grid/log/db1/agent/crsd/oraagent_oracle/oraagent_o racle.log. 2019-12-28 05:30:48.471: [/u01/app/11.2.0/grid/bin/orarootagent.bin(77482)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:11497} in /u01/app/11.2.0/grid/log/db1/agent/crsd/orarootagent_root/ orarootagent_root.log. 2019-12-28 05:30:48.480: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:30:50.003: [crsd(35285)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:30:50.021: [crsd(35285)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:30:50.520: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:30:51.918: [crsd(35305)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:30:51.929: [crsd(35305)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:30:52.557: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:30:53.945: [crsd(35322)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:30:53.956: [crsd(35322)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:30:54.595: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:30:55.976: [crsd(35346)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:30:55.988: [crsd(35346)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:30:56.633: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:30:58.010: [crsd(35363)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:30:58.020: [crsd(35363)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:30:58.669: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:31:00.043: [crsd(35459)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:31:00.054: [crsd(35459)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:31:00.706: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:31:02.093: [crsd(35481)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:31:02.103: [crsd(35481)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:31:02.742: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:31:04.109: [crsd(35520)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:31:04.119: [crsd(35520)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:31:04.777: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:31:06.141: [crsd(35539)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:31:06.151: [crsd(35539)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:31:06.810: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:31:08.181: [crsd(35557)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:31:08.191: [crsd(35557)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db1/crsd/crsd.log. 2019-12-28 05:31:08.846: [ohasd(33022)]CRS-2765:Resource 'ora.crsd' has failed on server 'db1'. 2019-12-28 05:31:08.847: [ohasd(33022)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart. 2019-12-28 05:31:08.848: [ohasd(33022)]CRS-2769:Unable to failover resource 'ora.crsd'.
檢查節點db1的oraagent_grid.log檔案有如下報錯,顯示OCR磁碟組的狀態被改變為offline了。
2019-12-28 05:30:16.531: [ AGFW][511039232]{1:30746:2} Agent received the message: AGENT_HB[Engine] ID 12293:113720 2019-12-28 05:30:37.808: [ AGFW][511039232]{1:30746:9373} Agent received the message: RESOURCE_STOP[ora.OCR.dg db1 1] ID 4099:113730 2019-12-28 05:30:37.808: [ AGFW][511039232]{1:30746:9373} Preparing STOP command for: ora.OCR.dg db1 1 2019-12-28 05:30:37.808: [ AGFW][511039232]{1:30746:9373} ora.OCR.dg db1 1 state changed from: ONLINE to: STOPPING 2019-12-28 05:30:37.809: [ora.OCR.dg][513140480]{1:30746:9373} [stop] (:CLSN00108:) clsn_agent::stop { 2019-12-28 05:30:37.810: [ora.OCR.dg][513140480]{1:30746:9373} [stop] DgpAgent::stop: enter { 2019-12-28 05:30:37.810: [ora.OCR.dg][513140480]{1:30746:9373} [stop] getResAttrib: attrib name USR_ORA_OPI value true len 4 2019-12-28 05:30:37.810: [ora.OCR.dg][513140480]{1:30746:9373} [stop] Agent::flagUsrOraOpiIsSet(true) reason not dependency 2019-12-28 05:30:37.810: [ora.OCR.dg][513140480]{1:30746:9373} [stop] DgpAgent::stop: tha exit } 2019-12-28 05:30:37.810: [ora.OCR.dg][513140480]{1:30746:9373} [stop] DgpAgent::stopSingle status:2 } 2019-12-28 05:30:37.810: [ora.OCR.dg][513140480]{1:30746:9373} [stop] (:CLSN00108:) clsn_agent::stop } 2019-12-28 05:30:37.810: [ AGFW][513140480]{1:30746:9373} Command: stop for resource: ora.OCR.dg db1 1 completed with status: SUCCESS 2019-12-28 05:30:37.810: [ora.OCR.dg][513140480]{1:30746:9373} [check] CrsCmd::ClscrsCmdData::stat entity 1 statflag 33 useFilter 0 2019-12-28 05:30:37.811: [ AGFW][511039232]{1:30746:9373} Agent sending reply for: RESOURCE_STOP[ora.OCR.dg db1 1] ID 4099:113730 2019-12-28 05:30:37.838: [ora.OCR.dg][513140480]{1:30746:9373} [check] DgpAgent::runCheck: asm stat asmRet 0 2019-12-28 05:30:37.839: [ora.OCR.dg][513140480]{1:30746:9373} [check] DgpAgent::getConnxn connected 2019-12-28 05:30:37.844: [ora.OCR.dg][513140480]{1:30746:9373} [check] DgpAgent::queryDgStatus excp no data found 2019-12-28 05:30:37.844: [ora.OCR.dg][513140480]{1:30746:9373} [check] DgpAgent::queryDgStatus no data found in v$asm_diskgroup_stat 2019-12-28 05:30:37.844: [ora.OCR.dg][513140480]{1:30746:9373} [check] DgpAgent::queryDgStatus dgName OCR ret 1 2019-12-28 05:30:37.845: [ AGFW][511039232]{1:30746:9373} ora.OCR.dg db1 1 state changed from: STOPPING to: OFFLINE 2019-12-28 05:30:37.845: [ AGFW][511039232]{1:30746:9373} Agent sending last reply for: RESOURCE_STOP[ora.OCR.dg db1 1] ID 4099:113730 2019-12-28 05:30:43.889: [ora.asm][503641856]{1:30746:2} [check] CrsCmd::ClscrsCmdData::stat entity 1 statflag 33 useFilter 0 2019-12-28 05:30:43.920: [ora.asm][503641856]{1:30746:2} [check] AsmProxyAgent::check clsagfw_res_status 0 2019-12-28 05:30:48.465: [ CRSCOMM][521545472] IpcC: IPC client connection 6c to member 0 has been removed 2019-12-28 05:30:48.465: [CLSFRAME][521545472] Removing IPC Member:{Relative|Node:0|Process:0|Type:1} 2019-12-28 05:30:48.465: [CLSFRAME][521545472] Disconnected from CRSD:db1 process: {Relative|Node:0|Process:0|Type:1} 2019-12-28 05:30:48.474: [ AGENT][511039232]{0:1:4} {0:1:4} Created alert : (:CRSAGF00117:) : Disconnected from server, Agent is shutting down. 2019-12-28 05:30:48.474: [ AGFW][511039232]{0:1:4} Agent is exiting with exit code: 1
檢查節點db2的alert_+ASM2.log有如下報錯,出現了類似的"Waited 15 secs for write IO to PST disk 0 in group 1"資訊,這說明對OCR磁碟組執行寫操作時超時了15秒.
Sat Dec 28 03:02:51 2019 WARNING: Waited 15 secs for write IO to PST disk 0 in group 1. WARNING: Waited 15 secs for write IO to PST disk 0 in group 1. WARNING: Waited 15 secs for write IO to PST disk 0 in group 3. WARNING: Waited 15 secs for write IO to PST disk 0 in group 3. WARNING: Waited 15 secs for write IO to PST disk 0 in group 5. WARNING: Waited 15 secs for write IO to PST disk 1 in group 5. WARNING: Waited 15 secs for write IO to PST disk 2 in group 5. WARNING: Waited 15 secs for write IO to PST disk 0 in group 5. WARNING: Waited 15 secs for write IO to PST disk 1 in group 5. WARNING: Waited 15 secs for write IO to PST disk 2 in group 5. WARNING: Waited 15 secs for write IO to PST disk 0 in group 6. WARNING: Waited 15 secs for write IO to PST disk 0 in group 6. Sat Dec 28 03:02:51 2019 NOTE: process _b000_+asm2 (45488) initiating offline of disk 0.3916160907 (OCR1) with mask 0x7e in group 5 NOTE: process _b000_+asm2 (45488) initiating offline of disk 1.3916160906 (OCR2) with mask 0x7e in group 5 NOTE: process _b000_+asm2 (45488) initiating offline of disk 2.3916160905 (OCR3) with mask 0x7e in group 5 NOTE: checking PST: grp = 5 GMON checking disk modes for group 5 at 19 for pid 27, osid 45488 ERROR: no read quorum in group: required 2, found 0 disks NOTE: checking PST for grp 5 done. NOTE: initiating PST update: grp = 5, dsk = 0/0xe96bdf8b, mask = 0x6a, op = clear NOTE: initiating PST update: grp = 5, dsk = 1/0xe96bdf8a, mask = 0x6a, op = clear NOTE: initiating PST update: grp = 5, dsk = 2/0xe96bdf89, mask = 0x6a, op = clear GMON updating disk modes for group 5 at 20 for pid 27, osid 45488 ERROR: no read quorum in group: required 2, found 0 disks Sat Dec 28 03:02:51 2019 NOTE: cache dismounting (not clean) group 5/0x8F5B2F9F (OCR) NOTE: messaging CKPT to quiesce pins Unix process pid: 45490, image: oracle@db2 (B001) Sat Dec 28 03:02:51 2019 NOTE: halting all I/Os to diskgroup 5 (OCR) Sat Dec 28 03:02:52 2019 NOTE: LGWR doing non-clean dismount of group 5 (OCR) NOTE: LGWR sync ABA=23.100 last written ABA 23.100 WARNING: Offline for disk OCR1 in mode 0x7f failed. WARNING: Offline for disk OCR2 in mode 0x7f failed. WARNING: Offline for disk OCR3 in mode 0x7f failed. Sat Dec 28 03:02:52 2019 kjbdomdet send to inst 1 detach from dom 5, sending detach message to inst 1 Sat Dec 28 03:02:52 2019 List of instances: 1 2 Dirty detach reconfiguration started (new ddet inc 1, cluster inc 36) Global Resource Directory partially frozen for dirty detach * dirty detach - domain 5 invalid = TRUE 0 GCS resources traversed, 0 cancelled Dirty Detach Reconfiguration complete Sat Dec 28 03:02:52 2019 WARNING: dirty detached from domain 5 NOTE: cache dismounted group 5/0x8F5B2F9F (OCR) SQL> alter diskgroup OCR dismount force /* ASM SERVER:2405117855 */ Sat Dec 28 03:02:52 2019 NOTE: cache deleting context for group OCR 5/0x8f5b2f9f GMON dismounting group 5 at 21 for pid 28, osid 45490 NOTE: Disk OCR1 in mode 0x7f marked for de-assignment NOTE: Disk OCR2 in mode 0x7f marked for de-assignment NOTE: Disk OCR3 in mode 0x7f marked for de-assignment NOTE:Waiting for all pending writes to complete before de-registering: grpnum 5 Sat Dec 28 03:03:03 2019 WARNING: Waited 27 secs for write IO to PST disk 0 in group 1. WARNING: Waited 27 secs for write IO to PST disk 0 in group 1. WARNING: Waited 24 secs for write IO to PST disk 0 in group 2. WARNING: Waited 24 secs for write IO to PST disk 0 in group 2. WARNING: Waited 27 secs for write IO to PST disk 0 in group 3. WARNING: Waited 27 secs for write IO to PST disk 0 in group 3. WARNING: Waited 21 secs for write IO to PST disk 0 in group 4. WARNING: Waited 21 secs for write IO to PST disk 0 in group 4. WARNING: Waited 27 secs for write IO to PST disk 0 in group 6. WARNING: Waited 27 secs for write IO to PST disk 0 in group 6. Sat Dec 28 03:03:03 2019 ASM Health Checker found 1 new failures Sat Dec 28 03:03:22 2019 SUCCESS: diskgroup OCR was dismounted SUCCESS: alter diskgroup OCR dismount force /* ASM SERVER:2405117855 */ SUCCESS: ASM-initiated MANDATORY DISMOUNT of group OCR Sat Dec 28 03:03:22 2019 NOTE: diskgroup resource ora.OCR.dg is offline Sat Dec 28 03:03:22 2019 Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Sat Dec 28 05:30:34 2019 WARNING: Waited 15 secs for write IO to PST disk 0 in group 1. WARNING: Waited 15 secs for write IO to PST disk 0 in group 1. WARNING: Waited 15 secs for write IO to PST disk 0 in group 6. WARNING: Waited 15 secs for write IO to PST disk 0 in group 6. Sat Dec 28 05:30:37 2019 Received dirty detach msg from inst 1 for dom 5 Sat Dec 28 05:30:37 2019 List of instances: 1 2 Dirty detach reconfiguration started (new ddet inc 2, cluster inc 36) Global Resource Directory partially frozen for dirty detach * dirty detach - domain 5 invalid = TRUE 0 GCS resources traversed, 0 cancelled freeing rdom 5 Dirty Detach Reconfiguration complete Sat Dec 28 05:30:37 2019 Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted WARNING: requested mirror side 1 of virtual extent 5 logical extent 0 offset 704512 is not allocated; I/O request failed WARNING: requested mirror side 2 of virtual extent 5 logical extent 1 offset 704512 is not allocated; I/O request failed Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_27831.trc: ORA-15078: ASM diskgroup was forcibly dismounted ORA-15078: ASM diskgroup was forcibly dismounted Sat Dec 28 05:30:37 2019 SQL> alter diskgroup OCR check /* proxy */ ORA-15032: not all alterations performed ORA-15001: diskgroup "OCR" does not exist or is not mounted ERROR: alter diskgroup OCR check /* proxy */ Sat Dec 28 05:30:44 2019 WARNING: Waited 20 secs for write IO to PST disk 0 in group 2. WARNING: Waited 20 secs for write IO to PST disk 0 in group 2. Sat Dec 28 05:30:48 2019 NOTE: client exited [27819] Sat Dec 28 05:30:49 2019 NOTE: [crsd.bin@db2 (TNS V1-V3) 142641] opening OCR file Sat Dec 28 05:30:51 2019 NOTE: [crsd.bin@db2 (TNS V1-V3) 142660] opening OCR file Sat Dec 28 05:30:53 2019 NOTE: [crsd.bin@db2 (TNS V1-V3) 142678] opening OCR file Sat Dec 28 05:30:55 2019 NOTE: [crsd.bin@db2 (TNS V1-V3) 142696] opening OCR file Sat Dec 28 05:30:57 2019 NOTE: [crsd.bin@db2 (TNS V1-V3) 142723] opening OCR file Sat Dec 28 05:30:59 2019 NOTE: [crsd.bin@db2 (TNS V1-V3) 142744] opening OCR file Sat Dec 28 05:31:01 2019 NOTE: [crsd.bin@db2 (TNS V1-V3) 142773] opening OCR file Sat Dec 28 05:31:03 2019 NOTE: [crsd.bin@db2 (TNS V1-V3) 142792] opening OCR file Sat Dec 28 05:31:05 2019 NOTE: [crsd.bin@db2 (TNS V1-V3) 142806] opening OCR file Sat Dec 28 05:31:07 2019 NOTE: [crsd.bin@db2 (TNS V1-V3) 142821] opening OCR file Sat Dec 28 06:18:42 2019 WARNING: Waited 15 secs for write IO to PST disk 0 in group 1. WARNING: Waited 15 secs for write IO to PST disk 0 in group 1. WARNING: Waited 15 secs for write IO to PST disk 0 in group 2. WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
檢查節點db2的alertdb2.log有如下報錯,也是說不能訪問OCR磁碟組的相關磁碟,但使用dd命令測試是可以訪問的。
2019-12-28 05:30:48.019: [/u01/app/11.2.0/grid/bin/oraagent.bin(28268)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:7:73} in /u01/app/11.2.0/grid/log/db2/agent/crsd/oraagent_oracle/oraagent_oracle.log. 2019-12-28 05:30:48.019: [/u01/app/11.2.0/grid/bin/scriptagent.bin(37953)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/scriptagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:9:8} in /u01/app/11.2.0/grid/log/db2/agent/crsd/scriptagent_grid/scriptagent_grid.log. 2019-12-28 05:30:48.020: [/u01/app/11.2.0/grid/bin/oraagent.bin(28009)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:10} in /u01/app/11.2.0/grid/log/db2/agent/crsd/oraagent_grid/oraagent_grid.log. 2019-12-28 05:30:48.021: [/u01/app/11.2.0/grid/bin/orarootagent.bin(28025)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:373} in /u01/app/11.2.0/grid/log/db2/agent/crsd/orarootagent_root/orarootagent_root.log. 2019-12-28 05:30:48.024: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:30:49.410: [crsd(142641)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:49.420: [crsd(142641)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:50.063: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:30:51.442: [crsd(142660)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:51.451: [crsd(142660)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:52.100: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:30:53.471: [crsd(142678)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:53.480: [crsd(142678)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:54.138: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:30:55.507: [crsd(142696)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:55.517: [crsd(142696)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:56.176: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:30:57.551: [crsd(142723)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:57.560: [crsd(142723)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:58.216: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:30:59.592: [crsd(142744)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:30:59.602: [crsd(142744)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:31:00.253: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:31:01.627: [crsd(142773)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:31:01.636: [crsd(142773)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:31:02.290: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:31:03.658: [crsd(142792)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:31:03.668: [crsd(142792)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:31:04.327: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:31:05.701: [crsd(142806)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:31:05.711: [crsd(142806)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:31:06.365: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:31:07.726: [crsd(142821)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:31:07.735: [crsd(142821)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage ]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/db2/crsd/crsd.log. 2019-12-28 05:31:08.402: [ohasd(13034)]CRS-2765:Resource 'ora.crsd' has failed on server 'db2'. 2019-12-28 05:31:08.402: [ohasd(13034)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart. 2019-12-28 05:31:08.403: [ohasd(13034)]CRS-2769:Unable to failover resource 'ora.crsd'.
檢查節點db2的oraagent_grid.log檔案有如下報錯:
2019-12-28 05:29:59.329: [ AGFW][3601811200]{2:6928:2} Agent received the message: AGENT_HB[Engine] ID 12293:273304 2019-12-28 05:30:17.162: [ora.LISTENER_SCAN2.lsnr][3592312576]{1:34166:403} [check] Utils:execCmd action = 3 flags = 38 ohome = (null) cmdname = lsnrctl. 2019-12-28 05:30:17.267: [ora.LISTENER_SCAN2.lsnr][3592312576]{1:34166:403} [check] execCmd ret = 0 2019-12-28 05:30:17.267: [ora.LISTENER_SCAN2.lsnr][3592312576]{1:34166:403} [check] CrsCmd::ClscrsCmdData::stat entity 5 statflag 32 useFilter 1 2019-12-28 05:30:17.298: [ora.LISTENER_SCAN2.lsnr][3592312576]{1:34166:403} [check] ScanLsnrAgent::checkDependentVipResource: scanVipResource = ora.scan2.vip, statRet = 0 2019-12-28 05:30:17.881: [ora.LISTENER_SCAN3.lsnr][2950686464]{1:34166:403} [check] Utils:execCmd action = 3 flags = 38 ohome = (null) cmdname = lsnrctl. 2019-12-28 05:30:17.986: [ora.LISTENER_SCAN3.lsnr][2950686464]{1:34166:403} [check] execCmd ret = 0 2019-12-28 05:30:17.987: [ora.LISTENER_SCAN3.lsnr][2950686464]{1:34166:403} [check] CrsCmd::ClscrsCmdData::stat entity 5 statflag 32 useFilter 1 2019-12-28 05:30:18.019: [ora.LISTENER_SCAN3.lsnr][2950686464]{1:34166:403} [check] ScanLsnrAgent::checkDependentVipResource: scanVipResource = ora.scan3.vip, statRet = 0 2019-12-28 05:30:27.292: [ora.asm][2950686464]{2:6928:2} [check] CrsCmd::ClscrsCmdData::stat entity 1 statflag 33 useFilter 0 2019-12-28 05:30:27.319: [ora.asm][2950686464]{2:6928:2} [check] AsmProxyAgent::check clsagfw_res_status 0 2019-12-28 05:30:34.522: [ora.ons][2950686464]{2:6928:2} [check] getOracleHomeAttrib: oracle_home = /u01/app/11.2.0/grid 2019-12-28 05:30:34.522: [ora.ons][2950686464]{2:6928:2} [check] Utils:execCmd action = 3 flags = 6 ohome = /u01/app/11.2.0/grid/opmn/ cmdname = onsctli. 2019-12-28 05:30:34.627: [ora.ons][2950686464]{2:6928:2} [check] (:CLSN00010:)ons is running ... 2019-12-28 05:30:34.627: [ora.ons][2950686464]{2:6928:2} [check] (:CLSN00010:) 2019-12-28 05:30:34.628: [ora.ons][2950686464]{2:6928:2} [check] execCmd ret = 0 2019-12-28 05:30:37.858: [ USRTHRD][3575973632]{1:30748:9373} Processing the event CRS_RESOURCE_STATE_CHANGE 2019-12-28 05:30:38.652: [ora.LISTENER.lsnr][3594413824]{2:6928:2} [check] Utils:execCmd action = 3 flags = 38 ohome = (null) cmdname = lsnrctl. 2019-12-28 05:30:38.757: [ora.LISTENER.lsnr][3594413824]{2:6928:2} [check] execCmd ret = 0 2019-12-28 05:30:48.017: [ CRSCOMM][3612317440] IpcC: IPC client connection 6c to member 0 has been removed 2019-12-28 05:30:48.017: [CLSFRAME][3612317440] Removing IPC Member:{Relative|Node:0|Process:0|Type:1} 2019-12-28 05:30:48.017: [CLSFRAME][3612317440] Disconnected from CRSD:db2 process: {Relative|Node:0|Process:0|Type:1} 2019-12-28 05:30:48.020: [ AGENT][3601811200]{0:1:10} {0:1:10} Created alert : (:CRSAGF00117:) : Disconnected from server, Agent is shutting down. 2019-12-28 05:30:48.020: [ AGFW][3601811200]{0:1:10} Agent is exiting with exit code: 1
檢查2個節點的/var/log/messages日誌,發現2個節點均有多路徑相關的錯誤資訊,但相關的磁碟是用於備份的,不是用於生產資料庫所使用的
Dec 30 05:25:31 db1 multipathd: backup2: sdcr - emc_clariion_checker: query command indicates error Dec 30 05:25:31 db1 multipathd: checker failed path 69:240 in map backup2 Dec 30 05:25:31 db1 kernel: device-mapper: multipath: Failing path 69:240. Dec 30 05:25:31 db1 multipathd: backup2: remaining active paths: 3 Dec 30 05:25:37 db1 multipathd: backup2: sdcr - emc_clariion_checker: Active path is healthy. Dec 30 05:25:37 db1 multipathd: 69:240: reinstated Dec 30 05:25:37 db1 multipathd: backup2: remaining active paths: 4 Dec 30 05:25:37 db1 kernel: sd 5:0:3:2: emc: ALUA failover mode detected Dec 30 05:25:37 db1 kernel: sd 5:0:3:2: emc: at SP A Port 5 (owned, default SP A) Dec 30 05:26:03 db1 kernel: qla2xxx [0000:05:00.1]-801c:5: Abort command issued nexus=5:3:4 -- 1 2002. Dec 30 06:03:35 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 06:15:23 db1 multipathd: backup3: sdcq - emc_clariion_checker: Read error for WWN 600601608b203300d563752524c1e611. Sense data are 0x0/0x0/0x0. Dec 30 06:15:23 db1 kernel: qla2xxx [0000:05:00.1]-801c:5: Abort command issued nexus=5:3:1 -- 1 2002. Dec 30 06:15:23 db1 kernel: device-mapper: multipath: Failing path 69:224. Dec 30 06:15:23 db1 multipathd: checker failed path 69:224 in map backup3 Dec 30 06:15:23 db1 multipathd: backup3: remaining active paths: 3 Dec 30 06:15:28 db1 multipathd: backup3: sdcq - emc_clariion_checker: Active path is healthy. Dec 30 06:15:28 db1 multipathd: 69:224: reinstated Dec 30 06:15:28 db1 multipathd: backup3: remaining active paths: 4 Dec 30 06:15:28 db1 kernel: sd 5:0:3:1: emc: ALUA failover mode detected Dec 30 06:15:28 db1 kernel: sd 5:0:3:1: emc: at SP A Port 5 (owned, default SP A) Dec 30 06:59:29 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 07:53:22 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 07:55:11 db1 multipathd: sdct: couldn't get asymmetric access state Dec 30 07:55:11 db1 multipathd: backup4: load table [0 2147483648 multipath 2 queue_if_no_path retain_attached_hw_handler 1 emc 2 1 round-robin 0 2 1 70:16 1 66:240 1 round-robin 0 2 1 8:64 1 67:80 1] Dec 30 07:55:11 db1 kernel: sd 5:0:3:4: emc: ALUA failover mode detected Dec 30 07:55:11 db1 kernel: sd 5:0:3:4: emc: at SP A Port 5 (owned, default SP A) Dec 30 07:55:11 db1 kernel: sd 4:0:3:4: emc: ALUA failover mode detected Dec 30 07:55:11 db1 kernel: sd 4:0:3:4: emc: at SP A Port 4 (owned, default SP A) Dec 30 07:55:35 db1 multipathd: backup2: sdcr - emc_clariion_checker: Read error for WWN 600601608b203300d663752524c1e611. Sense data are 0x0/0x0/0x0. Dec 30 07:55:35 db1 multipathd: checker failed path 69:240 in map backup2 Dec 30 07:55:35 db1 multipathd: backup2: remaining active paths: 3 Dec 30 07:55:35 db1 kernel: device-mapper: multipath: Failing path 69:240. Dec 30 07:55:40 db1 multipathd: backup2: sdcr - emc_clariion_checker: Active path is healthy. Dec 30 07:55:40 db1 multipathd: 69:240: reinstated Dec 30 07:55:40 db1 multipathd: backup2: remaining active paths: 4 Dec 30 07:55:40 db1 kernel: sd 5:0:3:2: emc: ALUA failover mode detected Dec 30 07:55:40 db1 kernel: sd 5:0:3:2: emc: at SP A Port 5 (owned, default SP A) Dec 30 08:39:47 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 08:43:36 db1 multipathd: mpathb: load table [0 20971520 multipath 2 queue_if_no_path retain_attached_hw_handler 1 emc 2 1 round-robin 0 2 1 69:208 1 66:176 1 round-robin 0 2 1 8:0 1 67:16 1] Dec 30 08:43:36 db1 kernel: sd 5:0:3:0: emc: ALUA failover mode detected Dec 30 08:43:36 db1 kernel: sd 5:0:3:0: emc: at SP A Port 5 (owned, default SP A) Dec 30 08:43:36 db1 kernel: sd 4:0:3:0: emc: ALUA failover mode detected Dec 30 08:43:36 db1 kernel: sd 4:0:3:0: emc: at SP A Port 4 (owned, default SP A) Dec 30 09:24:04 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 10:13:09 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 11:06:07 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 12:07:36 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 13:08:58 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 14:00:19 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 14:52:20 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 15:40:45 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 16:34:38 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 17:09:56 db1 auditd[15975]: Audit daemon rotating log files Dec 30 17:38:16 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 18:59:38 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Dec 30 19:54:43 db1 CLSD: The clock on host db1 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
經檢視叢集相關日誌可以確定,由於儲存磁碟出現IO問題(或光線閃斷、或IO延遲),導致叢集CRS異常當機。但是,比較奇怪的是,雖然CSR掉線了,ASM例項和DB例項卻好好的,還可以正常使用。查詢oracle support發現一篇文章1581864.1?提到ASM CRS仲裁盤訪問超時與隱藏引數_asm_hbeatiowait有關係,而ASM的隱藏引數_asm_hbeatiowait由於作業系統多路徑Multipath配置的polling_interval有關,具體的故障原因是作業系統盤的判斷訪問超時遠大於資料庫ASM仲裁盤訪問超時,導致ORACLE RAC判定ASM中仲裁盤無法訪問從而將仲裁盤強制Offline。解決的思路是:首先,確定作業系統polling_interval引數與資料庫ASM隱藏引數值_asm_hbeatiowait,將_asm_hbeatiowait的值調整到比polling_interval值大即可。
下面是具體的解決操作:
1、檢視資料庫RAC ASM的_asm_hbeatiowait值(預設是15秒):
SQL> col ksppinm for a20 SQL> col ksppstvl for a40 SQL> col ksppdesc for a80 SQL> SELECT ksppinm, ksppstvl, ksppdesc 2 FROM x$ksppi x, x$ksppcv y 3 WHERE x.indx = y.indx AND ksppinm = '_asm_hbeatiowait'; KSPPINM KSPPSTVL KSPPDESC -------------------- ---------------------------------------- -------------------------------------------------------------------------------- _asm_hbeatiowait 15 number of secs to wait for PST Async Hbeat IO return
2、檢視操作儲存盤訪問超時時間(Oracle Linux 6.7預設是30秒)
[root@db1 ~]# cat /sys/block/sdb/device/timeout 30 [root@db2 ~]# cat /sys/block/sdb/device/timeout 30
3、將_asm_hbeatiowait 的值調整為45秒(該引數是靜態引數,需要重啟叢集)
SQL> alter system set "_asm_hbeatiowait"=45 scope=spfile sid='*'; System altered.
4.重啟叢集
5.檢查叢集狀態
[grid@db1 ~]$ crsctl stat res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.ARCH.dg ONLINE ONLINE db1 ONLINE ONLINE db2 ora.CWDATA.dg ONLINE ONLINE db1 ONLINE ONLINE db2 ora.DADATA.dg ONLINE ONLINE db1 ONLINE ONLINE db2 ora.DATA.dg ONLINE ONLINE db1 ONLINE ONLINE db2 ora.LISTENER.lsnr ONLINE ONLINE db1 ONLINE ONLINE db2 ora.OCR.dg ONLINE ONLINE db1 ONLINE ONLINE db2 ora.SBKDATA.dg ONLINE ONLINE db1 ONLINE ONLINE db2 ora.asm ONLINE ONLINE db1 Started ONLINE ONLINE db2 Started ora.gsd OFFLINE OFFLINE db1 OFFLINE OFFLINE db2 ora.net1.network ONLINE ONLINE db1 ONLINE ONLINE db2 ora.ons ONLINE ONLINE db1 ONLINE ONLINE db2 -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE db2 ora.LISTENER_SCAN2.lsnr 1 ONLINE ONLINE db1 ora.LISTENER_SCAN3.lsnr 1 ONLINE ONLINE db1 ora.caiwu.db 1 ONLINE ONLINE db1 Open 2 ONLINE ONLINE db2 Open ora.chdyl.db 1 ONLINE ONLINE db1 Open 2 ONLINE ONLINE db2 Open ora.cvu 1 ONLINE ONLINE db1 ora.dadb.db 1 ONLINE ONLINE db1 Open 2 ONLINE ONLINE db2 Open ora.db1.vip 1 ONLINE ONLINE db1 ora.db2.vip 1 ONLINE ONLINE db2 ora.oc4j 1 ONLINE ONLINE db1 ora.rlzy.db 1 ONLINE ONLINE db1 Open 2 ONLINE ONLINE db2 Open ora.scan1.vip 1 ONLINE ONLINE db2 ora.scan2.vip 1 ONLINE ONLINE db1 ora.scan3.vip 1 ONLINE ONLINE db1
到此處理完成。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26015009/viewspace-2671893/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- ORACLE RAC 11.2.0.4 FOR RHEL6叢集無法啟動的處理Oracle
- oracle 11.2.0.4 rac叢集等待事件enq: TM - contentionOracle事件ENQ
- oracle異常處理Oracle
- oracle 11.2.0.4 rac節點異常當機之ORA-07445Oracle
- ORACLE 11.2.0.4 rac for linux 鏈路宕導致的單節點異常當機OracleLinux
- Oracle異常錯誤處理Oracle
- ORACLE 異常錯誤處理Oracle
- Oracle 監聽異常處理Oracle
- oracle常見異常等待——latch處理思路Oracle
- ORACLE rac 11.2.0.4 for rhel7.8 upgrade to 19.11.1 報錯ORA-29516處理Oracle
- Oracle開發基礎-異常處理Oracle
- Oracle 11.2.0.4 Dataguard兩則故障處理Oracle
- ORACLE 11.2.0.4 RAC RMAN異機恢復之ORA-15001Oracle
- Oracle:Oracle RAC 11.2.0.4 升級為 19cOracle
- tidb之dm叢集同步異常處理TiDB
- oracle rac 11.2.0.3 升級到11.2.0.4Oracle
- oracle分散式事務異常處理方法Oracle分散式
- oracle RAC 診斷叢集狀態命令Oracle
- Solaris下Oracle RAC 11.2.0.4 安裝方法Oracle
- 紅色警報 ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM 和DB因叢集心跳丟失重啟OracleASM
- ORACLE RAC 11.2.0.4 ASM加盤導致叢集重啟之ASM sga設定過小OracleASM
- Oracle 10g RAC故障處理Oracle 10g
- Oracle叢集(RAC)時間同步(ntp和CTSS)Oracle
- Oracle 11.2.0.4 rac for aix acfs異常環境的克隆環境ASM磁碟組掛載緩慢OracleAIASM
- ORACLE 19C RAC叢集安裝與PRCR-1079&CRS-5017&ORA-03113Oracle
- CRS Resource Introduction In Oracle 19c RAC-20220125Oracle
- Oracle 11gr2修改RAC叢集的scan ip,並處理ORA-12514問題Oracle
- 記一次oracle 19c RAC叢集重啟單節點DB啟動異常(二)Oracle
- 沃趣微講堂 | Oracle叢集技術(二):GI與Oracle RACOracle
- ORACLE 11.2.0.4 DG(Broker) for linux 部署OracleLinux
- Oracle RAC日常運維-NetworkManager導致叢集故障Oracle運維
- 【ASK_ORACLE】Relink RAC叢集詳細步驟Oracle
- Oracle RAC啟動因CTSS導致的異常Oracle
- 記錄一次Oracle 11.2.0.4 RAC異地恢復到單例項Oracle單例
- oracle 異常Oracle
- 一次ORACLE分散式事務鎖異常處理分析Oracle分散式
- Oracle12c叢集啟動時提示%CRS_LIMIT_OPENFILE%: invalid numberOracleMIT
- vmware + 裸裝置 + crs + oracle10g RAC搭建步驟(二):安裝linuxOracleLinux