客戶的環境是兩臺IBM X3850，安裝Oracle Linux 6.x x86_64bit的作業系統部署的Oracle 11.2.0.4.0 RAC Database，共享儲存是EMC，使用了EMC vplex虛擬化軟體對儲存做了映象保護，作業系統安裝了EMC原生的多路徑軟體。故障的現象是當vplex內部發生切換時，RAC其中一個節點的OCR和Votedisk所在的磁碟組變得不可訪問，導致ora.crsd服務離線，Grid Infrastrasture叢集堆疊宕掉，但是該節點的資料庫例項不受影響，但不再接受外部新的連線，在這個過程中另外一個節點完全不受影響。下面是相關的日誌資訊：

1.作業系統日誌：
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 4 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 2 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 3 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 1 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 0 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 11 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 12 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 10 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 9 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 8 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 7 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 5 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 6 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Bus 3 to VPLEX CKM00142000957 port CL2-00 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 1 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 12 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 11 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 10 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 7 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 4 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 8 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 9 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 5 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 3 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 6 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 2 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 0 to CKM00142000957 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Bus 3 to VPLEX CKM00142000957 port CL2-04 is dead.
從作業系統日誌可以看出，Mar 18 08:25:48的時候port CL2-00和port CL2-04兩個鏈路dead了。

2.ASM日誌：
Fri Mar 18 08:25:59 2016
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3. <<<< 幾乎在和作業系統報錯的相同時間，ASM開始檢查所有磁碟的PST(partnership state table)，ASM的等待時間為15秒。
WARNING: Waited 15 secs for write IO to PST disk 1 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.
Fri Mar 18 08:25:59 2016
NOTE: process _b000_+asm1 (66994) initiating offline of disk 0.3190888900 (OCRVDISK_0000) with mask 0x7e in group 3 <<<< group 3是OCR和Votedisk所在的磁碟組。
NOTE: process _b000_+asm1 (66994) initiating offline of disk 1.3190888899 (OCRVDISK_0001) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (66994) initiating offline of disk 2.3190888898 (OCRVDISK_0002) with mask 0x7e in group 3
NOTE: checking PST: grp = 3
GMON checking disk modes for group 3 at 10 for pid 48, osid 66994
ERROR: no read quorum in group: required 2, found 0 disks <<<< 由於OCR和Votedisk所在的磁碟組是Normal冗餘級別，3個ASM磁碟，要求2個可訪問，但是實際是0個可訪問。
NOTE: checking PST for grp 3 done.
NOTE: initiating PST update: grp = 3, dsk = 0/0xbe3119c4, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 3, dsk = 1/0xbe3119c3, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 3, dsk = 2/0xbe3119c2, mask = 0x6a, op = clear
GMON updating disk modes for group 3 at 11 for pid 48, osid 66994
ERROR: no read quorum in group: required 2, found 0 disks <<<< 0個磁碟可訪問。
Fri Mar 18 08:25:59 2016
NOTE: cache dismounting (not clean) group 3/0x3D81E95D (OCRVDISK)
WARNING: Offline for disk OCRVDISK_0000 in mode 0x7f failed. <<<< OCR和Votedisk所在的磁碟組對應的所有磁碟都離線。
WARNING: Offline for disk OCRVDISK_0001 in mode 0x7f failed.
WARNING: Offline for disk OCRVDISK_0002 in mode 0x7f failed.
NOTE: messaging CKPT to quiesce pins Unix process pid: 66996, image: oracle@dzqddb01 (B001)
Fri Mar 18 08:25:59 2016
NOTE: halting all I/Os to diskgroup 3 (OCRVDISK) <<<< OCRVDISK磁碟組下面的所有I/O都不可用。
Fri Mar 18 08:25:59 2016
NOTE: LGWR doing non-clean dismount of group 3 (OCRVDISK)
NOTE: LGWR sync ABA=11.69 last written ABA 11.69
Fri Mar 18 08:25:59 2016
kjbdomdet send to inst 2
detach from dom 3, sending detach message to inst 2
Fri Mar 18 08:25:59 2016
List of instances:
1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 96)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 3 invalid = TRUE
Fri Mar 18 08:25:59 2016
NOTE: No asm libraries found in the system
2 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Fri Mar 18 08:25:59 2016
WARNING: dirty detached from domain 3
NOTE: cache dismounted group 3/0x3D81E95D (OCRVDISK)
SQL> alter diskgroup OCRVDISK dismount force /* ASM SERVER:1031924061 */ <<<< dismount OCRVDISK磁碟組。
Fri Mar 18 08:25:59 2016
NOTE: cache deleting context for group OCRVDISK 3/0x3d81e95d
GMON dismounting group 3 at 12 for pid 51, osid 66996
NOTE: Disk OCRVDISK_0000 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVDISK_0001 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVDISK_0002 in mode 0x7f marked for de-assignment
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 3
ASM Health Checker found 1 new failures

3.Clusterware告警日誌：
2016-03-18 11:53:19.394:
[crsd(47973)]CRS-1006:The OCR location +OCRVDISK is inaccessible. Details in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log. <<<< 時間上要比OCRVDISK被dismount的時間晚很多。
2016-03-18 11:53:38.437:
[/u01/app/11.2.0/grid/bin/oraagent.bin(48283)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:7:121} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2016-03-18 11:53:38.437:
[/u01/app/11.2.0/grid/bin/scriptagent.bin(80385)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/scriptagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:9:7} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/scriptagent_grid/scriptagent_grid.log.
2016-03-18 11:53:38.437:
[/u01/app/11.2.0/grid/bin/orarootagent.bin(48177)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:3303} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/orarootagent_root/orarootagent_root.log.
2016-03-18 11:53:38.437:
[/u01/app/11.2.0/grid/bin/oraagent.bin(48168)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:7} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/oraagent_grid/oraagent_grid.log.
2016-03-18 11:53:38.442:
[ohasd(47343)]CRS-2765:Resource 'ora.crsd' has failed on server 'dzqddb01'. <<<< ora.crsd 已經OFFLINE。
2016-03-18 11:53:39.773:
[crsd(45323)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log.
2016-03-18 11:53:39.779:
[crsd(45323)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log. <<<< 物理裝置不可訪問。
2016-03-18 11:53:40.470:
[ohasd(47343)]CRS-2765:Resource 'ora.crsd' has failed on server 'dzqddb01'.

  這裡我們會產生一個疑問，為什麼ora.crsd掛掉，但是ora.cssd沒有OFFLINE（透過crsctl stat res -t -init可以確認ora.cssd沒有掛掉，資料庫例項還正常執行，節點並沒有被踢出去），原因在於OCRVDISK對應的磁碟只是短暫的不可訪問，cssd程式是直接訪問OCRVDISK對應的3個ASM磁碟，並不依賴於OCRVDISK磁碟組是MOUNT狀態，並且Clusterware預設的磁碟心跳超時時間為200秒，所以cssd程式沒有出現問題。

  由此我們又會有更多的疑問，為什麼RAC的另外一個節點沒有出現故障？為什麼只有OCRVDISK磁碟組dismount，其他的磁碟組都正常？

  在出現問題後重啟has服務之後該節點即可恢復正常，加上其他磁碟組，其他節點並沒有出現故障，所以可以簡單的判斷共享儲存沒有太大的問題，只是鏈路斷掉之後有短時間的不可訪問，尋找問題的關鍵是ASM例項日誌中的這個資訊：WARNING: Waited 15 secs for write IO to PST disk，15秒的時間是否過短影響了OCRVDISK的離線，下面是MOS上的解釋：

Generally this kind messages comes in ASM alertlog file on below situations,

Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup, <<<< 在normal或high冗餘度的磁碟組上的ASM磁碟被執行延遲ASM PST心跳檢查。
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds. <<<< 檢查失敗，ASM例項會dismount磁碟組，預設的超時時間為15秒。

By the way the heart beat delays are sort of ignored for external redundancy diskgroup. <<<< PST heartbeat檢查會忽略外部冗餘的磁碟組。
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly. <<<< PST heartbeat檢查即使超過了15秒也不會dismount外部冗餘的磁碟組。

The ASM disk could go into unresponsiveness, normally in the following scenarios: <<< ASM磁碟出現無反應的情況通常是由於以下幾個原因：

+   Some of the paths of the physical paths of the multipath device are offline or lost <<<< 1.聚合裝置下的一些物理路徑offline或丟失。
+   During path 'failover' in a multipath set up <<<< 2.具有裝置下的物理路徑發生failover。
+   Server load, or any sort of storage/multipath/OS maintenance <<<< 3.系統或裝置的維護操作。

透過上面的這段描述，能大概的解釋出現問題的原因，由於儲存鏈路斷掉了2條（可能發生failover），導致聚合後的共享儲存裝置短暫的不可訪問，OCRVDISK是Normal冗餘度的磁碟組，ASM會執行PST heartbeat檢查，由於超過15秒OCRVDISK對應的磁碟組不可訪問導致ASM將OCRVDISK直接dismount，進而導致OCR檔案不可訪問，導致crs服務OFFLINE，由於cssd的磁碟心跳超時時間為200秒，且是直接訪問ASM磁碟，不經過ASM磁碟組，所以css服務沒有受影響，hasd高可用堆疊依然正常工作，叢集節點未被踢出，資料庫例項正常工作。

Oracle給出了在資料庫層面解決這個問題的辦法：

If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):

_asm_hbeatiowait <<<< 該引數指定了PST heartbeat超時時間。

As per internal bug 17274537 , based on internal testing the value should be increased to 120 secs, which is fixed in 12.1.0.2 <<<< 從12.1.0.2開始，該引數預設值被增加到了120秒。

Run below in asm instance to set desired value for _asm_hbeatiowait

alter system set "_asm_hbeatiowait"= scope=spfile sid='*'; <<<< 執行這條命令修改ASM例項的該引數，之後重啟ASM例項，CRS。

And then restart asm instance / crs, to take new parameter value in effect.

為了避免類似的問題，可以將OCR映象到不同的ASM磁碟組，這樣將進一步的提高ora.crsd服務的可用性。

更詳細的內容請參考文章：《ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (文件 ID 1581684.1)》

--end--

RAC共享磁碟物理路徑故障導致OCR、Votedisk所在ASM磁碟組不可訪問的案例分析

相關文章