保證RAC 24*7安全穩定執行秘籍之一（IO閃斷解決辦法）

多年運維生涯當中，閃斷無疑是最讓我頭疼的問題。閃斷問題多數只能“事後諸葛”。
閃斷問題處理思路：
1.使用合理的工具收集有效資訊進而分析閃斷問題。比如涉及ORACLE資料庫的推薦使用OSwatch。
2.如果無法使用合裡的工具收集有效的錯誤資訊，這時你的重點應該轉移到收集的現象和技術原理上。（比如銀行類想安裝一個收集資訊工具基本不可能）以少數的現象和夯實的技術原理分

析給領導，邏輯必須合理。這樣也是可以解決問題的。

今天就說一個光纖線引起的IO閃斷問題，進而影響了多臺RAC同時OCR被dismount。
下圖簡單的介紹了一個CASE的背景。同時也普及了一下儲存到光纖交換機再到伺服器的鏈路走向。

保證RAC 24*7安全穩定執行秘籍之一（IO閃斷解決辦法）
此次的CASE就發生在儲存A0到光纖交換機AO的光纖線上，這條光纖線的閃斷就影響了所有走DEFAULT SPA的LUN，恰好設計的DEFAULT SPA的LUN 多數是在OCR磁碟組內。所以多次發生OCR被

DISMOUNT的現象。透過主機，儲存，資料庫三方共同協作最終問題確認是光纖線問題，更換光纖線。
IO閃斷解決辦法總結：
儲存：透過此次CASE，更換一些不達標準的光纖線。
資料庫：分析PST的原理，進而最佳化了"_asm_hbeatiowait"值。增強系統的強壯性和穩定性。

PST heartbeat：往往發生在IO閃斷/繁忙/CPU繁忙時，PST檢測到同步延遲超過"_asm_hbeatiowait"值時，會通知ORACLE ASM INSTANCE dismount disk group,由GMON程式完成disk group。
cssd voting heartbeat:往往發生在本地無法範圍OCR的情況下（IO徹底中斷），進而腦裂。

1.What is PST ?
PST is Partner Status Table .
1.1 Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
1.2 PST determines whether the disk is OFFLINE by judging the number of partners.
1.3 Disk Group Offline was Oracle ASM.
Oracle ASM forces the dismounting of the disk group.Otherwise,Oracle ASM takes the disk offline.

2.How is PST working ?

This problem has a few distinctive symptoms but the highest is a node crash:
Also:

Diskgroup outage
Very Slow IO Performance
Possible very high CPU
Timeouts for IO
Communications to ASM, CRS or CSS failures
More:

Only one node active, the other one hangs while starting ASM.
After an outage the Node restarts, but IO Waits are very high
Overall Very slow performance on one node, but no load or evidence of why IO stats are so high

工作原理，引用

External Redundancy一般有一個PST
Normal Redundancy至多有個3個PST
High Redundancy 至多有5個PST

如下場景中PST 可能被重定位：
存有PST的ASM DISK不可用了(當ASM啟東時)
ASM DISK OFFLINE了
當對PST的讀寫發生了I/O錯誤
disk被正常DROP了

在讀取其他ASM metadata之前會先檢查PST
當ASM例項被要求mount diskgroup時，GMON程式會讀取diskgroup中所有磁碟去找到和確認PST複製
如果他發現有足夠的PST，那麼會mount diskgroup
之後，PST會被快取在ASM快取中，以及GMON的PGA中並使用排他的PT.n.0鎖保護
同叢集中的其他ASM例項也將快取PST到GMON的PGA，並使用共享PT.n.o鎖保護
僅僅那個持有排他鎖的GMON能更新磁碟上的PST資訊
每一個ASM DISK上的AUN=1均為PST保留，但只有幾個磁碟上真的有PST資料

3.Why is using PST ?
已冗餘度保證資料安全。
確保Normal Redundancy和High Redundancy策略的磁碟組組內磁碟在一定時間內資料的一致性。進而以Normal Redundancy和High Redundancy策略保護資料的安全。

Reference:
Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration (文件 ID 1302539.1)

Oracle Automatic Storage Management (ASM) - Background

Read errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise healthy disk. Oracle ASM tries to recover from read errors on

corrupted sectors on a disk. When a read error by the database or Oracle ASM triggers the Oracle ASM instance to attempt bad block remapping, Oracle ASM reads a good

copy of the extent and copies it to the disk that had the read error.

If the write to the same location succeeds, then the underlying allocation unit (sector) is deemed healthy. This might be because the underlying disk did its own bad

block reallocation.
If the write fails, Oracle ASM attempts to write the extent to a new allocation unit on the same disk. If this write succeeds, the original allocation unit is marked

as unusable. If the write fails, the disk is taken offline.
Another benefit with Oracle ASM based mirroring is that the database instance is aware of the mirroring. For many types of physical block corruptions such as a bad

checksum, the database instance proceeds through the mirror side looking for valid content and proceeds without errors. If the process in the database that encountered

the read can obtain the appropriate locks to ensure data consistency, it writes the correct data to all mirror sides.

When encountering a write error, a database instance sends the Oracle ASM instance a disk offline message.

If database can successfully complete a write to at least one extent copy and receive acknowledgment of the offline disk from Oracle ASM, the write is considered

successful.
If the write to all mirror side fails, database takes the appropriate actions in response to a write error such as taking the tablespace offline.
When the Oracle ASM instance receives a write error message from a database instance or when an Oracle ASM instance encounters a write error itself, the Oracle ASM

instance attempts to take the disk offline. Oracle ASM consults the Partner Status Table (PST) to see whether any of the disk's partners are offline. If too many

partners are offline, Oracle ASM forces the dismounting of the disk group. Otherwise, Oracle ASM takes the disk offline.

The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before Oracle ASM or database I/O.

For information about the remap command, see "remap".

When ASM detects any block corruptions, ASM logs the error to the ASM alert.log file. The same corruption error may not appear in the database alert.log or

application if ASM can correct the corruption automatically.

Starting Oracle 12c, Oracle ASM disk scrubbing checks logical data corruptions and repairs the corruptions automatically in normal and high redundancy disks groups.

The feature is designed so that it does not have any impact to the regular input and output (I/O) operations in production systems. The scrubbing process repairs

logical corruptions using the Oracle ASM mirror disks. Disk scrubbing uses Oracle ASM rebalancing to minimize I/O overhead.

The scrubbing process is visible in fields of the V$ASM_OPERATION view. Refer to Oracle? Automatic Storage Management Administrator's Guide 12c Release 1 (12.1).

These ASM benefits are available for all databases using ASM. Since every Exadata Database Machine uses ASM, all these benefits are always available for Exadata

customers.

########################################################################################
版權所有，文章允許轉載，但必須以連結方式註明源地址，否則追究法律責任!【QQ交流群：53993419】
QQ：14040928 E-mail：dbadoudou@163.com
本文連結： http://blog.itpub.net/26442936/viewspace-2096971/
########################################################################################

保證RAC 24*7安全穩定執行秘籍之一（IO閃斷解決辦法）

相關文章