保證RAC 24*7安全穩定執行祕籍之一（IO閃斷解決辦法）

lovehewenyu發表於2016-05-10

多年運維生涯當中，閃斷無疑是最讓我頭疼的問題。閃斷問題多數只能“事後諸葛”。
閃斷問題處理思路：
1.使用合理的工具收集有效資訊進而分析閃斷問題。比如涉及ORACLE資料庫的推薦使用OSwatch。
2.如果無法使用合裡的工具收集有效的錯誤資訊，這時你的重點應該轉移到收集的現象和技術原理上。（比如銀行類想安裝一個收集資訊工具基本不可能）以少數的現象和夯實的技術原理分

析給領導，邏輯必須合理。這樣也是可以解決問題的。

今天就說一個光纖線引起的IO閃斷問題，進而影響了多臺RAC同時OCR被dismount。
下圖簡單的介紹了一個CASE的背景。同時也普及了一下儲存到光纖交換機再到伺服器的鏈路走向。

保證RAC 24*7安全穩定執行祕籍之一（IO閃斷解決辦法）
此次的CASE就發生在儲存A0到光纖交換機AO的光纖線上，這條光纖線的閃斷就影響了所有走DEFAULT SPA的LUN，恰好設計的DEFAULT SPA的LUN 多數是在OCR磁碟組內。所以多次發生OCR被

DISMOUNT的現象。通過主機，儲存，資料庫三方共同協作最終問題確認是光纖線問題，更換光纖線。
IO閃斷解決辦法總結：
儲存：通過此次CASE，更換一些不達標準的光纖線。
資料庫：分析PST的原理，進而優化了"_asm_hbeatiowait"值。增強系統的強壯性和穩定性。

PST heartbeat：往往發生在IO閃斷/繁忙/CPU繁忙時，PST檢測到同步延遲超過"_asm_hbeatiowait"值時，會通知ORACLE ASM INSTANCE dismount disk group,由GMON程式完成disk group。
cssd voting heartbeat:往往發生在本地無法範圍OCR的情況下（IO徹底中斷），進而腦裂。

1.What is PST ?
PST is Partner Status Table .
1.1 Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
1.2 PST determines whether the disk is OFFLINE by judging the number of partners.
1.3 Disk Group Offline was Oracle ASM.
Oracle ASM forces the dismounting of the disk group.Otherwise,Oracle ASM takes the disk offline.

2.How is PST working ?

This problem has a few distinctive symptoms but the highest is a node crash:
Also:

Diskgroup outage
Very Slow IO Performance
Possible very high CPU
Timeouts for IO
Communications to ASM, CRS or CSS failures
More:

Only one node active, the other one hangs while starting ASM.
After an outage the Node restarts, but IO Waits are very high
Overall Very slow performance on one node, but no load or evidence of why IO stats are so high

工作原理，引用 http://www.askmaclean.com/archives/pst-partnership-status-table.html

External Redundancy一般有一個PST
Normal Redundancy至多有個3個PST
High Redundancy 至多有5個PST

如下場景中PST 可能被重定位：
存有PST的ASM DISK不可用了(當ASM啟東時)
ASM DISK OFFLINE了
當對PST的讀寫發生了I/O錯誤
disk被正常DROP了

在讀取其他ASM metadata之前會先檢查PST
當ASM例項被要求mount diskgroup時，GMON程式會讀取diskgroup中所有磁碟去找到和確認PST拷貝
如果他發現有足夠的PST，那麼會mount diskgroup
之後，PST會被快取在ASM快取中，以及GMON的PGA中並使用排他的PT.n.0鎖保護
同叢集中的其他ASM例項也將快取PST到GMON的PGA，並使用共享PT.n.o鎖保護
僅僅那個持有排他鎖的GMON能更新磁碟上的PST資訊
每一個ASM DISK上的AUN=1均為PST保留，但只有幾個磁碟上真的有PST資料

3.Why is using PST ?
已冗餘度保證資料安全。
確保Normal Redundancy和High Redundancy策略的磁碟組組內磁碟在一定時間內資料的一致性。進而以Normal Redundancy和High Redundancy策略保護資料的安全。

Reference:
Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration (文件 ID 1302539.1)

Oracle Automatic Storage Management (ASM) - Background

Read errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise healthy disk. Oracle ASM tries to recover from read errors on

corrupted sectors on a disk. When a read error by the database or Oracle ASM triggers the Oracle ASM instance to attempt bad block remapping, Oracle ASM reads a good

copy of the extent and copies it to the disk that had the read error.

If the write to the same location succeeds, then the underlying allocation unit (sector) is deemed healthy. This might be because the underlying disk did its own bad

block reallocation.
If the write fails, Oracle ASM attempts to write the extent to a new allocation unit on the same disk. If this write succeeds, the original allocation unit is marked

as unusable. If the write fails, the disk is taken offline.
Another benefit with Oracle ASM based mirroring is that the database instance is aware of the mirroring. For many types of physical block corruptions such as a bad

checksum, the database instance proceeds through the mirror side looking for valid content and proceeds without errors. If the process in the database that encountered

the read can obtain the appropriate locks to ensure data consistency, it writes the correct data to all mirror sides.

When encountering a write error, a database instance sends the Oracle ASM instance a disk offline message.

If database can successfully complete a write to at least one extent copy and receive acknowledgment of the offline disk from Oracle ASM, the write is considered

successful.
If the write to all mirror side fails, database takes the appropriate actions in response to a write error such as taking the tablespace offline.
When the Oracle ASM instance receives a write error message from a database instance or when an Oracle ASM instance encounters a write error itself, the Oracle ASM

instance attempts to take the disk offline. Oracle ASM consults the Partner Status Table (PST) to see whether any of the disk's partners are offline. If too many

partners are offline, Oracle ASM forces the dismounting of the disk group. Otherwise, Oracle ASM takes the disk offline.

The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before Oracle ASM or database I/O.

For information about the remap command, see "remap".

When ASM detects any block corruptions, ASM logs the error to the ASM alert.log file. The same corruption error may not appear in the database alert.log or

application if ASM can correct the corruption automatically.

Starting Oracle 12c, Oracle ASM disk scrubbing checks logical data corruptions and repairs the corruptions automatically in normal and high redundancy disks groups.

The feature is designed so that it does not have any impact to the regular input and output (I/O) operations in production systems. The scrubbing process repairs

logical corruptions using the Oracle ASM mirror disks. Disk scrubbing uses Oracle ASM rebalancing to minimize I/O overhead.

The scrubbing process is visible in fields of the V$ASM_OPERATION view. Refer to Oracle? Automatic Storage Management Administrator's Guide 12c Release 1 (12.1).

These ASM benefits are available for all databases using ASM. Since every Exadata Database Machine uses ASM, all these benefits are always available for Exadata

customers.

########################################################################################
版權所有，文章允許轉載，但必須以連結方式註明源地址，否則追究法律責任!【QQ交流群：53993419】
QQ：14040928 E-mail：dbadoudou@163.com
本文連結： http://blog.itpub.net/26442936/viewspace-2096971/
########################################################################################

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/26442936/viewspace-2096971/，如需轉載，請註明出處，否則將追究法律責任。

保證RAC 24*7安全穩定執行秘籍之一（IO閃斷解決辦法）
2018-06-20
指令碼在crontab裡沒有執行的解決辦法之一
2007-11-30
指令碼
Eclipse閃退解決辦法
2015-08-07
Eclipse
執行緒安全性保證---JMM特性詳解
2020-10-19
執行緒
安裝RAC 執行root.sh指令碼報錯，解決辦法
2016-01-18
指令碼
通過監控執行緒狀態來保證socket伺服器的穩定執行
2015-03-18
執行緒伺服器
應該如何保證伺服器的安全與穩定
2023-03-03
伺服器
Windows7音訊服務未執行的解決辦法
2016-08-05
Windows音訊
保證執行緒安全的技術
2023-05-08
執行緒
執行Docker命令報錯解決辦法
2018-01-02
Docker
Java下如何保證多執行緒安全
2021-07-30
Java執行緒
怎樣做才能保證執行緒安全？
2017-11-30
執行緒
root鎖屏解決辦法之一(轉)
2007-08-11
UICollectionView設定行間距失效，解決辦法
2018-07-11
UIView
Laravel 執行 Gulp 命令出錯解決辦法
2018-06-19
Laravel
多執行緒高併發解決辦法
2015-05-03
執行緒
華為雲網站安全解決方案：全面保障企業網路安全，助力業務穩定高效執行
2023-04-18
網站
疫情致居家辦公激增，谷歌決定暫停 Chrome 更新確保穩定
2020-03-20
谷歌Chrome
win7黑屏解決辦法
2016-09-27
Win7
CentOS 中yum命令執行錯誤解決辦法
2014-05-18
CentOS
執行 xhost + 出現 unable to open display 解決辦法
2014-10-18
解決PythonWin執行時崩潰的辦法
2009-05-09
Python
定時間點執行任務的asp.net簡易解決辦法
2009-12-07
ASP.NET
iOS 在主執行緒操作UI不能保證安全
2018-10-15
iOS執行緒UI
聊聊保證執行緒安全的10個小技巧
2022-06-08
執行緒
Docker Hello World容器執行報錯的解決辦法
2018-10-03
Docker
Laravel Mix - 執行 NPM install 報錯解決辦法
2018-03-29
LaravelNPM
MySql登入時閃退的快速解決辦法
2021-09-09
MySql
Cornerstone 意外退出、開啟閃退的解決辦法
2022-03-03
保證執行緒在主執行緒執行
2018-08-08
執行緒
解讀Java8中ConcurrentHashMap是如何保證執行緒安全的
2019-04-06
JavaHashMap執行緒
oracle rac asm 問題的官方解決辦法
2009-08-19
OracleASM
Win7 IIS7.5執行ASP時出現500錯誤的解決辦法
2019-05-21
Win7
在RAC 中解決 vipca 和 srvctl 無法執行的錯誤
2016-05-31
PCA
解決會場租賃wifi安全穩定的方案
2019-11-19
WiFi
Archlinux Gnome桌面下Codeblocks無法執行的解決方案之一
2024-07-04
LinuxBloC
vmware虛擬機器執行卡慢的解決辦法
2018-07-12
虛擬機
Android 避免APP啟動閃黑屏的解決辦法
2016-03-02
AndroidAPP

保證RAC 24*7安全穩定執行祕籍之一（IO閃斷解決辦法）

相關文章