保證RAC 24*7安全穩定執行祕籍之一(IO閃斷解決辦法)
保證RAC 24*7安全穩定執行祕籍之一(IO閃斷解決辦法)
多年運維生涯當中,閃斷無疑是最讓我頭疼的問題。閃斷問題多數只能“事後諸葛”。
閃斷問題處理思路:
1.使用合理的工具收集有效資訊進而分析閃斷問題。比如涉及ORACLE資料庫的推薦使用OSwatch。
2.如果無法使用合裡的工具收集有效的錯誤資訊,這時你的重點應該轉移到收集的現象和技術原理上。(比如銀行類想安裝一個收集資訊工具基本不可能)以少數的現象和夯實的技術原理分
析給領導,邏輯必須合理。這樣也是可以解決問題的。
今天就說一個光纖線引起的IO閃斷問題,進而影響了多臺RAC同時OCR被dismount。
下圖簡單的介紹了一個CASE的背景。同時也普及了一下儲存到光纖交換機再到伺服器的鏈路走向。
此次的CASE就發生在儲存A0到光纖交換機AO的光纖線上 ,這條光纖線的閃斷就影響了所有走DEFAULT SPA的LUN,恰好設計的DEFAULT SPA的LUN 多數是在OCR磁碟組內。所以多次發生OCR被
DISMOUNT的現象。通過主機,儲存,資料庫三方共同協作最終問題確認是光纖線問題,更換光纖線。
IO閃斷解決辦法總結:
儲存:通過此次CASE,更換一些不達標準的光纖線。
資料庫:分析PST的原理,進而優化了"_asm_hbeatiowait"值。增強系統的強壯性和穩定性。
PST heartbeat:往往發生在IO閃斷/繁忙/CPU繁忙時,PST檢測到同步延遲超過"_asm_hbeatiowait"值時,會通知ORACLE ASM INSTANCE dismount disk group,由GMON程式完成disk group。
cssd voting heartbeat:往往發生在本地無法範圍OCR的情況下(IO徹底中斷),進而腦裂。
1.What is PST ?
PST is Partner Status Table .
1.1 Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
1.2 PST determines whether the disk is OFFLINE by judging the number of partners.
1.3 Disk Group Offline was Oracle ASM.
Oracle ASM forces the dismounting of the disk group.Otherwise,Oracle ASM takes the disk offline.
2.How is PST working ?
This problem has a few distinctive symptoms but the highest is a node crash:
Also:
Diskgroup outage
Very Slow IO Performance
Possible very high CPU
Timeouts for IO
Communications to ASM, CRS or CSS failures
More:
Only one node active, the other one hangs while starting ASM.
After an outage the Node restarts, but IO Waits are very high
Overall Very slow performance on one node, but no load or evidence of why IO stats are so high
工作原理,引用 http://www.askmaclean.com/archives/pst-partnership-status-table.html
External Redundancy一般有一個PST
Normal Redundancy至多有個3個PST
High Redundancy 至多有5個PST
如下場景中PST 可能被重定位:
存有PST的ASM DISK不可用了(當ASM啟東時)
ASM DISK OFFLINE了
當對PST的讀寫發生了I/O錯誤
disk被正常DROP了
在讀取其他ASM metadata之前會先檢查PST
當ASM例項被要求mount diskgroup時,GMON程式會讀取diskgroup中所有磁碟去找到和確認PST拷貝
如果他發現有足夠的PST,那麼會mount diskgroup
之後,PST會被快取在ASM快取中,以及GMON的PGA中並使用排他的PT.n.0鎖保護
同叢集中的其他ASM例項也將快取PST到GMON的PGA,並使用共享PT.n.o鎖保護
僅僅那個持有排他鎖的GMON能更新磁碟上的PST資訊
每一個ASM DISK上的AUN=1均為PST保留,但只有幾個磁碟上真的有PST資料
3.Why is using PST ?
已冗餘度保證資料安全。
確保Normal Redundancy和High Redundancy策略的磁碟組組內磁碟在一定時間內資料的一致性。進而以Normal Redundancy和High Redundancy策略保護資料的安全。
Reference:
Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration (文件 ID 1302539.1)
Oracle Automatic Storage Management (ASM) - Background
Read errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise healthy disk. Oracle ASM tries to recover from read errors on
corrupted sectors on a disk. When a read error by the database or Oracle ASM triggers the Oracle ASM instance to attempt bad block remapping, Oracle ASM reads a good
copy of the extent and copies it to the disk that had the read error.
If the write to the same location succeeds, then the underlying allocation unit (sector) is deemed healthy. This might be because the underlying disk did its own bad
block reallocation.
If the write fails, Oracle ASM attempts to write the extent to a new allocation unit on the same disk. If this write succeeds, the original allocation unit is marked
as unusable. If the write fails, the disk is taken offline.
Another benefit with Oracle ASM based mirroring is that the database instance is aware of the mirroring. For many types of physical block corruptions such as a bad
checksum, the database instance proceeds through the mirror side looking for valid content and proceeds without errors. If the process in the database that encountered
the read can obtain the appropriate locks to ensure data consistency, it writes the correct data to all mirror sides.
When encountering a write error, a database instance sends the Oracle ASM instance a disk offline message.
If database can successfully complete a write to at least one extent copy and receive acknowledgment of the offline disk from Oracle ASM, the write is considered
successful.
If the write to all mirror side fails, database takes the appropriate actions in response to a write error such as taking the tablespace offline.
When the Oracle ASM instance receives a write error message from a database instance or when an Oracle ASM instance encounters a write error itself, the Oracle ASM
instance attempts to take the disk offline. Oracle ASM consults the Partner Status Table (PST) to see whether any of the disk's partners are offline. If too many
partners are offline, Oracle ASM forces the dismounting of the disk group. Otherwise, Oracle ASM takes the disk offline.
The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before Oracle ASM or database I/O.
For information about the remap command, see "remap".
When ASM detects any block corruptions, ASM logs the error to the ASM alert.log file. The same corruption error may not appear in the database alert.log or
application if ASM can correct the corruption automatically.
Starting Oracle 12c, Oracle ASM disk scrubbing checks logical data corruptions and repairs the corruptions automatically in normal and high redundancy disks groups.
The feature is designed so that it does not have any impact to the regular input and output (I/O) operations in production systems. The scrubbing process repairs
logical corruptions using the Oracle ASM mirror disks. Disk scrubbing uses Oracle ASM rebalancing to minimize I/O overhead.
The scrubbing process is visible in fields of the V$ASM_OPERATION view. Refer to Oracle? Automatic Storage Management Administrator's Guide 12c Release 1 (12.1).
These ASM benefits are available for all databases using ASM. Since every Exadata Database Machine uses ASM, all these benefits are always available for Exadata
customers.
########################################################################################
版權所有,文章允許轉載,但必須以連結方式註明源地址,否則追究法律責任!【QQ交流群:53993419】
QQ:14040928 E-mail:dbadoudou@163.com
本文連結: http://blog.itpub.net/26442936/viewspace-2096971/
########################################################################################
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26442936/viewspace-2096971/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 保證RAC 24*7安全穩定執行秘籍之一(IO閃斷解決辦法)
- 指令碼在crontab裡沒有執行的解決辦法之一指令碼
- Eclipse閃退解決辦法Eclipse
- 執行緒安全性保證---JMM特性詳解執行緒
- 安裝RAC 執行root.sh指令碼報錯,解決辦法指令碼
- 通過監控執行緒狀態來保證socket伺服器的穩定執行執行緒伺服器
- 應該如何保證伺服器的安全與穩定伺服器
- Windows7音訊服務未執行的解決辦法Windows音訊
- 保證執行緒安全的技術執行緒
- 執行Docker命令報錯解決辦法Docker
- Java下如何保證多執行緒安全Java執行緒
- 怎樣做才能保證執行緒安全?執行緒
- root鎖屏解決辦法之一(轉)
- UICollectionView設定行間距失效,解決辦法UIView
- Laravel 執行 Gulp 命令出錯解決辦法Laravel
- 多執行緒高併發解決辦法執行緒
- 華為雲網站安全解決方案:全面保障企業網路安全,助力業務穩定高效執行網站
- 疫情致居家辦公激增,谷歌決定暫停 Chrome 更新確保穩定谷歌Chrome
- win7黑屏解決辦法Win7
- CentOS 中yum命令執行錯誤解決辦法CentOS
- 執行 xhost + 出現 unable to open display 解決辦法
- 解決PythonWin執行時崩潰的辦法Python
- 定時間點執行任務的asp.net簡易解決辦法ASP.NET
- iOS 在主執行緒操作UI不能保證安全iOS執行緒UI
- 聊聊保證執行緒安全的10個小技巧執行緒
- Docker Hello World容器執行報錯的解決辦法Docker
- Laravel Mix - 執行 NPM install 報錯解決辦法LaravelNPM
- MySql登入時閃退的快速解決辦法MySql
- Cornerstone 意外退出、開啟閃退的解決辦法
- 保證執行緒在主執行緒執行執行緒
- 解讀Java8中ConcurrentHashMap是如何保證執行緒安全的JavaHashMap執行緒
- oracle rac asm 問題的官方解決辦法OracleASM
- Win7 IIS7.5執行ASP時出現500錯誤的解決辦法Win7
- 在RAC 中解決 vipca 和 srvctl 無法執行的錯誤PCA
- 解決會場租賃wifi安全穩定的方案WiFi
- Archlinux Gnome桌面下Codeblocks無法執行的解決方案之一LinuxBloC
- vmware虛擬機器執行卡慢的解決辦法虛擬機
- Android 避免APP啟動閃黑屏的解決辦法AndroidAPP