保證RAC 24*7安全穩定執行秘籍之一(IO閃斷解決辦法)
保證RAC 24*7安全穩定執行秘籍之一(IO閃斷解決辦法)
多年運維生涯當中,閃斷無疑是最讓我頭疼的問題。閃斷問題多數只能“事後諸葛”。
閃斷問題處理思路:
1.使用合理的工具收集有效資訊進而分析閃斷問題。比如涉及ORACLE資料庫的推薦使用OSwatch。
2.如果無法使用合裡的工具收集有效的錯誤資訊,這時你的重點應該轉移到收集的現象和技術原理上。(比如銀行類想安裝一個收集資訊工具基本不可能)以少數的現象和夯實的技術原理分
析給領導,邏輯必須合理。這樣也是可以解決問題的。
今天就說一個光纖線引起的IO閃斷問題,進而影響了多臺RAC同時OCR被dismount。
下圖簡單的介紹了一個CASE的背景。同時也普及了一下儲存到光纖交換機再到伺服器的鏈路走向。
此次的CASE就發生在儲存A0到光纖交換機AO的光纖線上 ,這條光纖線的閃斷就影響了所有走DEFAULT SPA的LUN,恰好設計的DEFAULT SPA的LUN 多數是在OCR磁碟組內。所以多次發生OCR被
DISMOUNT的現象。透過主機,儲存,資料庫三方共同協作最終問題確認是光纖線問題,更換光纖線。
IO閃斷解決辦法總結:
儲存:透過此次CASE,更換一些不達標準的光纖線。
資料庫:分析PST的原理,進而最佳化了"_asm_hbeatiowait"值。增強系統的強壯性和穩定性。
PST heartbeat:往往發生在IO閃斷/繁忙/CPU繁忙時,PST檢測到同步延遲超過"_asm_hbeatiowait"值時,會通知ORACLE ASM INSTANCE dismount disk group,由GMON程式完成disk group。
cssd voting heartbeat:往往發生在本地無法範圍OCR的情況下(IO徹底中斷),進而腦裂。
1.What is PST ?
PST is Partner Status Table .
1.1 Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
1.2 PST determines whether the disk is OFFLINE by judging the number of partners.
1.3 Disk Group Offline was Oracle ASM.
Oracle ASM forces the dismounting of the disk group.Otherwise,Oracle ASM takes the disk offline.
2.How is PST working ?
This problem has a few distinctive symptoms but the highest is a node crash:
Also:
Diskgroup outage
Very Slow IO Performance
Possible very high CPU
Timeouts for IO
Communications to ASM, CRS or CSS failures
More:
Only one node active, the other one hangs while starting ASM.
After an outage the Node restarts, but IO Waits are very high
Overall Very slow performance on one node, but no load or evidence of why IO stats are so high
工作原理,引用
External Redundancy一般有一個PST
Normal Redundancy至多有個3個PST
High Redundancy 至多有5個PST
如下場景中PST 可能被重定位:
存有PST的ASM DISK不可用了(當ASM啟東時)
ASM DISK OFFLINE了
當對PST的讀寫發生了I/O錯誤
disk被正常DROP了
在讀取其他ASM metadata之前會先檢查PST
當ASM例項被要求mount diskgroup時,GMON程式會讀取diskgroup中所有磁碟去找到和確認PST複製
如果他發現有足夠的PST,那麼會mount diskgroup
之後,PST會被快取在ASM快取中,以及GMON的PGA中並使用排他的PT.n.0鎖保護
同叢集中的其他ASM例項也將快取PST到GMON的PGA,並使用共享PT.n.o鎖保護
僅僅那個持有排他鎖的GMON能更新磁碟上的PST資訊
每一個ASM DISK上的AUN=1均為PST保留,但只有幾個磁碟上真的有PST資料
3.Why is using PST ?
已冗餘度保證資料安全。
確保Normal Redundancy和High Redundancy策略的磁碟組組內磁碟在一定時間內資料的一致性。進而以Normal Redundancy和High Redundancy策略保護資料的安全。
Reference:
Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration (文件 ID 1302539.1)
Oracle Automatic Storage Management (ASM) - Background
Read errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise healthy disk. Oracle ASM tries to recover from read errors on
corrupted sectors on a disk. When a read error by the database or Oracle ASM triggers the Oracle ASM instance to attempt bad block remapping, Oracle ASM reads a good
copy of the extent and copies it to the disk that had the read error.
If the write to the same location succeeds, then the underlying allocation unit (sector) is deemed healthy. This might be because the underlying disk did its own bad
block reallocation.
If the write fails, Oracle ASM attempts to write the extent to a new allocation unit on the same disk. If this write succeeds, the original allocation unit is marked
as unusable. If the write fails, the disk is taken offline.
Another benefit with Oracle ASM based mirroring is that the database instance is aware of the mirroring. For many types of physical block corruptions such as a bad
checksum, the database instance proceeds through the mirror side looking for valid content and proceeds without errors. If the process in the database that encountered
the read can obtain the appropriate locks to ensure data consistency, it writes the correct data to all mirror sides.
When encountering a write error, a database instance sends the Oracle ASM instance a disk offline message.
If database can successfully complete a write to at least one extent copy and receive acknowledgment of the offline disk from Oracle ASM, the write is considered
successful.
If the write to all mirror side fails, database takes the appropriate actions in response to a write error such as taking the tablespace offline.
When the Oracle ASM instance receives a write error message from a database instance or when an Oracle ASM instance encounters a write error itself, the Oracle ASM
instance attempts to take the disk offline. Oracle ASM consults the Partner Status Table (PST) to see whether any of the disk's partners are offline. If too many
partners are offline, Oracle ASM forces the dismounting of the disk group. Otherwise, Oracle ASM takes the disk offline.
The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before Oracle ASM or database I/O.
For information about the remap command, see "remap".
When ASM detects any block corruptions, ASM logs the error to the ASM alert.log file. The same corruption error may not appear in the database alert.log or
application if ASM can correct the corruption automatically.
Starting Oracle 12c, Oracle ASM disk scrubbing checks logical data corruptions and repairs the corruptions automatically in normal and high redundancy disks groups.
The feature is designed so that it does not have any impact to the regular input and output (I/O) operations in production systems. The scrubbing process repairs
logical corruptions using the Oracle ASM mirror disks. Disk scrubbing uses Oracle ASM rebalancing to minimize I/O overhead.
The scrubbing process is visible in fields of the V$ASM_OPERATION view. Refer to Oracle? Automatic Storage Management Administrator's Guide 12c Release 1 (12.1).
These ASM benefits are available for all databases using ASM. Since every Exadata Database Machine uses ASM, all these benefits are always available for Exadata
customers.
########################################################################################
版權所有,文章允許轉載,但必須以連結方式註明源地址,否則追究法律責任!【QQ交流群:53993419】
QQ:14040928 E-mail:dbadoudou@163.com
本文連結: http://blog.itpub.net/26442936/viewspace-2096971/
########################################################################################
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25462274/viewspace-2156404/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 執行緒安全性保證---JMM特性詳解執行緒
- 應該如何保證伺服器的安全與穩定伺服器
- 保證執行緒安全的技術執行緒
- UICollectionView設定行間距失效,解決辦法UIView
- Laravel 執行 Gulp 命令出錯解決辦法Laravel
- Java下如何保證多執行緒安全Java執行緒
- Archlinux Gnome桌面下Codeblocks無法執行的解決方案之一LinuxBloC
- 如何確保寶塔皮膚在Centos7.x系統上穩定執行?CentOS
- Laravel Mix - 執行 NPM install 報錯解決辦法LaravelNPM
- 疫情致居家辦公激增,谷歌決定暫停 Chrome 更新確保穩定谷歌Chrome
- 華為雲網站安全解決方案:全面保障企業網路安全,助力業務穩定高效執行網站
- Docker Hello World容器執行報錯的解決辦法Docker
- Cornerstone 意外退出、開啟閃退的解決辦法
- MySql登入時閃退的快速解決辦法MySql
- io.lettuce.core.RedisCommandTimeoutException: Command timed out 解決辦法RedisException
- 解讀Java8中ConcurrentHashMap是如何保證執行緒安全的JavaHashMap執行緒
- Win7 IIS7.5執行ASP時出現500錯誤的解決辦法Win7
- iOS 在主執行緒操作UI不能保證安全iOS執行緒UI
- 聊聊保證執行緒安全的10個小技巧執行緒
- 保證執行緒在主執行緒執行執行緒
- vmware虛擬機器執行卡慢的解決辦法虛擬機
- 解決會場租賃wifi安全穩定的方案WiFi
- 微服務+非同步工作流+ Serverless,Netflix 決定棄用穩定執行 7 年的舊平臺微服務非同步Server
- 集合類不安全及解決辦法
- /etc/rc.d/rc.local不執行的解決辦法
- java多執行緒程式設計問題以及解決辦法Java執行緒程式設計
- 用CMD執行時Java,出現亂碼的解決辦法Java
- Flutter 使用環信即時通訊閃退解決辦法Flutter
- 新升級Windows11 24H2後,前置的視窗邊緣不停閃爍解決辦法Windows
- npm 執行時報錯“因為在此係統上禁止執行指令碼”解決辦法NPM指令碼
- PyCharm啟動報錯:Failed to create JVM.解決辦法之一PyCharmAIJVM
- Laravel 執行 NPM run watch 提示 Missing binding node-Sass 解決辦法LaravelNPM
- 執行 PHP artisan migrate 時報長度錯誤的解決辦法?PHP
- sbt卡住的解決辦法,sbt設定代理
- Java中如何保證執行緒順序執行Java執行緒
- 集合框架與執行緒安全解決框架執行緒
- 遠端辦公如何保證資料安全?
- Xshell連不上centos7的解決辦法CentOS