asm例項自動dismount導致rac一個節點當機

531968912發表於2017-12-26
asm日誌<br /> /u01/app/grid/diag/asm/+asm/+ASM1/trace<br /> <br /> <br /> Thu Jul 30 02:10:46 2015<br /> WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.<br /> WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.<br /> WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.<br /> WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.<br /> WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.<br /> WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.<br /> Thu Jul 30 02:10:47 2015<br /> NOTE: process _b000_+asm1 (38695) initiating offline of disk 0.3915941304 (DATA2_0000) with mask 0x7e in group 1<br /> NOTE: process _b000_+asm1 (38695) initiating offline of disk 1.3915941302 (DATA2_0001) with mask 0x7e in group 1<br /> NOTE: process _b000_+asm1 (38695) initiating offline of disk 2.3915941303 (DATA2_0002) with mask 0x7e in group 1<br /> NOTE: checking PST: grp = 1<br /> GMON checking disk modes for group 1 at 12 for pid 28, osid 38695<br /> ERROR: no read quorum in group: required 2, found 0 disks<br /> .............<br /> Dirty Detach Reconfiguration complete<br /> Thu Jul 30 02:10:47 2015<br /> WARNING: dirty detached from domain 1<br /> NOTE: cache dismounted group 1/0xB368755B (DATA2) &nbsp; &lt;--自己dismounted了<br /> SQL&gt; alter diskgroup DATA2 dismount force /* ASM SERVER:3009967451 */&nbsp;<br /> .............<br /> Thu Jul 30 02:11:24 2015<br /> NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1<br /> SUCCESS: diskgroup DATA2 was mounted &nbsp; &nbsp;&lt;---自己又mounted了<br /> SUCCESS: ALTER DISKGROUP DATA2 MOUNT &nbsp;/* asm agent *//* {0:31:15779} */ &nbsp; &nbsp;&nbsp;<br /> <br /> <br /> <br /> 參考文件<br /> ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (文件 ID 1581684.1)<br /> <br /> alert可以看到ASM磁碟dismount,並且是錯誤“Waited 15 secs for write IO to PST”的問題,這是ASM特有的心跳超時檢測,<br /> ASM instance會定期檢查每個asm disk是不是能正常反饋<br /> <br /> <br /> Generally this kind messages comes in ASM alertlog file on below situations,<br /> Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,<br /> thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.<br /> By the way the heart beat delays are sort of ignored for external redundancy diskgroup.<br /> ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,<br /> but the heart beat delays do not dismount external redundancy diskgroup directly.<br /> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~<br /> 上面描述,可以理解為下面幾點:<br /> 1. ASM例項會定期檢查每一個磁碟組的磁碟狀態,是否通訊正常;<br /> 2. 這個檢查,只是針對normal和high冗餘模式,對於external冗餘,不會遇到這個錯誤;<br /> 3. 預設情況是15s超時,也就是說15s磁碟組還是沒有對ASM例項響應的話,就會dismount磁碟組。<br /> <br /> <br /> 在儲存網路出現問題的情況下,會引發這個錯誤的出現。也就是說,在ASM定期發出檢查資訊的時候,如果磁碟沒有在15s內反饋的話,就認為磁碟已經無法訪問。<br /> <br /> <br /> 實際情況是上面的凌晨2:10時間點正好是做全庫備份時間,估計大量的寫入導致io響應慢<br /> <br /> 在11.2.0.3.0之後才有這個引數出現,也就是說ASM例項對磁碟超時的檢測是在11.2.0.3之後才出現的<br /> <br /> <br /> set pages 9999;<br /> <br /> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ<br /> FROM SYS.x$ksppi x, SYS.x$ksppcv y<br /> WHERE x.inst_id = USERENV ('Instance')<br /> AND y.inst_id = USERENV ('Instance')<br /> AND x.indx = y.indx<br /> AND upper(x.ksppinm) like '%ASM_H%';<br /> 顯示如下:<br /> _asm_hbeatiowait<br /> 15<br /> number of secs to wait for PST Async Hbeat IO return<br /> _asm_hbeatwaitquantum<br /> 2<br /> quantum used to compute time-to-wait for a PST Hbeat check<br /> <br /> <br /> 在儲存網路條件不是很好的情況下可以設定檢查時間長點,其實在12.1.0.2預設就是120秒了<br /> <br /> alter system set "_asm_hbeatiowait"=120 scope=spfile;<br /> <br /> 重啟asm 繼續觀察<br />

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25462274/viewspace-2149305/,如需轉載,請註明出處,否則將追究法律責任。

相關文章