原文連結：

導讀：當我們生產系統中遇到ASM磁碟組容量快被耗盡時，新增磁碟擴容是處理該問題較為常用的手段之一，幾乎每個專業的DBA都操作過。但是設想一下，如果新增到ASM磁碟組的磁碟沒有提前被清空，會出現什麼樣的情況呢？本文分享一起客戶近期碰到的未清空磁碟被新增到磁碟組觸發壞塊（Read datafile mirror）的案例，在此提醒大家注意。

問題描述

收到系統維護人員通知，Oracle資料庫軟體目錄突然異常爆滿，需要及時清理。登陸環境後檢查發現告警日誌不斷的重新整理日誌，重新整理的內容為檢測到有壞塊。

部分告警日誌內容如下：

Reading datafile '+xx01/xxx85' for corruption at rdba: 0x1c4b3afc (file x3,  block 474692348)

Read datafile mirror 'xxx02' (file x3, block 47xx48) found same  corrupt data (no logical check)

Read datafile mirror ' xxx 53' (file x3, block 47xx48) found valid data

Hex  dump of (file x3, block 47xx48) in trace file /xxx130931.trc

Repaired corruption at (file x3,  block 47  xxx  48)

Hex  dump of (file x3, block 47xxx24) in trace file /xx931.trc

Corrupt  block relative dba: 0x1c308c08 (file x3, block 47xx4)

Bad  header found during buffer read

Data  in bad block:

 type: 0 format: 6 rdba: 0x34363835

 last change scn: 0x3833.35313431 seq: 0x30  flg: 0x37

 spare1: 0x31 spare2: 0x36 spare3: 0xf00

 consistency value in tail: 0x30520300

 check value in block header: 0x36

 computed block checksum: 0x6060

Reading  datafile '+xxxx8685' for corruption at rdba: 0x1c308c08 (file x3, block 47xx24)

Read datafile mirror 'xxxx2'  (file x3, block 47xx24) found same corrupt data (no logical check)

Read  datafile mirror 'xxx3' (file x3, block 47xx24) found valid data

Hex  dump of (file x3, block 47xxx24) in trace file /xxx0931.trc

Sat  Nov 09 12:48:17 2019

Hex  dump of (file x3, block 14xxx7) in trace file /xxx22.trc

Corrupt  block relative dba: 0x1ed647db (file x3, block 14xxx7)

Bad  header found during buffer read

Data  in bad block:

 type: 73 format: 6 rdba: 0x5454415f

 last change scn: 0x0e00.00440052 seq: 0x0  flg: 0x00

 spare1: 0x53 spare2: 0x54 spare3: 0x0

 consistency value in tail: 0x01006541

 check value in block header: 0xa00

 block checksum disabled

Reading  datafile '+xxxx17527' for corruption at rdba: 0x1ed647db (file x3, block 14xxx7)

Read datafile mirror 'xx002'  (file x3, block 14xxx7) found same corrupt data (no logical check)

Read  datafile mirror 'xxx0' (file x3, block 14xxx7) found valid data

Hex  dump of (file x3, block 14xx7) in trace file /xxx2.trc

Repaired corruption at (file x3,  block 14xxx7)

問題分析

透過告警日誌中出現的資訊，我們檢視這些問題資料塊發現，涉及的型別包含表和索引等。

select  relative_fno,owner,segment_name ,segment_type   from dba_extents where file_id = x3 and 35xxxx9 between block_id and  block_id + blocks -1;

RELATIVE_FNO      OWNER       SEGMENT_NAME       SEGMENT_TYPE

----------------------    --------------      -------------------------      ---------------------

1024                           IxxxL                 PxxxT                          INDEX

 

RELATIVE_FNO      OWNER       SEGMENT_NAME       SEGMENT_TYPE

----------------------    --------------      -------------------------      ---------------------

 124                            IxxxM                 OxxxT                         TABLE

使用DBV 進行檢查校驗：



……

Page 278199 is marked  corrupt

Corrupt block relative  dba: 0x21843eb7 (file x4, block 2xx9)

Bad header found during  dbv:

Data in bad block:

 type: 0 format: 4 rdba: 0x0000ffff

 last change scn: 0x0000.00000000 seq: 0x0  flg: 0x1d

 spare1: 0x0 spare2: 0xa spare3: 0x0

 consistency value in tail: 0x31040000

 check value in block header: 0x1500

 computed block checksum: 0xe403

 

Page 278200 is marked  corrupt

Corrupt block relative  dba: 0x21843eb8 (file x4, block 2xx0)

Bad header found during  dbv:

Data in bad block:

 type: 48 format: 0 rdba: 0x000a0018

 last change scn: 0x3031.31060000 seq: 0x30  flg: 0x30

 spare1: 0x30 spare2: 0x0 spare3: 0x19

 consistency value in tail: 0x000b0000

 check value in block header: 0x31

 block checksum disabled

…………此處省略n行

相關Trace 中記錄:



Corrupt block relative  dba: 0x2180ba80 (file x4, block 4xx4)

Bad header found during  user buffer read

Data in bad block:

 type: 82 format: 0 rdba: 0x534e4901

 last change scn: 0x4546.464f2e54 seq: 0x52  flg: 0x5f

 spare1: 0x0 spare2: 0x0 spare3: 0x5453

 consistency value in tail: 0x0908bdf2

 check value in block header: 0x4e49

 computed block checksum: 0x66c6

Reading datafile '+xxx05'  for corruption at rdba: 0x2180ba80 (file x4, block 4xx4)

ksfdrfms:Mirror Read  file=+xxx905 fob=0x246076cb80 bufp=0x7f9a07619c00 blkno=47744 nbytes=8192

ksfdrfms: Read success  from mirror side=1 logical extent number=0 disk=xxx2 path=/dev/axxx1

Mirror I/O done from ASM  disk /dev/axxx1

Read datafile mirror 'xxx02'  (file x4, block 4xx4) found same corrupt data (no logical check)

ksfdrnms:Mirror Read  file=+xxx7905 fob=0x246076cb80 bufp=0x7f9a07619c00 nbytes=8192

ksfdrnms: Read success  from mirror side=2 logical extent number=1 disk=xxx3 path=/dev/axxx4

Mirror I/O done from ASM  disk /dev/axxx4

Read datafile mirror 'xxx3'  (file x4, block 4xx4) found valid data

Hex dump of (file x4,  block 4xx4)

仔細觀察發現，每次的壞塊報錯都十分相似，如下所示:

Read datafile mirror 'xxx2'(file x3, block 47xxx48) found same corrupt data (no logical check)

我們進一步細看日誌，發現有一共同特點是基本都是磁碟名為 xxx2與其他磁碟名中都發現了相同的資料塊, 並且這些資料塊中有效的資料塊都在其他磁碟中，反而無效的資料壞塊卻全都在磁碟/dev/axxx1 (也就是磁碟名:xxx2) ，因此猜測可能和這塊磁碟的相關操作有關，進一步瞭解與發現,這塊磁碟之前原本就是磁碟組xxx1 中的一塊盤，但由於某些原因導致這塊磁碟不在該磁碟組，然後他們在異常時間的前一天又重新新增該磁碟，最後真相浮出水面，由於 /dev/axxx1 的舊資料尚未被清空，導致新增磁碟後，舊塊與新塊衝突，資料庫異常報錯，撐爆軟體目錄。

而xxx1 磁碟組的冗餘度是 NORMAL ，簡單舉例說明下 ,oracle根據映象個數不同,磁碟組的冗餘度被劃分為以下3種：

1)外部冗餘(External redundancy):資料沒有映象。這種情況適用於已經使用底層儲存軟體對資料做過映象的系統。

2)普通冗餘(Normal redundancy): 1路映象。這種冗餘度適用於大部分系統。

3)高冗餘(High redundancy) : 2路映象。這種冗餘度適合儲存系統的重要資料,當然這也意味著會佔用更多的空間。

Oracle映象資料是透過failuregroup (失敗組)的方式來實現的。也就是說由於xxx1 磁碟組是normal 冗餘，在保留一份映象的同時Oracle會保證每一個Extent和它對應的映象不會儲存在相同的failure group中,從而確保了當failure group中的某一個或多個磁碟,甚至整個failure group全部丟失時也不會有資料丟失；當磁碟/dev/axxx1重新加入到磁碟組中時，ASM再平衡功能會讓磁碟組中所有磁碟上的檔案extent 均衡的分佈，該過程是由後臺程式RBAL進行處理。當分佈的映象與磁碟/dev/axxx1 中的舊資料存在衝突時，將報錯。

問題解決

直接剔除問題磁碟，dd磁碟，清除舊資料，再重新新增回來，問題解決，故障恢復。



alter diskgroup xxx1  drop disk 'Oxxxx2';



dd if=/dev/zero  of=/dev/asxxx1 bs=1M count=256

問題：未清空磁碟被新增到磁碟組觸發壞塊

問題描述

問題分析

問題解決

相關文章