oracle grid 其中一個節點asm 磁碟組後設資料損壞處理案例

paulyibinyi發表於2011-10-08

[故障描述]
今天進行節點2的asm diskgroup 檢查時發現,
asmcmd lsdg 命令沒有結果輸出,然後去檢查節點2的相關日誌,
資料庫日誌無異常報錯,cluster 日誌也無異常報錯,但ASM日誌報以下的錯誤:
2011.9.20 11:26:01 ASM報以下的錯誤

Sep 20 11:26:01 2011
WARNNING: cache read a corrupted block group=1(CRS) fn=1 blk=7 from disk 0(CRS_0000)
Errors in file /software/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_ora_46793154.trc:
ORA-15196: invalid ASM block header [kfc.c:25165] [obj_kfbl] [1] [7] [2147483649 != 1]
NOTE: a corrupted block from group CRS was dumped to /software/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_ora_46793154.trc
WARNNING: cache read(retry) a corrupted block group=1(CRS) fn=1 blk=7 from disk 0(CRS_0000)
Errors in file /software/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_ora_46793154.trc:
ORA-15196: invalid ASM block header [kfc.c:25165] [obj_kfbl] [1] [7] [2147483649 != 1]
ORA-15196: invalid ASM block header [kfc.c:25165] [obj_kfbl] [1] [7] [2147483649 != 1]
ERROR: cache failed to read group=1(CRS) fn=1 blk=7 from disk(s): 0(CRS_0000)
ORA-15196: invalid ASM block header [kfc.c:25165] [obj_kfbl] [1] [7] [2147483649 != 1]
ORA-15196: invalid ASM block header [kfc.c:25165] [obj_kfbl] [1] [7] [2147483649 != 1]
System State dumped to trace file /software/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_ora_46793154.trc
NOTE: AMDU dump of disk group CRS created at /software/oracle/diag/asm/+asm/+ASM2/trace
NOTE: cache initiating offline of disk 0 group CRS
NOTE: process 46793154 initiating offline of disk 0.2209752627 (CRS_0000) with mask 0x7e in group 1

上述日誌報CRS diskgroup裡有一個邏輯block出現損壞。導致diskgroup CRS offline.

OCR diskgoup 是RAC的一個重要組成部分,其存放著ocr file 和vote disk file,用於維護在cluster 中高可用性元件的資訊。例如,
Cluster節點列表等。

當前只有節點2出現上述的錯誤,在節點2上,crs diskgroup offline,其他diskgroup 正常,所以節點2例項還是正常,可以提供訪問。
而主節點1並沒有出現相應的錯誤,資料庫及應用連線正常。

[解決方案]
1、檢查各個diskgroup的狀態是否正常,節點1 全正常,節點2 CRS diskgroup 不正常
SQL> select name,state from v$asm_diskgroup;

NAME                           STATE
------------------------------ -----------
CRS                            MOUNTED
DGSYSTEM                       MOUNTED
DG01                      MOUNTED
RAID5_DG01                MOUNTED
RECOVER_DG                MOUNTED
SSD_DG01                  MOUNTED

6 rows selected.

SQL> alter diskgroup SSD_DG01 check;

Diskgroup altered.

SQL> alter diskgroup RECOVER_DG check;

Diskgroup altered.

SQL> alter diskgroup RAID5_DG01 check;

Diskgroup altered.

SQL> alter diskgroup DG01 check;

Diskgroup altered.

SQL> alter diskgroup DGSYSTEM check;

Diskgroup altered.

SQL> alter diskgroup CRS check;

Diskgroup altered.

節點2 CRS diskgroup報以下的錯:
SQL> alter diskgroup CRS check;
alter diskgroup CRS check
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15130: diskgroup "CRS" is being dismounted

2、備份節點1 CRS diskgroup 的元資訊
[nhdb18:grid]asmcmd md_backup crsdgbk -G CRS
Disk group metadata to be backed up: CRS
Current alias directory path: cluster
Current alias directory path: cluster/ASMPARAMETERFILE
Current alias directory path: cluster/OCRFILE

3、檢查節點1是否有應用在跑,如有,通知應用組停止應用。
4、停止Goldengate

5、停止節點2 例項
6、恢復CRS diskgroup 後設資料
   asmcmd md_restore crsdgbk -G CRS

7、把crs diskgroup 載入,並檢查是否正常
  alter diskgroup crs mount;
  alter diskgroup CRS check;

8、把節點2例項啟動,通知應用組啟動相關的應用及Goldengate。

整個處理過程大概需要一個小時左右。

[後續處理]
可以線上處理:
1、CRS diskgroup 增加mirror disk,以防crs disk出現物理損壞。
2、定期備份ASM 磁碟組後設資料。
3、加強ASM 和 CRS 日誌的監控檢查。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/7199859/viewspace-708751/,如需轉載,請註明出處,否則將追究法律責任。

相關文章