背景：

一個小系統，9月14號儲存控制器和硬碟發生故障，兩臺AIX主機均丟失了hdisk2 3 4，資料庫環境當掉；

21號儲存恢復工作結束晚上，AIX主機識別出了三塊磁碟；

資料庫叢集件可以啟動，資料庫不能啟動，報600錯；

ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [346397], [346390], [0x000000000], [], [], [], []

kccpb_sanity_check_2的相關資料如下：

MetaLink上Oracle對kccpb_sanity_check_2的解釋為:

The ORA-600 [kccpb_sanity_check_2] is reported when the SCN in the controlfile header is lower than the SCN in a controlfile block. 或者

The ORA-600 [kccpb_sanity_check_2] indicates that the seq# of the last read block is higher than the seq# of the control file header block.

可透過下列三種方法解決:

1) restore a backup of a controlfile and recover

OR

2) recreate the controlfile

OR

3) restore the database from last good backup and recover

可參考MetaLink:435436.1,833115.1

[@more@]

NOTE:

If you do not have any special backup of control file to restore and you are using Multiple Control File copies in your pfile/init.ora/spfile you can attempt to mount the database using each control file one by one. If you are able to mount the database with any of these control file copies you can then issue 'alter database backup controlfile to trace' to recreate controlfile.

過程

首先現場人員，嘗試從備份中恢復控制檔案，9月10號的全備中包含有控制檔案；恢復控制檔案成功，open失敗，recover失敗，using backup controlfile until cancel 失敗，資料庫不能開啟；

22號中午開始接手，在備份了原始環境之後，還原資料庫，重新recover using backup controlfile until cancel;在應用了最後一個日誌，

ORA-03113: end-of-file on communication channel

ERROR:

ORA-03114: not connected to ORACLE

例項當掉，後臺報ora-07445

Thu Sep 22 22:43:33 2011

ALTER DATABASE RECOVER LOGFILE '+DATA1/aysyrk/onlinelog/group_6.264.745102907'

Thu Sep 22 22:43:33 2011

Media Recovery Log +DATA1/aysyrk/onlinelog/group_6.264.745102907

Thu Sep 22 22:44:03 2011

Errors in file /oracle/admin/aysyrk/bdump/aysyrk1_p000_7274518.trc:

ORA-07445: exception encountered: core dump [kcbzfc+00dc] [SIGSEGV] [Address not mapped to object] [0x70000040190] [] []

一些細節：

Restore之前需要去清空datafile，如果不清空，在recover之後，

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below

ORA-01194: file 1 needs more recovery to be consistent

ORA-01110: data file 1: '+DATA1/aysyrk/datafile/system.256.745102829';

恢復file1時，又報必須使用using backup controlfile來recover；

Restore到asm中的資料檔案會被重新命名；

Restore結束之後，把所有的資料檔案做一個copy放在2號機的mybak目錄下；

第一次恢復時，缺少312# 日誌，所以不能恢復，查詢備份日誌，restore 312# 313# 回asm中，資料夾名稱不再是9號而是當天；

重要的是掌握資料檔案號、資料檔名稱、歸檔日誌檔名、歸檔日期、seq#、scn

Asmcmd要慎重操作，忽然報錯會搞得你一身冷汗；

手工恢復與使用NBU恢復不同；同樣的命令，不一樣的結果；

Asm磁碟的寫速度平均80M 而檔案系統能達到100M；

ora-07445怎麼辦？難道需要logminer一條一條去恢復資料?

只能忽略出問題的日誌，提前cancel。

反反覆覆恢復了多次，凌晨1點的時候，資料庫終於開啟了。

此外一些細節

1 線上日誌檔案不是越大越好，在恢復資料時，日誌檔案越大，意味著可能損失的資料越多；

原來是理論，現在是血淋淋的現實；一天切換了一次日誌，太久了。教科書上的20分鐘切換一次對於目前這個系統還是比較適合。

2 即便是放在同一個地方，控制檔案還是應該多份；在rman中，設定controlfile autobackup on是有必要的。經常性的手動備份引數檔案和控制檔案，也是有用的；

3 對於關鍵的業務資料，除了圖片之外，做個DMP，還是有必要的；

4 ASM與檔案系統的檔案複製，除了rman之外，在資料庫不能開啟時，還可以透過新建一個例項使用dbms_file_transfer.copy_file複製，也是很容易操作的。

5 備份與資料放在一個陣列上，還沒有備份軟體備份伺服器，有點驚心；

6 v$archived_log v$datafile 和 rman的備份日誌 alert日誌

後續的一些改善計劃

調整線上日誌檔案的大小；

調整備份策略，包括備份集的儲存和保留策略等；

調整控制檔案，使有三份；

增加DMP的輔助備份；

kccpb_sanity_check2