最近偶爾會接到一條簡訊，提示某個備庫中出現了ORA-00600的錯誤。對於這個問題還真不能心存僥倖，自己帶著疑問檢視了一下，
這是一個一主兩備的庫，主庫和其中的一個備庫沒有任何的ORA-00600的錯誤，只有這一個備庫中偶爾會出現ORA-00600的錯誤。
這個問題如果放大還是很嚴重的，比如主庫出現問題了，如果切換到這個備庫，那麼ORA-00600的錯誤就會直接轉移過來，這個時候這兒備庫就有點雞肋的味道了。所以這個問題一種思路就是重新搭建備庫，另外一種就是手工修復。我還是更希望透過手工修復的方式來先來看看能不能解決掉這個問題。
報錯的備庫資料庫日誌如下：
Media Recovery Waiting for thread 1 sequence 654 (in transit)
Recovery of Online Redo Log: Thread 1 Group 7 Seq 654 Reading mem 0
Mem# 0: /U01/app/oracle/fast_recovery_area/TESTSOB0/onlinelog/o1_mf_7_9lo8zrpc_.log
Mon Sep 21 08:36:02 2015
Archived Log entry 650 added for thread 1 sequence 653 ID 0xfe9af939 dest 1:
Mon Sep 21 12:18:09 2015
Errors in file /U01/app/oracle/diag/rdbms/testb0/testob0/trace/ordermob0_ora_26830.trc (incident=403849):
ORA-00600: 鍐呴儴閿欒?浠ｇ爜, 鍙傛暟: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /U01/app/oracle/diag/rdbms/testob0/testob0/incident/incdir_403849/testob0_ora_26830_i403849.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Sep 21 12:18:11 2015
Sweep [inc][403849]: completed
Sweep [inc2][403849]: completed
Mon Sep 21 12:18:11 2015
Dumping diagnostic data in directory=[cdmp_20150921121811], requested by (instance=1, osid=26830), summary=[incident=403849].
Mon Sep 21 12:37:06 2015
Errors in file /U01/app/oracle/diag/rdbms/testob0/testob0/trace/testob0_ora_26766.trc (incident=403918):
ORA-00600: 鍐呴儴閿欒?浠ｇ爜, 鍙傛暟: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /U01/app/oracle/diag/rdbms/testob0/ordermob0/incident/incdir_403918/testob0_ora_26766_i403918.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Sep 21 12:37:08 2015
Sweep [inc][403918]: completed
Sweep [inc2][403918]: completed
Mon Sep 21 12:37:08 2015
Dumping diagnostic data

而對於日誌中提到的trace檔案，得到的內容如下：
/U01/app/oracle/diag/rdbms/testob0/testob0/trace/testob0_ora_26766.trc
*** 2015-09-21 12:37:04.930
*** SESSION ID:(1898.8357) 2015-09-21 12:37:04.930
*** CLIENT ID:() 2015-09-21 12:37:04.930
*** SERVICE NAME:(SYS$USERS) 2015-09-21 12:37:04.930
*** MODULE NAME:(JDBC Thin Client) 2015-09-21 12:37:04.930
*** ACTION NAME:() 2015-09-21 12:37:04.930

* kdsgrp1-1: *************************************************
row 0x01c55b6f.1f continuation at
0x01c55b6f.1f file# 7 block# 351087 slot 31 not found
KDSTABN_GET: 0 ..... ntab: 1
curSlot: 31 ..... nrows: 32
kdsgrp - dump CR block dba=0x01c55b6f
Block header dump: 0x01c55b6f
Object id on Block? Y
seg/obj: 0x12739 csc: 0x00.10bfb352 itc: 2 flg: E typ: 1 - DATA
brn: 0 bdba: 0x1c55481 ver: 0x01 opc: 0
inc: 0 exflg: 0

而第一次丟擲ORA-00600的錯誤，可以追溯到2014年了，所以還是一個遺留問題。
這個錯誤代表的含義是對應的索引ROWID,在資料表中找不到記錄，還是有資料不一致的情況。
從trace檔案裡也可以看到，其實是在執行一條sql語句的時候丟擲來的錯誤。
SELECT nvl(count(distinct USER_ID),0) as userCount,nvl(SUM(goods_price),0) as total FROM TEST_ORDER WHERE (IS_SANDBOX = 0 or IS_SANDBOX is null) and (order_status = 2 or order_status = 4) and UPDATE_DATE >=to_date(:1,'yyyy-mm-dd') and UPDATE_DATE <to_date(:2,'yyyy-mm-dd')+1 and (substr(MEDIA_CHANNEL_ID,0,2)='10' or substr(MEDIA_CHANNEL_ID,0,2)='20') AND app_id = :3
至於問題怎麼定位，trace中的內容值得好好琢磨一下。
* kdsgrp1-1: *************************************************
row 0x01c55b6f.1f continuation at
0x01c55b6f.1f file# 7 block# 351087 slot 31 not found
這個就代表錯誤出現在7號資料檔案，351087號資料塊上。
可以透過下面的語句來定位錯誤是否在TEST_ORDER這個表上
SQL> select owner,segment_name,segment_type from dba_extents where file_id=7 and block_id<=351087 and (block_id+blocks)>=351087;
OWNER SEGMENT_NAME SEGMENT_TYPE
-------------------- -------------------- ------------------------------------------------------
TESTOB TEST_ORDER TABLE

對於這個問題，MOS中已經提供了完整的解決方法。
ORA-600 [kdsgrp1] During Table/Index Full Scans (文件 ID 468883.1)

一種是透過dump出來資料，然後設定對應的事件，重建修復
#1 alter system dump datafile 7 block 351087;
另外一種是直接宣告跳過那些壞塊,可以直接使用dbms_repair來修復。
execute dbms_repair.skip_corrupt_blocks('TESTOB','TEST_ORDER');
但是因為問題發生在備庫所以還是無法執行這個包的。
SQL> execute dbms_repair.skip_corrupt_blocks('TESTOB','TEST_ORDER');
BEGIN dbms_repair.skip_corrupt_blocks('TESTOB','TEST_ORDER'); END;
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-16000: database open for read-only access
ORA-00604: error occurred at recursive SQL level 1
ORA-16000: database open for read-only access
ORA-06512: at "SYS.DBMS_REPAIR", line 419
ORA-06512: at line 1
所以還是需要在主庫來執行，儘管主庫中還是沒有這個錯誤的。
SQL> execute dbms_repair.skip_corrupt_blocks('TESTOB','TEST_ORDER');
PL/SQL procedure successfully completed.
執行過程很快，修復後自己也在觀察是否還收到過ORA的錯誤警告，按照目前的情況來看，這個問題應該還是順利解決了，因為已經過去了快兩週，之前每一兩天就會拋個錯誤。

備庫中ORA-00600錯誤的簡單修復

相關文章