上週在客戶那裡遇到了一例由Oracle Bug引發的表資料塊邏輯訛誤觸發ORA-00600:[13013], [5001]的問題,這裡為了更好地說明該問題,於是萌發了手工模擬該資料塊邏輯訛誤的想法。
基礎知識
Oracle中表的資料塊由塊頭、事務槽、行字典、行資料等多種結構組成。 行資料(rowdata)實際是由許多row piece 行片組成的,每一條row piece的頭部都有flag、locks、cols(cc)三個標誌位。
其中flag標記了該row piece的型別,該flag位佔用一個位元組,其不同的bit位代表不同的含義,見下表:
ROW_CLUSTER_KEY = 0x80; KDRHFK
ROW_CTABLE_NUMBER = 0x40; KDRHFC
ROW_HEAD_PIECE = 0x20; KDRHFH
ROW_DELETED_ROW = 0x10; KDRHFD
ROW_FIRST_PIECE = 0x08; KDRHFF
ROW_LAST_PIECE = 0x04; KDRHFL
ROW_FROM_PREVIOUS = 0x02; KDRHFP
ROW_CONTINUE_NEXT = 0x01; KDRHFN
一般來說最普通的一條row piece是普通堆表(heap table)的未被刪除的且無行遷移/連結的,其flag位應為
普通row的flag一般為
Single Row =
ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE= 0x20 + 0x08 + 0x04= 0x2c
===================================================================================
cluster key的flag一般為
Cluster Key =
ROW_CLUSTER_KEY + ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE=
KDRHFL, KDRHFF, KDRHFH, KDRHFK =0x80 + 0x2c = 0xac
BBED> x /rn
rowdata[68] @8166
-----------
flag@8166: 0xac (KDRHFL, KDRHFF, KDRHFH, KDRHFK)
lock@8167: 0x00
cols@8168: 1
kref@8169: 1
mref@8171: 1
hrid@8173:0x01800014.0
nrid@8179:0x01800014.0
col 0[2] @8185: 10
===================================================================================
Cluster Row =
ROW_CTABLE_NUMBER + ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE =
(KDRHFL, KDRHFF, KDRHFH, KDRHFC) = 0x6c
BBED> x /rncc
rowdata[0] @8098
----------
flag@8098: 0x6c (KDRHFL, KDRHFF, KDRHFH, KDRHFC)
lock@8099: 0x00
cols@8100: 10
col 0[2] @8102: 200
col 1[8] @8105: Jennifer
col 2[6] @8114: Whalen
col 3[7] @8121: JWHALEN
col 4[12] @8129: 515.123.4444
col 5[7] @8142: w....
col 6[7] @8150: AD_ASST
col 7[2] @8158:
col 8[0] @8161: *NULL*
col 9[3] @8162: .
出現ORA-00600:[13013], [5001]且Arg [f] Code =3 代表這一row piece的flag >0xc0,
也就是該行片同時被標記為key和clustered(row is marked as both a Key and Clustered), 其檢驗程式碼為check code 6251。
當flag >= 0xc0 時 會出現kdrchk: row is marked as both a Key and Clustered Block 12 failed with check code 6251
當 0xac >flag >= 0xa0 時 會 kdrchk: row is Key and Not only piece of key Block 12 failed with check code 6255
當 flag = 0x43 是 會出現 kdrchk: C and neither of H or F Block 12 failed with check code 6263
當 flag = 0x83 時 會出現 kdrchk: row is marked both as a Key and being continued Block 12 failed with check code 6254
當Oracle程式訪問資料塊時首先會校驗block的sum值並與block中的CHECKSUM值進行對比,若一致則說明該block沒有物理訛誤。但是光這一項檢查是不夠的,不足以保證block無誤。所以Oracle引入了一些列的邏輯檢驗,每一種邏輯檢驗對應一個檢測程式碼(check code),這些檢測包括row piece的flag、cols(cc)狀態是否正確等。
實際負責這類邏輯檢驗的函式包括:kdbchk、kddummy_blkchk、kco_blkchk、kdBlkCheckError、kdrchk等等。
這裡當服務程式訪問到問題資料塊,檢測程式碼發現其flag為0xff(KCHDFLPN),該flag從邏輯上講是衝突的,所以檢測程式碼認為該row piece存在異常,進而會引發update的ORA-00600:[13013], [5001]或查詢的ORA-600 [qertbFetchByRowID]內部錯誤。
這裡需要說明一下的是,很多人認為dbv工具時無法檢測出邏輯訛誤的,實際上dbv、rman、validate structure和bbed-verify均可以檢測出一定程度的邏輯訛誤,但是最可靠的還是db_block_checksum=true情況下的validate structure [online]驗證命令。從另一個角度來說,普通的dbv只能做單一的檢測,而無法做到交叉地檢驗,從而瞭解表和索引上的不一致問題,但是validate structure online卻可以做到。
正式模擬
以上我們瞭解了ORA-00600:[13013], [5001]內部錯誤是如何被引發的,那麼下面手工模擬該錯誤也就不困難了,當然這裡需要用到bbed工具。
以下我們會建立實驗用的tablespace,table,index:
SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE 10.2.0.4.0 Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production
SQL> select * from global_name;
GLOBAL_NAME
--------------------------------------------------------------------------------
www.oracledatabase12g.com
/* 建立實驗用的表空間 */
SQL> create tablespace maclean datafile '/home/oracle/maclean.dbf' size 20M;
Tablespace created.
SQL> create table tv tablespace maclean as select rownum t1,'find me' t2 from
dba_tables where rownumcreate index ind_tv on tv(t1) tablespace users;
Index created.
SQL> update tv set t2='corrption here' where t1=200;
update tv set t2='corrption here' where t1=200
*
ERROR at line 1:
ORA-12899: value too large for column "SYS"."TV"."T2" (actual: 14, maximum: 7)
SQL> alter table tv modify t2 varchar2(200);
Table altered.
SQL> update tv set t2='corruption here' where t1=200;
1 row updated.
SQL> commit;
Commit complete.
/* 以上建立了示例用表,其中t1=200的記錄是之後將會
手動修改為存在訛誤的行 */
SQL> select dump(200,16) from dual;
DUMP(200,16)
-----------------
Typ=2 Len=2: c2,3
/* 通過16進位制碼可以方便找出該t1=200的記錄行 */
SQL> alter system checkpoint;
System altered.
SQL> alter tablespace maclean read only;
Tablespace altered.
SQL> select dbms_rowid.rowid_block_number(rowid) bno ,dbms_rowid.rowid_relative_fno(rowid) fno from tv;
BNO FNO
---------- ----------
12 6
[oracle@rh2 ~]$ cp maclean.dbf maclean.dbf.bak
接著使用BBED工具找到目標行並實施手工修改:
[oracle@rh2 ~]$ bbed filename=maclean.dbf mode=edit
Password:
BBED: Release 2.0.0.0.0 - Limited Production on Sun Sep 18 22:14:59 2011
Copyright (c) 1982, 2007, Oracle. All rights reserved.
BBED> set blocksize 8192
BLOCKSIZE 8192
BBED> set block 13
BLOCK# 13
BBED> map /v
File: maclean.dbf (0)
Block: 13 Dba:0x00000000
------------------------------------------------------------
KTB Data Block (Table/Cluster)
struct kcbh, 20 bytes @0
ub1 type_kcbh @0
ub1 frmt_kcbh @1
ub1 spare1_kcbh @2
ub1 spare2_kcbh @3
ub4 rdba_kcbh @4
ub4 bas_kcbh @8
ub2 wrp_kcbh @12
ub1 seq_kcbh @14
ub1 flg_kcbh @15
ub2 chkval_kcbh @16
ub2 spare3_kcbh @18
struct ktbbh, 96 bytes @20
ub1 ktbbhtyp @20
union ktbbhsid, 4 bytes @24
struct ktbbhcsc, 8 bytes @28
b2 ktbbhict @36
ub1 ktbbhflg @38
ub1 ktbbhfsl @39
ub4 ktbbhfnx @40
struct ktbbhitl[3], 72 bytes @44
struct kdbh, 14 bytes @124
ub1 kdbhflag @124
b1 kdbhntab @125
b2 kdbhnrow @126
sb2 kdbhfrre @128
sb2 kdbhfsbo @130
sb2 kdbhfseo @132
b2 kdbhavsp @134
b2 kdbhtosp @136
struct kdbt[1], 4 bytes @138
b2 kdbtoffs @138
b2 kdbtnrow @140
sb2 kdbr[200] @142
ub1 freespace[4725] @542
ub1 rowdata[2921] @5267
ub4 tailchk @8188
BBED> find /x c203
File: maclean.dbf (0)
Block: 13 Offsets: 5271 to 5782 Dba:0x00000000
------------------------------------------------------------------------
c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e 64206d65
2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420 6d652c00
0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65 2c000203
c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00 0203c202
5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203 c2025c07
66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202 5a076669
6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807 66696e64
206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669 6e64206d
652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64 206d652c
000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d 652c0002
03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c 000203c2
024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002 03c2024d
0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2 024b0766
696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249 0766696e
64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766 696e6420
6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e 64206d65
找到了t1=200的偏移值為5271
則其fb的偏移值為5271 -4 = 5267
BBED> set offset 5267
OFFSET 5267
BBED> d
File: maclean.dbf (0)
Block: 13 Offsets: 5267 to 5778 Dba:0x00000000
------------------------------------------------------------------------
2c020202 c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e
64206d65 2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420
6d652c00 0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65
2c000203 c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00
0203c202 5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203
c2025c07 66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202
5a076669 6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807
66696e64 206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669
6e64206d 652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64
206d652c 000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d
652c0002 03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c
000203c2 024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002
03c2024d 0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2
024b0766 696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249
0766696e 64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766
696e6420 6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e
/* 找到指定行的地址為5267,其當前flag為正常的0x2c */
BBED> x /rnc
rowdata[0] @5267
----------
flag@5267: 0x2c (KDRHFL, KDRHFF, KDRHFH)
lock@5268: 0x02
cols@5269: 2
col 0[2] @5270: 200
col 1[15] @5273: corruption here
修改該flag 為 0xff BBED> modify /x 0xff
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
File: maclean.dbf (0)
Block: 13 Offsets: 5267 to 5778 Dba:0x00000000
------------------------------------------------------------------------
ff020202 c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e
64206d65 2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420
6d652c00 0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65
2c000203 c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00
0203c202 5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203
c2025c07 66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202
5a076669 6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807
66696e64 206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669
6e64206d 652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64
206d652c 000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d
652c0002 03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c
000203c2 024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002
03c2024d 0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2
024b0766 696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249
0766696e 64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766
696e6420 6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e
BBED> x /rnc
rowdata[0] @5267
----------
flag@5267: 0xff (KDRHFN, KDRHFP, KDRHFL, KDRHFF, KDRHFD, KDRHFH, KDRHFC, KDRHFK)
lock@5268: 0x02
cols@5269: 0
ckix@5270: 2
BBED> sum apply
Check value for File 0, Block 13:
current = 0x0000, required = 0x0000
我們使用bbed的verify命令驗證資料塊會發現問題flag
BBED> verify
DBVERIFY - Verification starting
FILE = maclean.dbf
BLOCK = 12
kdrchk: row is marked as both a Key and Clustered
prow=0x7f5335f05693 flag=0xff
Block Checking: DBA = 25165836, Block Type = KTB-managed data block
data header at 0x7f5335f0427c
kdbchk: bad row tab 0, slot 199
Block 12 failed with check code 6251
DBVERIFY - Verification complete
Total Blocks Examined : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing (Data) : 1
Total Blocks Processed (Index): 0
Total Blocks Failing (Index): 0
Total Blocks Empty : 0
Total Blocks Marked Corrupt : 0
Total Blocks Influx : 0
使用dbv工具是也可以驗證這種邏輯訛誤的
[oracle@rh2 ~]$ dbv file=maclean.dbf
DBVERIFY: Release 10.2.0.4.0 - Production on Sun Sep 18 22:27:49 2011
Copyright (c) 1982, 2007, Oracle. All rights reserved.
DBVERIFY - Verification starting : FILE = maclean.dbf
kdrchk: row is marked as both a Key and Clustered
prow=0x7f9ef25f7693 flag=0xff
Block Checking: DBA = 25165836, Block Type = KTB-managed data block
data header at 0x7f9ef25f627c
kdbchk: bad row tab 0, slot 199
Page 12 failed with check code 6251
DBVERIFY - Verification complete
Total Pages Examined : 2560
Total Pages Processed (Data) : 1
Total Pages Failing (Data) : 1
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 11
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 2548
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 691691 (0.691691)
回到sqlplus中訪問之前修改的資料行,觸發ORA-600[13013] [5001]錯誤:
SQL> alter system flush buffer_cache;
System altered.
SQL> update tv set t2='correct here' where t1=200;
update tv set t2='correct here' where t1=200
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [13013], [5001], [52937],
[25165836], [199], [25165836], [3], []
PLAN_TABLE_OUTPUT
---------------------------------------------------------
Plan hash value: 568795662
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | UPDATE STATEMENT | | 1 | 115 | 2 (0)| 00:00:01 |
| 1 | UPDATE | TV | | | | |
|* 2 | INDEX RANGE SCAN| IND_TV | 1 | 115 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"=200)
SQL> select * from tv where t1=200;
select * from tv where t1=200
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [],
[], [], []
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1015724781
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 115 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TV | 1 | 115 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | IND_TV | 1 | | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"=200)
可以看到當正好update到問題行記錄時如預料出現了ORA-00600:[13013], [5001]錯誤,而ACCESS BY INDEX ROWID時出現了ORA-00600:[qertbFetchByRowID]。
解決方案
1.在有備份的情況下可以通過blockrecovery線上修復該問題資料塊:
RMAN> blockrecover datafile 6 block 12;
Starting blockrecover at 18-SEP-11
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=144 devtype=DISK
starting media recovery
media recovery complete, elapsed time: 00:00:01
Finished blockrecover at 18-SEP-11
但是請注意如果該邏輯訛誤確實是由Oracle Bug引起的話,那麼很有可能blockrecover也無能為力,那麼可以借鑑第二種方法。
2. 第二種方法針對沒有備份可用的資料庫或者recover資料塊不管用的場景,可以設定10231事件並ctas複製該表,但是這種方法可能會丟失有問題的行記錄:
SQL> alter session set events '10231 trace name context forever, level 10'
SQL> Create table.TABLE_COPY as select * from TABLE;
瞭解更多關於kdrchk函式的資訊:
Add check for continued row piece pointing to itself with
corruption description:
"kdrchk: Row piece pointing to itself"
DB_BLOCK_CHECKING = MEDIUM will check for row pieces where the
next rowid (nrid) points to itself (chained row points to itself).
It produces error ORA-600 [kddummy_blkchk] or ORA-600 [kdBlkCheckError]
with check code [6266] (3rd ORA-600 argument).
DBVERIFY reports the same corruption description if the block is corrupt on disk.
RMAN when run with the CHECK LOGICAL option reports it as
corruption_type=CORRUPT/LOGICAL in v$database_block_corruption.
"ANALYZE TABLE VALIDATE STRUCTURE" produces error ORA-1498 and trace file
shows the same corruption description.
With this fix in place DBMS_REPAIR can be used to identify and mark the affected
block as Soft Corrupt producing error ORA-1578 and it can be skipped it for DML's
using DBMS_REPAIR.SKIP_CORRUPT_BLOCKS.
[CM][SG][event 1][domain Q423][mem 0] Joining shared group
kdrchk: column length 0 but not null
prow=0x2a97f4d9d6 flag=0x2c column=57
Block Checking: DBA = 29635651, Block Type = KTB-managed data block
data header at 0x2a97f4be7c
kdbchk: bad row tab 0, slot 2
data_block_dump,data header at 0x2a97d113d8
data_block_dump,data header at 0x2a97d113d8
kdrchk: found invalid symbol reference 48
reference to delete symbol
valid symbol range [0,78)
Block Checking: DBA = 411055291, Block Type = KTB-managed data block
data header at 0x68a3f4
kdbchk: bad row tab 0, slot 4
Page 13499 failed with check code 6265
kdrchk: C and neither of H or F
prow=0x4282803ae flag=0x41
Block Checking: DBA = 322963095, Block Type = KTB-managed data block
data header at 0x42828007c
kdrchk: column length 0 but not null
prow=0x10021035e flag=0x2c column=40
Block Checking: DBA = 25189259, Block Type = KTB-managed data block
data header at 0x10020fe7c
kdbchk: bad row tab 0, slot 0
Page 23435 failed with check code 6264
kdrchk: column length 0 but not null
prow=0x1002122e5 flag=0x2c column=40
Block Checking: DBA = 25189260, Block Type = KTB-managed data block
kdrchk: row is marked as both a Key and Clustered
prow=0xd2bfa981 flag=0xff
File#67, Block#74754
kdbchk: bad row tab 0, slot 0
kdrchk: no columns, but has one of P or N
prow=0x934fbffa flag=0x31
DIAGNOSTIC ANALYSIS:
====================
A look at the block dump in the analyze trace file revealed two very
suspicious looking rows:
tab 0, row 0, @0x1ede
tl: 2 fb: --HD---N lb: 0x0
tab 0, row 1, @0x1edc
tl: 2 fb: --HD---N lb: 0x0
The flag bytes in these rows look incorrect.
待修訂!