ORA-07445錯誤導致叢集CI鎖的問題分析

charsi發表於2014-09-26

前兩天,客戶資料庫叢集突然無法訪問。一個節點的alert日誌中出現大量ORA-07445: exception encountered: core dump [kghlkremf+0024] [SIGSEGV] [Address not mapped to object]和ORA-00476: RECO process terminated with error的錯誤。同時該節點例項使用sqlplus無法登陸
從另一個節點中可以看到有很多DFS lock handle等待事件,P1的值為1128857605,可以判斷為CI(Cross-Instance)鎖。對於這種鎖,最簡單粗暴的解決方法是重啟兩個節點。
我們看一下這個問題.

環境:
Oracle 10.2.0.4 RAC
AIX 5.3

故障現象:
1.節點2的資料庫alert日誌中首先數個伺服器程式出現ORA-07445報錯,之後後臺程式RECO也出現ORA-07445錯誤,然後pmon程式不斷的去恢復RECO程式
Errors in file /oracle/admin/orcl/udump/orcl2_ora_446622.trc:
ORA-07445: 出現異常錯誤: 核心轉儲 [kghlkremf+0024] [SIGSEGV] [Address not mapped to object] [0xB4B000000023208] [] []
Tue Sep 23 09:49:24 2014
Trace dumping is performing id=[cdmp_20140923094924]
Tue Sep 23 09:49:33 2014
Errors in file /oracle/admin/orcl/udump/orcl2_ora_299228.trc:
ORA-07445: 出現異常錯誤: 核心轉儲 [kghlkremf+0024] [SIGSEGV] [Address not mapped to object] [0xB4B000000023208] [] []
Tue Sep 23 09:49:43 2014
Errors in file /oracle/admin/orcl/udump/orcl2_ora_127730.trc:
ORA-07445: 出現異常錯誤: 核心轉儲 [kghlkremf+0024] [SIGSEGV] [Address not mapped to object] [0xB4B000000023208] [] []
Tue Sep 23 09:49:51 2014
PMON failed to acquire latch, see PMON dump
Tue Sep 23 09:49:53 2014
Errors in file /oracle/admin/orcl/udump/orcl2_ora_484038.trc:
ORA-07445: 出現異常錯誤: 核心轉儲 [kghlkremf+0024] [SIGSEGV] [Address not mapped to object] [0xB4B000000023208] [] []
Tue Sep 23 09:50:03 2014
Errors in file /oracle/admin/orcl/bdump/orcl2_reco_115088.trc:
ORA-07445: exception encountered: core dump [kghlkremf+0024] [SIGSEGV] [Address not mapped to object] [0xB4B000000023208] [] []
Tue Sep 23 09:50:13 2014
Errors in file /oracle/admin/orcl/bdump/orcl2_pmon_115700.trc:
ORA-00476: RECO process terminated with error
Tue Sep 23 09:50:23 2014
Errors in file /oracle/admin/orcl/bdump/orcl2_pmon_115700.trc:
ORA-00476: RECO process terminated with error
Tue Sep 23 09:50:33 2014
Errors in file /oracle/admin/orcl/bdump/orcl2_pmon_115700.trc:
ORA-00476: RECO process terminated with error
Tue Sep 23 09:50:43 2014
Errors in file /oracle/admin/orcl/bdump/orcl2_pmon_115700.trc:
ORA-00476: RECO process terminated with error
Tue Sep 23 09:50:53 2014

2.檢視udump目錄中生成的trc檔案,以及reco程式的trace檔案。在call stack中都出現kghadd_reserved_ext,kghget_reserved_ext資訊
*** 2014-09-23 09:50:03.110
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [kghlkremf+0024] [SIGSEGV] [Address not mapped to object] [0xB4B000000023208] [] []
No current SQL statement being executed.
----- Call Stack Trace -----
calling              call     entry                argument values in hex     
location             type     point                (? means dubious value)    
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              000000000 ? 104A2CD0D ?
ksedmp+0290          bl       ksedst               104A2C690 ?
ssexhd+03e0          bl       ksedmp               300001D15 ?
000044C0             ?        00000000            
kghget_reserved_ext  bl       kghadd_reserved_ext  000000000 ? 000000000 ?
ent+0214                      ent                  000000000 ? 000000000 ?
                                                   FFFFFFFFFFF5F40 ? 104AC36D4 ?
kghgex+0534          bl       kghget_reserved_ext  000000000 ? 70000038F799CF8 ?
                              ent                  FFFFFFFFFFF5FB0 ?
                                                   2488C54000000000 ?
                                                   100203234 ?
kghfnd+0584          bl       kghgex               700000389E9AAC0 ? 0000009A4 ?
                                                   000000001 ? 110000FF8 ?
                                                   000000001 ? 7000000100EE190 ?
                                                   FFFFFFFFFFF6110 ?
kghalo+0a24          bl       kghfnd               102DCCC40 ? 000000000 ?
                                                   000000000 ? 44442D4D4F4E ?
問題分析:
這個問題是由Oracle的bug導致,參考Oracle文件 ID 468456.1
其原因是由於使用到bloom filter演算法的SQL查詢在執行的時候出現異常中斷。這是一個記憶體錯誤導致,當記憶體重新分配給其他一些客戶端時,原有的連結仍舊往這個記憶體中寫入資料,所以導致了ORA-07445錯誤

關於bloom filter演算法可以在網上搜尋查詢
只是有個問題,關於[kghlkremf+0024] 這個錯誤中的0024這個訊號是從哪裡查詢的?

在Oracle的468456.1文件中,描述該問題出現的版本是Version 10.2.0.1 to 10.2.0.3。對於10.2.0.4版本,原話是這樣的It can be provided only as part of a complete patchset so the first version that will include this fix is 10.2.0.4.
就是說這個問題將作為一個完整patch的一部分給使用者提供,因此修復這個問題的第一個版本會是10.2.0.4

從這個現象來看,也許這個bug在10.2.0.4中可能並沒有被修復.

解決該問題的方法是禁用bloom filter演算法
alter system set "_bloom_filter_enabled"=false scope=both;
動態修改即可生效

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/23850820/viewspace-1281786/,如需轉載,請註明出處,否則將追究法律責任。

相關文章