1、問題分析
12月1日早上8:20，客戶報障說資料庫無法連線上，使用sqlplus連線也無法連上。我本人，透過遠端工具連線上去，發現資料庫好的，並且檢查叢集如下：
[oracle@myrac1 ~]$ crs_stat -t

長時間沒有響應

[oracle@myrac1 cssd]$ ps -ef | grep pmon
oracle    3308 19190 0 11:41 pts/2    00:00:00 grep pmon
oracle   18254     1 0 09:35 ?        00:00:00 asm_pmon_+ASM1
oracle   18474     1 0 09:36 ?        00:00:00 ora_pmon_testdb1

但是進入資料庫，發現根據不可用

[oracle@myrac1 ~]$ sqlplus "/as sysdba"

SQL*Plus: Release 10.2.0.5.0 - Production on Fri Mar 7 10:25:15 2014

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP,
Data Mining Scoring Engine and Real Application Testing options

SQL> select status from v$instance;

一直無法響應。這種狀態，說明資料庫已經無法使用！

檢查alert日誌發現：
從6點24開始，系統出現報錯
Attempt to get Control File Enqueue by LGWR pid=17130 (mode=X, type=0, timeout=9
00) is being blocked by inst=2, pid=11498
Please check inst 2's alert log for more information on the blocker including a
possible ORA-00494 and related incident logs

出現資料庫ora-00494

[oracle@myrac1 cssd]$ oerr ora 494
00494, 00000, "enqueue%s held for too long%s by '%s'"
// *Cause: The specified process did not release the enqueue within
// the maximum allowed time.
// *Action: Reissue any commands that failed and contact Oracle Support
// Services with the incident information.

從這種報障情況來看，是由於叢集的第二個節點對控制檔案進行了鎖定，並且產生了阻塞超過了900秒。第一節點無法得到資源，主動查殺了後臺程式！而在第二個節點，這段時間沒有任何日誌！接到報錯後，第二節點可以ping通，但無法連線，到機房接上螢幕、鍵盤後，沒任任何顯示，說明系統已經掛死
在第一個節點，就出現要求instances離開
Waiting for instances to leave:

Waiting for instances to leave:

所以，最後的結論來看，應該是第二節點在系統掛死前，已經持有cf鎖，而第一節點無法申請到，進而引起當機。

最後，我們沒有辦法，只好重新啟動第二個節點，對第一個節點進行強制關閉資料庫

SQL> shutdown abort;

2、解決辦法

    為避免因為某一個節點的原因，引資料庫叢集當機，oracle官方解決方案
    CF eq超過900秒，會報ORA-00494,10.2.0.4對ORA-00494引進一種新的處理機制，當出現這個錯誤時，不管是不是後臺程式，只要是阻塞的，都會kill掉。因為cf 鎖的一直存在，lgwr因為需要申請cf鎖會一直等待，此時active session中有越來越多的log file sync，當cf eq超過900s，報ORA-00494的時候，先kill了ckpt，然後 lgwr也被kill，ckpt也被 kill掉，再在 11:57:35 將lgwr也kill了。同時在11:57:35 時，例項也crash了。設定_kill_controlfile_enqueue_blocker=false引數，可以不kill掉任何程式。（對於CF eq超過900s也不會處理）。
    如果在init.ora中設定_kill_enqueue_blocker=1 ，可以阻止kill後臺程式，但是仍舊kill非後臺的程式。出現這種問題的原因應該去找，為什麼CF EQ會超過900s
SQL> select ksppinm,ksppstvl,ksppstdf from x$ksppi a,x$ksppcv b where a.indx = b.indx and ksppinm='_kill_controlfile_enqueue_blocker';

KSPPINM                                   KSPPSTVL
----------------------------------------- ------
_kill_controlfile_enqueue_blocker         TRUE

建議將這個值調整為false

ora-00494引起rac當機的分析處理

相關文章