ORA-00494: enqueue [CF] held for too long (more than 900 seconds) -RAC

客戶的一臺rac一個月hang死一次，上次重啟機器後一切正常，這次必須一次性解決，生產庫不可能老是重啟賽。

##############################################
系統軟體日誌：syslog.log
##############
Jan 6 04:57:51 db2 cimserver[1971]: PGS13606: A failure was detected in provider module HPUXStorageProviderModule. The generation of indications by pro
viders in this module may be affected. To ensure these providers are serving active subscriptions, disable and then re-enable this module using the cimproviu
der command.
Jan 6 06:24:17 db2 automountd[1485]: match_addr: getaddrinfo failed: host nor service provided, or not known
Jan 6 06:24:17 db2 automountd[1485]: match_addr: getaddrinfo failed: host nor service provided, or not known
Jan 6 06:41:47 db2 above message repeats 47 times
##############
uamdb2 automountd[1485]和oracle的ORA-00494: enqueue [CF] held 同時報錯，在11月28日出問題時也是同時報錯。
諮詢HP工程師，回覆說uamdb2 automountd[1485]錯誤是由網路問題，或者是DNS服務解析出問題引起的。
停止了系統的autofs服務後，系統報錯消失，但oracle仍然報錯ORA-00494: enqueue [CF] held

##############################################
oracle日誌：alert_db2.log
同時發現db2_diag_26520.trc未生成
##############
Wed Jan 6 06:24:15 2010
Errors in file /oracle/admin/db/bdump/db2_diag_26520.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 2, osid 22109'
Wed Jan 6 06:24:18 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:24:37 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:25:15 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
Wed Jan 6 06:25:34 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
Wed Jan 6 06:27:24 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:28:24 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
Wed Jan 6 06:28:31 2010
System State dumped to trace file /oracle/admin/db/bdump/db2_diag_26520.trc
Wed Jan 6 06:29:31 2010
Killing enqueue blocker (pid=22109) on resource CF-00000000-00000000
by killing session 1061.7968
##############################################
oracle 監聽器日誌：/oracle/product/102/network/log/listener_db2.log
##############
06-JAN-2010 06:08:41 * service_died * uamdb1 * 12537
06-JAN-2010 06:09:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=db)(CID=(PROGRAM=rman@db1)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57242)) * establish * db * 0
06-JAN-2010 06:09:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=uamdb)(CID=(PROGRAM=rman)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57244)) * establish * db * 0
06-JAN-2010 06:09:02 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=db)(CID=(PROGRAM=rman)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57246)) * establish * uamdb * 0
06-JAN-2010 06:09:02 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=db)(CID=(PROGRAM=rman)(HOST=db1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=136.5.74.132)(PORT=57247)) * establish * db * 0
06-JAN-2010 06:09:04 * service_update * db2 * 0
06-JAN-2010 06:09:22 * service_update * db2 * 0
06-JAN-2010 06:09:44 * service_register * db1 * 1184
##############################################
rac crs資源檢查
##############
node1
##############
uamdb1#[/]crs_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
ora.uamdb.db   application    ONLINE    ONLINE    uamdb2
ora....b1.inst application    ONLINE    ONLINE    uamdb1
ora....b2.inst application    ONLINE    ONLINE    uamdb2
ora....SM1.asm application    ONLINE    ONLINE    uamdb1
ora....B1.lsnr application    ONLINE    ONLINE    uamdb1
ora.uamdb1.gsd application    ONLINE    ONLINE    uamdb1
ora.uamdb1.ons application    ONLINE    ONLINE    uamdb1
ora.uamdb1.vip application    ONLINE    ONLINE    uamdb1
ora....SM2.asm application    ONLINE    ONLINE    uamdb2
ora....B2.lsnr application    ONLINE    ONLINE    uamdb2
ora.uamdb2.gsd application    ONLINE    ONLINE    uamdb2
ora.uamdb2.ons application    ONLINE    ONLINE    uamdb2
ora.uamdb2.vip application    ONLINE    ONLINE    uamdb2
uamdb1#[/]crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
##############
通過以上檢查，發現node1資源狀態完全正常
##############
node2
##############
uamdb2#[/]crs_stat -t
#掛死，不出結果
uamdb2#[/]crsctl check crs
#掛死，不出結果
uamdb2#[/]crsctl check crsd
CRS appears healthy
uamdb2#[/]crsctl check evmd
EVM appears healthy
uamdb2#[/]crsctl check cssd
#掛死，不出結果
##############
通過以上檢查，發現node2資源狀態不正常，css掛死。

##############
檢查oracle系統問題
##############
node2
##############
sys@DB_SQL> select event,count(*) from v$active_session_history group by event order by 2;

EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
PX Deq: reap credit                                                       1
name-service call wait                                                    1
os thread startup                                                        14
control file sequential read                                             19
ksdxexeotherwait                                                         56
enq: WF - contention                                                    859
control file parallel write                                            2782
enq: CF - contention                                                   5493
                                                                      24914
##############
enq: CF - contention ：5493   CFqueue很嚴重
##############
鎖等待：
sys@DB_SQL>   SELECT s.username,l.OBJECT_ID,l.SESSION_ID,s.SERIAL#,l.ORACLE_USERNAME,l.OS_USER_NAME,l.PROCESS FROM V$LOCKED_OBJECT l,V$SESSION S WHERE l.SESSION_ID=S.SID;

USERNAME                        OBJECT_ID SESSION_ID    SERIAL# ORACLE_USERNAME                OS_USER_NAME                   PROCESS
------------------------------ ---------- ---------- ---------- ------------------- ------------------------------
                                     9167       1052      40224                                oracle                         24193
                                     9238       1052      40224                                oracle                         24193

2 rows selected.
sys@DB_SQL> select sid,serial#,username,program,sql_hash_value,
2           to_char(logon_time,'yyyy/mm/dd hh24:mi:ss') as login_time
3    from v$session
4   where paddr in ( select addr from v$process where spid=&spid);
Enter value for spid: 24193

   SID SERIAL# USERNAME   PROGRAM                          hash_value LOGIN_TIME
------ ------- ---------- -------------------------------- ---------- -------------------
1052   40224            oracle@db2 (m000)                      0 2010/01/07 09:33:18
1 row selected.
sys@UAMDB_SQL> SELECT l.sid, s.blocking_session blocker, s.event, l.type, l.lmode, l.request, o.object_name, o.object_type
2 FROM v$lock l, dba_objects o, v$session s
3 WHERE l.id1 = o.object_id (+)
4 AND   l.sid= s.sid
5 and l.sid=&sid
6 ORDER BY sid, type;
Enter value for sid: 1052

   SID    BLOCKER EVENT                TYPE            LMODE    REQUEST OBJECT_NAME OBJECT_TYPE
------ ---------- -------------------- ---------- ---------- ----------
1052            gc current request   SH                  6          0
1052            gc current request   TM                  6          0 WRH$_ACTIVE_SESSION_HISTORY TABLE
1052            gc current request   TM                  3          0 WRM$_DATABASE_INSTANCE TABLE
1052            gc current request   TX                  6          0
1052            gc current request   WF                  4          0
sys@DB_SQL> SELECT s.sid , s.serial#, s.status, s.server, s.username, e.event, e.time_waited
2 FROM v$session_event e, v$session s
3 WHERE e.sid=s.sid
4 and e.sid=&sid;
Enter value for sid: 1052

   SID SERIAL# STATUS   SERVER    USERNAME   EVENT                TIME_WAITED
------ ------- -------- --------- ---------- -------------------- -----------
1052   40224 ACTIVE   DEDICATED            class slave wait               0
1052   40224 ACTIVE   DEDICATED            control file sequential read           0
1052   40224 ACTIVE   DEDICATED            gc cr block 2-way              0
1052   40224 ACTIVE   DEDICATED            gc current block 2-way           0
1052   40224 ACTIVE   DEDICATED            gc current block busy           0
1052   40224 ACTIVE   DEDICATED            gc current grant busy           0
1052   40224 ACTIVE   DEDICATED            enq: TM - contention           4
1052   40224 ACTIVE   DEDICATED            row cache lock                 0
1052   40224 ACTIVE   DEDICATED            library cache lock             0
1052   40224 ACTIVE   DEDICATED            events in waitclass Other            0

該session 有4個TM等待，TM等待一般是由於外來鍵沒有索引引起的
################
外來鍵沒有索引的列
################
sys@DB_SQL> SELECT * FROM (
2     SELECT c.table_name, cc.column_name, cc.position column_position
3     FROM   user_constraints c, user_cons_columns cc
4     WHERE c.constraint_name = cc.constraint_name
5     AND    c.constraint_type = 'R'
6     MINUS
7     SELECT i.table_name, ic.column_name, ic.column_position
8     FROM   user_indexes i, user_ind_columns ic
9     WHERE i.index_name = ic.index_name
10     )
11 ORDER BY table_name, column_position;

TABLE_NAME                     COLUMN_NAME                              COLUMN_POSITION
------------------------------ ---------------------------------------- ---------------
HS$_CLASS_CAPS                 CAP_NUMBER                                             1
HS$_CLASS_DD                   DD_TABLE_ID                                            1
HS$_FDS_INST                   FDS_CLASS_ID                                           1
HS$_INST_CAPS                  CAP_NUMBER                                             1
HS$_INST_DD                    DD_TABLE_ID                                            1
OLAP_OLEDB_MDPROPS             VALUE                                                  1
REGISTRY$                      PID                                                    2
WRM$_SNAPSHOT                  INSTANCE_NUMBER                                        2
WRM$_SNAPSHOT                  STARTUP_TIME                                           3

9 rows selected.
#######################
enqueue的狀態
#######################
sys@DB_SQL> select * from v$enqueue_stat where cum_wait_time>0 order by inst_id,cum_wait_time;

   INST_ID EQ TOTAL_REQ# TOTAL_WAIT# SUCC_REQ# FAILED_REQ# CUM_WAIT_TIME
---------- -- ---------- ----------- ---------- ----------- -------------
         2 KM          2           1          2           0             1
         2 RR          2           1          2           0             1
         2 PE     397651           3     397651           0             1
         2 TL          2           1          2           0             1
         2 SW         29           1         29           0             1
         2 XR        781           1        781           0             1
         2 CN          7           1          7           0             1
         2 PW         14           2          2          12             2
         2 RT          6           4          5           1             4
         2 PG         27          10         27           0             5
         2 DL         82          16         82           0             6
         2 BR        102          41        102           0            11
         2 KO         81          27         81           0            12
         2 UL       1395         149       1395           0            32
         2 WP         38          32         38           0            36
         2 MR      62790          68      62790           0            38
         2 SK        263         120        263           0            40
         2 IR       7173         160       7173           0            46
         2 TS       2904         381       2904           0            84
         2 PI        600         211        600           0            94
         2 MD        487         244        487           0           100
         2 TO       1139         543       1139           0           142
         2 RS       1332         599       1332           0           178
         2 TC       1095         776       1095           0           308
         2 TX     403790         502     403790           0           422
         2 FB       9488        2532       9488           0           682
         2 TT     362946        3822     362946           0           882
         2 AF      17905        1913      17905           0          1075
         2 MW       5111        3746       5111           0          1527
         2 HW     610709        5172     610709           0          1632
         2 PR      28957          36      28957           0          1763
         2 DR       5112        5014       5112           0          2646
         2 TD      21603       15045      21603           0          3441
         2 RO      21953        7763      21953           0          4177
         2 US      40633       32406      40633           0          6173
         2 TA      35746       26669      35746           0          9295
         2 JS 104933748        5264 104933573         175         16486
         2 WL        580         260        358         222        178461
         2 PS    1550184      690899    1537349       12835        250683
         2 WF      63022       40630      44837       18185      47065384
         2 FU        264         261          8         256     230306409
         2 CF   20987280       11795   20970328       16933     407841619
         2 TM     664700       13494     664021         679     926373131

CF和TM都是相當的高。
##############
node1
##############
sys@DB_SQL> SELECT s.username,l.OBJECT_ID,l.SESSION_ID,s.SERIAL#,l.ORACLE_USERNAME,l.OS_USER_NAME,l.PROCESS FROM V$LOCKED_OBJECT l,V$SESSION S WHERE l.SESSION_ID=S.SID;

no rows selected
##############
沒有鎖等待
sys@DB_SQL> select event,count(*) from v$active_session_history group by event order by 2;

EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
rdbms ipc reply                                                           1
gc current block busy                                                     1
null event                                                                1
gc cr grant 2-way                                                         1
enq: PS - contention                                                      1
gc current block 2-way                                                    2
library cache lock                                                        2
latch free                                                                2
gc cr block 2-way                                                         3
gc current grant busy                                                     4
log file sequential read                                                  4
Log archive I/O                                                           4
DFS lock handle                                                           5
gc cr disk read                                                           5
Backup: sbtclose2                                                         6
ges inquiry response                                                      7
row cache lock                                                           10
wait for scn ack                                                         18
db file parallel write                                                   19
log file parallel write                                                  21
control file parallel write                                              22
Backup: sbtinit2                                                         36
log file sync                                                            41
control file sequential read                                             50
os thread startup                                                        55
Backup: sbtbackup                                                        83
db file sequential read                                                 111
ksdxexeotherwait                                                        186
RMAN backup & recovery I/O                                              328
                                                                        693
Backup: sbtwrite2                                                      1591
enq: WF - contention                                                   2709
reliable message                                                       3714
enq: TM - contention                                                  10651
enq: CF - contention                                                  15625

35 rows selected.

sys@DB_SQL> select * from v$enqueue_stat where cum_wait_time>0 order by inst_id,cum_wait_time;

   INST_ID EQ TOTAL_REQ# TOTAL_WAIT# SUCC_REQ# FAILED_REQ# CUM_WAIT_TIME
---------- -- ---------- ----------- ---------- ----------- -------------
         1 TL          1           1          1           0             1
         1 FU         57           2         57           0             3
         1 DL         62          11         62           0             5
         1 SK         76          22         76           0            14
         1 PI        160          53        160           0            15
         1 PG         32          25         32           0            16
         1 MR      23713          34      23713           0            24
         1 WP         21          21         21           0            25
         1 MD        134          68        134           0            32
         1 TO        312         153        312           0            45
         1 RS        364         160        364           0            63
         1 BR        459         176        459           0            68
         1 HW     622463         747     622463           0           230
         1 TX    1785623          84    1785623           0           284
         1 PR       9439           6       9439           0           317
         1 AF       5256         573       5256           0           385
         1 FB       7088        1358       7088           0           597
         1 MW       1364        1364       1364           0           608
         1 TC        685         203        685           0           645
         1 DR       1364        1364       1364           0           902
         1 TT     100900        4119     100900           0           975
         1 RO       7356        2134       7356           0          1037
         1 TD       5352        5352       5352           0          1511
         1 US      10831        6203      10831           0          1523
         1 JS   15730382        1599   15730336          46          2122
         1 TA       8234        8107       8234           0          3271
         1 PS     405099      156349     401698        3401         63259
         1 WL        159          63         92          67         64756
         1 TM    4522892        4094    4522890           0        317188
         1 WF      19798        7608      13427        6371       3337224
         1 CF    5556633        3787    5554145        2482      66750958
###################################################################################
問題現狀跟Bug 6857917很相像。

Description

An Oracle foreground process can cause a cluster wide hang if it is holding a critical enqueue (such as CF) and it hangs when registering with CSS. With this fix the process will attempt to register BEFORE acquiring such an enqueue. Note: This fix does not attempt to fix any process hang when registering will CSS, but instead tries to ensure that if such a hang does occur then the impact is reduced.

#####################

最終處理方式：升級到10.2.0.4.2後，資料庫正常，再也沒有發生hang死的情況。

#####################

故障在2個月後又重現，這次oracle的原廠工程師分析後，指出系統的記憶體洩漏的bug問題，同時我也處理了個oracle的記憶體洩漏的bug，期待問題能解決。

hp：建議補丁：建議安裝補丁：PHKL_40208；我最終是打上了最新的golden patch

oracle：

------------------------------------

下面是METALINK給出的解釋：

I have checked the information provided. crsd is reporting timeouts when running the checking
actions. The racgmain process on the other hand, executed by racgwrap, remain
running so the number is increasing.

This issue has been reported in Bug 7009245 ""RACGMAIN CHECK" PROCESS NOT TERMINATING", and drill down and
determine the root cause this require to get a stack trace from racgmain
process(es) as well as you should run OSWatcher with a specific parameters,
finally the issue of Bug 7009245 has been fixed by the fix of Bug 6196746
(unpublished bug -- summary is "racgwrap spawns racgmain child process. crsd
timeout kills racgwrap but bot racgmain")

So please perform. the following action plan that intend to solve the issue.
.
ACTION PLAN
============
1. Stop CRS on this node.
2. Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin
3. Edit the file racgwrap and modify the last 3 lines from:
.
$ORACLE_HOME/bin/racgmain "$@"
status=$?
exit $status
.
to:
.
# Line added to test fix for Bug 6196746
exec $ORACLE_HOME/bin/racgmain "$@"
.
4. Restart CRS and make sure that all the resources are startes

This should solve the issue. also if you are not able to stop CRS right now, you can try the
solution without crs stop/start.

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/11088128/viewspace-675791/，如需轉載，請註明出處，否則將追究法律責任。

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) -RAC

相關文章