US contention: Lock held to perform. DDL on the undo segment

看到一篇fuyuncat寫的關於US-contention的troubleshooting 案例，以前從沒有接觸過類似的enqueue，仔細閱讀幾遍並記下筆記；然後又多搜了幾篇類似的案例，以備後用。

該案例大致總結
"Active Sessions Waiting: Other" 統計了RAC中除了IO和Idle waits之外的所有wait事件
AWR中top 5佔了兩席
DFS lock handle—會話等待獲取一個全域性鎖控制程式碼，全域性鎖由DLM(分散式鎖管理器)管理和分配的，這一等待事件說明全域性鎖控制程式碼資源不夠用，決定引數_lm_locks，9i後預設為12000；說明大量事務獲取鎖，但沒有commit/rollback;
enq: US – contention 說明事務在佇列中等待UNDO segment，通常由於UNDO空間不足導致
首先查詢這兩個等待事件的wait物件
Select * from Dba_objects in( select current_obj#,count(*) from dba_hist_active_sess_history where event=’DFS lock handle’ and snap_id between *** and ***)
發現兩個等待事件的物件基本相同
Undo資源不足，可能是undo_retention時間過長且設定為guarantee，select retention,tablespace_name from dba_tablespaces where tablespace_name like ‘UNDO%’檢視發現沒有設定guarantee
接著檢視哪些事務消耗了UNDO，但是隻有一個transaction；
那就檢視UNDO的實際使用情況 select * from dba_segments where segment_type like ‘%UNDO%’ 結果顯示一個回滾段_SYSSMU69$佔了將近20g，檢視該回滾段的extents的狀態
Select status, count(*) from dba_undo_extents where segment_name ='_SYSSMU69$' group by status; 全部為active，則說明有事務正在使用所有的擴充套件段，但又找不到這樣的事務，原因是一個使用該回滾段的事務被異常終止了，先是透過kill session殺死，但是仍會回滾還未提交的事務，於是直接在OS刪除

由於該回滾段狀態仍未online，且所有extents都是active，所以無法drop或shrink，解決方案：
1、重啟例項，重置回滾段；
2、新增一個undo表空間，使其他事務正常執行；殺掉由於等待而徹底hung的會話，恢復正常

總結：作者在AWR top 5 events裡發現兩個等待事件，然後判斷US contention為元兇，先檢查undo_retention/guarantee的引數設定，接下來檢視UNDO表空間的使用狀況，dba_segments裡發現一個非常大的undo segment，進而查詢dba_undo_extents以確認其extents的狀態，最後聯絡開發人員找到原因；

另外一個案例來源http://www.itpub.net/redirect.php?tid=1269096&goto=lastpost

RAC出現大量的row cache lock + us contention
前者是由於一個sequence設定為nocache，修改後變好；後一個猜測undo出問題，直接查active+unexpired的總和，接近undo表空間的大小，臨時增大undo表空間並kill掉消耗量最大的impdp程式

第三個案例來自eagle fan http://www.dbafan.com/blog/?p=170
10203版本
AUM管理方式是系統不忙的時候offline一些undo segment，不夠用時再online；而當系統特別繁忙時online或者resize或出現問題--10511事件解決；
但此刻並不繁忙
作者留意到v$undostat中的unxpblkreucnt: Number of unexpired undo blocks reused by transactions
該列值不為0，一般只有當undo不夠存放undo_retention時間段內的資料時，才會發生unexpired undo extents stealing;
但是目前不是高峰期，作者留意到v$undostat中的tuned_undoretention欄位，10.2之後，oracle預設採用自動調整undo retention，會根據undo大小以及系統繁忙程度欄位調整undo_retention引數；
出問題前一天資料庫重啟過，因為起來很空閒，所以tuned_undoretention很大，undo被撐滿，雖然該欄位值一直在降，但還是沒有趕上系統warm up的速度，導致資料庫出現問題；
透過設定_undo_autotune為disable，不再自動更新。
總結：出現US contention，並且是在系統不繁忙的時候，作者留意到v$undostat中的unxpblkreucnt引數，由此推斷出該enqueue是由於oracle 10.2以後的新特性引起的，透過設定隱含引數遮蔽此特性來最終解決

How to correct performance issues with enq: US - contention related to undo segments [ID 1332738.1]

Purpose

Assist in correcting performance issues related to "enq: US Contention" on undo segments.

You have many offline undo segments and the workload starts to online many undo segments over a short period of time. This can lead to high 'latch: row cache objects' contention may be seen on dc_rollback_segments together with high 'enq: US - contention' waits when using system managed undo with an auto tuned undo retention period.

Sessions attempting to online undo segments should show ktusmous_online_undoseg() in their call stack.

Another aspect of the problem can be due to long running queries which can raise tuned_undoretention to very high values and exhausts the undo tablespace resulting in ORA-1628.

A real world case:
A query is being executed and some rows are fetched from the cursor and then the user stops working on that query (e.g. does not press the "next" button on the application screen) and works on something else (e.g. in a different window). After some time the user continues working on the query ... auto-tune starts tracking the query from this point and the maxquerylen is quite large now, hence also the tuned_undoretention (that depends directly on the maxquerylen).

NOTE: The Seibel application can allow for this problem to happen.

Last Review Date

June 24, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

The wait event "enq: US Contention" is associated with contention on the latch in the row cache (dc_rollback_seg). Enqueue US - Contention can become a bottle-neck for performance if workload dictates that a lot of offlined undo segments must be onlined over a short period of time. The latch on the row cache can be unable to keep up with the workload.

This can happen for a number of reasons and some scenarios are legitimate workload demands.

Solution:

Ensure that peaks in onlined undo segments do not happen (see workaround #2). That is not always feasible.

Workarounds:

1. Bounce the instance.

2. Setting _rollback_segment_count to a high number to keep undo segments online.

alter system set "_rollback_segment_count"=;

3. Set _undo_autotune to false

alter system set "_undo_autotune" = false;

NOTE: Simply using _smu_debug_mode=33554432 may not be enough to stop the problem, but valid fix for bug 5387030.

4. A fix to bug 7291739 is to set a new hidden parameter, _highthreshold_undoretention to set a high threshold for undo retention completely distinct from maxquerylen.

alter system set "_highthreshold_undoretention"=;

If problems persist, please file a Service Request with Oracle Support.

@ Diagnosis
@
@ Should the workarounds and/or configuration changes not help to alleviate the problems,
@ development would need the following diagnostics data:
@
@ a. Provide alert.log which shows the last instance startup parameters through the time of the
@ latest isssues.
@
@ b. AWR and/or ASH report of 30 or 60 minutes interval.
@
@ b. Following query output:
@
@ alter session set nls_date_format='mm/dd/yy hh24:mi:ss';
@ select begin_time, MAXQUERYID, MAXQUERYLEN from v$undostat;
@
@ c. While the error is ongoing:
@
@ On single instance:
@
@ sqlplus / as sysdba
@ oradebug setmypid
@ oradebug unlimit
@ oradebug hanganalyze 3
@ oradebug dump systemstate 266
@
@ wait for 5 seconds
@
@ oradebug dump systemstate 266
@
@ wait for 2 minutes
@
@ sqlplus / as sysdba
@ oradebug setmypid
@ oradebug unlimit
@ oradebug hanganalyze 3
@ oradebug dump systemstate 266
@
@ wait for 5 seconds
@
@ oradebug dump systemstate 266
@
@ On RAC get tracing on all nodes
@
@ sqlplus / as sysdba
@ oradebug setmypid
@ oradebug unlimit
@ oradebug -g all hanganalyze 3
@ oradebug -g all dump systemstate 266
@
@ wait for 5 seconds
@
@ oradebug -g all dump systemstate 266
@
@ wait for 2 minutes
@
@ sqlplus / as sysdba
@ oradebug setmypid
@ oradebug unlimit
@ oradebug -g all hanganalyze 3
@ oradebug -g all dump systemstate 266
@
@ wait for 5 seconds
@
@ oradebug -g all dump systemstate 266

關於enq: US – contention

相關文章