一次awr報告分析(密碼錯誤引發sql執行時間過長)

花花蘑菇發表於2016-09-07
預備知識:
1)row cache lock事件  
--記憶體share pool分為library cache,dictionary cache;而row cache lock物件分佈在dictionary cache中,是由於對於字典緩衝的訪問造成的;  
--這類屬於latch類的資源競爭,相當耗CPU,如果併發量大的話,容易down機
如果發現這個等待十分高,一般來說可能由於2種原因,一是共享池太小了,需要增加共享池,另外一種情況是SQL分析過於頻繁,對於共享池的併發訪問量過大。對於任何一種情況,絕大多數情況下加大共享池會有助於降低該等待,不過加大共享池的時候也要注意,並不一定所有的情況下增加共享池都會有明顯的效果。特別是對於第二種情況,精確的分析十分重要。另外進一步分析,弄清楚哪些ROW CACHE的等待最為嚴重,有助於解決問題。
row cache lock事件的調優基於每一個佇列鎖型別的行為,常見的佇列鎖型別有: 
-------DC_SEQUENCES:在使用序列的時候將發生該行緩衝佇列鎖。調優方式是檢查序列是否指定了緩衝選項並確定這個緩衝值可以承受預期的併發insert操作。
Check for appropriate caching of sequences for the application requirements.
-------DC_USED_EXTENTS和DC_FREE_EXTENTS:該行緩衝佇列鎖可能在空間管理碰到表空間分裂或者沒有足夠區大小時發生。調優方法是檢查表空間是否分裂了、區大小是否太小或者表空間是人工管理。 
-------DC_TABLESPACES:該行緩衝佇列鎖會在分配新區是發生。如果區大小設定得過小,程式將經常申請新區,這將導致衝突。調優方法是快速地增加區的數量。
Probably the most likely cause is the allocation of new extents. If extent sizes are set low then the application may constantly be requesting new extents and causing contention. Do you have objects with small extent sizes that are rapidly growing? (You may be able to spot these by looking for objects with large numbers of extents). Check the trace for insert/update activity, check the objects inserted into for number of extents. 
-------DC_OBJECTS:該行緩衝佇列鎖會在重編譯物件的時候發生。當物件編譯時將申請一個排他鎖阻塞其他行為。透過檢查非法物件和依賴關係來調優。 
-------DC_SEGMENTS:該行緩衝佇列鎖會在段分配的時候發生,觀察持有這個佇列鎖的會話在做什麼。
This is likely to be down to segment allocation. Identify what the session holding the enqueue is doing and use errorstacks to diagnose. 
-------DC_USERS:Deadlock and resulting “WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!” can occur if a session issues a GRANT to a user, and that user is in the process of logging on to the database.
-------DB_ROLLBACK_SEGMENTS:This is due to rollback segment allocation. Just like dc_segments,identify what is holding the enqueue and also generate errorstacks. Remember that on a multi-node system (RAC) the holder may be on another node and so multiple systemstates from each node will be required.
-------DC_AWR_CONTROL:This enqueue is related to control of the Automatic Workload Repository. As such any operation manipulating the repository may hold this so look for processes blocking these. 
2)相關檢視欄位說明:
ROW CACHE LOCK基礎說明
    P1 – Cache Id
    P2 – Mode Held
    P3 – Mode Requested
mode 和REQUEST的取值:
    KQRMNULL 0 null mode – not locked
    KQRMS 3 share mode
    KQRMX 5 exclusive mode
    KQRMFAIL 10 fail to acquire instance lock
3)SQL查詢
-----查詢row cache lock等待
    select *  from v$session_wait where wait_class = 'row cache lock'; 
    查出p1的值
-----查詢rowcache 名稱
    根據p1的值來查詢
    select * from v$rowcache where cache# = &p1;
其他:
select event,p1 from v$session a where a.username is not null and a.status='ACTIVE';
4)dba_hist_active_sess_history檢視
dba_hist_active_sess_history檢視記錄了記憶體中活動會話的歷史資訊,而動態效能檢視V$ACTIVE_SESSION_HISTORY則記錄了當前活動會話資訊。 透過dba_hist_active_sess_history檢視關聯v$sqlarea和DBA_HIST_SNAPSHOT可以跟蹤到某個時間段內的sql。當然能跟蹤到的sql多少取決於v$sqlarea,畢竟只有還保留在v$sqlarea中的sql才能跟蹤到。

AWR報告分析
資料庫出現狀況是8月30號下午3點到4點,於是拉取了兩份AWR報告,一份是8月30號下午3點到4點,另一份是31號下午3點到4點,將兩份報告比對著看,不同的地方應該就是有問題的地方。
1)發現top 5的前臺事件中,排第一的不是DB CPU。預設正常情況下,DB CPU會排第一。

2)繼續看Wait Events Statistics ,發現排第一也不是DB CPU,而是connection management call elapsed time,看名字感覺和資料庫連線有關。

3)繼續比對,發現dc_users的Pct Miss比較大,超過了20%

4)檢視可疑的sql語句:

標紅的sql執行速度很慢,有些異常。
SQL> select to_char(sample_time, 'YY-MM-DD HH24:MI:SS') sample_time,
  2         instance_number,
  3         sql_id,
  4         P1,
  5         event,
  6         wait_class
  7    from dba_hist_active_sess_history
  8   where sample_time between
  9         to_date('16-08-29 13:00:00', 'YY-MM-DD HH24:MI:SS') and
10         to_date('16-08-29 13:30:00', 'YY-MM-DD HH24:MI:SS')
11     and sql_id in ('bvtu633rnwrwv',
12                    '4a6uhr508t0p6',
13                    'fpm2zazkfqhy6',
14                    'bhrcaykh5tzsw',
15                    '9fxv1px768bd5')
16   order by 1
17  ;
SAMPLE_TIME       INSTANCE_NUMBER SQL_ID                P1 EVENT                 WAIT_CLASS           
----------------- --------------- ------------- ---------- --------------------- ---------------------
16-08-29 13:00:59               1 4a6uhr508t0p6         10 row cache lock        Concurrency
16-08-29 13:01:09               1 4a6uhr508t0p6         10 row cache lock        Concurrency
16-08-29 13:01:19               1 4a6uhr508t0p6         10 row cache lock        Concurrency
16-08-29 13:01:29               1 4a6uhr508t0p6          7 row cache lock        Concurrency
16-08-29 13:01:39               1 bvtu633rnwrwv         10 row cache lock        Concurrency
16-08-29 13:01:49               1 bvtu633rnwrwv         10 row cache lock        Concurrency
16-08-29 13:01:59               1 bvtu633rnwrwv         10 row cache lock        Concurrency
16-08-29 13:02:10               1 bvtu633rnwrwv          7 row cache lock        Concurrency
16-08-29 13:02:20               1 fpm2zazkfqhy6         10 row cache lock        Concurrency
16-08-29 13:02:30               1 fpm2zazkfqhy6         10 row cache lock        Concurrency
16-08-29 13:02:40               1 fpm2zazkfqhy6         10 row cache lock        Concurrency
16-08-29 13:02:41               2 bvtu633rnwrwv          7 row cache lock        Concurrency
16-08-29 13:02:50               1 fpm2zazkfqhy6          7 row cache lock        Concurrency
16-08-29 13:04:31               1 bhrcaykh5tzsw         10 row cache lock        Concurrency
16-08-29 13:04:41               1 bhrcaykh5tzsw         10 row cache lock        Concurrency
16-08-29 13:04:51               1 bhrcaykh5tzsw          7 row cache lock        Concurrency
16-08-29 13:06:32               1 9fxv1px768bd5         10 row cache lock        Concurrency
16-08-29 13:06:42               1 9fxv1px768bd5         10 row cache lock        Concurrency
18 rows selected
根據上面的sql,發現sql導致的等待事件是row cache lock。再根據P1=7或者10,查出等待事件是發生在具體哪個類別上
SQL> select cache#,parameter from v$rowcache where cache# in ('7','10');
    CACHE# PARAMETER
---------- --------------------------------
        10 dc_users
         7 dc_users
         7 dc_users
         7 dc_users
根據上面的sql,發現引數類別是dc_users,網上說dc_users是和使用者用錯誤密碼登陸有關:
In 11g there is an intentional delay between allowing failed logon attempts to retry. For some specific application types this can cause a problem as the row cache entry is locked for the duration of the delay . This can lead to excessive row cache lock waits for DC_USERS for specific users / schemas.

After 3 successive failures a sleep delay is introduced starting at 3 seconds and extending to 10 seconds max. During each delay the user X row cache lock is held in exclusive mode preventing any concurrent logon attempt as user X (and preventing any other operation which would need the row cache lock for user X).
驗證一下,看是否在發生row cache lock期間,有使用者登陸失敗。
SQL> select username,
  2         userhost,
  3         to_char(timestamp, 'YY-MM-DD HH24:MI:SS') timestamp,
  4         action_name
  5    from dba_audit_trail
  6   where action_name = 'LOGON'
  7     and priv_used is null
  8     and timestamp between
  9         to_date('16-08-29 13:00:00', 'YY-MM-DD HH24:MI:SS') and
10         to_date('16-08-29 13:30:00', 'YY-MM-DD HH24:MI:SS'); 
USERNAME                       USERHOST              TIMESTAMP         ACTION_NAME
------------------------------ --------------------- ----------------- ----------------------------
MAPP_PLATFORM                  IDC-APP-02            16-08-29 13:03:51 LOGON
MAPP_PLATFORM                  IDC-APP-01            16-08-29 13:03:41 LOGON
MAPP_PLATFORM                  IDC-APP-02            16-08-29 13:03:31 LOGON
MAPP_PLATFORM                  IDC-APP-01            16-08-29 13:03:21 LOGON
MAPP_PLATFORM                  IDC-APP-02            16-08-29 13:03:11 LOGON
MAPP_PLATFORM                  IDC-APP-01            16-08-29 13:03:01 LOGON
MAPP_PLATFORM                  IDC-APP-02            16-08-29 13:02:51 LOGON
MAPP_PLATFORM                  IDC-APP-01            16-08-29 13:02:41 LOGON
MAPP_PLATFORM                  IDC-APP-02            16-08-29 13:02:31 LOGON
。。。。
發現使用者MAPP_PLATFORM從8.14到8.31,一直以錯誤密碼嘗試登陸資料庫,驗證了以錯誤密碼登陸會導致row cache lock同時sql執行慢的現象發生。
 進一步查詢登入審計資訊:
select * from
(
select os_username,userhost,terminal,username,count(*) failures
    from dba_audit_trail
   where returncode = 1017 and timestamp between to_date ('2016-8-29 11:30:00','yyyy-mm-dd hh24:mi:ss') and to_date ('2016-8-29 13:30:00','yyyy-mm-dd hh24:mi:ss')
   group by os_username,userhost,username,terminal
   order by 5 desc );
returncode = 1017 這裡的1017就是oracle內部定義的錯誤返回程式碼值。

實際上不管使用者提供的密碼是否正確,Oracle都會為新的connection分配一個shadow process,該服務程式為了進一步驗證使用者資訊都不得不獲取少量資源;如果以此為契機,即便在無法正常登入的情況下依然能在短期內造成例項僵死。


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/28497416/viewspace-2124662/,如需轉載,請註明出處,否則將追究法律責任。

相關文章