客戶資料庫出現大量cache buffer chains latch
客戶產品資料庫上午出現了嚴重的效能問題,簡單記錄一下問題的診斷和解決過程。
可以看到,資料庫的DB TIME已經漲到了非常高的地步,這說明系統正經受著非常嚴重的效能問題:
Snap Id |
Snap Time |
Sessions |
Cursors/Session |
|
Begin Snap: |
7170 |
31-3月 -12 08:00:50 |
395 |
10.5 |
End Snap: |
7172 |
31-3月 -12 09:00:27 |
689 |
11.7 |
Elapsed: |
59.61 (mins) |
|||
DB Time: |
17,270.93 (mins) |
從TOP 5看,等待事件中最明顯的是latch: cache buffers chains,從當前系統的狀態也可以看到這一點:
SQL> SELECT EVENT, COUNT(*) FROM GV$SESSION GROUP BY EVENT HAVING COUNT(*) > 5 ORDER BY 2 DESC;
EVENT
COUNT(*)
---------------------------------------------------------------- ----------
latch: cache buffers chains
1043
SQL*Net message from client
500
SQL*Net message to client
35
rdbms ipc message
29
gc cr request
25
latch free
13
gcs remote message
10
gc buffer busy
7
8 rows selected.
系統中部分SQL的執行時間已經長達幾個小時:
SQL ordered by Elapsed Time
Elapsed Time (s) |
CPU Time (s) |
Executions |
Elap per Exec (s) |
% Total DB Time |
SQL Id |
SQL Module |
SQL Text |
341,821 |
11,917 |
112 |
3051.97 |
32.99 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s... |
244,752 |
3,380 |
147 |
1664.98 |
23.62 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s... |
52,419 |
11,331 |
3 |
17472.91 |
5.06 |
|
JDBC Thin Client |
select cf.cardid, (CASE WHEN c... |
38,767 |
532 |
11 |
3524.24 |
3.74 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s... |
37,146 |
2,524 |
2 |
18573.15 |
3.58 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s... |
30,796 |
2,331 |
5 |
6159.20 |
2.97 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s... |
29,991 |
2,506 |
1 |
29991.41 |
2.89 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s... |
24,762 |
875 |
3 |
8254.06 |
2.39 |
|
JDBC Thin Client |
SELECT count(*) total FROM (SE... |
17,845 |
699 |
2 |
8922.50 |
1.72 |
|
JDBC Thin Client |
SELECT count(*) total FROM (SE... |
17,470 |
1,295 |
0 |
1.69 |
|
JDBC Thin Client |
select * from (select aa.*, ro... |
其實根據這些現象,基本上可以判斷問題了。肯定是SQL執行計劃的改變導致了當前的效能問題。而類似繫結變數窺探之類的問題只會影響個別的SQL,而這麼大面積的執行計劃的改變,幾乎可以斷定是統計資訊造成的。
詢問了一下客戶最近的操作,原來昨天夜裡透過資料庫鏈的方式匯入了一部分資料。而今天一早問題就出現了。
其實問題已經很明顯了,在透過資料庫鏈載入資料後,並沒有重新收集統計資訊,且由於載入時間是在半夜,這批資料也沒有趕上每天22:00的統計資訊自動收集的工作。這就使得Oracle在生成執行計劃時,會依賴資料載入前的統計資訊,從而造成了錯誤的執行計劃。
首先解決問題的根源問題,對載入過資料的表重新收集統計:
SQL> SELECT 'EXEC DBMS_STATS.GATHER_TABLE_STATS(''USER1'', ''' || TABLE_NAME || ''', CASCADE => TRUE)' FROM DBA_TABLES WHERE WNER = 'USER1' AND LAST_ANALYZED < TRUNC(SYSDATE);
'EXECDBMS_STATS.GATHER_TABLE_STATS(''USER1'','''||TABLE_NAME||''',CASCADE=>TRUE)'
-----------------------------------------------------------------------------------------
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_1', CASCADE => TRUE)
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_2', CASCADE => TRUE)
.
.
.
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_3', CASCADE => TRUE)
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_4', CASCADE => TRUE)
12 rows selected.
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_1', CASCADE => TRUE)
PL/SQL procedure successfully completed.
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_2', CASCADE => TRUE)
PL/SQL procedure successfully completed.
.
.
.
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_3', CASCADE => TRUE)
PL/SQL procedure successfully completed.
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_4', CASCADE => TRUE)
PL/SQL procedure successfully completed.
雖然問題的根源已經被解決,但是當前執行的SQL並不會改變執行計劃,因此還需要找到問題的SQL,從後臺將其結束:
SQL> SELECT 'kill -9 ' || spid from v$session s, v$process p where s.username = 'USER2' and s.paddr = p.addr and event = 'latch: cache buffers chains';
'KILL-9'||SPID
--------------------
kill -9 28321
kill -9 25384
kill -9 23697
kill -9 7239
.
.
.
kill -9 9331
kill -9 13759
740 rows selected.
SQL> SELECT 'kill -9 ' || spid from gv$session s, gv$process p where s.username = 'USER2' and s.paddr = p.addr and event = 'latch: cache buffers chains' and s.inst_id = p.inst_id and s.inst_id = 2;
'KILL-9'||SPID
--------------------
kill -9 23992
kill -9 5289
kill -9 21067
kill -9 16816
kill -9 16820
kill -9 26767
.
.
.
kill -9 14981
kill -9 26678
kill -9 26682
258 rows selected.
分別在兩個節點殺掉這些執行計劃存在問題的會話,釋放被大量佔用的系統資源。
由於Oracle的執行計劃並非在收集統計資訊後馬上生效,因此還有個別的SQL仍然沿用錯誤的執行計劃:
SQL> select distinct inst_id, sql_id from gv$session where event = 'latch: cache buffers chains';
INST_ID SQL_ID
---------- -------------
1 39gvg7vbcm8jx
1 a6aqkm30u7p90
SQL> select address, hash_value from v$sqlarea where sql_id = 'a6aqkm30u7p90';
ADDRESS HASH_VALUE
---------------- ----------
C000000EB7ED3420 3248739616
SQL> exec dbms_shared_pool.purge('C000000EB7ED3420,3248739616','C')
PL/SQL procedure successfully completed.
SQL> select address, hash_value from v$sqlarea where sql_id ='39gvg7vbcm8jx';
ADDRESS HASH_VALUE
---------------- ----------
C000001037B8E308 3603538493
SQL> exec dbms_shared_pool.purge('C000001037B8E308’, ‘3603538493', 'C')
PL/SQL procedure successfully completed.
由於當前的資料庫版本是10.2.0.5,因此可以很方便的使用dbms_shared_pool將執行計劃錯誤的SQL清除出共享池,強制其重新生成執行計劃。
SQL> select event, count(*) from gv$session where username like != user GROUP BY EVENT order by 2 desc;
EVENT COUNT(*)
---------------------------------------------------------------- ----------
SQL*Net message from client 370
SQL*Net message to client 15
gc cr request 10
latch free 4
Streams AQ: waiting for messages in the queue 1
5 rows selected.
資料庫中SQL執行計劃錯誤除了導致大量的latch: cache buffers chains等待以外,還存在gc cr request和latch free等這些等待事件,經分析同樣是由於錯誤的執行計劃所致。將這些會話採用相同的方法清除後,系統負載恢復到正常範圍:
Snap Id |
Snap Time |
Sessions |
Cursors/Session |
|
Begin Snap: |
7188 |
31-3月 -12 14:55:17 |
257 |
20.3 |
End Snap: |
7189 |
31-3月 -12 15:03:07 |
256 |
19.9 |
Elapsed: |
7.84 (mins) |
|||
DB Time: |
70.46 (mins) |
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-720352/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Latch: cache buffer chains (%)AI
- cbc latch或cache buffer chains latch系列一AI
- buffer cache實驗6-latch:cache buffers lru chainsAI
- CACHE BUFFER CHAINSAI
- latch: cache buffers chainsAI
- Oracle Cache Buffer ChainsOracleAI
- latch:cache buffers chains案例AI
- buffer busy waits, latch cache buffers chains, read by other session區別AISession
- 深入理解latch: cache buffers chainsAI
- Trouble shooting latch: cache buffers chainsAI
- ORACLE等待事件latch: cache buffers chainsOracle事件AI
- 等待事件_cache_buffers_chains_latch事件AI
- Cache Buffer Chain Latch等待事件AI事件
- latch:cache buffers chains解決步驟AI
- latch:cache buffers chains的優化思路AI優化
- cr塊和latch buffer cache chainAI
- cache buffer chain latch只讀共享?AI
- cache buffer lru chain latch等待事件AI事件
- 【BUFFER】Oracle buffer cache之 latch 學習記錄Oracle
- buffer cache實驗5-latch:cache buffers chainAI
- latch:cache buffers chains的最佳化思路AI
- latch: cache buffers chains---AWR實戰分析AI
- latch free 中 cache buffer chain 的整理AI
- buffer cache與相關的latch等待事件事件
- latch: cache buffers chains故障處理總結(轉載)AI
- 解決一例latch:cache buffers chains小記AI
- cache buffer chain latch可以以只讀模式共享AI模式
- 一次latch cache buffers chains問題的處理AI
- 處理 latch_cache_buffers_chains等待事件一例AI事件
- cache buffers chains and cache buffers lru chainsAI
- 關閉資料庫出現PMON failed to acquire latch資訊資料庫AIUI
- 核心資料庫由於大量的latch free導致CPU資源耗盡的現場資料庫
- latch: cache buffers chains-熱塊的簡單模擬實驗AI
- oracle實驗記錄 (buffer_cache分析(2)cbc latch)Oracle
- 【原創】cache buffer chains的一次解決過程AI
- 客戶資料庫升級後出現ORA-30004錯誤資料庫
- Buffer Cache Size(資料緩衝區)
- Bug 3797171 cache buffers chains latch contention increased in 10g-3797171.8AI