客戶資料庫出現大量cache buffer chains latch

yangtingkun發表於2012-04-03

客戶產品資料庫上午出現了嚴重的效能問題,簡單記錄一下問題的診斷和解決過程。

 

 

可以看到,資料庫的DB TIME已經漲到了非常高的地步,這說明系統正經受著非常嚴重的效能問題:

Snap Id

Snap Time

Sessions

Cursors/Session

Begin Snap:

7170

31-3 -12 08:00:50

395

10.5

End Snap:

7172

31-3 -12 09:00:27

689

11.7

Elapsed:

59.61 (mins)

DB Time:

17,270.93 (mins)

 

TOP 5看,等待事件中最明顯的是latch: cache buffers chains,從當前系統的狀態也可以看到這一點:

SQL> SELECT EVENT, COUNT(*) FROM GV$SESSION GROUP BY EVENT HAVING COUNT(*) > 5 ORDER BY 2 DESC;

EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
latch: cache buffers chains                                            1043
SQL*Net message from client                                             500
SQL*Net message to client                                                35
rdbms ipc message                                                        29
gc cr request                                                            25
latch free                                                               13
gcs remote message                                                       10
gc buffer busy                                                            7

8 rows selected.

系統中部分SQL的執行時間已經長達幾個小時:

SQL ordered by Elapsed Time

Elapsed Time (s)

CPU Time (s)

Executions

Elap per Exec (s)

% Total DB Time

SQL Id

SQL Module

SQL Text

341,821

11,917

112

3051.97

32.99

JDBC Thin Client

SELECT count(*) total FROM ( s...

244,752

3,380

147

1664.98

23.62

JDBC Thin Client

SELECT count(*) total FROM ( s...

52,419

11,331

3

17472.91

5.06

JDBC Thin Client

select cf.cardid, (CASE WHEN c...

38,767

532

11

3524.24

3.74

JDBC Thin Client

SELECT count(*) total FROM ( s...

37,146

2,524

2

18573.15

3.58

JDBC Thin Client

SELECT count(*) total FROM ( s...

30,796

2,331

5

6159.20

2.97

JDBC Thin Client

SELECT count(*) total FROM ( s...

29,991

2,506

1

29991.41

2.89

JDBC Thin Client

SELECT count(*) total FROM ( s...

24,762

875

3

8254.06

2.39

JDBC Thin Client

SELECT count(*) total FROM (SE...

17,845

699

2

8922.50

1.72

JDBC Thin Client

SELECT count(*) total FROM (SE...

17,470

1,295

0

1.69

JDBC Thin Client

select * from (select aa.*, ro...

 

 

其實根據這些現象,基本上可以判斷問題了。肯定是SQL執行計劃的改變導致了當前的效能問題。而類似繫結變數窺探之類的問題只會影響個別的SQL,而這麼大面積的執行計劃的改變,幾乎可以斷定是統計資訊造成的。

詢問了一下客戶最近的操作,原來昨天夜裡透過資料庫鏈的方式匯入了一部分資料。而今天一早問題就出現了。

其實問題已經很明顯了,在透過資料庫鏈載入資料後,並沒有重新收集統計資訊,且由於載入時間是在半夜,這批資料也沒有趕上每天22:00的統計資訊自動收集的工作。這就使得Oracle在生成執行計劃時,會依賴資料載入前的統計資訊,從而造成了錯誤的執行計劃。

首先解決問題的根源問題,對載入過資料的表重新收集統計:

SQL> SELECT 'EXEC DBMS_STATS.GATHER_TABLE_STATS(''USER1'', ''' || TABLE_NAME || ''', CASCADE => TRUE)' FROM DBA_TABLES WHERE WNER = 'USER1' AND LAST_ANALYZED < TRUNC(SYSDATE);

'EXECDBMS_STATS.GATHER_TABLE_STATS(''USER1'','''||TABLE_NAME||''',CASCADE=>TRUE)'
-----------------------------------------------------------------------------------------
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_1', CASCADE => TRUE)
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_2', CASCADE => TRUE)
.
.
.
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_3', CASCADE => TRUE)
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_4', CASCADE => TRUE)

12 rows selected.

SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_1', CASCADE => TRUE)

PL/SQL procedure successfully completed.

SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_2', CASCADE => TRUE)

PL/SQL procedure successfully completed.

.
.
.

SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_3', CASCADE => TRUE)

PL/SQL procedure successfully completed.

SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_4', CASCADE => TRUE)

PL/SQL procedure successfully completed.

雖然問題的根源已經被解決,但是當前執行的SQL並不會改變執行計劃,因此還需要找到問題的SQL,從後臺將其結束:

SQL> SELECT 'kill -9 ' || spid from v$session s, v$process p where s.username = 'USER2' and s.paddr = p.addr and event = 'latch: cache buffers chains';

'KILL-9'||SPID
--------------------
kill -9 28321
kill -9 25384
kill -9 23697
kill -9 7239
.
.
.
kill -9 9331
kill -9 13759

740 rows selected.

SQL> SELECT 'kill -9 ' || spid from gv$session s, gv$process p where s.username = 'USER2' and s.paddr = p.addr and event = 'latch: cache buffers chains' and s.inst_id = p.inst_id and s.inst_id = 2;

'KILL-9'||SPID
--------------------
kill -9 23992
kill -9 5289
kill -9 21067
kill -9 16816
kill -9 16820
kill -9 26767
.
.
.
kill -9 14981
kill -9 26678
kill -9 26682

258 rows selected.

分別在兩個節點殺掉這些執行計劃存在問題的會話,釋放被大量佔用的系統資源。

由於Oracle的執行計劃並非在收集統計資訊後馬上生效,因此還有個別的SQL仍然沿用錯誤的執行計劃:

SQL> select distinct inst_id, sql_id from gv$session where event = 'latch: cache buffers chains';

INST_ID SQL_ID
---------- -------------
         1 39gvg7vbcm8jx
         1 a6aqkm30u7p90

SQL> select address, hash_value from v$sqlarea where sql_id = 'a6aqkm30u7p90'

ADDRESS HASH_VALUE
---------------- ----------
C000000EB7ED3420 3248739616

SQL> exec dbms_shared_pool.purge('C000000EB7ED3420,3248739616','C')

PL/SQL procedure successfully completed.

 SQL> select address, hash_value from v$sqlarea where sql_id ='39gvg7vbcm8jx';

ADDRESS HASH_VALUE
---------------- ----------
C000001037B8E308 3603538493

SQL> exec dbms_shared_pool.purge('C000001037B8E308’, ‘3603538493', 'C')

PL/SQL procedure successfully completed.

由於當前的資料庫版本是10.2.0.5,因此可以很方便的使用dbms_shared_pool將執行計劃錯誤的SQL清除出共享池,強制其重新生成執行計劃。

SQL> select event, count(*) from gv$session where username like != user GROUP BY EVENT order by 2 desc;

EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
SQL*Net message from client                                             370
SQL*Net message to client                                                15
gc cr request                                                            10
latch free                                                                4
Streams AQ: waiting for messages in the queue                             1

5 rows selected.

資料庫中SQL執行計劃錯誤除了導致大量的latch: cache buffers chains等待以外,還存在gc cr requestlatch free等這些等待事件,經分析同樣是由於錯誤的執行計劃所致。將這些會話採用相同的方法清除後,系統負載恢復到正常範圍:

Snap Id

Snap Time

Sessions

Cursors/Session

Begin Snap:

7188

31-3 -12 14:55:17

257

20.3

End Snap:

7189

31-3 -12 15:03:07

256

19.9

Elapsed:

7.84 (mins)

DB Time:

70.46 (mins)

 

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-720352/,如需轉載,請註明出處,否則將追究法律責任。

相關文章