在生產環境中，我們可能會經常遇到CPU過度使用而影響系統效能或正常執行的問題。大多數情況下，系統的效能問題都是由不良SQL程式碼引起的，那麼作為DBA，怎樣發現和解決這些SQL問題就顯得尤為重要。

本文透過一個生產系統中的案例分析和處理過程分享，希望能夠給大家帶來一點借鑑作用。

問題描述:
業務及開發人員報告系統執行緩慢，已經影響業務系統正常使用。請求協助診斷。

1．使用vmstat檢查系統當前情況
首先登陸資料庫主機，檢查當前系統狀況。
使用vmstat檢查，發現CPU資源已經耗盡，大量任務位於執行佇列：

bash-2.03$ vmstat 3
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s6 s9 s1 sd in sy cs us sy id
0 0 0 5504232 1464112 0 0 0 0 0 0 0 0 1 1 0 4294967196 0 0 -84 -5 -145
131 0 0 5368072 1518360 56 691 0 2 2 0 0 0 1 0 0 3011 7918 2795 97 3 0
131 0 0 5377328 1522464 81 719 0 2 2 0 0 0 1 0 0 2766 8019 2577 96 4 0
130 0 0 5382400 1524776 67 682 0 0 0 0 0 0 0 0 0 3570 8534 3316 97 3 0
134 0 0 5373616 1520512 127 1078 0 2 2 0 0 0 1 0 0 3838 9584 3623 96 4 0
136 0 0 5369392 1518496 107 924 0 5 5 0 0 0 0 0 0 2920 8573 2639 97 3 0
132 0 0 5364912 1516224 63 578 0 0 0 0 0 0 0 0 0 3358 7944 3119 97 3 0
129 0 0 5358648 1511712 189 1236 0 0 0 0 0 0 0 0 0 3366 10365 3135 95 5 0
129 0 0 5354528 1511304 120 1194 0 0 0 0 0 0 0 4 0 3235 8864 2911 96 4 0

本案例當時，系統CPU資源已經耗盡，Idle為0，並且執行佇列大量程式排隊等待。

2．使用Top工具輔助診斷
透過Top工具，我們可以檢視程式CPU耗用情況，如果存在程式異常，可以透過Top定位，為進一步診斷提供依據。
對於本案例，觀察程式CPU耗用，發現沒有明顯過高CPU使用的程式。

$ top

last pid: 28313; load averages: 99.90, 117.54, 125.71 23:28:38
296 processes: 186 sleeping, 99 running, 2 zombie, 9 on cpu
CPU states: 0.0% idle, 96.5% user, 3.5% kernel, 0.0% iowait, 0.0% swap
Memory: 4096M real, 1404M free, 2185M swap in use, 5114M swap free

PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
27082 oracle8i 1 33 0 1328M 1309M run 0:17 1.29% oracle
26719 oracle8i 1 55 0 1327M 1306M sleep 0:29 1.11% oracle
28103 oracle8i 1 35 0 1327M 1304M run 0:06 1.10% oracle
28161 oracle8i 1 25 0 1327M 1305M run 0:04 1.10% oracle
26199 oracle8i 1 45 0 1328M 1309M run 0:42 1.10% oracle
26892 oracle8i 1 33 0 1328M 1310M run 0:24 1.09% oracle
27805 oracle8i 1 45 0 1327M 1306M cpu/1 0:10 1.04% oracle
23800 oracle8i 1 23 0 1327M 1306M run 1:28 1.03% oracle
25197 oracle8i 1 34 0 1328M 1309M run 0:57 1.03% oracle
21593 oracle8i 1 33 0 1327M 1306M run 2:12 1.01% oracle

從Top的輸出中我們發現有大量程式處於running的執行狀態，CPU消耗很平均，單程式消耗大約在1%左右，我們基本可以排除個別程式異常導致CPU問題的可能。

3、檢查程式數量
對於一個生產資料庫系統，穩定執行的程式數量通常是可知的。
提示:對於穩定執行的生產系統，資料庫的執行狀況通常是穩定的，如果你繪製出效能曲線，你會發現每個星期的曲線幾乎是可以重合的，對資料庫系統的執行狀況及效能指標具有充分認識和了解是必須的。

我們看一下當前系統的程式數量，從而進行比較判斷：

發現此時系統存在大量Oracle程式,大約在300左右，大量程式消耗了幾乎所有CPU資源,而正常情況下Oracle連線數應該在100左右。

由此，我們可以基本判斷，是資料庫或應用出現問題，導致程式任務無法完成，不斷累積，從而出現大量佇列等待。

這些等待在資料庫中應該有具體的體現，接下來需要登陸資料庫進行檢查了。
4、登陸資料庫
我們判斷資料庫可能經歷了等待，那麼Oracle資料庫提供了相關檢視供我們查詢和發現問題，v$session_wait是首先值得我們關注的。

查詢v$session_wait獲取各程式等待事件：

SQL> select sid,event,p1,p1text from v$session_wait;

SID EVENT P1 P1TEXT
---------- ------------------------------ ---------- ----------------------------------------------------------------
124 latch free 1.6144E+10 address
1 pmon timer 300 duration
2 rdbms ipc message 300 timeout
。。。。。。
140 buffer busy waits 17 file#
66 buffer busy waits 17 file#
10 db file sequential read 17 file#
18 db file sequential read 17 file#
54 db file sequential read 17 file#
49 db file sequential read 17 file#
48 db file sequential read 17 file#
46 db file sequential read 17 file#
45 db file sequential read 17 file#
。。。。。。
234 db file sequential read 17 file#
233 db file sequential read 17 file#
230 db file sequential read 17 file#
333 db file sequential read 17 file#
330 db file scattered read 17 file#
310 db file scattered read 17 file#
。。。。。。。

244 rows selected.

對於本案例，我們發現存在大量db file scattered read及db file sequential read等待。顯然全表掃描等操作成為系統最嚴重的效能影響因素。

5、捕獲相關SQL

確定這些程式因為資料訪問產生了等待，我們考慮捕獲這些SQL以發現問題。
這裡用到了我的以下指令碼getsqlbysid.sql，該指令碼透過已知session的sid，聯合v$session、v$sqltext檢視，獲得相關session正在執行的完整的SQL語句。

SELECT sql_text
FROM v$sqltext a
WHERE a.hash_value = (SELECT sql_hash_value
FROM v$session b
WHERE b.SID = '&sid')
ORDER BY piece ASC
/

使用該指令碼,透過從v$session_wait中獲得的等待全表或索引掃描的程式SID，可以捕獲可能存在問題的sql語句:
SQL> @getsqlbysid
Enter value for sid: 18
old 5: where b.sid='&sid'
new 5: where b.sid='18'

SQL_TEXT
----------------------------------------------------------------
select i.vc2title,i.numinfoguid from hs_info i where i.intenab
ledflag = 1 and i.intpublishstate = 1 and i.datpublishdate <=
sysdate and i.numcatalogguid = 2047 order by i.datpublishdate d
esc, i.numorder desc

SQL> /
Enter value for sid: 54
old 5: where b.sid='&sid'
new 5: where b.sid='54'

SQL_TEXT
----------------------------------------------------------------
select i.vc2title,i.numinfoguid from hs_info i where i.intenab
ledflag = 1 and i.intpublishstate = 1 and i.datpublishdate <=
sysdate and i.numcatalogguid = 33 order by i.datpublishdate des
c, i.numorder desc
。。。。。。。。
對幾個程式進行跟蹤,分別得到以上SQL語句，這些SQL可能就是問題產生的根源。

使用該應用使用者連線，透過autotrace功能檢查以上SQL的執行計劃:
SQL> set autotrace trace explain
SQL> select i.vc2title,i.numinfoguid
2 from hs_info i where i.intenabledflag = 1
3 and i.intpublishstate = 1 and i.datpublishdate <=sysdate
4 and i.numcatalogguid = 3475
5 order by i.datpublishdate desc, i.numorder desc ;

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=228 Card=1 Bytes=106)
1 0 SORT (ORDER BY) (Cost=228 Card=1 Bytes=106)
2 1 TABLE ACCESS (FULL) OF 'HS_INFO' (Cost=218 Card=1 Bytes=106)

SQL> select count(*) from hs_info;

COUNT(*)
----------
227404

透過執行計劃，我們看到以上查詢使用了全表掃描，而該表這裡有22萬記錄，全表掃描已經不再適合。

我們檢查全表掃描的資料表，發現存在以下索引:
SQL> select index_name,index_type from user_indexes where table_name='HS_INFO';

INDEX_NAME INDEX_TYPE
------------------------------ ---------------------------
HSIDX_INFO1 FUNCTION-BASED NORMAL
HSIDX_INFO_SEARCHKEY DOMAIN
PK_HS_INFO NORMAL

檢查索引鍵值：

SQL> select index_name,column_name
2 from user_ind_columns where table_name ='HS_INFO';

INDEX_NAME COLUMN_NAME
------------------------------ --------------------
HSIDX_INFO1 NUMORDER
HSIDX_INFO1 SYS_NC00024$
HSIDX_INFO_SEARCHKEY VC2INDEXWORDS
PK_HS_INFO NUMINFOGUID

SQL> desc hs_info
Name Null? Type
--------------------------------- -------- --------------------------------------------
NUMINFOGUID NOT NULL NUMBER(15)
NUMCATALOGGUID NOT NULL NUMBER(15)
INTTEXTTYPE NOT NULL NUMBER(38)
VC2TITLE NOT NULL VARCHAR2(60)
VC2AUTHOR VARCHAR2(100)
NUMPREVINFOGUID NUMBER(15)
NUMNEXTINFOGUID NUMBER(15)
NUMORDER NOT NULL NUMBER(15)
。。。。。

6、建立新的索引以消除全表掃描

檢查發現在numcatalogguid欄位上並沒有索引,該欄位具有很好的區分度，考慮在該欄位建立索引以消除全表掃描。

SQL> create index hs_info_NUMCATALOGGUID on hs_info(NUMCATALOGGUID);

Index created.

SQL> set autotrace trace explain
SQL> select i.vc2title,i.numinfoguid
2 from hs_info i where i.intenabledflag = 1
3 and i.intpublishstate = 1 and i.datpublishdate <=sysdate
4 and i.numcatalogguid = 3475
5 order by i.datpublishdate desc, i.numorder desc ;

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=12 Card=1 Bytes=106)
1 0 SORT (ORDER BY) (Cost=12 Card=1 Bytes=106)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'HS_INFO' (Cost=2 Card=1 Bytes=106)
3 2 INDEX (RANGE SCAN) OF 'HS_INFO_NUMCATALOGGUID' (NON-UNIQUE) (Cost=1 Card=1)

7、觀察系統狀況
原大量等待消失

SQL> select sid,event,p1,p1text from v$session_wait where event not like 'SQL%';

SID EVENT P1 P1TEXT
---------- ------------------------------ ---------- ----------------------------------------------------------------
1 pmon timer 300 duration
2 rdbms ipc message 300 timeout
3 rdbms ipc message 300 timeout
6 rdbms ipc message 180000 timeout
59 rdbms ipc message 6000 timeout
118 rdbms ipc message 6000 timeout
275 rdbms ipc message 30000 timeout
147 rdbms ipc message 6000 timeout
62 rdbms ipc message 6000 timeout
11 rdbms ipc message 30000 timeout
4 rdbms ipc message 300 timeout
305 db file sequential read 17 file#
356 db file sequential read 17 file#
19 db file scattered read 17 file#
5 smon timer 300 sleep time

15 rows selected.

在另外的session裡，持續觀察的CPU使用情況：

bash-2.03$ vmstat 3
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s6 s9 s1 sd in sy cs us sy id
34 0 0 5343016 1465416 44 386 77 0 0 0 0 0 0 0 0 3197 8486 2902 92 8 0
31 0 0 5331568 1459696 178 1491 122 0 0 0 0 0 0 3 0 3237 9461 3005 89 11 0
31 0 0 5317792 1453008 76 719 80 0 0 0 0 0 0 0 0 3292 8736 3025 93 7 0
31 2 0 5311144 1449552 235 1263 69 2 2 0 0 0 1 0 0 3473 9535 3357 88 12 0
25 0 0 5300240 1443920 108 757 18 2 2 0 0 0 1 1 0 2377 7876 2274 95 5 0
19 0 0 5295904 1441840 50 377 0 0 0 0 0 0 0 1 0 1915 6598 1599 98 1 0
----以上為建立索引之前部分
----以下為建立索引之後部分，CPU使用率恢復正常
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s6 s9 s1 sd in sy cs us sy id
0 0 0 4955872 1287136 737 6258 16 0 0 0 0 0 0 3 0 2890 11777 4432 44 12 44
1 0 0 4887888 1256464 809 6234 8 2 2 0 0 0 0 2 0 2809 12066 4247 45 12 43
0 0 0 4828912 1228200 312 2364 13 5 5 0 0 0 2 1 0 2410 6816 3492 38 6 57
0 0 0 4856816 1240168 8 138 0 0 0 0 0 0 1 0 0 2314 4026 3232 34 4 62
0 0 0 4874176 1247712 0 86 0 0 0 0 0 0 0 0 0 2298 3930 3324 35 2 63
2 0 0 4926088 1270824 34 560 0 0 0 0 0 0 0 0 0 2192 4694 2612 29 16 55
0 0 0 5427320 1512952 53 694 0 0 0 0 0 0 3 2 0 2443 5085 3340 33 12 55
0 0 0 5509120 1553136 0 37 0 0 0 0 0 0 0 0 0 2309 3908 3321 35 1 64

至此，此問題得以解決。

8、效能何以提高
回答這個問題似乎是多餘的,我只想重申一點:

有效的降低SQL的邏輯讀是SQL最佳化的基本原則之一。

我們來比較一下前後兩種執行方式的邏輯讀取及效能差異。

a. 全表掃描的效能
SQL> select i.vc2title,i.numinfoguid
2 from hs_info i where i.intenabledflag = 1
3 and i.intpublishstate = 1 and i.datpublishdate <=sysdate
4 and i.numcatalogguid = 3475
5 order by i.datpublishdate desc, i.numorder desc ;

352 rows selected.

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=541 Card=1 Bytes=106)
1 0 SORT (ORDER BY) (Cost=541 Card=1 Bytes=106)
2 1 TABLE ACCESS (FULL) OF 'HS_INFO' (Cost=531 Card=1 Bytes=106)

Statistics
----------------------------------------------------------
0 recursive calls
25 db block gets
3499 consistent gets
258 physical reads
0 redo size
…..
2 sorts (memory)
0 sorts (disk)
352 rows processed

b. 使用索引的效能
SQL> select i.vc2title,i.numinfoguid
2 from hs_info i where i.intenabledflag = 1
3 and i.intpublishstate = 1 and i.datpublishdate <=sysdate
4 and i.numcatalogguid = 3475
5 order by i.datpublishdate desc, i.numorder desc;

352 rows selected.

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=12 Card=1 Bytes=106)
1 0 SORT (ORDER BY) (Cost=12 Card=1 Bytes=106)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'HS_INFO' (Cost=2 Card=1 Bytes=106)
3 2 INDEX (RANGE SCAN) OF 'HS_INFO_NUMCATALOGGUID' (NON-UNIQUE) (Cost=1 Card=1)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
89 consistent gets
0 physical reads
0 redo size
….
1 sorts (memory)
0 sorts (disk)
352 rows processed

consistent gets從3499 到89,我們看到效能得到了巨大的提高.

結語

通常,開發人員很少注意SQL程式碼的效率,他們更著眼於功能的實現。
至於效能問題通常被認為是次要的,而且在應用系統開發初期,由於資料庫資料量較少,對於查詢SQL語句等,不容易體會出各種SQL句法的效能差異。

但是一旦這些應用作為生產系統上線執行,隨著資料庫中資料量的增加,大量併發訪問,系統的響應速度可能就會成為系統需要解決的最主要的問題之一。

在少量使用者下效能可以接受的SQL，可能在大量使用者併發的條件下就會成為效能瓶頸。

在我這個案例中，開發人員很難相信僅只一條SQL語句就導致了整個資料庫的效能下降。然而事實就是如此，一條低效的SQL語句就可能毀掉你的資料庫，所以在系統設計及開發過程中，你必須考慮到諸多細節，嚴格的測試也是提早發現問題的有效方法。
如果不幸以上環節都被忽略，那麼，DBA(也許就是你)就是最後的一環，你必須能夠快速的診斷並解決各種複雜問題。

[@more@]

捕獲問題SQL解決過度CPU消耗問題-- 轉載

相關文章