awr中DB CPU過低的原因分析

dbhelper發表於2015-01-31
前幾天在做巡檢的時候發現有個庫的負載在某一個時間段內極高,高達100倍。一個10分鐘的awr報告,得到的db time 卻有1000分鐘。
Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 27745 13-Jan-15 06:00:02 3500 6.4
End Snap: 27746 13-Jan-15 06:10:03 3512 6.4
Elapsed:   10.01 (mins)    
DB Time:   1,010.21 (mins)    
發生的時間段是早上6點左右。想必不大可能是人為操作的原因。
但是檢視time model的時候卻讓人感覺很奇怪。
Time Model Statistics
Statistic Name Time (s) % of DB Time
sql execute elapsed time 7,303.18 12.05
DB CPU 1,233.88 2.04
PL/SQL execution elapsed time 111.37 0.18
parse time elapsed 14.47 0.02
inbound PL/SQL rpc elapsed time 11.08 0.02
RMAN cpu time (backup/restore) 5.57 0.01
hard parse elapsed time 4.08 0.01
connection management call elapsed time 2.41 0.00
sequence load elapsed time 2.03 0.00
repeated bind elapsed time 0.03 0.00
failed parse elapsed time 0.02 0.00
hard parse (sharing criteria) elapsed time 0.01 0.00
hard parse (bind mismatch) elapsed time 0.01 0.00
PL/SQL compilation elapsed time 0.00 0.00
DB time 60,612.63  
background elapsed time 1,770.76  
background cpu time 40.97  

OS級別的統計資訊如下。
Statistic Value End Value
BUSY_TIME 197,613  
IDLE_TIME 4,611,782  
IOWAIT_TIME 10,540  
NICE_TIME 0  
SYS_TIME 34,788  
USER_TIME 150,433  
LOAD 10 1

Host CPU (CPUs: 80 Cores: 40 Sockets: 4)

Load Average Begin Load Average End %User %System %WIO %Idle
9.92 10.06 3.1 0.7 0.2 95.9

Instance CPU

%Total CPU %Busy CPU %DB time waiting for CPU (Resource Manager)
2.7 64.5 0.0

OS Busy% = 197613/(197613+4611782) = 25%
Inst CPU% (using DB CPU) = 1234*100/ (197613+4611782)=2.57%

檢視等待事件的時候發現
Top 5 Timed Foreground Events
Event Waits Time(s) Avg wait (ms) % DB time Wait Class
db file sequential read 1,286,918 4,395 3 7.25 User I/O
DB CPU   1,234   2.04  
direct path read 68,781 870 13 1.43 User I/O
SQL*Net more data from dblink 2,860 438 153 0.72 Network
db file parallel read 15,649 315 20 0.52 User I/O

Inst CPU% (using db file sequential read) = 4395*1000/ (197613+4611782) = 9% 

可以很明顯的看到top 5的等待事件只佔用了將近10%左右的DB time,剩下的90%左右的到哪去了呢?

DB CPU+Foreground events=1234+4395+870+438+315=7252 而DB time=1010.21*60=60612, 差別實在是太大了。
對於這個問題,需要首先關注的就是sql部分,DB time是會把程式等待的時間計算進去的,但是DB cpu不會。在短短的這十分鐘裡看來還是有不少的文章的。
檢視sql的統計情況時,可以明顯看到executions為0的sql語句。但是佔用的總體比例看起來很低,是因為這部分的時間沒有算到DB cpu裡面去。
Elapsed Time (s) Executions Elapsed Time per Exec (s) %Total %CPU %IO SQL Id SQL Module SQL Text
602.35 0   0.99 2.39 98.75 JDBC Thin Client select /*+ index(s SERVICE_AGR...
528.06 0   0.87 2.48 98.51 SQL Developer select bc.charge_seq_no, bc.se...
460.43 1 460.43 0.76 1.78 5.04   DECLARE job BINARY_INTEGER := ...
460.36 1 460.36 0.76 1.78 5.03   create table RATED_EVENT NOLOG...
458.74 1 458.74 0.76 13.08 86.54 JDBC Thin Client SELECT CD.TARGET_PCN, SE.AGREE...
446.97 1 446.97 0.74 11.19 90.35 PL/SQL Developer declare cnt number; begin cnt ...
446.97 1 446.97 0.74 12.18 89.21 PL/SQL Developer declare cnt number; begin cnt ...
446.97 1 446.97 0.74 11.28 90.19 PL/SQL Developer declare cnt number; begin cnt ...
446.94 1 446.94 0.74 11.54 89.88 PL/SQL Developer declare cnt number; begin cnt ...
378.29 8 47.29 0.62 40.37 57.10 JDBC Thin Client /* AR9_AR9GetPaymentSMS_select...

繼續檢視其它的部分,就會發現executions為0的sql遠不止那兩個。
Buffer Gets Executions Gets per Exec %Total Elapsed Time (s) %CPU %IO SQL Id SQL Module SQL Text
20,877,607 304 68,676.34 15.86 146.80 99.85 0.00 JDBC Thin Client SELECT COUNT(1) FROM (/* null_...
15,363,930 5 3,072,786.00 11.67 10.84 102.02 0.00 1hg2wcuapy3y3 JDBC Thin Client select bl1_run_request.run_mod...
7,445,354 8 930,669.25 5.66 378.29 40.37 57.10 JDBC Thin Client /* AR9_AR9GetPaymentSMS_select...
6,700,190 53 126,418.68 5.09 48.73 99.88 0.00 JDBC Thin Client /* null_CL9CommActivity_update...
6,491,924 1 6,491,924.00 4.93 446.97 12.18 89.21 PL/SQL Developer declare cnt number; begin cnt ...
5,813,177 1 5,813,177.00 4.42 446.94 11.54 89.88 PL/SQL Developer declare cnt number; begin cnt ...
5,350,250 1 5,350,250.00 4.06 446.97 11.19 90.35 PL/SQL Developer declare cnt number; begin cnt ...
5,305,308 1 5,305,308.00 4.03 446.97 11.28 90.19 PL/SQL Developer declare cnt number; begin cnt ...
5,029,500 1 5,029,500.00 3.82 458.74 13.08 86.54 JDBC Thin Client SELECT CD.TARGET_PCN, SE.AGREE...
4,440,670 1 4,440,670.00 3.37 302.84 12.34 86.66 JDBC Thin Client SELECT CD.TARGET_PCN, SE.AGREE...
4,424,502 35 126,414.34 3.36 31.27 99.87 0.00 JDBC Thin Client /* null_CL9CommActivity_select...
4,179,264 5,709 732.05 3.17 17.78 95.09 3.50 JDBC Thin Client /* AR9_SearchInvoices_searchIn...
4,162,849 5,631 739.27 3.16 46.95 40.39 59.88 JDBC Thin Client SELECT COUNT(1) FROM (/* AR9_S...
4,101,017 280,418 14.62 3.12 358.90 9.67 91.72 PL/SQL Developer UPDATE TABLE_BPM_STEP_INST SET...
3,916,711 272,996 14.35 2.98 347.54 9.69 91.71 PL/SQL Developer UPDATE TABLE_BPM_STEP_INST SET...
3,806,950 268,600 14.17 2.89 334.82 9.89 91.30 PL/SQL Developer UPDATE TABLE_BPM_STEP_INST SET...
3,806,852 268,649 14.17 2.89 334.17 9.79 91.54 PL/SQL Developer UPDATE TABLE_BPM_STEP_INST SET...
3,792,292 31 122,332.00 2.88 177.30 13.10 85.91 JDBC Thin Client /* null_CL9CrdMntrExtract_sele...
2,563,257 33 77,674.45 1.95 65.68 8.82 91.91 JDBC Thin Client SELECT /*+ ordered use_nl (bl1...
2,552,705 33 77,354.70 1.94 4.19 99.46 0.17 JDBC Thin Client SELECT COUNT(*) FROM (SELECT /...
2,388,043 0   1.81 81.54 16.81 85.32 PL/SQL Developer SELECT /*+ PARALELL(cl1_coll_e...
2,159,680 389 5,551.88 1.64 7.30 100.02 0.00 JDBC Thin Client /* MSD_SubLogSelect_selectNonQ...
1,967,921 32,264 60.99 1.49 7.24 97.49 0.00 JDBC Thin Client /* BL_LogicalDate_selectOnLine...
1,893,601 0   1.44 93.11 13.01 89.11 PL/SQL Developer SELECT /*+ PARALELL(cl1_coll_e...
1,777,723 0   1.35 528.06 2.48 98.51 SQL Developer select bc.charge_seq_no, bc.se...
1,540,694 0   1.17 106.64 10.81 91.84 PL/SQL Developer SELECT /*+ PARALELL(cl1_coll_e...
1,495,550 0   1.14 105.95 10.76 91.94 PL/SQL Developer SELECT /*+ PARALELL(cl1_coll_e...
1,319,463 30 43,982.10 1.00 5.39 99.97 0.00 Ac1FtcManager@ccbstbpr1 (TNS V1-V3) /* */ UPDATE /*+ INDEX (AC1_CO...
同時這部分的資訊都指向了一個program,都是透過客戶端pl/sql Deverloper或者SQL Developer連線的。
檢視相應的sql都使用了並行,並行度都在4以上,這樣問題似乎就清晰很多了。這部分executions為0的語句透過客戶端程式連線,手工啟用了並行,這些時間都沒有計算進DB CPU中。
稍後和現場,開發部門的同事進行了溝通和確認,他們說在昨天有一個嚴重的問題,最後由開發提供了指令碼進行了修復。
因為持續的時間較短,只有10分鐘,所以對系統的負載沒有造成嚴重的影響,最後就在內部做一個總結和分析。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/8494287/viewspace-1422005/,如需轉載,請註明出處,否則將追究法律責任。

相關文章