充值平臺資料庫最佳化一例

comebackdog發表於2011-11-04
環境:
os:CentOS release 5.4 (Final)
db:Orcle 10g 10.2.0.4.0
 
巡檢發現從週三開始充值平臺資料庫的cpu使用率增長了以前的1到2倍,系統負載也是相同情況的增加,雖然伺服器負載一直不大但是我覺得有必要檢查下資料庫是否存在一些效能問題。
該資料庫網路限制無法連線EM透過web平臺檢視,sqlplus登入手動生成awr和addm報告。
1、生成報告
a、登入資料庫
[oracle@chongzhi admin]$ sqlplus / as sysdba
b、執行awrrpt.sql指令碼生產awr報告
SQL> @/u01/app/oracle/product/10.2.0/rdbms/admin/awrrpt.sql
c、執行addmrpt.sql指令碼生成addm報告
SQL> @/u01/app/oracle/product/10.2.0/rdbms/admin/addmrpt.sql
ps:/app/oracle/product/10.2.0/rdbms/admin/  目錄存放著oracle自帶的很多有用的sql指令碼
2、檢視報告
a、檢視awr報告發現db file scattered read和read by other session等待時間比較嚴重
Top 5 Timed Events
Event                   Waits           Time(s) Avg Wait(ms) % Total Call Time Wait Class
CPU time                               5,896                            94.3  
db file scattered read  37,034,761     1,229      0                     19.7    User I/O
read by other session   10,619,767     407        0                      6.5    User I/O
db file scattered read是表示太多全表掃描
read by other session 是第一次遇到,如下是網上查到解釋:
read by other session Definition: When information is requested from the database, Oracle will first read the data from disk into the database buffer cache. If two or more sessions request the same information, the first session will read the data into the buffer cache while other sessions wait.
In previous versions this wait was classified under the “buffer busy waits” event.
However, in Oracle 10.1 and higher this wait time is now broken out into the “read by other session” wait event. Excessive waits for this event are typically due to several processes repeatedly reading the same blocks, e.g. many sessions scanning the same index or performing full-table scans on the same table. Tuning this issue is a matter of finding and eliminating this contention.
Confio concludes with a summary that “read by other session waits” are very similar to buffer busy waits
When a session waits on the “read by other session” event, it indicates a wait for another session to read the data from disk into the Oracle buffer cache.
If this happens too often the performance of the query or the entire database can suffer. Typically this is caused by contention for “hot” blocks or objects so it is imperative to find out which data is being contended for. Once that is known this document lists several alternative methods for solving the issue.
簡單點說就是存在熱快(讀)的爭用,一定有一個或者數個資料庫物件被爭用嚴重。
檢視awr報告發現物理和邏輯讀取的段最多的段是表CZ_INSTANCE_MESSAGE
Segments by Logical Reads
Total Logical Reads: 847,901,296
Captured Segments account for 99.8% of Total
Owner         Tablespace Name Object Name Subobject Name Obj. Type Logical Reads %Total
ZJHC_CHONGZHI  CHONGZHI_DATA    CZ_INSTANCE_MESSAGE           TABLE 428,135,472 50.49
ZJHC_CHONGZHI  CHONGZHI_DATA    CZ_INSTANCE                   TABLE 397,794,832 46.92
Segments by Physical Reads
Total Physical Reads: 337,490,135
Captured Segments account for 100.0% of Total
Owner Tablespace Name Object Name Subobject Name Obj. Type Physical Reads %Total
ZJHC_CHONGZHI   CHONGZHI_DATA    CZ_INSTANCE_MESSAGE   TABLE 333,501,208 98.82
b、檢視addm報告發現id為ayhmtabsvpdgc的sql語句佔用大量的資料庫、cpu和I/O時間,平均執行時間1.5秒
RECOMMENDATION 3: SQL Tuning, 25% benefit (1541 seconds)
      ACTION: Run SQL Tuning Advisor on the SQL statement with SQL_ID
         "ayhmtabsvpdgc".
         RELEVANT OBJECT: SQL statement with SQL_ID ayhmtabsvpdgc and
         PLAN_HASH 2559540153
         select * from (
         select id,error_message,error_stacktrace,ch_req_message,ch_resp_messa
         ge,co_req_message,co_resp_message,async_req_message,async_resp_messag
         e,create_date,modify_date,sts,instance_id from cz_instance_message
         WHERE
         instance_id  = :1
         order by id desc
         )
         WHERE
         rownum  <= :2
      RATIONALE: SQL statement with SQL_ID "ayhmtabsvpdgc" was executed 796
         times and had an average elapsed time of 1.5 seconds.
        
RECOMMENDATION 3: SQL Tuning, 25% benefit (1541 seconds)
      ACTION: Run SQL Tuning Advisor on the SQL statement with SQL_ID
         "ayhmtabsvpdgc".
         RELEVANT OBJECT: SQL statement with SQL_ID ayhmtabsvpdgc and
         PLAN_HASH 2559540153
         select * from (
         select id,error_message,error_stacktrace,ch_req_message,ch_resp_messa
         ge,co_req_message,co_resp_message,async_req_message,async_resp_messag
         e,create_date,modify_date,sts,instance_id from cz_instance_message
         WHERE
         instance_id  = :1
         order by id desc
         )
         WHERE
         rownum  <= :2
      RATIONALE: SQL statement with SQL_ID "ayhmtabsvpdgc" was executed 796
         times and had an average elapsed time of 1.5 seconds.
      RATIONALE: Average CPU used per execution was 1.3 seconds.
     
RECOMMENDATION 3: SQL Tuning, 25% benefit (1541 seconds)
      ACTION: Run SQL Tuning Advisor on the SQL statement with SQL_ID
         "ayhmtabsvpdgc".
         RELEVANT OBJECT: SQL statement with SQL_ID ayhmtabsvpdgc and
         PLAN_HASH 2559540153
         select * from (
         select id,error_message,error_stacktrace,ch_req_message,ch_resp_messa
         ge,co_req_message,co_resp_message,async_req_message,async_resp_messag
         e,create_date,modify_date,sts,instance_id from cz_instance_message
         WHERE
         instance_id  = :1
         order by id desc
         )
         WHERE
         rownum  <= :2
      RATIONALE: SQL statement with SQL_ID "ayhmtabsvpdgc" was executed 796
         times and had an average elapsed time of 1.5 seconds.
      RATIONALE: Average time spent in User I/O wait events per execution was
         0.62 seconds.
該sql語句是對錶cz_instance_message的一個查詢操作,初步定位問題到對該表的查詢,下一步檢查該sql執行計劃
3、檢查sql的執行計劃
SQL> explain plan for
         select * from (
         select id,error_message,error_stacktrace,ch_req_message,ch_resp_message,
         co_req_message,co_resp_message,async_req_message,async_resp_message,
         create_date,modify_date,sts,instance_id from zjhc_chongzhi.cz_instance_message
         WHERE
         instance_id  = :1
         order by id desc
         )
         WHERE
            rownum  <= :2;
Explained.
SQL> @/u01/app/oracle/product/10.2.0/rdbms/admin/utlxpls.sql
ps:utlxpls.sql可以讓你以一個完美的閱讀格式檢視當前最近的sql語句的執行計劃
或者執行一下操作亦有相同效果
SQL> set linesize 130;
SQL> select * from table(dbms_xplan.display);
Plan hash value: 2559540153
-----------------------------------------------------------------------------------------------
| Id  | Operation             | Name                    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                       |     1 | 13062 | 30494    (1)| 00:06:06 |
|*  1 |  COUNT STOPKEY               |                       |       |       |    |                |
|   2 |   VIEW                        |                       |     1 | 13062 | 30494    (1)| 00:06:06 |
|*  3 |    SORT ORDER BY STOPKEY|                |     1 |  1742 | 30494   (1)| 00:06:06 |
|*  4 |     TABLE ACCESS FULL  | CZ_INSTANCE_MESSAGE |     1 |  1742 | 30493   (1)| 00:06:06 |
-----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=TO_NUMBER(:2))
   3 - filter(ROWNUM<=TO_NUMBER(:2))
   4 - filter("INSTANCE_ID"=TO_NUMBER(:1))
結果表明sql對錶CZ_INSTANCE_MESSAGE執行了全表掃描。
4、分析和處理:
問題sql
         select * from (
         select id,error_message,error_stacktrace,ch_req_message,ch_resp_message,
         co_req_message,co_resp_message,async_req_message,async_resp_message,
         create_date,modify_date,sts,instance_id from zjhc_chongzhi.cz_instance_message
         WHERE
         instance_id  = :1
         order by id desc
         )
         WHERE
            rownum  <= :2;
該sql是以CZ_INSTANCE_MESSAGE表的instance_id欄位作為查詢條件的,但instance_id並未做索引造成sql執行是進行了全表掃描,全表掃描增加了cpu和IO的佔用,提請開發人員對instance_id做了索引。
再次檢查cpu和系統負載,已經下降到以前的正常水平。
第二天檢查相同業務時段的awr報告db file scattered read和read by other session已經很低了不在前五個等待事件的報告內了。
Segments by Logical Reads和Segments by Physical Reads報告內容上CZ_INSTANCE_MESSAGE表也沒有了。
檢查addm報告已經沒有對這條sql進行指導和建議的內容了。
5、總結:
本次最佳化調整操作根據awr報告發現等待事件和引起等待事件的段(表)根據addm報告發現問題sql,檢查該sql的執行計劃發現了該sql對錶執行了全表掃描,透過對查詢條件欄位加索引解決了全表掃描引起的效能問題。
整個解決未查詢資料字典相關的檢視,只是依靠oracle提供的的診斷和建議報告結合sql執行計劃的檢視這些基本的手段完成。
雖然只是最簡單的一例因索引而引起的資料庫效能問題,但重點是在於向新人(包括我在內)使用這些診斷工具來排查和解決問題的方法。
 
 
 

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/20575781/viewspace-710235/,如需轉載,請註明出處,否則將追究法律責任。

相關文章