Tuning CPU 100% in Oracle 11g rac-20220215
Tuning CPU 100% in Oracle 11g rac-20220215
導讀:一個檢驗DBA透過AWR,ASH,SQL,執行計劃,hang sql,10046事件等等工具串聯起來解決生產故障的case。隨著分析深入瞭解fixed table原理,繞過BUG解決問題,並解決了偶爾可執行sql語句的最佳化思路。隆重推薦細讀。
場景:FIXED TABLE(X$KSOLSFTS)資料量大,觸發的Oracle Bug,引起的cpu高負載(100%)。因業務系統不能立即應用Patch,故採取重啟instance方式來重新填充FIXED TABLE(X$KSOLSFTS)資料,進而減少FIXED TABLE(X$KSOLSFTS)資料最佳化案例。
1.問題描述
2月14日夜裡00點左右。接到業務同事的最佳化請求。數倉系統中發生cpu負載100%情況,導致正常的業務跑批無法完成。 發生100%cpu高負載是00:00:00開始,00:15:00處理,透過kill session(超過24小時的session的 foglight監控processes)減緩了cpu高負載的狀況,勉強把跑批任務完成。具體原因上班繼續分析。 2月14日上班期間繼續分析這個cpu高負載問題,這個問題還沒有根本解決。其實,2月9日已經開始關注這個問題了, 但是誤以為是偶發現象,沒有立即做最佳化處理,導致2月14的故障。深刻教訓,dba發現問題立即解決,誤等,切記誤等!!
2.問題分析
2.1 環境
Oracle 11.2.0.4.0 rac on Aix6.1
2.2 事後分析:由面(os,awr)到點(ash,sql,rows)
2.2.1 從面分析(OS,AWR)
OS層面:從2月13晚10點左右開始高cpu負載,2月14下午14左右徹底解決(2月14日12點左右重啟instance)。 歷史資料來源foglight監控。 # OS層面分析的思路是:提供一個大方向上的影響範圍(時間,裝置)。同時諮詢業務運維人員此期間未做任何變更。 從cpu高負載的時間段分析AWR。 AWR層面:從AWR分析,發現CPU高負載和OS層看到的CPU高負載相吻合。並且可以找出可疑的SQL。 # 2月14日0點至1點這個1個小時內節點2,從AWR中分析出DB負載為189%,且OS級CPU負載接近98.9% Host Name Platform CPUs Cores Sockets Memory(GB) ---------------- -------------------------------- ---- ----- ------- ---------- bmcdb2 AIX-Based Systems (64-bit) 32 8 124.50 Snap Id Snap Time Sessions Curs/Sess Instances --------- ------------------- -------- --------- --------- Begin Snap: 68748 14-Feb-22 00:00:07 162 .9 2 End Snap: 68749 14-Feb-22 01:00:26 300 .9 2 Elapsed: 60.33 (mins) DB Time: 3,662.68 (mins) # 從awr檢視db負載情況。負載=(DB Time/Elapsed*CPUs)*100%=189%。注:不包含DB後臺程式等待事件 Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~~~~ --------------- --------------- --------- --------- DB Time(s): 60.7 380.2 9.75 7.01 DB CPU(s): 7.8 49.0 1.26 0.90 Redo size (bytes): 572,447.6 3,585,022.2 Logical read (blocks): 28,460.0 178,234.1 Block changes: 787.2 4,929.8 Physical read (blocks): 5,351.1 33,511.6 Physical write (blocks): 548.2 3,433.3 Read IO requests: 56.5 353.6 Write IO requests: 40.6 254.4 Read IO (MB): 41.8 261.8 Write IO (MB): 4.3 26.8 RAC GC blocks received: 17.7 110.7 RAC GC blocks served: 5.9 37.0 User calls: 8.7 54.2 Parses (SQL): 3.5 22.2 Hard parses (SQL): 0.1 0.3 SQL Work Area (MB): 0.9 5.5 Logons: 0.9 5.3 Executes (SQL): 6.2 39.0 Rollbacks: 0.1 0.4 Transactions: 0.2 Host CPU ~~~~~~~~ Load Average CPUs Cores Sockets Begin End %User %System %WIO %Idle ----- ----- ------- --------- --------- --------- --------- --------- --------- 32 8 52.47 52.44 98.9 1.1 0.0 0.0 Instance CPU ~~~~~~~~~~~~ % of total CPU for Instance: 24.6 % of busy CPU for Instance: 24.6 %DB time waiting for CPU - Resource Mgr: 28.7 <=cpu使用與RM競爭嚴重 # 理論上:%Idle=0較低,% of busy CPU=24.6較高情況下,存在cpu瓶頸。這個cpu瓶頸也與OS層監控展示的cpu瓶頸相吻合。 # 理論上:%DB time waiting for CPU - Resource Mgr 產生等待值時, 表示resource manager限制某個使用者和會話使用CPU,而產生的等待。 一般會產生resmgr:cpu quantum等待事件,如果產生該等待事件需要和RSRC_MGR的值結合起來判斷。 # %DB time waiting for CPU - Resource Mgr 一般解決辦法是調整RM的限制,但有時候RM限制是正確的, 只是session使用cpu不合理。我們這個案例就是這樣的。 # resmgr:cpu quantum 等待事件並不是很多,資料庫也未出現掛起的情況。所以這裡分析RM受限是正確的, 那產生%DB time waiting for CPU - Resource Mgr等待,只能推測是cpu使用不合理了。 接下來我們看看cpu資源嚴重的sql排名第一,如下: # 分析awr找到一個sql_id為70ddwj7363cd7的語句比較可疑。這個語句每執行1次大約5566秒,收集1個小時的awr才3600秒。 可以說明這個session連續執行了1個多小時都未執行成功,一直使用cpu,所以才有這個session1個小時內竟然執行5566秒。 CPU CPU per Elapsed Time (s) Executions Exec (s) %Total Time (s) %CPU %IO SQL Id ---------- ------------ ---------- ------ ---------- ------ ------ ------------- 22,264.3 4 5,566.09 78.5 159,729.5 13.9 .0 70ddwj7363cd7 Module: Oracle SQL Performance Investigator SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic _name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block chan ges', 'physical writes') ) PIVOT (SUM (value) FOR statistic_ 結論: 從os,awr的分析結果sql_id=70ddwj7363cd7消耗cpu嚴重,並在一個小時內長時間執行不結束,一直使用cpu。 接下來我們細化分析(ASH,SQL,ROWS)一下SQL是否需要最佳化。
2.2.2 從點分析(ASH,SQL,ROWS)
故障發生時間是00:00:00--00:15:00。awr和addm預設1小時快照1次,15分鐘內肯定沒有快照。這種情況使用ash進行精準分析 # 透過ash分析同樣發現了問題sql_id為70ddwj7363cd7的語句,等待所佔比例為77.2%。這個結果和AWR分析等結果再次一次吻合。 Top SQL with Top Events DB/Inst: bmcDB/bmcdb2 (Feb 14 00:00 to 00:15) Sampled # SQL ID Planhash of Executions % Activity ----------------------- -------------------- -------------------- -------------- Event % Event Top Row Source % RwSrc ------------------------------ ------- --------------------------------- ------- 70ddwj7363cd7 1450034333 52 97.41 CPU + Wait for CPU 77.20 FIXED TABLE - FULL 63.45 SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic _name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block chan ges', 'physical writes') ) PIVOT (SUM (value) FOR statistic_ # cpu等待的事件session,竟然是system,值得深思。 Top Sessions DB/Inst: bmcDB/bmcdb2 (Feb 14 00:00 to 00:15) -> '# Samples Active' shows the number of ASH samples in which the session was found waiting for that particular event. The percentage shown in this column is calculated with respect to wall clock time and not total database activity. -> 'XIDs' shows the number of distinct transaction IDs sampled in ASH when the session was waiting for that particular event -> For sessions running Parallel Queries, this section will NOT aggregate the PQ slave activity into the session issuing the PQ. Refer to the 'Top Sessions running PQs' section for such statistics. Sid, Serial# % Activity Event % Event --------------- ---------- ------------------------------ ---------- User Program # Samples Active XIDs -------------------- ------------------------------ ------------------ -------- 17,21977 1.89 CPU + Wait for CPU 1.41 SYSTEM JDBC Thin Client 67/90 [ 74%] 0 29,10017 1.89 CPU + Wait for CPU 1.47 SYSTEM JDBC Thin Client 70/90 [ 78%] 0 56,46405 1.89 CPU + Wait for CPU 1.52 SYSTEM JDBC Thin Client 72/90 [ 80%] 0 121, 8429 1.89 CPU + Wait for CPU 1.52 SYSTEM JDBC Thin Client 72/90 [ 80%] 0 130,20913 1.89 CPU + Wait for CPU 1.56 SYSTEM JDBC Thin Client 74/90 [ 82%] 0 # 最後,發現foglight監控使用的是system使用者 Activity Over Time DB/Inst: bmcDB/bmcdb2 (Feb 14 00:00 to 00:15) -> Analysis period is divided into smaller time slots -> Top 3 events are reported in each of those slots -> 'Slot Count' shows the number of ASH samples in that slot -> 'Event Count' shows the number of ASH samples waiting for that event in that slot -> '% Event' is 'Event Count' over all ASH samples in the analysis period Slot Event Slot Time (Duration) Count Event Count % Event -------------------- -------- ------------------------------ -------- ------- 00:00:00 (5.0 min) 1,575 CPU + Wait for CPU 1,258 26.48 resmgr:cpu quantum 313 6.59 Streams AQ: qmn coordinator wa 1 0.02 00:05:00 (5.0 min) 1,574 CPU + Wait for CPU 1,247 26.25 resmgr:cpu quantum 324 6.82 control file sequential read 3 0.06 00:10:00 (5.0 min) 1,601 CPU + Wait for CPU 1,240 26.11 resmgr:cpu quantum 356 7.49 control file sequential read 4 0.08 -------------------------------------------------------------
2.2.3 問題原因:SQL語句引起的CPU高負載
問題SQL_ID='70ddwj7363cd7'的sql語句,由foglight程式發起,system執行,導致cpu高負載問題。接下來我們來看看罪惡元兇這個sql語句長什麼樣子。 # 從AWR分析得知SQL_ID='70ddwj7363cd7',cpu使用率極高 # 從ASH分析得知SQL_ID='70ddwj7363cd7',Event比例極高 select SQL_TEXT from DBA_HIST_SQLTEXT where SQL_ID='70ddwj7363cd7'; # 使用DBA_HIST_SQLTEXT檢視透過sql_id可以查詢到詳細的sql,sql語句如下: SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= :1 # 這個語句顯然不是一個業務語句,所以處理它,可以大膽一些。
2.2.4 最佳化問題SQL,檢視真實執行計劃及解決辦法
首先,檢視真實執行計劃,可使用一次sql語句後使用dbms_xplan.display_cursor檢視執行計劃, 可惜QL(SQL_ID='70ddwj7363cd7)無法執行成功,一直hang住。 其次,未執行完成或者一直hang的SQL,可以使用10046事件來分析執行計劃。 未執行完成或一直hang的SQL使用sql trace工具在session級別分析其執行計劃 # 具體方法步驟請參考:附表中《未執行完成的SQL語句如何檢視執行計劃呢?》 # 分析得出trc檢視,並tkprof工具格式化後,如下 ========================================== SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= 10 call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 1 200.03 328.22 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 3 200.03 328.23 0 0 0 0 Misses in library cache during parse: 1 Optimizer mode: ALL_ROWS Parsing user id: SYS Number of plan statistics captured: 1 Rows (1st) Rows (avg) Rows (max) Row Source Operation ---------- ---------- ---------- --------------------------------------------------- 0 0 0 COUNT STOPKEY (cr=0 pr=0 pw=0 time=62 us) 0 0 0 VIEW (cr=0 pr=0 pw=0 time=61 us cost=313 size=485355 card=7467) 0 0 0 SORT ORDER BY STOPKEY (cr=0 pr=0 pw=0 time=60 us cost=313 size=246411 card=7467) 0 0 0 HASH GROUP BY PIVOT (cr=0 pr=0 pw=0 time=43 us cost=313 size=246411 card=7467) 81563632 81563632 81563632 FIXED TABLE FULL X$KSOLSFTS (cr=0 pr=0 pw=0 time=762017301 us cost=293 size=3310527 card=100319) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ SQL*Net message to client 1 0.00 0.00 SQL*Net break/reset to client 1 0.00 0.00 SQL*Net message from client 1 12.02 12.02 # 透過10046事件分析得知,這條語句執行全部消耗在FIXED TABLE FULL X$KSOLSFTS , 並且嚴重消耗CPU=200/328=60% ========================================== # X$KSOLSFTS是一個fixed table,作用是動態分析instance資源的fixed table, 資料均儲存記憶體中,理應很快返回結果,為什麼這個case這麼慢呢,還消耗cpu這麼高。 從這個不正常的角度分析,查詢了一下mos,發現是一個bug。 # Slow Performance When Querying V$SEGSTAT or GV$SEGSTAT Views (Doc ID 1532624.1) 解決辦法: 第一:Apply the patch for Bug 21050285 or 第二:As a workaround, restarting the problem node alleviates this issue since fixed tables are populated from the time when the instance is started. # 因為我短時間無法升級patch 21050285,所以我選擇了重啟instance。 總結:到此其實這個問題已經解決,我們按照mos指定方法,找個時間視窗重啟instance嘗試解決它就可以了。 但是,抱著身為高階dba裝B的態度,想了解以下原理內容,故就有了下面《3.發散思維》
3.發散思維
作為一個想進步的DBA,不能簡單解決問題就結束了。拿著這個case鍛鍊一下自己其他方面的知識。接下來我們分析一下執行計劃。
3.1 簡易執行計劃
從awr分析到需要深入分析的sql語句後,最檢視執行計劃的方式就是透過dbms_xplan.display_awr來檢視。 # 這種方法優缺點:執行計劃不一定是最佳的,但是最快速瞭解問題sql大概資訊的方法。 3.2 會介紹更準確分析執行計劃的方法。 select * from table(dbms_xplan.display_awr('&sql_id')); select * from table(dbms_xplan.display_awr('70ddwj7363cd7')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID 70ddwj7363cd7 -------------------- SELECT obj#, WHEREV$SEGSTAT 'physicalreads', 'physical writes'), ('logical_name IN 'db block changes', WHERE rownum <= :1 Plan hash value: 1450034333 -------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 2 (100)| | | 1 | COUNT STOPKEY | | | | | | | 2 | VIEW | | 1 | 65 | 2 (100)| 00:00:01 | | 3 | SORT ORDER BY STOPKEY| | 1 | 86 | 2 (100)| 00:00:01 | | 4 | HASH GROUP BY PIVOT | | 1 | 86 | 2 (100)| 00:00:01 | | 5 | FIXED TABLE FULL | X$KSOLSFTS | 1 | 86 | 0 (0)| | -------------------------------------------------------------------------------------- 23 rows selected. # 這個執行計劃,估計不是最佳的。如果是最佳的,也不至於1個小時也不出結果 # 執行計劃 Plan hash value: 1450034333;很多時候不同的sql_id是可以有同一個執行計劃的。 # 這裡的SQL與DBA_HIST_SQLTEXT裡查詢的SQL並不相同。 # 既然認為執行計劃不是最佳,那我們如何判斷它不是最佳的執行計劃呢? # X$KSOLSFTS表實際行數是632764,可統計資訊是283998。顯然是不準確的。統計資訊不準確的情況下, 很難得到最佳的執行計劃。所以,認為執行計劃不是最佳的,有最佳化的空間 SQL> select t.TABLE_NAME,t.NUM_ROWS,t.BLOCKS,t.LAST_ANALYZED from dba_tab_statistics t where table_name ='X$KSOLSFTS'; TABLE_NAME NUM_ROWS BLOCKS LAST_ANALYZED ------------------------------ ---------- ---------- ------------------- X$KSOLSFTS 283998 2017-09-05 14:03:06 實際行數 SQL> select count(*) from X$KSOLSFTS; COUNT(*) ---------- 632764 # 往往透過dbms_xplan.display_awr檢視執行計劃不是最準確的, 接下來我們使用dbms_xplan.display_cursor檢視真實的執行計劃。
3.2 真實執行計劃
3.2.1 如何檢視SQL的真實執行計劃?
========================================== step1:設定session分析級別 alter session set statistics_level=ALL; step2:執行業務sql 用10來代替變數 step3:為了樣式,設定linesize set linesize 200 pagesize 300; step4:查詢真實執行計劃 select * from table(dbms_xplan.display_cursor(null, null, 'iostats last')); ========================================== SQL> select * from table(dbms_xplan.display_cursor(null, null, 'iostats last')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID 6u1rq6czpwhgy, child number 0 ------------------------------------- SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= 10 Plan hash value: 1450034333 -------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:01.24 | |* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:01.24 | | 2 | VIEW | | 1 | 4583 | 10 |00:00:01.24 | |* 3 | SORT ORDER BY STOPKEY| | 1 | 4583 | 10 |00:00:01.24 | | 4 | HASH GROUP BY PIVOT | | 1 | 4583 | 7467 |00:00:01.23 | |* 5 | FIXED TABLE FULL | X$KSOLSFTS | 1 | 44024 | 115K|00:00:00.97 | -------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=10) 3 - filter(ROWNUM<=10) 5 - filter((INTERNAL_FUNCTION("FTS_STATNAM") AND "FTS_INTE"=0 AND "INST_ID"=USERENV('INSTANCE'))) 31 rows selected. # E-Rows 44024 與 A-Rows 115K 差距較大,執行計劃顯然不是最佳的。之前我們也分析X$KSOLSFTS統計資訊不準確,我們來收集一下X$KSOLSFTS統計的資訊。
3.2.2 如何收集fixed tables統計資訊?
========================================== # 備註:業務低峰期進行收集。 收集fixed tables統計資訊 execute dbms_stats.gather_fixed_objects_stats; 刪除fixed tables統計資訊 execute dbms_stats.delete_fixed_objects_stats; ========================================== 收集fixed tables統計資訊後,X$KSOLSFTS統計的資訊也得到了更新。 想著這回執行計劃可能會準確把?但現實狠狠再次抽我一下。 SQL> select t.TABLE_NAME,t.NUM_ROWS,t.BLOCKS,t.LAST_ANALYZED from dba_tab_statistics t where table_name like '%KSOLSFTS%'; TABLE_NAME NUM_ROWS BLOCKS LAST_ANALYZED ------------------------------ ---------- ---------- ------------------- X$KSOLSFTS 632280 2022-02-10 11:26:24 再次檢視執行計劃,還是不準確的 SQL> select * from table(dbms_xplan.display_cursor(null, null, 'iostats last')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID 6u1rq6czpwhgy, child number 0 ------------------------------------- SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= 10 Plan hash value: 1450034333 -------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.51 | |* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.51 | | 2 | VIEW | | 1 | 4583 | 10 |00:00:00.51 | |* 3 | SORT ORDER BY STOPKEY| | 1 | 4583 | 10 |00:00:00.51 | | 4 | HASH GROUP BY PIVOT | | 1 | 4583 | 7459 |00:00:00.50 | |* 5 | FIXED TABLE FULL | X$KSOLSFTS | 1 | 44024 | 114K|00:00:00.39 | -------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=10) 3 - filter(ROWNUM<=10) 5 - filter((INTERNAL_FUNCTION("FTS_STATNAM") AND "FTS_INTE"=0 AND "INST_ID"=USERENV('INSTANCE'))) 31 rows selected. # 雖然X$KSOLSFTS統計資訊更新了,但是執行計劃還是不準確,E-Rows是44024行,我們實際卻要訪問114K行。 # 但是實際使用的時間是00:00:00.51,已經很快了。為什麼會出現cpu等待呢? 問題:其實這條語句在一個小時內執行內有時可以完成,有時無法完成;導致我們無法系統的分析他。 答案:我們必須全域性的考慮他,比如檢視每個小時cpu執行情況,cpu使用情況等等,這樣就招撥出來dba_hist_sqlstat檢視來幫助我們分析。
3.3.3 透過dba_hist_sqlstat檢視來分析sql執行計劃消耗資源的變化,選擇資源消耗更少的執行計劃
SQL_ID='70ddwj7363cd7'偶爾能執行成功,偶爾不行,這種情況我們只能看歷史情況綜合分析它了。 透過dba_hist_sqlstat檢視來看看這個sql語句的執行計劃在每個小時內消耗cpu的資源分析 # 7天內,查詢SQL_ID的sql每個執行計劃消耗的資源情況 # 收集fixed tables統計資訊後,顯然是有效果的,但是沒想到後續又出現了cpu高等待 set lines 150 pages 999; col p_user for 99999 ; col execs for 9999999 ; select a.INSTANCE_NUMBER inst_id,to_char(b.END_INTERVAL_TIME,'yyyymmdd hh24:mi:ss') time,plan_hash_value,buffer_gets_delta/executions_delta get_exec, disk_reads_delta/executions_delta read_exec,cpu_time_delta/executions_delta/1000 cpu_exec_ms ,elapsed_time_delta/executions_delta/1000 elaps_exec_ms ,parsing_schema_id p_user,ROWS_PROCESSED_delta/executions_delta rows_exec,EXECUTIONS_DELTA execs from dba_hist_sqlstat a, dba_hist_snapshot b where a.sql_id='&1' and a.snap_id = b.snap_id and a.instance_number = b.instance_number and b.END_INTERVAL_TIME between sysdate - 7 and sysdate and executions_delta>0 and a.INSTANCE_NUMBER ='2' order by 2,1; INST_ID TIME PLAN_HASH_VALUE GET_EXEC READ_EXEC CPU_EXEC_MS ELAPS_EXEC_MS P_USER ROWS_EXEC EXECS ---------- ----------------- --------------- ---------- ---------- ----------- ------------- ------ ---------- -------- 2 20220207 18:00:13 1450034333 0 0 3245284.24 22792974.5 5 0 6 =>省略一些 2 20220209 12:00:42 0 0 18276.144 30567.4557 5 0 60 2 20220209 13:00:23 0 0 18589.8178 30220.1374 5 0 60 2 20220209 14:00:32 0 0 18809.3314 30660.602 5 0 60 2 20220209 15:00:16 0 0 719547.728 1170843.71 5 0 6 =>計劃工作是1個小時60次,其實只能完成6次,這個時候其實問題已經存在了 2 20220209 16:00:42 0 0 2599794.88 4483118.75 5 0 4 2 20220209 17:00:02 0 0 3711172.17 7159658.05 5 0 4 2 20220209 18:00:15 0 0 329147.826 662295.523 5 0 41 2 20220209 19:00:24 0 0 1653292.58 2709949.97 5 0 4 =>省略一些,這個時候偶爾可以,偶爾不可以 2 20220210 10:00:04 0 0 314395.289 1281711.87 5 7459.88333 60 2 20220210 11:00:15 0 0 317873.686 1263796.6 5 7465.53333 60 2 20220210 12:00:21 0 0 322069.2 1253976.16 5 7461.5 20 =>11:30左右開始收集fixed tables統計資訊,12:50左右收集完成。14:00開啟監控sql(V$SEGSTAT) 2 20220210 14:00:25 0 0 289.353 590.408256 5 7456.16279 43 2 20220210 15:00:22 1.43333333 .083333333 287.013433 465.832983 5 7453.6 60 2 20220210 16:00:15 0 0 285.755183 463.890183 5 7450.21667 60 2 20220210 17:00:22 1450034333 0 0 288.7832 589.889667 5 7450 60 2 20220210 18:00:17 1450034333 0 0 286.847317 464.948483 5 7450.91667 60 2 20220210 19:00:18 1450034333 0 0 286.966 464.918083 5 7450.01667 60 2 20220210 20:00:11 1450034333 0 0 286.781417 465.000033 5 7449.96667 60 2 20220210 21:00:06 1450034333 0 0 287.152033 465.723383 5 7450 60 2 20220210 22:00:07 1450034333 0 0 287.14685 465.364817 5 7450 60 2 20220210 23:00:04 1450034333 0 0 309.7571 503.2724 5 7450 60 2 20220211 00:00:49 1450034333 0 0 309.959393 503.812279 5 7449.18033 61 2 20220211 05:00:41 1450034333 0 0 225638.216 372509.22 5 0 18 =>再次出現cpu高等待,後面就是偶爾高,偶爾低不穩定 2 20220211 06:00:24 1450034333 0 0 102763.108 166902.224 5 0 60 2 20220211 07:00:18 1450034333 0 0 102958.522 167239.497 5 0 60 2 20220211 08:00:57 1450034333 0 0 103486.041 168101.32 5 0 61 2 20220211 09:00:11 1450034333 0 0 96940.2201 158655.225 5 0 59 2 20220211 10:00:04 1450034333 0 0 100547.878 164694.348 5 0 60 2 20220211 11:00:02 1450034333 0 0 104021.177 169745.013 5 0 60 2 20220211 12:00:11 1450034333 0 0 104604.996 170262.021 5 0 60 # 分析到此,幾乎黔驢技窮了。fixed tables統計資訊準確後,我們的執行計劃還是不準的。看樣這個BUG只有重啟rac的instance來減少X$KSOLSFTS的資料量來解決這個問題了。
3.3.4 rac重啟instance後,fixed tables資料減少,查詢速度變快,執行計劃成本未改變(未收集統計資訊)
# 重啟instance後,X$KSOLSFTS在重啟instance時被記憶體中的C程式結構重新填充了資料, 進而X$KSOLSFTS資料量減少了。 節點1 SQL> select count(*) from X$KSOLSFTS; COUNT(*) ---------- 36102 節點2 SQL> select count(*) from X$KSOLSFTS; COUNT(*) ---------- 22352 # instance重啟後,X$KSOLSFTS實際行數從632764下降至3萬多 # 重啟之後,sql執行計劃的改變 節點2 SQL> select * from table(dbms_xplan.display_cursor(null, null, 'iostats last')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID 6u1rq6czpwhgy, child number 1 ------------------------------------- SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= 10 Plan hash value: 1450034333 -------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.02 | |* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.02 | | 2 | VIEW | | 1 | 7467 | 10 |00:00:00.02 | |* 3 | SORT ORDER BY STOPKEY| | 1 | 7467 | 10 |00:00:00.02 | | 4 | HASH GROUP BY PIVOT | | 1 | 7467 | 920 |00:00:00.02 | |* 5 | FIXED TABLE FULL | X$KSOLSFTS | 1 | 100K| 4064 |00:00:00.01 | -------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=10) 3 - filter(ROWNUM<=10) 5 - filter((INTERNAL_FUNCTION("FTS_STATNAM") AND "FTS_INTE"=0 AND "INST_ID"=USERENV('INSTANCE'))) 31 rows selected. # 這裡可以看出來,這個執行計劃還是有問題的。e-rows與a-rows差距太大100k與4064。 為什麼e-rows是100k呢,因為我們的統計資訊不準確
3.3.5 收集統計資訊後得到最佳執行計劃
我們繼續收集一下fixed tables的統計資訊,看看執行計劃是否有改變 節點2收集統計資訊後 SQL> execute dbms_stats.gather_fixed_objects_stats; PL/SQL procedure successfully completed. # 剛重啟instance後,fixed table資料量少,收集速度也很快。 節點2統計資訊分析結果檢視 SQL> select t.TABLE_NAME,t.NUM_ROWS,t.BLOCKS,t.LAST_ANALYZED from dba_tab_statistics t where table_name like '%KSOLSFTS%'; TABLE_NAME NUM_ROWS BLOCKS LAST_ANALYZED ------------------------------ ---------- ---------- ------------------- X$KSOLSFTS 37246 2022-02-14 13:17:57 節點2收集統計資訊後,sql真實執行計劃如下 SQL> select * from table(dbms_xplan.display_cursor(null, null, 'iostats last')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL_ID g5ktn014rjsyw, child number 0 ------------------------------------- SELECT * FROM ( SELECT * FROM ( SELECT obj#, statistic_name, value FROM V$SEGSTAT WHERE statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) PIVOT (SUM (value) FOR statistic_name IN ('logical reads', 'physical reads', 'db block changes', 'physical writes') ) ORDER BY 2 DESC ) WHERE rownum <= 10 Plan hash value: 1450034333 -------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.02 | |* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.02 | | 2 | VIEW | | 1 | 1507 | 10 |00:00:00.02 | |* 3 | SORT ORDER BY STOPKEY| | 1 | 1507 | 10 |00:00:00.02 | | 4 | HASH GROUP BY PIVOT | | 1 | 1507 | 936 |00:00:00.02 | |* 5 | FIXED TABLE FULL | X$KSOLSFTS | 1 | 5011 | 3096 |00:00:00.01 | -------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=10) 3 - filter(ROWNUM<=10) 5 - filter((INTERNAL_FUNCTION("FTS_STATNAM") AND "FTS_INTE"=0 AND "INST_ID"=USERENV('INSTANCE'))) 31 rows selected. # 當前執行計劃終於準確了,E-Rows | A-Rows差距相近
3.3.6 使用dba_hist_sqlstat統計一下我們最佳化後的效果
rac重啟instance之後,持續觀察一下cpu資源每個小時內消耗cpu的資源分析 #7天內,查詢SQL_ID的sql每個執行計劃消耗的資源情況 set lines 150 pages 999; col p_user for 99999 ; col execs for 9999999 ; select a.INSTANCE_NUMBER inst_id,to_char(b.END_INTERVAL_TIME,'yyyymmdd hh24:mi:ss') time,plan_hash_value,buffer_gets_delta/executions_delta get_exec, disk_reads_delta/executions_delta read_exec,cpu_time_delta/executions_delta/1000 cpu_exec_ms ,elapsed_time_delta/executions_delta/1000 elaps_exec_ms ,parsing_schema_id p_user,ROWS_PROCESSED_delta/executions_delta rows_exec,EXECUTIONS_DELTA execs from dba_hist_sqlstat a, dba_hist_snapshot b where a.sql_id='&1' and a.snap_id = b.snap_id and a.instance_number = b.instance_number and b.END_INTERVAL_TIME between sysdate - 7 and sysdate and executions_delta>0 and a.INSTANCE_NUMBER ='2' order by 2,1; Enter value for 1: 70ddwj7363cd7 old 3: where a.sql_id='&1' new 3: where a.sql_id='70ddwj7363cd7' INST_ID TIME PLAN_HASH_VALUE GET_EXEC READ_EXEC CPU_EXEC_MS ELAPS_EXEC_MS P_USER ROWS_EXEC EXECS ---------- ----------------- --------------- ---------- ---------- ----------- ------------- ------ ---------- -------- ............ =》省略一些 2 20220213 19:00:27 1450034333 0 0 5734704.21 22100839.5 5 0 4 2 20220213 20:00:51 1450034333 0 0 6204963.99 26358999.8 5 0 4 2 20220213 21:00:28 1450034333 0 0 6270982.43 29882859 5 0 4 2 20220213 22:00:05 1450034333 0 0 6402684.53 33756085.6 5 0 4 2 20220213 23:00:13 1450034333 0 0 5777364.85 34102730.6 5 0 4 2 20220214 00:00:07 1450034333 0 0 5759844.62 36741886.9 5 0 4 2 20220214 01:00:26 1450034333 0 0 5566086.21 39932383.6 5 0 4 2 20220214 02:00:17 1450034333 0 0 663313.873 4898792.2 5 0 33 2 20220214 03:00:38 1450034333 0 0 1218136.33 9638080.7 5 0 18 2 20220214 04:00:12 1450034333 0 0 21725865.7 182932616 5 0 1 =>20220214 12:30:00 例項重啟之後,X$KSOLSFTS錶行數減少。問題得到真正解決 2 20220214 14:00:14 1450034333 1.75 .055555556 13.5259444 21.8915833 5 1002.05556 36 2 20220214 15:00:10 1450034333 0 0 13.9752 22.5881333 5 1032.55 60 2 20220214 16:00:09 1450034333 0 0 14.4937667 23.4365333 5 1089.48333 60 2 20220214 17:00:36 1450034333 0 0 14.9452 24.17055 5 1126.1 60 2 20220214 18:00:35 1450034333 0 0 15.6892167 25.32605 5 1196.4 60 2 20220214 19:00:45 1450034333 0 0 15.7695082 25.4163115 5 1199.45902 61 2 20220214 20:00:24 1450034333 0 0 16.0380508 25.8679492 5 1224.94915 59 2 20220214 21:00:22 1450034333 0 0 16.1035167 25.9960667 5 1226 60 2 20220214 22:00:13 1450034333 0 0 16.10595 26.0308333 5 1226 60 2 20220215 00:00:00 1450034333 0 0 25.5067333 41.3598667 5 2001.21667 60 2 20220215 05:00:18 1450034333 0 0 36.4353 59.0761833 5 2941.05 60 2 20220215 06:00:12 1450034333 0 0 36.5330167 59.0673833 5 2948.21667 60 2 20220215 07:00:16 1450034333 0 0 36.44085 58.8934333 5 2948 60 2 20220215 08:00:24 1450034333 0 0 36.6509833 59.2170667 5 2949.53333 60 2 20220215 09:00:17 1450034333 0 0 37.3499 60.3852333 5 3035.36667 60 2 20220215 10:00:41 1450034333 0 0 38.42875 62.13975 5 3115.76667 60 2 20220215 11:00:03 1450034333 0 0 38.7289833 62.6527667 5 3124.53333 60 2 20220215 12:00:17 1450034333 0 0 38.4324167 62.15165 5 3124.08333 60 2 20220215 13:00:27 1450034333 0 0 38.6223833 62.4524333 5 3125 60 2 20220215 14:00:47 1450034333 0 0 38.5679672 62.5711311 5 3125.59016 61 2 20220215 15:00:07 1450034333 0 0 38.5895085 62.5896271 5 3126.22034 59 到此問題得到根本解決。 第一,執行計劃準確,E-Rows | A-Rows差距接近 第二,使用dba_hist_sqlstat檢視sql使用cpu資源也趨近穩定合理。
4.附表:
4.1 如何檢視統計資訊維護視窗?
col WINDOW_NAME for a15 col REPEAT_INTERVAL for a60 col DURATION for a30 set linesize 120 SELECT t1.window_name, t1.repeat_interval, t1.duration FROM dba_scheduler_windows t1, dba_scheduler_wingroup_members t2 WHERE t1.window_name = t2.window_name AND t2.window_group_name IN ('MAINTENANCE_WINDOW_GROUP', 'BSLN_MAINTAIN_STATS_SCHED'); ***************************************************************** WINDOW_NAME REPEAT_INTERVAL DURATION --------------- ------------------------------------------------------------ ------------------------------ MONDAY_WINDOW freq=daily;byday=MON;byhour=22;byminute=0; bysecond=0 +000 04:00:00 TUESDAY_WINDOW freq=daily;byday=TUE;byhour=22;byminute=0; bysecond=0 +000 04:00:00 WEDNESDAY_WINDOW freq=daily;byday=WED;byhour=22;byminute=0; bysecond=0 +000 04:00:00 THURSDAY_WINDOW freq=daily;byday=THU;byhour=22;byminute=0; bysecond=0 +000 04:00:00 FRIDAY_WINDOW freq=daily;byday=FRI;byhour=22;byminute=0; bysecond=0 +000 04:00:00 SATURDAY_WINDOW freq=daily;byday=SAT;byhour=22;byminute=0; bysecond=0 +000 04:00:00 SUNDAY_WINDOW freq=daily;byday=SUN;byhour=22;byminute=0; bysecond=0 +000 04:00:00 7 rows selected.
4.2 未執行完成的SQL語句如何檢視執行計劃呢?
使用sql trace工具在session級別對未執行完的SQL,進行執行計劃分析 # 使用sql trace 檢視未執行完的SQL執行計劃的方法: ========================================== connect xxxx/yyyy alter session set tracefile_identifier='EVENT10046'; alter session set max_dump_file_size = UNLIMITED; alter session set timed_statistics=true; alter session set events '10046 trace name context forever, level 12'; # level 12 可以對變數進行追蹤 -- 執行物件 SQL -- sql_id=70ddwj7363cd7是一個繫結變數的SQL,所以暫定把變數=10帶入SQL來最佳化 alter session set events '10046 trace name context off'; select * from v$diag_info where name like 'Default Trace File%'; exit ==========================================
5.總結:
5.1 靈活應用手中的工具: 若分析一個快照週期內資料庫狀況,可以使用AWR;若要分析未完成快照的資料庫狀況,可使用ASH。 5.2 instance永遠不重啟,也許不是一個正確的選擇? 長時間未重啟的instance要考慮fixed table的增長對資料庫影響, 本案例instance就是1年多未重啟。X$KSOLSFTS由3w增加到63多w觸發了BUG。 5.3 未執行完成的SQL如何檢視執行計劃? 10046事件來分析。 5.4 怎麼分析SQL執行偶爾成功,偶爾失敗的資源狀況呢? dba_hist_sqlstat和dba_hist_snapshot組合是一個利器。衡定一個時間內分析這個SQL資源狀況最為標準。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26442936/viewspace-2856709/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- oracle sql tuning 14 --10046OracleSQL
- Automatic SQL Tuning in Oracle Database 11gSQLOracleDatabase
- oracle tuningOracle
- oracle 11g AWR CPU 顯示Bug 一則Oracle
- Oracle程式導致CPU 100%解決步驟(轉)Oracle
- Oracle 11g新SQL Trace 10046方法OracleSQL
- Oracle Tuning總結Oracle
- Oracle SQL Perfomance TuningOracleSQL
- oracle 11g vktm程式佔用CPU高 【轉載】Oracle
- Oracle效能優化-資料庫CPU使用率100%Oracle優化資料庫
- Oracle 11G OCP 1Z0-052 100Oracle
- Oracle 11G OCP 1Z0-053 100Oracle
- oracle addm segment tuningOracle
- Oracle Wait Event - TuningOracleAI
- CPU100%排查總結
- oracle 11g 開啟 em後,High CPU after DB Console start in 11gOracle
- 10g,11g sql auto tuning 測試SQL
- oracle.Performance.Tuning筆記OracleORM筆記
- Oracle Advanced Performance Tuning ScriptsOracleORM
- Oracle -- Common Performance Tuning IssuesOracleORM
- Tuning Oracle Full-table ScansOracle
- zt_Limiting I/O and CPU resources using 11g Oracle Resource ManagerMITOracle
- mysql cpu 100% 滿 優化方案MySql優化
- Oracle Advanced Performance Tuning Scripts(轉)OracleORM
- ORACLE SQL Tuning Health-Check(SQLHC)OracleSQL
- 11g rac 等待事件resmgr:cpu quantum事件
- oracle 12.2.0.1 crs call completion cpu使用率100% library cache lockOracle
- CPU持續100%分析並解決
- mysql例項cpu超過100%分析MySql
- top 的%CPU總數大於100
- Oracle CPU CostingOracle
- Oracle Performance Tuning 11g2 (2)OracleORM
- Oracle Performance Tuning 11g2 (6)OracleORM
- Oracle Performance Tuning 11g2 (5)OracleORM
- Oracle Performance Tuning 11g2 (4)OracleORM
- Oracle Performance Tuning 11g2 (3)OracleORM
- Oracle Performance Tuning 11g2 (1)OracleORM
- Oracle Doc list involved with performance tuningOracleORM