GC Buffer Busy Waits in RAC: Finding Hot Blocks
Well I don’t have a lot of time to write anything up… sheesh - it’s like 10pm and I’m still messing with this. I should be in bed. But before I quit for the night I thought I’d just do a quick post with some queries that might be useful for anyone working on a RAC system who sees a lot of the event “gc buffer busy”.
Now you’ll recall that this event simply means that we’re waiting for another instance who has the block. But generally if you see lots of these then it’s an indication of contention across the cluster. So here’s how I got to the bottom of a problem on a pretty active 6-node cluster here in NYC.
Using the ASH
I’ll show two different ways here to arrive at the same conclusion. First, we’ll look a the ASH to see what the sampled sessions today were waiting on. Second, we’ll look at the segment statistics captured by the AWR.
First of all some setup. I already knew what the wait events looked like from looking at dbconsole but here’s a quick snapshot using the ASH data from today:
select min(begin_interval_time) min, max(end_interval_time) max from dba_hist_snapshot where snap_id between 12831 and 12838; MIN MAX ------------------------------ ------------------------------ 12-SEP-07 09.00.17.451 AM 12-SEP-07 05.00.03.683 PM
This is the window I’m going to use; 9am to 5pm today.
select wait_class_id, wait_class, count(*) cnt from dba_hist_active_sess_history where snap_id between 12831 and 12838 group by wait_class_id, wait_class order by 3; WAIT_CLASS_ID WAIT_CLASS CNT ------------- ------------------------------ ---------- 3290255840 Configuration 169 2000153315 Network 934 4108307767 System I/O 7199 3386400367 Commit 7809 4217450380 Application 12248 3875070507 Concurrency 14754 1893977003 Other 35499 97762 3871361733 Cluster 104810 1740759767 User I/O 121999
You can see that there were a very large number of cluster events recorded in the ASH. Let’s look a little closer.
select event_id, event, count(*) cnt from dba_hist_active_sess_history where snap_id between 12831 and 12838 and wait_class_id=3871361733 group by event_id, event order by 3; EVENT_ID EVENT CNT ---------- ---------------------------------------- ---------- 3905407295 gc current request 4 3785617759 gc current block congested 10 2705335821 gc cr block congested 15 512320954 gc cr request 16 3794703642 gc cr grant congested 17 3897775868 gc current multi block request 17 1742950045 gc current retry 18 1445598276 gc cr disk read 148 1457266432 gc current split 229 2685450749 gc current grant 2-way 290 957917679 gc current block lost 579 737661873 gc cr block 2-way 699 2277737081 gc current grant busy 991 3570184881 gc current block 3-way 1190 3151901526 gc cr block lost 1951 111015833 gc current block 2-way 2078 3046984244 gc cr block 3-way 2107 661121159 gc cr multi block request 4092 3201690383 gc cr grant 2-way 4129 1520064534 gc cr block busy 4576 2701629120 gc current block busy 14379 1478861578 gc buffer busy 67275
Notice the huge gap between the number of buffer busy waits and everything else. Other statistics I checked also confirmed that this wait event was the most significant on the cluster. So now we’ve got an event and we know that 67,275 sessions were waiting on it during ASH snapshots between 9am and 5pm today. Let’s see what SQL these sessions were executing when they got snapped. In fact lets even include the “gc current block busy” events since there was a bit of a gap for them too.
select sql_id, count(*) cnt from dba_hist_active_sess_history where snap_id between 12831 and 12838 and event_id in (2701629120, 1478861578) group by sql_id having count(*)>1000 order by 2; SQL_ID CNT ------------- ---------- 6kk6ydpp3u8xw 1011 2hvs3mpab5j0w 1022 292jxfuggtsqh 1168 3mcxaqffnzgfw 1226 a36pf34c87x7s 1328 4vs8wgvpfm87w 1390 22ggtj4z9ak3a 1574 gsqhbt5a6d4uv 1744 cyt90uk11a22c 2240 39dtqqpr7ygcw 4251 8v3b2m405atgy 42292
Wow - another big leap - 4,000 to 42,000! Clearly there’s one SQL statement which is the primary culprit. What’s the statement?
select sql_text from dba_hist_sqltext where sql_id='8v3b2m405atgy'; SQL_TEXT --------------------------------------------------------------------------- insert into bigtable(id, version, client, cl_business_id, cl_order_id, desc
I’ve changed the table and field names so you can’t guess who my client might be. :) But it gets the idea across - an insert statement. Hmmm. Any guesses yet about what the problem might be? Well an insert statement could access a whole host of objects (partitions and indexes)… and even more in this case since there are a good number of triggers on this table. Conveniently, the ASH in 10g records what object is being waited on so we can drill down even to that level.
select count(distinct(current_obj#)) from dba_hist_active_sess_history where snap_id between 12831 and 12838 and event_id=1478861578 and sql_id='8v3b2m405atgy'; COUNT(DISTINCT(CURRENT_OBJ#)) ----------------------------- 14 select current_obj#, count(*) cnt from dba_hist_active_sess_history where snap_id between 12831 and 12838 and event_id=1478861578 and sql_id='8v3b2m405atgy' group by current_obj# order by 2; CURRENT_OBJ# CNT ------------ ---------- 3122841 1 3122868 3 3173166 4 3324924 5 3325122 8 3064307 8 -1 10 3064369 331 0 511 3122795 617 3064433 880 3208619 3913 3208620 5411 3208618 22215
Well a trend is emerging. Another very clear outlier - less than a thousand sessions waiting on most objects but the last one is over twenty-two thousand. Let’s have a look at all three of the biggest ones.
select object_id, owner, object_name, subobject_name, object_type from dba_objects where object_id in (3208618, 3208619, 3208620); OBJECT_ID OWNER OBJECT_NAME SUBOBJECT_NAME OBJECT_TYPE ---------- ---------- ------------------------------ ------------------------------ ------------------- 3208618 JSCHDER BIGTABLE_LOG P_2007_09 TABLE PARTITION 3208619 JSCHDER BIGTABL_LG_X_ID P_2007_09 INDEX PARTITION 3208620 JSCHDER BIGTABL_LG_X_CHANGE_DATE P_2007_09 INDEX PARTITION
Now wait just a moment… this isn’t even the object we’re updating!! Well I’ll spare you the details but one of the triggers logs every change to BIGTABLE with about 7 inserts into this one. It’s all PL/SQL so we get bind variables and everything - it’s just the sheer number of accesses that is causing all the contention.
One further thing we can do is actually see which blocks are getting most contended for - the ASH records this too. (Isn’t the ASH great?)
select current_file#, current_block#, count(*) cnt from dba_hist_active_sess_history where snap_id between 12831 and 12838 and event_id=1478861578 and sql_id='8v3b2m405atgy' and current_obj# in (3208618, 3208619, 3208620) group by current_file#, current_block# having count(*)>50 order by 3; CURRENT_FILE# CURRENT_BLOCK# CNT ------------- -------------- ---------- 1330 238073 51 1542 22645 55 1487 237914 56 1330 238724 61 1330 244129 76 1487 233206 120
One thing that I immediately noticed is that there does not seem to be a single hot block!!! (What?) Out of 40,000 sessions accessing these three objects no more than 120 ever tried to hit the same block. Let’s quickly check if any of these are header blocks on the segments.
select segment_name, header_file, header_block from dba_segments where wner='JHEIDER' and partition_name='P_2007_09' and segment_name in ('PLACEMENTS_LOG','PLCMNTS_LG_X_ID', 'PLCMNTS_LG_X_CHANGE_DATE'); SEGMENT_NAME HEADER_FILE HEADER_BLOCK ------------------------------ ----------- ------------ BIGTABL_LG_X_CHANGE_DATE 1207 204809 BIGTABL_LG_X_ID 1207 196617 BIGTABLE_LOG 1209 16393
No - all seem to be data blocks. Why so much contention? Maybe the RAC and OPS experts out there already have some guesses… but first let’s explore one alternative method to check the same thing and see of the numbers line up.
AWR Segment Statistics
Here’s a handy little query I made up the other day to quickly digest any of the segment statistics from the AWR and grab the top objects for the cluster, reporting on each instance. I’m not going to explain the whole thing but I’ll just copy it verbatim - feel free to use it but you’ll have to figure it out yourself. :)
col object format a60 col i format 99 select * from ( select o.owner||'.'||o.object_name||decode(o.subobject_name,NULL,'','.')|| o.subobject_name||' ['||o.object_type||']' object, instance_number i, stat from ( select obj#||'.'||dataobj# obj#, instance_number, sum( GC_BUFFER_BUSY_DELTA ) stat from dba_hist_seg_stat where (snap_id between 12831 and 12838) and (instance_number between 1 and 6) group by rollup(obj#||'.'||dataobj#, instance_number) having obj#||'.'||dataobj# is not null ) s, dba_hist_seg_stat_obj o where o.dataobj#||'.'||o.obj#=s.obj# order by max(stat) over (partition by s.obj#) desc, o.owner||o.object_name||o.subobject_name, nvl(instance_number,0) ) where rownum<=40; OBJECT I STAT ------------------------------------------------------------ --- ---------- JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 2529540 JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 1 228292 JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 2 309684 JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 3 289147 JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 4 224155 JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 5 1136822 JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 6 341440 JSCHDER.BIGTABL_LG_X_CHANGE_DATE.P_2007_09 [INDEX PARTITION] 2270221 JSCHDER.BIGTABL_LG_X_CHANGE_DATE.P_2007_09 [INDEX PARTITION] 1 220094 JSCHDER.BIGTABL_LG_X_CHANGE_DATE.P_2007_09 [INDEX PARTITION] 2 313038 JSCHDER.BIGTABL_LG_X_CHANGE_DATE.P_2007_09 [INDEX PARTITION] 3 299509 JSCHDER.BIGTABL_LG_X_CHANGE_DATE.P_2007_09 [INDEX PARTITION] 4 217489 JSCHDER.BIGTABL_LG_X_CHANGE_DATE.P_2007_09 [INDEX PARTITION] 5 940827 JSCHDER.BIGTABL_LG_X_CHANGE_DATE.P_2007_09 [INDEX PARTITION] 6 279264 JSCHDER.BIGTABLE.P_WAREHOUSE [TABLE PARTITION] 1793931 JSCHDER.BIGTABLE.P_WAREHOUSE [TABLE PARTITION] 1 427482 JSCHDER.BIGTABLE.P_WAREHOUSE [TABLE PARTITION] 2 352305 JSCHDER.BIGTABLE.P_WAREHOUSE [TABLE PARTITION] 3 398699 JSCHDER.BIGTABLE.P_WAREHOUSE [TABLE PARTITION] 4 268045 JSCHDER.BIGTABLE.P_WAREHOUSE [TABLE PARTITION] 5 269230 JSCHDER.BIGTABLE.P_WAREHOUSE [TABLE PARTITION] 6 78170 JSCHDER.BIGTABL_LG_X_ID.P_2007_09 [INDEX PARTITION] 771060 JSCHDER.BIGTABL_LG_X_ID.P_2007_09 [INDEX PARTITION] 1 162296 JSCHDER.BIGTABL_LG_X_ID.P_2007_09 [INDEX PARTITION] 2 231141 JSCHDER.BIGTABL_LG_X_ID.P_2007_09 [INDEX PARTITION] 3 220573 JSCHDER.BIGTABL_LG_X_ID.P_2007_09 [INDEX PARTITION] 4 157050 JSCHDER.BIGTABLE.P_DEACTIVE [TABLE PARTITION] 393663 JSCHDER.BIGTABLE.P_DEACTIVE [TABLE PARTITION] 1 66277 JSCHDER.BIGTABLE.P_DEACTIVE [TABLE PARTITION] 2 10364 JSCHDER.BIGTABLE.P_DEACTIVE [TABLE PARTITION] 3 6930 JSCHDER.BIGTABLE.P_DEACTIVE [TABLE PARTITION] 4 3484 JSCHDER.BIGTABLE.P_DEACTIVE [TABLE PARTITION] 5 266722 JSCHDER.BIGTABLE.P_DEACTIVE [TABLE PARTITION] 6 39886 JSCHDER.BIGTABLE.P_ACTIVE_APPROVED [TABLE PARTITION] 276637 JSCHDER.BIGTABLE.P_ACTIVE_APPROVED [TABLE PARTITION] 1 13750 JSCHDER.BIGTABLE.P_ACTIVE_APPROVED [TABLE PARTITION] 2 12207 JSCHDER.BIGTABLE.P_ACTIVE_APPROVED [TABLE PARTITION] 3 23522 JSCHDER.BIGTABLE.P_ACTIVE_APPROVED [TABLE PARTITION] 4 28336 JSCHDER.BIGTABLE.P_ACTIVE_APPROVED [TABLE PARTITION] 5 99704 JSCHDER.BIGTABLE.P_ACTIVE_APPROVED [TABLE PARTITION] 6 99118 40 rows selected.
As an aside, there is a line in the middle that says “GC_BUFFER_BUSY_DELTA”. You can replace that line with any of these values to see the top objects for the corresponding waits during the reporting period:
LOGICAL_READS_DELTA BUFFER_BUSY_WAITS_DELTA DB_BLOCK_CHANGES_DELTA PHYSICAL_READS_DELTA PHYSICAL_WRITES_DELTA PHYSICAL_READS_DIRECT_DELTA PHYSICAL_WRITES_DIRECT_DELTA ITL_WAITS_DELTA ROW_LOCK_WAITS_DELTA GC_CR_BLOCKS_SERVED_DELTA GC_CU_BLOCKS_SERVED_DELTA GC_BUFFER_BUSY_DELTA GC_CR_BLOCKS_RECEIVED_DELTA GC_CU_BLOCKS_RECEIVED_DELTA SPACE_USED_DELTA SPACE_ALLOCATED_DELTA TABLE_SCANS_DELTA
Now as you can see, these statistics confirm what we observed from the ASH: the top waits in the system are for the BIGTABLE_LOG table. However this also reveals something the ASH didn’t - that the date-based index on the same table is a close second.
The Real Culprit
Any time you see heavy concurrency problems during inserts on table data blocks there should always be one first place to look: space management. Since ancient versions of OPS it has been a well-known fact that freelists are the enemy of concurrency. In this case, that was exactly the culprit.
select distinct tablespace_name from dba_tab_partitions where table_name='BIGTABLE_LOG'; TABLESPACE_NAME ------------------------------ BIGTABLE_LOG_DATA select extent_management, allocation_type, segment_space_management from dba_tablespaces where tablespace_name='BIGTABLE_LOG_DATA'; EXTENT_MAN ALLOCATIO SEGMEN ---------- --------- ------ LOCAL USER MANUAL SQL> select distinct freelists, freelist_groups from dba_tab_partitions 2 where table_name='BIGTABLE_LOG'; FREELISTS FREELIST_GROUPS ---------- --------------- 1 1
And there you have it. The busiest table on their 6-node OLTP RAC system is using MSSM with a single freelist group. I’m pretty sure this could cause contention problems! But in this case it wasn’t quite what I expected. It looks to me like the single freelist itself wasn’t the point of contention - but it was pointing all of the nodes to the same small number of blocks for inserts and these data blocks were getting fought over. But they were probably filling up quickly and so no single block had a large number of waits reported in the ASH. I haven’t proven that but it’s my current theory. :) If anyone has another idea then leave a comment and let me know!
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/756652/viewspace-520651/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Buffer Cache以及buffer busy waits/gc相關事件AIGC事件
- rac 遭遇GC BUFFER BUSY 處理思路GC
- gc buffer busyGC
- Oracle Buffer Busy WaitsOracleAI
- 【等待事件】buffer busy waits事件AI
- RAC遇到GC Buffer Busy的解決方法2GC
- RAC遇到GC Buffer Busy的解決方法1GC
- Buffer Busy Waits深入分析AI
- Buffer busy waits/read by other sessionAISession
- buffer busy waits你誤解了嗎?AI
- wait event:gc buffer busyAIGC
- gc buffer busy的優化GC優化
- buffer busy waits與rac cluster wait之間的聯絡AI
- buffer busy waits引起的會話突增AI會話
- oracle buffer busy waits等待的含義OracleAI
- buffer busy waits 平均等待時間AI
- buffer cache實驗7-buffer busy waits-完成AI
- gc buffer busy的最佳化GC
- 【TUNE_ORACLE】等待事件之“buffer busy waits”Oracle事件AI
- Buffer Busy Waits是怎麼產生的?AI
- gc buffer busy acquire問題處理GCUI
- GC BUFFER BUSY問題的診斷GC
- 一次GC BUFFER BUSY處理GC
- 分析解決11gR2 雙節點RAC環境下的gc cr block busy/gc buffer busy acquire等待GCBloCUI
- update/select也可能產生buffer busy waits。AI
- [摘錄]Oracle Wait Interface之Buffer busy waits事件OracleAI事件
- Oracle Dba必須瞭解的buffer busy waits等待OracleAI
- 等待事件_buffer_busy_waits_and_read_by_other_session(1)事件AISession
- 等待事件_buffer_busy_waits_and_read_by_other_session(2)事件AISession
- 等待事件_buffer_busy_waits_and_read_by_other_session(3)事件AISession
- 等待事件_buffer_busy_waits_and_read_by_other_session(4)事件AISession
- 效能調整一則:buffer busy waits導致主要issueAI
- 一次gc buffer busy問題的診斷GC
- [20161214]關於Buffer Busy Waits.txtAI
- [20150122]buffer busy waits特例.txtAI
- 記一次gc buffer busy等待事件的處理GC事件
- buffer busy waits, latch cache buffers chains, read by other session區別AISession
- Oracle Free Buffer WaitsOracleAI