? 系統環境

兩節點的RAC：AIX6.1+Oracle 11.2.0.3.3

? AWR裡展示出來的各種症狀（資料來自例項2）

雖然應用沒有報障，但AWR報告裡的各種跡象已經很明顯了

(1) gc buffer busy acquire排進了Top 5 Timed Foreground Events

圖-1

(2) 除去DB CPU在gc buffer busy acquire之後的就是gc cr block busy了

圖-2

(3) 2h21bq1mnc5kd這條sql的耗時裡85%的時間都在等待叢集有關的事件，且領先第二條太多了

圖-3

2h21bq1mnc5kd對應的完整SQL是：select t.* from SD.SYS_PARAMETER t where t.PARAM_CLASS in (:1 , :2 )

且從語句的執行頻次來看1小時內(AWR的取樣間隔是1小時)執行了260790次，約合72次/秒，併發度還是非常高的

(4) % of Capture=89.46表示在系統裡每發生100次的gc buffer busy等待就有89次來自於物件SD.SYS_PARAMETER，該值同樣遙遙領先第二條

圖-4

(5) CR Blocks Received的比例達到了43%，為實現一致性讀，SD.SYS_PARAMETER表上存在大量的跨節點傳輸

圖-5

(6) 最後看下從v$active_session_history獲取的2h21bq1mnc5kd這條sql的gc事件，每秒鐘都有等待產生，爭用的焦點在93號資料檔案的65696號block上

select to_char(sample_time,'yyyymmdd hh24:mi:ss'),event,p1,p2,count(1) from gv$active_session_history where sql_id='2h21bq1mnc5kd' and event in ('gc cr block busy','gc buffer busy acquire') and inst_id=2 group by to_char(sample_time,'yyyymmdd hh24:mi:ss'),event,p1,p2;

圖-5-1

雖然應用沒有變慢，但上述種種跡象已經引起了我的關注，做運維不就是要防患於未然麼。根據AWR取樣到的結果，初步結論如下：

gc buffer busy acquire、gc cr block busy這兩個gc wait event與select t.* from SD.SYS_PARAMETER t where t.PARAM_CLASS in (:1 , :2 )這條SQL有較大關係，該條SQL執行時存在大量的跨節點讀，以實現一致性讀。

（注：上述AWR資料都來自例項2，例項1的AWR報告未現任何異常情況）

? 瞭解gc buffer busy acquire和gc cr block busy

以最簡單的雙節點RAC為例，當例項1發起一條select查詢某個block的時候，如果這個block不在本地的Buffer cache，但是能在例項2的buffer cache裡找到，那麼例項1的LMS程式會透過私網將這個block從例項2的cache獲取到例項1的cache，以避免physical reads，獲取過程中會出現gc cr block 2-way相關的等待事件，這就是cache fusion的基本功能，之後如果這個block沒有被任何一個例項更改，那麼例項1與例項2訪問這個block就只需從本地的buffer cache讀取(假設buffer cache足夠大，block未被flush出buffer cache)，本地讀取的情況下不會發生與gc相關的等待事件，對於同一個block只需跨節點傳輸一次，這是一種比較理想的情況。跨節點讀取除了這種以讀為主的情況外，還常見於以下場景：

例項1和例項2的buffer cache都含有這個block，T1時刻例項1修改了這個block，T2時刻例項2的會話讀取這個block時就會從例項1的buffer cache裡讀過來，過程中例項2會話統計裡就會觀察到gc cr block busy相關的等待事件。

—gc buffer busy acquire

沿用上面例子：例項1和例項2的buffer cache都含有某個block，T1時刻例項1修改了這個block，T2時刻例項2上的會話1讀取這個block，當這個讀取還沒有完成，例項2上的會話2也發起了讀取相同block的操作，這時會話2就會等在gc buffer busy acquire上。例項2同時發起的讀取相同block的會話數越多，我們就越容易觀察到gc buffer busy acquire等待。

—gc cr block busy

仍沿用上面例子：例項1和例項2的buffer cache都含有某個block，T1時刻例項1修改了這個block；T2時刻例項2上的會話1讀取這個block修改前的CR copy，因為CR copy需要透過應用undo record才能構造出來，且構造的過程中必須將改變記入到例項1的online redo，因此例項2會話1讀取的時候在可能會因為如下原因而發生gc cr block busy等待：

CR copy在例項1上構造過慢或者記入online redo過慢

? gc buffer busy acquire與gc cr block busy有何聯絡

從v$active_session_history裡取出某一時刻select t.* from SD.SYS_PARAMETER t where t.PARAM_CLASS in (:1 , :2 )語句【sql_id=2h21bq1mnc5kd】的等待，

col sample_time format a30

col sql_opname format a10

col event format a30

set linesize 150

select t.session_id,t.sample_time,event,p1,p2,blocking_session,xid,sql_opname,in_parse,in_sql_execution from gv$active_session_history t where sample_id=111712902 and sql_id='2h21bq1mnc5kd' and inst_id=2;

圖6

可以看出在10:03:34.682這個時刻存在以下兩條session等待鏈：

17203(gc cr block busy)—>15808(gc buffer busy acquire)—>20203(gc buffer busy acquire)

17203(gc cr block busy)—>12608(gc buffer busy acquire)

之後又看了其它時刻的取樣結果(篇幅關係，不一一列出)，基本都是這個情況，因此判斷是gc cr block busy進一步導致了gc buffer busy acquire，某

一時刻對於同一個data block而言只會有一個會話等在gc cr block busy事件上，其它會話都在等gc buffer busy acquire。這也可以解釋AWR報告裡為

何gc buffer busy acquire等待次數要遠多於gc cr block busy。至此關注重點集中到了gc cr block busy上。

? 定位修改源頭，模擬gc cr block busy等待

從gc cr block busy事件的定義上來看，例項2上的會話在執行select t.* from SD.SYS_PARAMETER t where t.PARAM_CLASS in (:1 , :2 )語句時之所以會遇到

gc cr block busy等待，肯定是例項1上的會話對SD.SYS_PARAMETER表進行了DML操作，且這個修改操作涉及的記錄主要集中在93號資料檔案的65696

號block上，但從開發人員處瞭解到SD.SYS_PARAMETER是一張配置表，平時不會有修改操作，口說無憑，我們還是使用dbms_fga 包捕獲到了針對

SD.SYS_PARAMETER表的修改操作，執行如下語句配置FGA捕獲策略：

dbms_fga.add_policy(object_schema=>'SD',object_name=>'SYS_PARAMETER',policy_name=>'SYS_PARAMETER',enable=>TRUE,statement_types=>'INSERT,

UPDATE, DELETE');

不出五分鐘就有結果了

col timestamp format a30

col sql_text format a70

col userhost format a15

set linesize 180

select timestamp,sql_text,userhost,instance_number from dba_fga_audit_trail where object_name='SYS_PARAMETER' and timestamp > to_date('20160505

00:00:00','yyyymmdd hh24:mi:ss');

圖7

很有規律的每30秒update一次，來自於例項1。既然update的頻率是30秒一次，select語句遇到的gc cr block busy、gc buffer busy acquire等待事件

也應該是每隔30秒出現一波，但現實卻是分分秒秒都有這兩種等待，以下是從v$active_session_history統計出來的資訊：

set linesize 170

col sample_time format a40

col event format a40

select sql_id,sample_time,event,count(1) from v$active_session_history where sample_time between to_date('20160505 9:54:09','yyyymmdd hh24:mi:ss')

and to_date('20160505 9:58:38','yyyymmdd hh24:mi:ss') and sql_id='2h21bq1mnc5kd' and event in ('gc cr block busy','gc buffer busy acquire') group by

sql_id,sample_time,event;

圖8

為此在同版本的一套測試RAC資料庫上進行了一組簡單的測試，過程中用到了三個session，

testnode 1上的session 1：對測試表進行200W次不間斷的select

testnode 1上的session 2：與session 1執行相同的操作，與session 1同時發起

testnode 2上的session 1：在testnode 1的兩個session執行過程中以每30秒一次的頻度update測試表中的一條記錄

testnode 1上的session 3：統計testnode 1 上兩會話的等待事件

先將要用到的指令碼貼出來：

/*+ circle_sel.sql 模擬連續的select行為 *****/

declare

cnt number;

type typ_t0505 is record (username VARCHAR2(30),user_id NUMBER,created DATE);

type lst_t0505 is table of typ_t0505;

v_rec_t0505 lst_t0505;

begin

cnt:=2000000;

while ( cnt > 0 ) loop

select * bulk collect into v_rec_t0505 from poweruser.t0505_1;

cnt:=cnt-1;

end loop;

end;

/

/*+ sess_1st.sql 記錄會話初始event waits *****/

define v_inst1=&1

define v_inst2=&3

define v_sid1=&2

define v_sid2=&4

drop table sess_evt_1st;

create table sess_evt_1st as select inst_id,sid,event,total_waits from gv$session_event where (inst_id=&v_inst1 and sid in (&v_sid1)) or (inst_id=&v_inst2 and

sid in (&v_sid2)) order by inst_id,sid,event;

/*+ sess_2nd.sql 統計會話在兩次執行間的event waits差值 *****/

define v_inst1=&1

define v_inst2=&3

define v_sid1=&2

define v_sid2=&4

drop table sess_evt_2nd;

create table sess_evt_2nd as select inst_id,sid,event,total_waits from gv$session_event where (inst_id=&v_inst1 and sid in (&v_sid1)) or (inst_id=&v_inst2 and

sid in (&v_sid2)) order by inst_id,sid,event;

col event format a30

set linesize 120 pagesize 60

select nd.sid,nd.event,nd.total_waits-nvl(st.total_waits,0) diff_waits from sess_evt_1st st,sess_evt_2nd nd where st.inst_id(+)=nd.inst_id and st.sid(+)=nd.sid

and st.event(+)=nd.event order by event,sid;

測試開始：

---session 1 on testnode 1: 用於執行select

create table poweruser.t0505_1 tablespace TS_AJX_DATATMP as select * from all_users;

select distinct dbms_rowid.rowid_relative_fno(rowid),dbms_rowid.rowid_block_number(rowid) from poweruser.t0505_1; <---確保測試資料都在一個block裡

DBMS_ROWID.ROWID_RELATIVE_FNO(ROWID) DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)

------------------------------------ ------------------------------------

103 19515

select s.sid,s.serial# from v$session s,v$process p where s.paddr=p.addr and s.sid=sys_context('userenv','sid');

SID SERIAL#

---------- ----------

3411 58063

---session 2 on testnode 1: 用於執行select

select s.sid,s.serial# from v$session s,v$process p where s.paddr=p.addr and s.sid=sys_context('userenv','sid');

SID SERIAL#

---------- ----------

13559 3597

------session 3 on testnode 1: 執行前的使用者會話統計

@sess_1st.sql 1 3411 1 13559

---session 1 testnode 1 & session 2 testnode 1 : 同時執行

@circle_sel.sql

--- session 1 testnode 2：在testnode1上的兩session執行期間每隔30秒執行一次update

Begin

While ( true ) loop

update poweruser.t0505_1 set username=dbms_random.string('u',5) where user_id=0;

commit;

dbms_lock.sleep(30);

end loop;

end;

/

------session 3 on testnode 1: 執行後的會話統計

@sess_2nd.sql 1 3411 1 13559

圖9

測試結果1：

testnode 1上的兩個session 分別執行耗時分別為118s、120s，統計結果顯示與它們同時執行的update語句並沒有帶來多大影響，gc cr block busy等

待事件都在個位數，當然這也可能是因為測試環境裡模擬的併發量只有2遠小於生產環境裡的實際併發數，再仔細想一下，每30秒update一次，120s

的時間內update最多也就執行4次，也就是說迴圈select期間的至多遇到4次update，從testnode1上的會話統計資訊看，gc cr block busy分別為2、

3，非常接近於4，所以推測這2次、3次的gr cr block busy等待應該正好發生在每次update過後的第一次select裡，從這以後到下一次update執行

前的這段空檔期，不會再有gc cr block busy事件發生。而實際在生產環境表現出的症狀卻是每秒鐘都有gc cr block busy等待。

這個顯著的差異讓我對測試方法稍加調整，把 session 1 testnode 2執行語句改成update後過30秒再commit，調整後的測試過程如下：

---session 1 testnode 1 & session 2 testnode 1 :同時執行

@circle_sel.sql

---session 1 node 2：

Begin

While ( true ) loop

update poweruser.t0505_1 set username=dbms_random.string('u',5) where user_id=0;

dbms_lock.sleep(30);

commit;

end loop;

end;

/

---session 3 node 1: 實時捕捉gc事件的增長情況

select inst_id,sid,event,total_waits from gv$session_event where inst_id=1 and sid in (3411, 13559) and event in ('gc buffer busy acquire','gc cr block busy');

圖10

測試結果2：

上述相鄰查詢間隔大概在1秒鐘，可以看出gc buffer busy acquire，gc cr block busy等待次數快速上升，增長的頻率與update的頻率(30秒一次)並沒

有直接關係。至此測試結果與生產環境中的情況有點接近了，難道在生產環境也存在延時commit的情況？

? 找到生產環境裡update後沒有及時commit的證據

dba_fga_audit_trail檢視有一列SQL_BIND會記錄語句執行時的繫結變數資訊：

col timestamp format a30

col sql_text format a70

col sql_bind format a50

col userhost format a15

set linesize 180

select timestamp,sql_text,sql_bind from dba_fga_audit_trail where object_name='SYS_PARAMETER' and timestamp >

to_date('20160505 00:00:00','yyyymmdd hh24:mi:ss');

圖11

拿上圖的第一行來講在20160505 09:54:08時刻，將繫結變數代入得到完整的update語句是：

update SD.SYS_PARAMETER set PARAM_VALUE ='2016-05-06 09:54:03' where PARAM_CODE='CA_VCSMS_TIME';

所以最簡單的方法就是實時監控dba_fga_audit_trail，每當觀察到dba_fga_audit_trail檢視新進來一條被審計到的SQL及其對應的繫結變數後，立刻執

行select param_value from sd.sys_parameter where param_code='CA_VCSMS_TIME'，來觀察param_value欄位何時變成繫結變數的值，如果一直沒有變

化，那就證明沒有提交，透過這個方法，我們發現commit操作果然是在update之後的30秒才執行的，準確的說應該是在下一條update開始之前commit

前一條update的修改結果。

還有一種能證明延時commit的方法是記錄下dba_fga_audit_trail. TRANSACTIONID，然後到v$transaction根據XID查詢是否存在對應記錄，若能找到就

表明還未提交。

順便也驗證一下：parameter_code=’CA_VCSMS_TIME’所在的記錄就位於爭用最嚴重的file#:93/block#:65696裡

select dbms_rowid.rowid_relative_fno(rowid),dbms_rowid.rowid_block_number(rowid),param_code from SD.SYS_PARAMETER where

PARAM_CODE='CA_VCSMS_TIME';

圖11-1

? 深入gc cr block busy

文章開頭處曾經介紹過其成因與block的CR copy在遠端例項上的構造過程及寫online redo過程有關。展開講，當例項1和例項2的buffer cache都含有某個block，T1時刻例項1修改了這個block，但沒有commit；T2時刻例項2上的會話1讀取這個block，讀取的大致過程分解可以分解為以下幾個步驟：

1）例項2的LMS向例項1的LMS發起block 讀請求；

2）例項1的buffer cache裡已經存在了這個block修改後的最新副本B1’，例項1的 LMS在本地的buffer cache里根據B1’再複製出一個新的副本

B1’’，此時B1’和B1’’的內容是完全相同的；

3）例項1的LMS從undo segment裡找到undo record用來applied到B1’’上，把B1’’回滾到修改前的狀態，記為B1；

4）這時例項1的buffer cache 裡包含了B1’，B1，其中B1是B1’修改前的內容。

5）利用undo record將B1’’回滾到B1這一過程是會產生redo的，例項1 的LMS程式通知Lgwr程式把redo寫入online redolog，寫成功後才能

進入下一步

6）例項1上的Lgwr通知例項1上的LMS redo寫入完成，例項1上的LMS將B1傳輸給例項2的LMS

7）例項2的LMS將結果返回給例項2上的會話1的server process

我們透過如下的簡單測試可以大致觀察到這個過程，還是在同版本的測試RAC環境下進行：

---session 1 testnode 2：執行update但不提交

select s.sid,s.serial# from v$session s,v$process p where s.paddr=p.addr and s.sid=sys_context('userenv','sid');

SID SERIAL#

---------- ----------

7338 3253

update poweruser.t0505_1 set username=dbms_random.string('u',5) where user_id=0;

1 row updated.

select t.xid,t.status,t.USED_UREC,s.sql_id,s.sid,s.serial# from v$transaction t,v$session s where t.ses_addr=s.saddr and s.sid=sys_context('userenv','sid');

圖12

---session 2 testnode 2: 記錄testnode 2上LMS程式的初始統計值

set linesize 150 pagesize 100

select s.sid,n.name,s.value from v$statname n,v$sesstat s where n.statistic#=s.statistic# and s.sid in (select sid from v$session where program like '%LMS%')

and n.name in ('data blocks consistent reads - undo records applied','redo size','redo entries','CR blocks created') order by sid,name;

圖13

---session 1 testnode 1；select測試表

select * from poweruser.t0505_1;

---session 2 testnode 2: 檢查testnode 2上LMS程式的最新統計值

select s.sid,n.name,s.value from v$statname n,v$sesstat s where n.statistic#=s.statistic# and s.sid in (select sid from v$session where program like '%LMS%')

and n.name in ('data blocks consistent reads - undo records applied','redo size','redo entries','CR blocks created') order by sid,name;

圖14

計算差值: 一個RAC節點上有多個LMS程式以負載均衡的方式被使用，本測試中正好是7897這個sid對應的LMS程式被使用到

與前一次查詢結果相比：

CR blocks created：1098103->1098104，增加了1 <--- 因為poweruser.t0505_1表裡只有一個塊是有資料的

data blocks consistent reads - undo records applied：11662578->11662580，增加了2 <--- 與v$transaction.used_urec值對應

redo entries：1058951->1058955，增加了4

redo size：111107920->111108232，增加了312

---session 1 testnode 1；再次select測試表

select * from poweruser.t0505_1;

---session 2 testnode 2: 檢查testnode 2上LMS程式的最新統計值

select s.sid,n.name,s.value from v$statname n,v$sesstat s where n.statistic#=s.statistic# and s.sid in (select sid from v$session where program like '%LMS%')

and n.name in ('data blocks consistent reads - undo records applied','redo size','redo entries','CR blocks created') order by sid,name;

圖15

計算差值：

CR blocks created：增加了1 <--- 因為poweruser.t0505_1表裡只有一個塊是有資料的

data blocks consistent reads - undo records applied：增加了2 <--- 與v$transaction.used_urec值對應

redo entries：增加了1

redo size：增加了64

由此可以看出若遠端節點修改某個block後未提交，那麼本地節點去select遠端節點上的這個block時每次都會引發CR塊構造(CR blocks created)、應

用undo record(data blocks consistent reads - undo records applied)、生成redo(redo entries & redo size )等一系列動作，查詢未提交的block開銷還是比較

大的，主要體現在遠端節點LMS、LGWR程式的cpu使用率上升(之後會有詳細說明)，讀取undo block與寫入online redo時的IO量增大。尤其是在本

地節點高併發select 的時候這一開銷會無限放大。

? 本地節點select遠端節點cache裡未提交的資料塊開銷到底有多大

還是由實驗資料來說話：

(1) 遠端節點修改後及時提交，本地節點兩個session併發select同一張表

---session 1 on testnode2：執行對於測試表的修改並提交

update poweruser.t0505_1 set username=dbms_random.string('u',5) where user_id=0;

commit;

---session 3 on testnode1：記錄會話統計初始值

@sess_1st.sql 1 14678 1 15251

---session 1 on testnode1 & session 2 on testnode1: 同時執行

Set timing on

@circle_sel1.sql <---circle_sel1.sql和之前circle_sel.sql的區別在於執行次數從200W下調至30W，僅為了節省等待時間，更快得出結論

session 1 on testnode1的執行耗時：17.99s

session 2 on testnode1的執行耗時：17.72s

---session 3 on testnode1：輸出會話統計值(僅列出與gc相關的等待)

@sess_2nd.sql 1 14678 1 15251

圖16

(2) 遠端節點修改後未提交，本地節點兩個session併發select同一張表

---session 1 on testnode2：修改測試表但不提交

update poweruser.t0505_1 set username=dbms_random.string('u',5) where user_id=0;

---session 3 on testnode1：記錄會話統計初始值

@sess_1st.sql 1 14678 1 15251

---session 1 on testnode1 & session 2 on testnode1: 同時執行

Set timing on

@circle_sel1.sql

session 1 on testnode1的執行耗時：485.89s

session 2 on testnode1的執行耗時：485.91s

---session 3 on testnode1：輸出會話統計值(僅列出與gc相關的等待)

@sess_2nd.sql 1 14678 1 15251

圖17

附testnode1節點select執行期間testnode2上的LMS、LGWR程式的cpu使用率

圖18

其中8585392是LGWR、5571040是LMS

ps -ef|grep -E "lms|lgwr" | grep -vE "asm|grep"

oracle 2949308 1 1 Apr 19 - 56:08 ora_lms0_shajx2

oracle 6094878 1 0 Apr 19 - 53:54 ora_lms2_shajx2

oracle 8585392 1 0 Apr 19 - 15:11 ora_lgwr_shajx2

oracle 5571040 1 0 Apr 19 - 46:06 ora_lms1_shajx2

(3) 擴充套件一下，本地節點兩個session併發select不同的兩張表

---session 1 on testnode2：分別修改兩張測試表，都不提交

update poweruser.t0505_1 set username=dbms_random.string('u',5) where user_id=0;

update poweruser.t0506 set username=dbms_random.string('u',5) where user_id=0;

---session 3 on testnode1：記錄會話統計初始值

@sess_1st.sql 1 14678 1 15251

---session 1 on testnode1：30W次select表t0505_1

@circle_sel1.sql

---session 2 on testnode1：30W次select表t0506

@circle_sel2.sql

session 1 on testnode1的執行耗時：501.94s

session 2 on testnode1的執行耗時：482.88s

---session 3 on testnode1：輸出會話統計值(僅列出與gc相關的等待)

@sess_2nd.sql 1 14678 1 15251

圖19

測試結論：

對於(1)、(2)兩個場景可以看出遠端節點修改後未提交的場景下測出的select時長為485s遠遠大於提交情況下的17s，且後一種場景下遠端節點的LMS、

LGWR程式分別佔用了1.9%和2.5%的cpu資源，對於32 cores的主機來說這兩個程式佔去了約1.5 cores，況且現在僅僅是兩個併發的情況。

對於(3)場景而言，Select兩張不同的表，仍然出現了各30W次的gc cr block busy等待，且耗時與(2)場景差不多。可見是否select相同的表不重要，

關鍵在於遠端節點的構造cr block->flush redo這個過程是相當耗時的

? 最佳化舉措

短期考慮：

1）修改後及時提交，且修改的操作儘量與select操作放在同一個節點

2）鑑於sd.sys_parameter是一張配置表，配置表裡的記錄應該保持相對穩定，所以將update的動作挪至新建的一張表進行

長遠考慮：

3）把sd.sys_parameter放到App快取裡，避免過於頻繁從資料庫裡進行select

當前已經完成上述第1項最佳化，效果明顯

? 這個案例告訴我們

Cache Fusion是一把雙刃劍，雖然能以節點間的通訊來避免更多的物理讀，但我們也要避免節點間某些不合理的通訊行為，比如對於同一個block的讀和寫分別在兩個節點進行，本文的案例和測試使我們認識到block跨節點傳輸所產生的開銷是如此之大。對於應用設計者而言應該認識到修改後不及時的提交在單節點環境下可能最多會引起row lock，但在多節點RAC環境下，在節點間存在大量通訊的場景下，會對效能產生不可估量的影響。

分析解決11gR2 雙節點RAC環境下的gc cr block busy/gc buffer busy acquire等待

相關文章