gc current request 引起長期鎖表的故障

故障描述：一個數倉系統長期跑批時出現鎖表，鎖表偶爾嚴重時影響整體跑批流程，進而影響第2天業務執行。根據業務請求需要進行最佳化。最佳化目標儘量減少鎖表時間（業務人員最低要求不影響整體跑批流程）。
業務人員訴求分析：找出嚴重鎖表原因，解決鎖表嚴重問題不影響整體跑批即可。
DBA解決思路分析：現象=》AWR/ASH/ADDM報告和呼叫（監控foglight/bmc）等=》結合經驗分析結果=》結合現象+技術原理分析=》最終確定問題原因+解決方案=》長期觀察是否解決問題

1.系統環境
SQL> select * from gv$version;

INST_ID BANNER
---------- --------------------------------------------------------------------------------
2 Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
2 PL/SQL Release 11.2.0.4.0 - Production
2 CORE 11.2.0.4.0 Production
2 TNS for IBM/AIX RISC System/6000: Version 11.2.0.4.0 - Production
2 NLSRTL Version 11.2.0.4.0 - Production
1 Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
1 PL/SQL Release 11.2.0.4.0 - Production
1 CORE 11.2.0.4.0 Production
1 TNS for IBM/AIX RISC System/6000: Version 11.2.0.4.0 - Production
1 NLSRTL Version 11.2.0.4.0 - Production

10 rows selected.

2.故障時間點確認
Troubleshooting：資料說話。
我們這裡用foglight(DB監控軟體)來抓取資料(CPU/IO/MEMORY波峰)，確定故障時間點。

2.1 以等待事件為基礎資料
7天統計如下：
截圖省略...

9月13號，業務人員報故障的時間段
截圖省略...

==經DBA分析，決定選擇00:00:00--01:00:00和03:00:00--04:00:00進行分析==

3.根據效能分析工具AWR的資料進行分析
00:00:00--01:00:00 AWR報告
3.1 因為是鎖表也是一個種等待事件，所以直接快速關注events

==node A==
Top 10 Foreground Events by Total Wait Time

Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class
enq: CF - contention 3,651 13.7K 3746 57.9 Other
enq: TX - row lock contention 51 3610.8 70800 15.3 Application
direct path read 10,652 1705.3 160 7.2 User I/O
db file scattered read 35,561 1056.8 30 4.5 User I/O
direct path write 22,929 989.5 43 4.2 User I/O
DB CPU 530.4 2.2
direct path read temp 18,124 270.6 15 1.1 User I/O
db file sequential read 46,447 264.3 6 1.1 User I/O
enq: HW - contention 3,241 199.1 61 .8 Configuration
gc current block busy 3,794 164.5 43 .7 Cluster

==node B==
Top 10 Foreground Events by Total Wait Time

Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class
enq: CF - contention 7,114 6801.1 956 30.9 Other
direct path write 102,606 4393.3 43 19.9 User I/O
direct path read 28,670 4047.9 141 18.4 User I/O
db file scattered read 91,564 2344.5 26 10.6 User I/O
DB CPU 1296 5.9
gc current request 1 383.1 383095 1.7 Cluster
db file parallel read 3,286 284.7 87 1.3 User I/O
db file sequential read 86,524 247 3 1.1 User I/O
CSS initialization 298 222.5 747 1.0 Other
control file parallel write 31,876 148.7 5 .7 System I/O

Performance Degradation as a Result of 'enq: CF - contention' (文件 ID 1072417.1)
Waits for 'Enq: TX - ...' Type Events - Transaction (TX) Lock Example Scenarios (文件 ID 62354.1)

主要問題：
1.enq: CF - contention 推斷分析：redo log 產生過大且頻繁
2.enq: TX - row lock contention 推斷分析：應用設計有問題：長事務，未批次提交

4.驗證推斷分析的結果
檢視等待事件
00:00:00--01:00:00 EVENT TOP 10
INSTANCE_NUMBER EVENT CNT
--------------- ---------------------------------------- ----------
1 enq: CF - contention 1338
2 enq: CF - contention 690
2 direct path write 382
1 enq: TX - row lock contention 360
2 direct path read 321
2 db file scattered read 251
1 db file scattered read 94
1 direct path read 65
2 PX Deq: Table Q Get Keys 64
==此時間段，還有事務進行的，只是 redo log 等待嚴重。

00:00:00--08:00:00 EVENT TOP 10
(08:00:00為KILL LOCK TABLE後正常跑批流程的時間點)
INSTANCE_NUMBER EVENT CNT
--------------- ---------------------------------------- ----------
1 enq: TX - row lock contention 2878
2 gc current request 2171
2 direct path read 2005
1 direct path read 1827
1 enq: CF - contention 1338
2 Disk file operations I/O 900
1 Disk file operations I/O 888
2 enq: CF - contention 690
1 db file sequential read 472

9 rows selected.
==此時間段，基本上已經沒有事務進行了，完成被TX和gc current request等待所佔據。

00:00:00--01:00:00
==node A==
Load Profile

Per Second Per Transaction Per Exec Per Call
DB Time(s): 6.6 3.2 0.40 0.17
DB CPU(s): 0.2 0.1 0.01 0.00
Redo size (bytes): 1,669,685.0 814,982.5

==node B==
Load Profile

Per Second Per Transaction Per Exec Per Call
DB Time(s): 6.1 3.1 0.24 0.17
DB CPU(s): 0.4 0.2 0.01 0.01
Redo size (bytes): 2,170,422.3 1,082,389.6

03:00:00--04:00:00
==node A==
Load Profile

Per Second Per Transaction Per Exec Per Call
DB Time(s): 1.1 0.6 0.06 0.02
DB CPU(s): 0.0 0.0 0.00 0.00
Redo size (bytes): 329,441.6 183,969.1

==node B==
Load Profile

Per Second Per Transaction Per Exec Per Call
DB Time(s): 1.0 0.6 0.10 0.03
DB CPU(s): 0.0 0.0 0.00 0.00
Redo size (bytes): 1,489.5 840.2

00:00:00--01:00:00與03:00:00--04:00:00時間段，每秒產生redo size比值
node A :814,982/183,969 ==4.4
node B :1,082,389/840 ==1288.5
==根據redo size每秒產生的量，可以推斷出3-4這個時間段與0-1時間段的事務量劇減，可以說沒有事務完成，資料庫完全處於一個HANG的狀態下。

下一步看看是誰HANG了整個資料庫
00:00:00--01:00:00
enq: CF - contention =>
SQL_ID COUNT(1)
------------- ----------
b9zh67r40t24q 461
9tw8kwwj97zub 353
c404bhf4rjaum 204
66pb0fp8jt1c4 197
5dj472vrff0k6 127
SQL_TEXT
--------------------------------------------------------------------------------
INSERT /*+append parallel 32*/ INTO TRANSACTION_DETAIL(AGREEMENT_SEQ,CHARGE_CDE,
INSERT /*+append parallel 32*/ INTO SAP_INTERMEDIATE(CONTRACT_TYP,F_ACCOUNT_CD,F
INSERT /*+append parallel 32*/ INTO DC_FILE_DETAIL_RECORD(ACCOUNT_NBR_TRACE,ACCO
INSERT /*+append parallel 32*/ INTO RECEIVABLE_PAID(ACCOUNT_NBR,ACTUAL_REALIZATI
INSERT /*+append parallel 32*/ INTO INCOME_RECOGNITION(AGREEMENT_SEQ,CONTRACT_ID
enq: TX - row lock contention =>
SQL_TEXT
--------------------------------------------------------------------------------
DELETE FROM use_resource_state WHERE resource_addr=:1 AND resource_type=:2 AND r

00:00:00--08:00:00
gc current request =>
SQL_TEXT
--------------------------------------------------------------------------------
insert /*+append*/ into F_NS_PD_BMC_DETAIL(ADDRESS_ONE,ADDRESS_THREE,ADDRE
==據業務人員所述，當晚kill lock table F_NS_PD_BMC_DETAIL,資料庫就正常執行了
==DBA工具分析也關注到了這張問題表F_NS_PD_BMC_DETAIL，這物件伴隨著gc current request等待事件。
所以這個 gc current request 等待事件值得分析。

接下來分析gc current request產生的原因：
分析思路如下：
A.什麼時候開始產生的gc current request等待事件？
B.什麼時候結束了gc current request等待事件？
C.這個gc current request等待事件與BUG是否有關係？

A.什麼時候開始產生的gc current request等待事件？
gc current request 開始
SQL> select * from (
2 select INSTANCE_NUMBER,event,sql_id,to_char(SAMPLE_TIME,'yyyy-mm-dd hh24:mi:ss')
3 from dba_hist_active_sess_history
4 where to_char(SAMPLE_TIME,'yyyy-mm-dd')='2017-09-13'
5 and event like 'gc current request' order by SAMPLE_TIME) where rownum<20;

INSTANCE_NUMBER EVENT SQL_ID TO_CHAR(SAMPLE_TIME
--------------- ---------------------------------------- ------------- -------------------
2 gc current request fz486aazdcx0h 2017-09-13 00:54:26
2 gc current request fz486aazdcx0h 2017-09-13 00:54:36
2 gc current request fz486aazdcx0h 2017-09-13 00:54:46
2 gc current request fz486aazdcx0h 2017-09-13 00:54:56
2 gc current request fz486aazdcx0h 2017-09-13 00:55:06
2 gc current request fz486aazdcx0h 2017-09-13 00:55:16
2 gc current request fz486aazdcx0h 2017-09-13 00:55:26
2 gc current request fz486aazdcx0h 2017-09-13 00:55:36
2 gc current request fz486aazdcx0h 2017-09-13 00:55:46
2 gc current request fz486aazdcx0h 2017-09-13 00:55:56
2 gc current request fz486aazdcx0h 2017-09-13 00:56:06

INSTANCE_NUMBER EVENT SQL_ID TO_CHAR(SAMPLE_TIME
--------------- ---------------------------------------- ------------- -------------------
2 gc current request fz486aazdcx0h 2017-09-13 00:56:16
2 gc current request fz486aazdcx0h 2017-09-13 00:56:26
2 gc current request fz486aazdcx0h 2017-09-13 00:56:36
2 gc current request fz486aazdcx0h 2017-09-13 00:56:46
2 gc current request fz486aazdcx0h 2017-09-13 00:56:56
2 gc current request fz486aazdcx0h 2017-09-13 00:57:06
2 gc current request fz486aazdcx0h 2017-09-13 00:57:16
2 gc current request fz486aazdcx0h 2017-09-13 00:57:26

B.什麼時候結束了gc current request等待事件？
gc current request 結束
SQL> select * from (
2 select INSTANCE_NUMBER,event,sql_id,to_char(SAMPLE_TIME,'yyyy-mm-dd hh24:mi:ss')
3 from dba_hist_active_sess_history
4 where to_char(SAMPLE_TIME,'yyyy-mm-dd')='2017-09-13'
5 and event like 'gc current request' order by SAMPLE_TIME desc) where rownum<20;

INSTANCE_NUMBER EVENT SQL_ID TO_CHAR(SAMPLE_TIME
--------------- ---------------------------------------- ------------- -------------------
2 gc current request fz486aazdcx0h 2017-09-13 06:56:22
2 gc current request fz486aazdcx0h 2017-09-13 06:56:12
2 gc current request fz486aazdcx0h 2017-09-13 06:56:02
2 gc current request fz486aazdcx0h 2017-09-13 06:55:52
2 gc current request fz486aazdcx0h 2017-09-13 06:55:42
2 gc current request fz486aazdcx0h 2017-09-13 06:55:32
2 gc current request fz486aazdcx0h 2017-09-13 06:55:22
2 gc current request fz486aazdcx0h 2017-09-13 06:55:12
2 gc current request fz486aazdcx0h 2017-09-13 06:55:02
2 gc current request fz486aazdcx0h 2017-09-13 06:54:52
2 gc current request fz486aazdcx0h 2017-09-13 06:54:42

INSTANCE_NUMBER EVENT SQL_ID TO_CHAR(SAMPLE_TIME
--------------- ---------------------------------------- ------------- -------------------
2 gc current request fz486aazdcx0h 2017-09-13 06:54:32
2 gc current request fz486aazdcx0h 2017-09-13 06:54:22
2 gc current request fz486aazdcx0h 2017-09-13 06:54:12
2 gc current request fz486aazdcx0h 2017-09-13 06:54:02
2 gc current request fz486aazdcx0h 2017-09-13 06:53:52
2 gc current request fz486aazdcx0h 2017-09-13 06:53:42
2 gc current request fz486aazdcx0h 2017-09-13 06:53:32
2 gc current request fz486aazdcx0h 2017-09-13 06:53:22

==SQL_ID='fz486aazdcx0h' 從 2017-09-13 00:54:26 開始，每10秒鐘出現一次gc current request等待。直到2017-09-13 06:56:22，kill lock table後結束gc current request等待事件。
顯然 insert /*+append*/ into F_NS_PD_BMC_DETAIL（跑批中主要步驟）沒有完成，是與gc current request 有著密不可分的關係。

解決gc current request問題就可以解決跑批的問題了。

C.這個gc current request等待事件與BUG是否有關係？
已經定位根本問題了，gc current request問題。
ETL process hangs when using an Oracle 11g RAC database (文件 ID 1996100.1)

解決辦法：
Option 1

Set the "_gc_read_mostly_locking" parameter to FALSE using the following steps:

a. Stop STARETL Process. (If running on linux, you can check the process by running "ps -ef | grep star". If any process is found it can be terminated using the "kill" command)

b. Connect Oracle Database Hosting The STARUSER schema as the "SYS" User.

c. Backup the content of spfile to pfile and set _gc_read_mostly_locking to FALSE using the below commands:

SQL> create pfile='/oracle/product/base/oradb/admin/P6_test/adump/bakspfile.ora' from spfile;
SQL> alter system set "_gc_read_mostly_locking"=FALSE scope=spfile;
d. Restart Oracle Database hosting the STARUSER schema.

e. Rerun STARETL.

enq: TX - row lock contention =>
SQL_TEXT
--------------------------------------------------------------------------------
DELETE FROM use_resource_state WHERE resource_addr=:1 AND resource_type=:2 AND r

經7天觀察，跑批流程一切正常再沒有出現進行不去的情況。已經解決了業務人員的基本要求。

DBA分析問題，不能已解決問題為標準，其實構架上應用上也是存在問題的。所以，繼續研究一下。

在分析過程中，還發現不少問題。
1.鎖爭用嚴重。RAC並行使用有問題。分庫分表應用構架有問題。同時的業務模組（insert into），同時在2個節點進行插入操作。
2.logfile 檔案爭用嚴重；批次insert，沒有分批造成過多的redo log和archive log 。
3.長時間的enq: TX - row lock contention 竟然是一個程式模組引起的。寫死了，沒有提交動作程式也是奇葩。
DELETE FROM use_resource_state WHERE resource_addr=:1 AND resource_type=:2 AND ... 不 commit,也是一個奇葩的程式了。

########################################################################################
版權所有，文章允許轉載，但必須以連結方式註明源地址，否則追究法律責任!【QQ交流群：53993419】
QQ：14040928 E-mail：dbadoudou@163.com
本文連結： http://blog.itpub.net/26442936/viewspace-2145624/
########################################################################################

gc current request 引起長期鎖表的故障

相關文章