【故障處理】佇列等待之TX - allocate ITL entry引起的死鎖處理
【故障處理】佇列等待之TX - allocate ITL entry引起的死鎖處理
1 BLOG文件結構圖
2 前言部分
2.1 導讀和注意事項
各位技術愛好者,看完本文後,你可以掌握如下的技能,也可以學到一些其它你所不知道的知識,~O(∩_∩)O~:
① enq: TX - allocate ITL entry等待事件的解決
② 一般等待事件的解決辦法
③ 佇列等待的基本知識
④ ITL死鎖解決
⑤ ITL死鎖模擬
⑥ Merge語句的非關聯形式的查詢最佳化
Tips:
① 本文在itpub(http://blog.itpub.net/26736162)、部落格園(http://www.cnblogs.com/lhrbest)和微信公眾號(xiaomaimiaolhr)有同步更新。
② 文章中用到的所有程式碼,相關軟體,相關資料請前往小麥苗的雲盤下載(http://blog.itpub.net/26736162/viewspace-1624453/)。
③ 若網頁文章程式碼格式有錯亂,推薦使用360瀏覽器,也可以下載pdf格式的文件來檢視,pdf文件下載地址:http://blog.itpub.net/26736162/viewspace-1624453/,另外itpub格式顯示有問題,也可以去部落格園地址閱讀。
④ 本篇BLOG中命令的輸出部分需要特別關注的地方我都用灰色背景和粉紅色字型來表示,比如下邊的例子中,thread 1的最大歸檔日誌號為33,thread 2的最大歸檔日誌號為43是需要特別關注的地方;而命令一般使用黃色背景和紅色字型標注;對程式碼或程式碼輸出部分的注釋一般採用藍色字型表示。
List of Archived Logs in backup set 11 Thrd Seq Low SCN Low Time Next SCN Next Time ---- ------- ---------- ------------------- ---------- --------- 1 32 1621589 2015-05-29 11:09:52 1625242 2015-05-29 11:15:48 1 33 1625242 2015-05-29 11:15:48 1625293 2015-05-29 11:15:58 2 42 1613951 2015-05-29 10:41:18 1625245 2015-05-29 11:15:49 2 43 1625245 2015-05-29 11:15:49 1625253 2015-05-29 11:15:53
[ZHLHRDB1:root]:/>lsvg -o T_XLHRD_APP1_vg rootvg [ZHLHRDB1:root]:/> 00:27:22 SQL> alter tablespace idxtbs read write;
====》2097152*512/1024/1024/1024=1G |
本文如有錯誤或不完善的地方請大家多多指正,ITPUB留言或QQ皆可,您的批評指正是我寫作的最大動力。
3 故障分析及解決過程
3.1 故障環境介紹
專案 |
source db |
db 型別 |
RAC |
db version |
11.2.0.3.0 |
db 儲存 |
ASM |
OS版本及kernel版本 |
AIX 64位 7.1.0.0 |
3.2 故障發生現象及故障分析解決
早上剛來上班,同事就發了一個SQL過來,說是有鎖,然後我就查了查系統裡的鎖,結果一個鎖都沒得。好吧,還是得乾點事的,先看看SQL語句:
MERGE INTO TLHR.TLHRBOKBAL S
USING (SELECT A.BOOKACCOUNT AS BOOKACCOUNT,
(A.CURRBALANCE + NVL(B.BAL, 0.00)) AS BANKAMT
FROM TLHR.TLHRBOKBAL_TMP A,
(SELECT T1.BOOKACCOUNT AS BOOKACCOUNT,
SUM(DECODE(T1.DCFLAG, 'D', -T1.AMT, 'C', T1.AMT, 0)) AS BAL
FROM TLHR.TLHRBOKBALJN T1
WHERE T1.BOOKACCOUNT LIKE '13500000%'
AND T1.TRANDATE = '20150901'
AND (T1.REASON IN ('1', '2') OR
(T1.REASON = '0' AND T1.ONLINEFLAG = '1'))
GROUP BY T1.BOOKACCOUNT) B
WHERE A.BOOKACCOUNT = B.BOOKACCOUNT(+)
AND A.BOOKACCOUNT LIKE '13500000%') T
ON (S.BOOKACCOUNT = T.BOOKACCOUNT)
WHEN MATCHED THEN
UPDATE
SET S.LASTBALANCE = T.BANKAMT,
S.CURRBALANCE = T.BANKAMT,
S.DEBITAMT = 0.00,
S.CREDITAMT = 0.00;
看起來是一個MERGE語句,按照小麥苗以前的經驗,這一類的SQL最好是修改為MERGE的非關聯形式比較好,我們先看看執行計劃有沒有問題:
先找到SQL_ID為53qv858pwwwwb:
SELECT a.ELAPSED_TIME,a.EXECUTIONS,a.* FROM v$sql a WHERE a.SQL_TEXT LIKE '%MERGE INTO TLHRBOKBAL S%' AND A.SQL_TEXT LIKE '%13500000%' ;
查詢歷史執行計劃:
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_AWR(SQL_ID => '53qv858pwwwwb' )) ;
Plan hash value: 2695089823
------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------------ | 0 | MERGE STATEMENT | | | | | 155K(100)| | | 1 | MERGE | TLHRBOKBAL | | | | | | | 2 | VIEW | | | | | | | | 3 | HASH JOIN RIGHT OUTER | | 153K| 15M| | 155K (2)| 00:31:04 | | 4 | VIEW | | 1 | 31 | | 6 (0)| 00:00:01 | | 5 | SORT GROUP BY | | 1 | 41 | | 6 (0)| 00:00:01 | | 6 | TABLE ACCESS BY INDEX ROWID| TLHRBOKBALJN | 1 | 41 | | 6 (0)| 00:00:01 | | 7 | INDEX RANGE SCAN | PK_TLHRBOKBALJN | 2 | | | 4 (0)| 00:00:01 | | 8 | HASH JOIN | | 153K| 10M| 5848K| 155K (2)| 00:31:04 | | 9 | TABLE ACCESS FULL | TLHRBOKBAL_TMP | 153K| 4048K| | 85415 (2)| 00:17:05 | | 10 | TABLE ACCESS FULL | TLHRBOKBAL | 305K| 13M| | 68755 (3)| 00:13:46 | ------------------------------------------------------------------------------------------------------------ |
可以看到,該執行計劃的順序為【7-->6-->5-->4-->9-->10-->8-->3-->2-->1-->0】,而耗費效能的地方在9、10、8這3個步驟上,走的是全表掃描,我們先看看2個大表的資料量:
SELECT COUNT(*) FROM TLHR.TLHRBOKBAL_TMP A WHERE A.BOOKACCOUNT LIKE '13500000%'; --306043/38998765
SELECT COUNT(*) FROM TLHR.TLHRBOKBAL A WHERE A.BOOKACCOUNT LIKE '13500000%'; --306043/38826275
從3000萬的資料裡邊取出30萬的資料,還是比較少的,所以應該去走索引的,看了一下統計資訊,也是最新收集的,好吧,算了,先修改一下SQL讓其走索引掃描看看,:
MERGE INTO TLHR.TLHRBOKBAL S
USING (SELECT S.ROWID ROWIDS,
A.BOOKACCOUNT AS BOOKACCOUNT,
(A.CURRBALANCE + NVL(B.BAL, 0.00)) AS BANKAMT
FROM (SELECT /*+index(NB,PK_TLHRBOKBAL_TMP)*/NB.CURRBALANCE,NB.BOOKACCOUNT
FROM TLHR.TLHRBOKBAL_TMP NB
WHERE NB.BOOKACCOUNT LIKE '13500000%') A,
TLHR.TLHRBOKBAL S,
(SELECT T1.BOOKACCOUNT AS BOOKACCOUNT,
SUM(DECODE(T1.DCFLAG, 'D', -T1.AMT, 'C', T1.AMT, 0)) AS BAL
FROM TLHR.TLHRBOKBALJN T1
WHERE T1.BOOKACCOUNT LIKE '13500000%'
AND T1.TRANDATE = '20150901'
AND (T1.REASON IN ('1', '2') OR
(T1.REASON = '0' AND T1.ONLINEFLAG = '1'))
GROUP BY T1.BOOKACCOUNT) B
WHERE A.BOOKACCOUNT = B.BOOKACCOUNT(+)
AND S.BOOKACCOUNT = A.BOOKACCOUNT
AND S.BOOKACCOUNT LIKE '13500000%') T
ON (T.ROWIDS = S.ROWID)
WHEN MATCHED THEN
UPDATE
SET S.LASTBALANCE = T.BANKAMT,
S.CURRBALANCE = T.BANKAMT,
S.DEBITAMT = 0.00,
S.CREDITAMT = 0.00
Plan Hash Value : 273017430 ---------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | ---------------------------------------------------------------------------------------------------------- | 0 | MERGE STATEMENT | | 152885 | 4280780 | 283362 | 00:56:41 | | 1 | MERGE | TLHRBOKBAL | | | | | | 2 | VIEW | | | | | | | 3 | NESTED LOOPS | | 152885 | 20945245 | 283362 | 00:56:41 | | * 4 | HASH JOIN RIGHT OUTER | | 152885 | 14065420 | 130342 | 00:26:05 | | 5 | VIEW | | 124 | 3844 | 15668 | 00:03:09 | | 6 | SORT GROUP BY | | 124 | 5084 | 15668 | 00:03:09 | | * 7 | TABLE ACCESS BY INDEX ROWID | TLHRBOKBALJN | 124 | 5084 | 15668 | 00:03:09 | | * 8 | INDEX RANGE SCAN | PK_TLHRBOKBALJN | 165 | | 15501 | 00:03:07 | | * 9 | HASH JOIN | | 152885 | 9325985 | 114671 | 00:22:57 | | 10 | TABLE ACCESS BY INDEX ROWID | TLHRBOKBAL_TMP | 153563 | 4146201 | 112930 | 00:22:36 | | * 11 | INDEX RANGE SCAN | PK_TLHRBOKBAL_TMP | 153563 | | 1159 | 00:00:14 | | * 12 | INDEX RANGE SCAN | PK_TLHRBOKBAL | 152884 | 5198056 | 1117 | 00:00:14 | | 13 | TABLE ACCESS BY USER ROWID | TLHRBOKBAL | 1 | 45 | 1 | 00:00:01 | ----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id): ------------------------------------------ * 4 - access("NB"."BOOKACCOUNT"="B"."BOOKACCOUNT"(+)) * 7 - filter("T1"."REASON"='0' AND "T1"."ONLINEFLAG"='1' OR "T1"."REASON"='1' OR "T1"."REASON"='2') * 8 - access("T1"."BOOKACCOUNT" LIKE '13500000%' AND "T1"."TRANDATE"='20150901') * 8 - filter("T1"."BOOKACCOUNT" LIKE '13500000%' AND "T1"."TRANDATE"='20150901') * 9 - access("S"."BOOKACCOUNT"="NB"."BOOKACCOUNT") * 11 - access("NB"."BOOKACCOUNT" LIKE '13500000%') * 11 - filter("NB"."BOOKACCOUNT" LIKE '13500000%') * 12 - access("S"."BOOKACCOUNT" LIKE '13500000%') * 12 - filter("S"."BOOKACCOUNT" LIKE '13500000%') |
執行計劃中,基本都走了索引了,跑了一下,大約1分種多,但是裡邊有個HINTS,分析了一下表TLHRBOKBAL_TMP上的索引情況,發現是個主鍵索引,且有2個列(BOOKACCOUNT,CURRENCY),但是不包含列CURRBALANCE,可能是Oracle覺得回表讀的耗費比較大吧,那這裡可以使用虛擬索引測試一下索引的效能:
CREATE INDEX IX_VI01_ID ON TLHR.TLHRBOKBAL_TMP(CURRBALANCE, CURRENCY,BOOKACCOUNT) NOSEGMENT;
ALTER SESSION SET "_USE_NOSEGMENT_INDEXES"=TRUE;
EXPLAIN PLAN FOR MERGE INTO TLHR.TLHRBOKBAL S
USING (SELECT S.ROWID ROWIDS,
A.BOOKACCOUNT AS BOOKACCOUNT,
(A.CURRBALANCE + NVL(B.BAL, 0.00)) AS BANKAMT
FROM (SELECT NB.CURRBALANCE,NB.BOOKACCOUNT
FROM TLHR.TLHRBOKBAL_TMP NB
WHERE NB.BOOKACCOUNT LIKE '13500000%') A,
TLHR.TLHRBOKBAL S,
(SELECT T1.BOOKACCOUNT AS BOOKACCOUNT,
SUM(DECODE(T1.DCFLAG, 'D', -T1.AMT, 'C', T1.AMT, 0)) AS BAL
FROM TLHR.TLHRBOKBALJN T1
WHERE T1.BOOKACCOUNT LIKE '13500000%'
AND T1.TRANDATE = '20150901'
AND (T1.REASON IN ('1', '2') OR
(T1.REASON = '0' AND T1.ONLINEFLAG = '1'))
GROUP BY T1.BOOKACCOUNT) B
WHERE A.BOOKACCOUNT = B.BOOKACCOUNT(+)
AND S.BOOKACCOUNT = A.BOOKACCOUNT
AND S.BOOKACCOUNT LIKE '13500000%') T
ON (T.ROWIDS = S.ROWID)
WHEN MATCHED THEN
UPDATE
SET S.LASTBALANCE = T.BANKAMT,
S.CURRBALANCE = T.BANKAMT,
S.DEBITAMT = 0.00,
S.CREDITAMT = 0.00;
SELECT * FROM TABLE(DBMS_XPLAN.display);
Plan hash value: 983878991
------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------------- | 0 | MERGE STATEMENT | | 152K| 4180K| | 170K (1)| 00:34:06 | | 1 | MERGE | TLHRBOKBAL | | | | | | | 2 | VIEW | | | | | | | | 3 | NESTED LOOPS | | 152K| 19M| | 170K (1)| 00:34:06 | |* 4 | HASH JOIN RIGHT OUTER | | 152K| 13M| | 17421 (1)| 00:03:30 | | 5 | VIEW | | 124 | 3844 | | 15668 (1)| 00:03:09 | | 6 | SORT GROUP BY | | 124 | 5084 | | 15668 (1)| 00:03:09 | |* 7 | TABLE ACCESS BY INDEX ROWID| TLHRBOKBALJN | 124 | 5084 | | 15668 (1)| 00:03:09 | |* 8 | INDEX RANGE SCAN | PK_TLHRBOKBALJN | 165 | | | 15501 (1)| 00:03:07 | |* 9 | HASH JOIN | | 152K| 9107K| 5856K| 1750 (1)| 00:00:22 | |* 10 | INDEX FAST FULL SCAN | IX_VI01_ID | 153K| 4049K| | 9 (0)| 00:00:01 | |* 11 | INDEX RANGE SCAN | PK_TLHRBOKBAL | 152K| 5076K| | 1117 (1)| 00:00:14 | | 12 | TABLE ACCESS BY USER ROWID | TLHRBOKBAL | 1 | 45 | | 1 (0)| 00:00:01 | -------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id): ---------------------------------------------------
4 - access("NB"."BOOKACCOUNT"="B"."BOOKACCOUNT"(+)) 7 - filter("T1"."REASON"='0' AND "T1"."ONLINEFLAG"='1' OR "T1"."REASON"='1' OR "T1"."REASON"='2') 8 - access("T1"."BOOKACCOUNT" LIKE '13500000%' AND "T1"."TRANDATE"='20150901') filter("T1"."BOOKACCOUNT" LIKE '13500000%' AND "T1"."TRANDATE"='20150901') 9 - access("S"."BOOKACCOUNT"="NB"."BOOKACCOUNT") 10 - filter("NB"."BOOKACCOUNT" LIKE '13500000%') 11 - access("S"."BOOKACCOUNT" LIKE '13500000%') filter("S"."BOOKACCOUNT" LIKE '13500000%') |
說明建立3個列的索引是可以的。我們先將該虛擬索引刪除DROP INDEX IX_VI01_ID;
3.2.1 ITL死鎖問題解決
另外一個問題,是開發說上邊的SQL語句產生了死鎖,起初我還半信半疑,先去告警日誌中用命令(more alert* | grep Deadlock)搜了一下:
結果發現很多的死鎖,拿到相關的檔案,看到如下一段:
user session for deadlock lock 0x7000008094d14e0 sid: 332 ser: 47221 audsid: 991000 user: 84/TLHR flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC pid: 101 O/S info: user: grid, term: UNKNOWN, ospid: 6489034 image: oracle@ZFTLHRDB1 client details: O/S info: user: TLHR, term: , ospid: 34406578 machine: ZFTLHRAP1 program: bat_CheckBookBal@ZFTLHRAP1 (TNS V1-V3) application name: bat_CheckBookBal@ZFTLHRAP1 (TNS V1-V3), hash value=446537749 current SQL: MERGE INTO TLHRBOKBAL S USING (SELECT A.BOOKACCOUNT AS BOOKACCOUNT, (A.CURRBALANCE + nvl(B.BAL, 0.00)) AS BANKAMT FROM TLHRBOKBAL_TMP A, (SELECT T1.BOOKACCOUNT AS BOOKACCOUNT, SUM(DECODE(T1.DCFLAG, 'D', -T1.AMT, 'C', T1.AMT, 0)) AS BAL FROM TLHRBOKBALJN T1 WHERE T1.BOOKACCOUNT LIKE '13450000'||'%' AND T1.TRANDATE='20160901' AND (T1.REASON = '2' OR T1.REASON = '1' OR (T1.REASON = '0' AND T1.ONLINEFLAG = '1')) GROUP BY T1.BOOKACCOUNT) B WHERE A.BOOKACCOUNT = B.BOOKACCOUNT(+) AND A.BOOKACCOUNT LIKE '13450000'||'%') T ON (S.BOOKACCOUNT = T.BOOKACCOUNT) WHEN MATCHED THEN UPDATE SET S.LASTBALANCE=T.BANKAMT,S.CURRBALANCE=T.BANKAMT,S.DEBITAMT=0.00,S.CREDITAMT=0.00 DUMP LOCAL BLOCKER: initiate state dump for DEADLOCK possible owner[101.6489034] on resource TX-00EE0009-00005EA6
*** 2016-09-01 18:30:38.014 Submitting asynchronized dump request [28]. summary=[ges process stack dump (kjdglblkrdm1)]. Global blockers dump end:----------------------------------- Global Wait-For-Graph(WFG) at ddTS[0.2fe0] : BLOCKED 0x7000008e9c8bc28 3 wq 2 cvtops x1 TX 0x159001e.0x2379(ext 0x5,0x0)[1006-0065-0019365C] inst 1 BLOCKER 0x700000809ab4b28 3 wq 1 cvtops x28 TX 0x159001e.0x2379(ext 0x5,0x0)[2005-005E-00185E15] inst 2 BLOCKED 0x700000891b48708 3 wq 2 cvtops x1 TX 0x1c2001d.0x4b82(ext 0x2,0x0)[2005-005E-00185E15] inst 2 BLOCKER 0x7000008e9c8b148 3 wq 1 cvtops x28 TX 0x1c2001d.0x4b82(ext 0x2,0x0)[1004-004D-0000C03E] inst 1 BLOCKED 0x70000089a636970 3 wq 2 cvtops x1 TX 0x1c0000b.0x18f6(ext 0x2,0x0)[1004-004D-0000C03E] inst 1 BLOCKER 0x7000008e9c8b4e8 3 wq 1 cvtops x28 TX 0x1c0000b.0x18f6(ext 0x2,0x0)[1005-0058-000DD3D9] inst 1 BLOCKED 0x700000891d5fc50 3 wq 2 cvtops x1 TX 0xee0009.0x5ea6(ext 0x2,0x0)[1005-0058-000DD3D9] inst 1 BLOCKER 0x7000008094d14e0 3 wq 1 cvtops x28 TX 0xee0009.0x5ea6(ext 0x2,0x0)[1006-0065-0019365C] inst 1 * Cancel deadlock victim lockp 0x7000008e9c8bc28
*** 2016-09-01 18:30:43.001 kjddt2vb: valblk [0.2fe1] > local ts [0.2fe0]
*** 2016-09-01 18:30:47.000 kjddt2vb: valblk [0.2fe5] > local ts [0.2fe4]
*** 2016-09-01 18:40:38.062 kjddt2vb: valblk [0.2ff1] > local ts [0.2ff0]
*** 2016-09-01 18:42:01.084 kjddt2vb: valblk [0.2ff4] > local ts [0.2ff3] 2016-09-01 22:33:52.213848 : Setting 3-way CR grants to 0 global-lru off? 0
*** 2016-09-01 22:34:23.163 2016-09-01 22:34:23.163681 : Setting 3-way CR grants to 1 global-lru off? 0 2016-09-01 22:50:00.603305 : Setting 3-way CR grants to 0 global-lru off? 0
*** 2016-09-01 22:51:33.104 2016-09-01 22:51:33.104615 : Setting 3-way CR grants to 1 global-lru off? 0 2016-09-02 05:30:18.751891 : Setting 3-way CR grants to 0 global-lru off? 0 2016-09-02 05:49:01.360730 : Setting 3-way CR grants to 1 global-lru off? 0 2016-09-02 10:28:55.429293 : Setting 3-way CR grants to 0 global-lru off? 0 |
果然,產生死鎖的SQL還是上邊分析最佳化的SQL,其中會話資訊為:(332,47221),我們去DBA_HIST_ACTIVE_SESS_HISTORY檢視裡查詢:
SELECT D.SQL_ID, D.CURRENT_OBJ#, D.EVENT, COUNT(1)
FROM DBA_HIST_ACTIVE_SESS_HISTORY D
WHERE D.SAMPLE_TIME BETWEEN
TO_DATE('2016-09-01 18:25:00', 'YYYY-MM-DD HH24:MI:SS') AND
TO_DATE('2016-09-01 18:45:00', 'YYYY-MM-DD HH24:MI:SS')
AND D.BLOCKING_SESSION_STATUS = 'VALID'
AND D.SESSION_ID = 332
AND D.SESSION_SERIAL# = 47221
GROUP BY D.SQL_ID, D.CURRENT_OBJ#, D.EVENT;
可以看到該會話的等待事件是enq: TX - allocate ITL entry。可以猜測是由於ITL事務槽引起的問題。
SELECT DISTINCT D.BLOCKING_SESSION, D.BLOCKING_SESSION_SERIAL#, D.SQL_ID
FROM DBA_HIST_ACTIVE_SESS_HISTORY D
WHERE D.SAMPLE_TIME BETWEEN
TO_DATE('2016-09-01 18:25:00', 'YYYY-MM-DD HH24:MI:SS') AND
TO_DATE('2016-09-01 18:45:00', 'YYYY-MM-DD HH24:MI:SS')
AND D.EVENT = 'enq: TX - allocate ITL entry'
AND D.BLOCKING_SESSION_STATUS = 'VALID'
AND D.SESSION_ID = 332
AND D.SESSION_SERIAL# = 47221;
可以看出會話(332,47221)共阻塞了3個會話,由於有死鎖,那麼我們看看上邊查詢出來的3個會話阻塞了哪些會話:
SELECT DISTINCT D.INSTANCE_NUMBER,
D.SESSION_ID,
D.SESSION_SERIAL#,
D.BLOCKING_INST_ID,
D.BLOCKING_SESSION,
D.BLOCKING_SESSION_SERIAL#,
D.SQL_ID
FROM DBA_HIST_ACTIVE_SESS_HISTORY D
WHERE D.SAMPLE_TIME BETWEEN
TO_DATE('2016-09-01 18:25:00', 'YYYY-MM-DD HH24:MI:SS') AND
TO_DATE('2016-09-01 18:45:00', 'YYYY-MM-DD HH24:MI:SS')
AND D.EVENT = 'enq: TX - allocate ITL entry'
AND D.BLOCKING_SESSION_STATUS = 'VALID'
AND ((D.SESSION_ID = 332 AND D.SESSION_SERIAL# = 47221) OR
(D.SESSION_ID = 2602 AND D.SESSION_SERIAL# = 4343) OR
(D.SESSION_ID = 2995 AND D.SESSION_SERIAL# = 46891) OR
(D.SESSION_ID = 1894 AND D.SESSION_SERIAL# = 30761));
可以看到,1894和2602相互阻塞(綠色表示),332和2602相互阻塞(紅色表示),2995和332相互阻塞(粉色表示),這麼多的相互阻塞就產生了死鎖,這裡由於SQL_ID不同,而且產生的等待事件是enq: TX - allocate ITL entry,所以推測出生成的是ITL死鎖。
解決這類問題就是增大ini_trans和PCT_FREE的值。
SELECT * FROM DBA_TABLES D WHERE D.TABLE_NAME = 'TLHRBOKBAL';
可以看到,ini_trans和PCT_FREE的值都是預設的,太小了,根據MOS(Troubleshooting waits for 'enq: TX - allocate ITL entry' (Doc ID 1472175.1) 地址:http://blog.itpub.net/26736162/viewspace-2124531/)我們需要修改該引數,SQL如下:
ALTER TABLE TLHR.TLHRBOKBAL PCTFREE 20 INITRANS 16 ;
ALTER TABLE TLHR.TLHRBOKBAL MOVE NOLOGGING PARALLEL 12;
ALTER TABLE TLHR.TLHRBOKBAL LOGGING NOPARALLEL;
ALTER INDEX TLHR.PK_TLHRBOKBAL REBUILD PCTFREE 20 INITRANS 16 NOLOGGING PARALLEL 12;
ALTER INDEX TLHR.PK_TLHRBOKBAL LOGGING NOPARALLEL;
由於表裡有3000W的資料量,開了並行,本來我預估的是5分鐘,結果move表的時候10秒都不到還是比較快的。
調整之後的值:
SELECT * FROM DBA_TABLES D WHERE D.TABLE_NAME = 'TLHRBOKBAL';
SELECT * FROM dba_indexes d WHERE d.index_name='PK_TLHRBOKBAL';
修改已經生效,接下來就看開發那邊是否還報死鎖的錯誤,這個等待需要明天看了。
終於等到第2天了,看來沒有報錯了:
4 這裡我們模擬一個ITL死鎖
有人的地方就有江湖,有資源阻塞的地方就可能有死鎖。所謂死鎖: 是指兩個或兩個以上的程式在執行過程中,因爭奪資源而造成的一種互相等待的現象,若無外力作用,它們都將無法推進下去。此時稱系統處於死鎖狀態或系統產生了死鎖,這些永遠在互相等待的程式稱為死鎖程式。其最常見的死鎖的型別分為:行級鎖(row-level locks)和塊級鎖(block-level locks),這裡的行級鎖其實就是指的ITL死鎖。有關死鎖的問題,有許多需要介紹的,這篇blog主要是故障處理,所以這裡我們模擬一個ITL死鎖產生的過程即可,後邊我會系統的發一次有關死鎖的內容,還有ITL的內容,希望大家持續關注小麥苗的微信公眾號(xiaomaimiaolhr)。
實驗部分:
實驗的設計過程來源於網路!
我們首先建立一張表T_ITL_LHR,這裡指定PCTFREE為0,INITRANS為1,就是為了觀察到ITL的真實等待情況,然後我們給這些塊內插入資料,把塊填滿,讓它不能有空間分配。
SYS@lhrdb21> SELECT * FROM V$VERSION; BANNER -------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production PL/SQL Release 11.2.0.4.0 - Production CORE 11.2.0.4.0 Production TNS for IBM/AIX RISC System/6000: Version 11.2.0.4.0 - Production NLSRTL Version 11.2.0.4.0 - Production SYS@lhrdb21> SHOW PARAMETER CLUSTER
NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ cluster_database boolean TRUE cluster_database_instances integer 2 cluster_interconnects string
SYS@lhrdb21> CREATE TABLE T_ITL_LHR(A INT) PCTFREE 0 INITRANS 1; Table created. SYS@lhrdb21> BEGIN 2 FOR I IN 1 .. 2000 LOOP 3 INSERT INTO T_ITL_LHR VALUES (I); 4 END LOOP; 5 END; 6 /
PL/SQL procedure successfully completed.
SYS@lhrdb21> COMMIT;
Commit complete.
|
我們檢查資料填充的情況:
SYS@lhrdb21> SELECT F, B, COUNT(*) 2 FROM (SELECT DBMS_ROWID.ROWID_RELATIVE_FNO(ROWID) F, 3 DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID) B 4 FROM T_ITL_LHR) 5 GROUP BY F, B 6 ORDER BY F,B;
F B COUNT(*) ---------- ---------- ---------- 1 94953 734 1 94954 734 1 94955 532 |
可以發現,這2000條資料分佈在3個塊內部,其中有2個塊(94953和94954)填滿了,一個塊(94955)是半滿的。因為有2個ITL槽位,我們需要拿2個滿的資料塊,4個程式來模擬ITL死鎖:
實驗步驟 |
會話 |
SID |
要更新的塊號 |
要更新的行號 |
是否有阻塞 |
步驟一 |
1 |
19 |
94953 94953 |
1 |
N |
2 |
79 |
2 |
N |
||
3 |
78 |
94954 94954 |
1 |
N |
|
4 |
139 |
2 |
N |
會話1:
SYS@lhrdb21> SELECT USERENV('SID') FROM DUAL; USERENV('SID') -------------- 19
SYS@lhrdb21> UPDATE T_ITL_LHR SET A=A 2 WHERE DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)=94953 3 AND DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)=1;
1 row updated. |
會話2:
SYS@lhrdb21> SELECT USERENV('SID') FROM DUAL; USERENV('SID') -------------- 79
SYS@lhrdb21> UPDATE T_ITL_LHR SET A=A 2 WHERE DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)=94953 3 AND DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)=2;
1 row updated. |
會話3:
SYS@lhrdb21> SELECT USERENV('SID') FROM DUAL; USERENV('SID') -------------- 78
SYS@lhrdb21> UPDATE T_ITL_LHR SET A=A 2 WHERE DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)=94954 3 AND DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)=1;
1 row updated. |
會話4:
SYS@lhrdb21> SELECT USERENV('SID') FROM DUAL;
USERENV('SID') -------------- 139
SYS@lhrdb21> UPDATE T_ITL_LHR SET A=A 2 WHERE DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)=94954 3 AND DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)=2;
1 row updated. |
這個時候系統不存在阻塞,
SELECT NVL(A.SQL_ID, A.PREV_SQL_ID) SQL_ID,
A.BLOCKING_SESSION,
A.SID,
A.SERIAL#,
A.LOGON_TIME,
A.EVENT
FROM GV$SESSION A
WHERE A.SID IN (19, 79,78,139)
ORDER BY A.LOGON_TIME;
以上4個程式把2個不同塊的4個ITL槽位給消耗光了,現在的情況,就是讓他們互相鎖住,達成死鎖條件,回到會話1,更新塊94954,注意,以上4個操作,包括以下的操作,更新的根本不是同一行資料,主要是為了防止出現的是行鎖等待。
實驗步驟 |
會話 |
SID |
要更新的塊號 |
要更新的行號 |
是否有阻塞 |
步驟一 |
1 |
19 |
94953 94953 |
1 |
N |
2 |
79 |
2 |
N |
||
3 |
78 |
94954 94954 |
1 |
N |
|
4 |
139 |
2 |
N |
||
步驟二 |
1 |
19 |
94954 |
3 |
Y |
3 |
78 |
94953 |
3 |
Y |
會話1:
UPDATE T_ITL_LHR SET A=A
WHERE DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)=94954
AND DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)=3;
會話1出現了等待。
會話3:
UPDATE T_ITL_LHR SET A=A
WHERE DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)=94953
AND DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)=3;
會話3發現出現了等待。
我們查詢阻塞的具體情況:
SELECT NVL(A.SQL_ID, A.PREV_SQL_ID) SQL_ID,
A.BLOCKING_SESSION,
A.SID,
A.SERIAL#,
A.LOGON_TIME,
A.EVENT
FROM GV$SESSION A
WHERE A.SID IN (19, 79,78,139)
ORDER BY A.LOGON_TIME;
可以看到,會話1被會話4阻塞了,會話3被會話2阻塞了。
注意,如果是9i,在這裡就報死鎖了,但是在10g裡面,這個時候,死鎖是不會發生的,因為這裡的會話1還可以等待會話4釋放資源,會話3還可以等待會話2釋放資源,只要會話2與會話4釋放了資源,整個環境又活了,那麼我們需要把這兩個程式也塞住。
實驗步驟 |
會話 |
SID |
要更新的塊號 |
要更新的行號 |
是否有阻塞 |
步驟一 |
1 |
19 |
94953 94953 |
1 |
N |
2 |
79 |
2 |
N |
||
3 |
78 |
94954 94954 |
1 |
N |
|
4 |
139 |
2 |
N |
||
步驟二 |
1 |
19 |
94954 |
3 |
Y |
3 |
78 |
94953 |
3 |
Y |
|
步驟三 |
2 |
79 |
94954 |
4 |
Y |
4 |
139 |
94953 |
4 |
Y |
會話2,注意,我們也不是更新的同一行資料:
UPDATE T_ITL_LHR SET A=A
WHERE DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)=94954
AND DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)=4;
會話2出現了等待,具體阻塞情況:
我做了幾次實驗,會話2執行完SQL後,會話3到這裡就報出了死鎖,但有的時候並沒有產生死鎖,應該跟系統的阻塞順序有關,若沒有產生死鎖,我們可以繼續會話4的操作。
SYS@lhrdb21> UPDATE T_ITL_LHR SET A=A 2 WHERE DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)=94953 3 AND DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)=3;
UPDATE T_ITL_LHR SET A=A * ERROR at line 1: ORA-00060: deadlock detected while waiting for resource |
會話4,注意,我們也不是更新的同一行資料:
UPDATE T_ITL_LHR SET A=A
WHERE DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)=94953
AND DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)=4;
會話4發現出現了等待。
雖然,以上的每個更新語句,更新的都不是同一個資料行,但是,的確,所有的程式都被阻塞住了,那麼,死鎖的條件也達到了,等待一會(這個時間有個隱含引數來控制的:_lm_dd_interval),我們可以看到,會話2出現提示,死鎖:
SYS@lhrdb21> UPDATE T_ITL_LHR SET A=A 2 WHERE DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)=94954 3 AND DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)=4;
UPDATE T_ITL_LHR SET A=A * ERROR at line 1: ORA-00060: deadlock detected while waiting for resource |
報出死鎖之後的阻塞情況:
我們可以在會話2上繼續執行步驟三中的SQL,依然會產生死鎖。生成死鎖後,在告警日誌中有下邊的語句:
Fri Sep 09 17:56:55 2016 Global Enqueue Services Deadlock detected. More info in file /oracle/app/oracle/diag/rdbms/lhrdb2/lhrdb21/trace/lhrdb21_lmd0_17039368.trc. |
其中的內容有非常經典的一段Global Wait-For-Graph(WFG):
*** 2016-09-09 17:48:22.216 Submitting asynchronized dump request [1c]. summary=[ges process stack dump (kjdglblkrdm1)]. Global blockers dump end:----------------------------------- Global Wait-For-Graph(WFG) at ddTS[0.395] : BLOCKED 0x700010063d59b90 3 wq 2 cvtops x1001 TX 0x7000b.0xa67(ext 0x2,0x0)[1002-0029-00008387] inst 1 BLOCKER 0x700010063c6d268 3 wq 1 cvtops x28 TX 0x7000b.0xa67(ext 0x2,0x0)[1002-002D-00003742] inst 1 BLOCKED 0x700010063d5adc8 3 wq 2 cvtops x1 TX 0x30021.0x848(ext 0x2,0x0)[1002-002D-00003742] inst 1 BLOCKER 0x700010063d5a4b8 3 wq 1 cvtops x28 TX 0x30021.0x848(ext 0x2,0x0)[1002-0029-00008387] inst 1 |
至於每個引數到底是什麼意思,目前還沒有去研究,等待大神可以無償解釋一下。至於如何解決ITL產生的死鎖,無非就是增大表和索引的initrans和PCT_FREE的值,可以參考本BLOG中的ITL死鎖問題解決。
該實驗過程可能有點複雜,小麥苗畫了個圖來說明整個實驗過程:
5 與文章有關的相關連線
|
http://blog.itpub.net/26736162/viewspace-1244055/ |
||
|
http://blog.itpub.net/26736162/viewspace-1222423/ |
||
|
http://blog.itpub.net/26736162/viewspace-1222417/ |
||
|
http://blog.itpub.net/26736162/viewspace-1218671/ |
||
|
http://blog.itpub.net/26736162/viewspace-2080712/ |
||
|
http://blog.itpub.net/26736162/viewspace-1744719/ |
||
|
http://blog.itpub.net/26736162/viewspace-1744705/ |
||
|
About Me
...............................................................................................................................
● 本文作者:小麥苗,只專注於資料庫的技術,更注重技術的運用
● 本文在itpub(http://blog.itpub.net/26736162)、部落格園(http://www.cnblogs.com/lhrbest)和個人微信公眾號(xiaomaimiaolhr)上有同步更新
● 本文itpub地址:http://blog.itpub.net/26736162/viewspace-2124771/
● 本文部落格園地址:http://www.cnblogs.com/lhrbest/p/5859095.html
● 本文pdf版及小麥苗雲盤地址:http://blog.itpub.net/26736162/viewspace-1624453/
● 資料庫筆試面試題庫及解答:http://blog.itpub.net/26736162/viewspace-2134706/
● QQ群:230161599 微信群:私聊
● 聯絡我請加QQ好友(646634621),註明新增緣由
● 於 2016-08-01 15:00~ 2016-08-01 19:00 在魔都完成
● 文章內容來源於小麥苗的學習筆記,部分整理自網路,若有侵權或不當之處還請諒解
● 版權所有,歡迎分享本文,轉載請保留出處
...............................................................................................................................
拿起手機使用微信客戶端掃描下邊的左邊圖片來關注小麥苗的微信公眾號:xiaomaimiaolhr,掃描右邊的二維碼加入小麥苗的QQ群,學習最實用的資料庫技術。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26736162/viewspace-2124771/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 【故障處理】佇列等待之TX - allocate ITL entry案例佇列
- enq: TX - allocate ITL entryENQ
- enq: TX - allocate ITL entry等待過多導致全域性死鎖ENQ
- 故障處理】佇列等待之enq: US - contention案例佇列ENQ
- 【故障處理】佇列等待之enq: US - contention案例佇列ENQ
- 【故障處理】佇列等待之enq IV - contention案例佇列ENQ
- enq: TX - allocate ITL entry等待事件分析ENQ事件
- 關於enq: TX - allocate ITL entry等待事件ENQ事件
- Oracle TX鎖的處理Oracle
- 關於enq: TX - allocate ITL entry的問題分析ENQ
- [20150721]enq TX - allocate ITL entryENQ
- Oracle 死鎖處理Oracle
- Oracle死鎖處理Oracle
- 【MOS】 Troubleshooting waits for enq: TX - allocate ITL entry(1472175.1)AIENQ
- Mysql如何處理死鎖MySql
- 日常運維之TX鎖處理(一)運維
- 日常運維之TX鎖處理(二)運維
- 對於死鎖的處理流程:
- ORACLE死鎖及處理方式Oracle
- oracle 死鎖查詢處理Oracle
- [20140130]關於enq TX-allocate ITL entryENQ
- 【Oracle】死鎖的產生與處理Oracle
- 如何處理執行緒死鎖執行緒
- 關於Oracle死鎖處理方法Oracle
- Oracle死鎖查詢及處理Oracle
- 處理線上RabbitMQ佇列阻塞MQ佇列
- ITL與事務處理
- 【故障處理】一次RAC故障處理過程
- MongoDB故障處理MongoDB
- 表死鎖查詢及處理辦法
- [20231026]enq TX - allocate ITL entry的測試4.txtENQ
- 常見佇列等待事件處理思路佇列事件
- 通過佇列實現批量處理佇列
- APM RUEI processor處理程式hang死處理方法
- 關於處理死鎖的一點小知識
- 微服務的故障處理微服務
- GPON網路故障如何處理?GPON網路故障處理流程
- 【故障處理】ORA-600:[13013],[5001]故障處理