alter index rebuild online引發的血案

‘早上起來沒有一個人……‘，伸手抓起手機，‘喂，……應用hang住了……rac的一節點在手工shutdown……需要到現場……’。趕緊穿上衣服，拿起電腦往客戶辦公室趕。路上還接到客戶電話具體描述了下故障現象。
原來是客戶在rac的其中一個節點進行index rebuild online操作，結果命令發出去不久應用就hang住了，沒有辦法的客戶就直接用了大絕招：把操作的節點資料庫重啟了，現在的問題是shutdown immediate也hang住了。聽到這裡，馬上喊他看日誌，回覆日誌沒報錯，鬱悶的shutdown，如果有大事務在執行，豈不是要rollback死，由於此應用級別很高，等不起，只能死馬當活馬醫了。不管了，先讓它關了在說，先直接作業系統殺客戶端程式，使資源能更快的釋放。於是，指導輸入命令ps –ef|grep oracle|grep LOCAL=NO|grep –v grep|awk ‘{awk print $2}’|xargs kill -9。系統終於關閉了，直接startup……
等我到現場時，系統居然還沒啟動起來，難道真的有大事務需要恢復，難道就是重建索引？
登入伺服器檢視，日誌顯示資料庫正在recover，想起經常的recover幾個小時的情況，我再次對貿然的shutdown很是鬱悶。不過好在是rac，可客戶反應即使連線到好的節點還是不能使用，登入檢視，檢查鎖，檢查等待。發現存在大量的鎖，而且有一個鎖已經2個月了，恰恰就是重建索引的表，明顯這個鎖有問題，直接kill。等一會，另一個節點啟動好了，問應用的情況，也恢復了，就這麼簡單？就是那個鎖的問題？我暈，守一會業務後我就撤退了。
在路上我就在想，rebuild index online怎麼會把整個應用hang死，難道它鎖了整個表，怎麼會呢，在印象中online的同時是可以進行dml操作的啊，奇了怪了，先睡一覺，空了再來重現故障分析……
N天過後……
先上一篇文章，關於rebuild和rebuild online的區別metalink Note:272762.1
大家耐心看完喲，原理是很重要的……
========
- Online Index rebuild takes a long time.
- ONLINE INDEX REBUILD SCANS THE BASE TABLE AND NOT THE INDEX
Symptoms:
=========
Performance issues while rebuilding very large indexes.
- The offline rebuilds of their index is relatively quick -finishes in 15 minutes.
- Issuing index rebuild ONLINE statement => finishes in about an hour.
- This behavior. of ONLINE index rebuilds makes it a non-option for large tables
as it just takes too long to scan the table to rebuild the index. The offline may not be feasible due to due to the 24/7 nature of the database.
- This may be a loss of functionality for such situations.
- If we attempt to simultaneously ONLINE rebuild the same indexes we may encounter hanging behavior. indefinitely (or more than 6 hours).
DIAGNOSTIC ANALYSIS:
--------------------
We can trace the sessions rebuilding the indexes with 10046 level 12.
Comparing the IO reads for the index-rebuild and the index-rebuild-online reveals the following:
-ONLINE index rebuilds
It scans the base table and it doesn't scan the blocks of the index.
-OFFLINE index rebuilds
It scans the index for the build operation.
- This behaviour is across all versions.
Cause
Cause/Explanation
=============
When you rebuild index online,
- it will do a full tablescan on the base table.
- At the same time it will maintain a journal table for DML data, which has
changed during this index rebuilding operation.
So it should take longer time, specially if you do lots of DML on the same table,while rebuilding index online.
On the other hand, while rebuilding the index without online option, Oracle will grab the index in X-mode and rebuild a new index segment by selecting the data from the old index. So here we are
- not allowing any DML on the table hence there is no journal table involved
- and it is doing an index scan
Hence it will be pretty fast.
Fix
Solution/Conclusion:
===========
- The ONLINE index rebuild reads the base table, and this is by design.
- Rebuilding index ONLINE is pretty slow.
- Rebuilding index offline is very fast, but it prevents any DML on the base table.
好，我們現在明白了，rebuild online是可以同時進行dml操作的，但是online會維護一個操作日誌，會使rebuild時間大幅加長。巴拉巴拉……這一切的一切，跟我遇到的應用掛起沒有任何關係啊，難道是應用騙了我，其實應用可以用？還是模擬一下當時的場景再說。
Action：
Conn oracle/oracle
--建表
create table test(a int,b varchar2(64));
--插入資料
begin
for i in 1..1000000 loop
insert into test values(i,'ok');
if i=100 then
commit;
end if;
end loop;
end;
/
--建立索引
create index idx_a1 on test(a);

準備工作做好了……
--更新其中一條記錄，製造一個排他鎖，不要提交
SQL> update test set a=1000000 where a=103;
--開啟另一個視窗，線上重建索引
SQL> alter index idx_a1 rebuild online;
……
發現此命令長時間不能執行完，沒關係，大表的索引重建是需要大量時間的，查下等待和鎖吧。

sys@ORCL_SQL> select sid,event,P1TEXT,state from v$session_wait where event not in ('SQL*Net message from client');

SID EVENT                         P1TEXT                      STATE
------- ------------------------------ ------------------------------ --------------------------------------
139 enq: TM - contention          name|mode                   WAITING
149 Streams AQ: waiting for time m                               WAITING
      anagement or cleanup tasks

151 Streams AQ: qmn coordinator id                               WAITING
      le wait

155 rdbms ipc message             timeout                      WAITING

sys@ORCL_SQL> select * from v$lock where block>0;

ADDR    KADDR       SID TYPE    ID1    ID2  Lock Held  Lock Req.    CTIME    BLOCK
-------- -------- ------- ---- -------- -------- ---------- ---------- ---------- ----------
315C4134 315C414C    154 TM    52543       0 ########## ##########       24       1
看到木有，重建根本就沒有進行，而是在等待sid154，這個sid就是我們開始的update的命令，看來rebuild online是不會阻礙dml操作，但是在它之前的dml操作它會去等待，知道資源釋放，如果這時有個大事務一直不十分資源，那就恭喜了，你的重建就會hang在這裡。慢著，這裡hang住了對應用不會有影響吧，想當然是，馬上測試
又開啟個新連線
SQL> update test set a=1000000 where a=102;
……
1分鐘過去了，此命令沒有成功，真的hang住了，整個表被鎖住了？
再次檢查鎖

WAITING_SESSION LOCK_TYPE       MODE_REQUESTED             MODE_HELD                   LOCK_ID1       LOCK_ID2
----------------- ----------------- ---------------------------- ---------------------------- ----------------- -----------------
154             None
139          DML             Share                      Row-X (SX)                52543          0
142          DML             Row-X (SX)                Row-X (SX)                52543          0
這裡解釋一下，154是我們最開始update的sid，139是rebuild online的sid，而142是我們剛才update的sid，看來它們都在等sid154，也就是說，這個時候我不能對此表做dml操作了，也就是對於此操作非常頻繁的表來說，應用已經hang住了！！
看來rebuild online之前一定要檢查下系統，其實是做所有操作之前都要這樣做，確認系統沒問題後再做操作。
在貼下鎖的具體情況，大家看rebuild操作的139有幾個行排他鎖，我懷疑是它需要先把所有的資料都用排他鎖鎖住，然後其它dml的操作它會記入輔助日誌，而我這裡的情況是有個行鎖一直存在，就造成了它把其他行都鎖住了，但一直等待最開始的那個行鎖釋放，這個時候如果我開始那個鎖不釋放，系統也就掛在這裡了！！
SID SERIAL# USERNAME Term Resource Type       TAB       OWNER       ID1    ID2 Lock Held    Lock Req.          CTIME
------- ------- ---------- ------ -------------------- ---------- -------- -------- -------- -------------- -------------- ----------
139    11 ORACLE    pts/1  DLSystem          TEST    ORACLE    52543       0 Row Exclusive                      2958
139    11 ORACLE    pts/1  DLSystem          TEST    ORACLE    52543       0 Row Exclusive                      2958
154    9 ORACLE    pts/2  TM - DML Enqueue    TEST    ORACLE    52543       0 Row Exclusive                      2967
139    11 ORACLE    pts/1  TM - DML Enqueue    TEST    ORACLE    52543       0 Row Share    Share             2958
139    11 ORACLE    pts/1  TM - DML Enqueue    SYS_JOURNA ORACLE    52558       0 Share                            2957
                                                   L_52544

142    33 ORACLE    pts/4  TM - DML Enqueue    TEST    ORACLE    52543       0             Row Excl          2844

sys@ORCL_SQL> select sql_text from v$sqltext where hash_value=&hash_value order by piece;
Enter value for hash_value: 1016442092

SQL_TEXT
alter index idx_a1 rebuild online

alter index rebuild online引發的血案

相關文章