記一次Oracle資料庫無響應(hang住)故障的處理

hd_system發表於2017-05-02
先說說這個資料庫的環境:

Oracle 9.2.0.4 RAC,只不過這個RAC只執行了一個節點,另一節點沒有開啟。
AIX 5.3 TL04
主機為p550,4CPU,16G記憶體
應用為部署在Weblogic下的WEB應用。

故障現象:
首先是客戶端的操作沒有響應,從weblogic上看連線數非常高,其日誌裡面不停報超出連線池的最大連線數。在主機上用sqlplus "/ as sysdba",在顯示sqlplus的banner後,停止響應。

從故障現象來看,是資料庫hang住了。

由於sqlplus不能操作,那麼這個時候沒辦法透過oracle來dump system state。先看看作業系統裡面,用topas命令觀察,發現一個oracle程式佔用了26%左右的CPU資源,IO等待幾乎為0,可用的實體記憶體還比較多。根據那個佔用CPU的程式號用ps命令檢視,是一個普通的Server Process。

看來起這個程式陷入死迴圈了,26%的CPU資源正好是1個CPU(因為系統共4個CPU)。如果一個oracle程式拿到比較重要的資源,比如shared pool latch、library cache latch等,然後陷入了死迴圈(SPIN)後,其他程式沒法解析SQL等,也就只有掛起了。

用kill命令殺掉那個程式,系統恢復正常,看來前面對故障的推斷是正確的,不過沒過幾分鐘,又出現了此故障現象。

只有找到oracle當時正在幹什麼,才能進行處理。用dbx來dump system state:

# dbx -a 446910
Waiting to attach to process 446910 ...
Successfully attached to oracle.
Type 'help' for help.
reading symbolic information ...
stopped in iosl.select at 0x9000000000c94d8 ($t2)
0x9000000000c94d8 (select+0xfffffffffff06318) e8410028 ld r2,0x28(r1)
(dbx) print ksudss(10)

Segmentation fault in slrac at 0x100083aa0 ($t2)
0x100083aa0 (slrac+0xe4) 88030000 lbz r0,0x0(r3)
(dbx) detach

分析trace檔案:

Starting Systemstate 1
..............................................................................
......................
Ass.Awk Version 1.0.9 - Processing oa2_ora_446910.trc

System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer' seq=16329
3:
4: waiting for 'rdbms ipc message' seq=33158
5: waiting for 'ges remote message' seq=30917
6: waiting for 'gcs remote message' seq=52339
7: waiting for 'gcs remote message' seq=52407
8: waiting for 'rdbms ipc message' seq=32661
9: waiting for 'rdbms ipc message' seq=32504
10: waiting for 'rdbms ipc message' seq=19573
11: waiting for 'rdbms ipc message' seq=52125
12: waiting for 'latch free' [Latch 7000001491d2a40] seq=16415
13: waiting for 'rdbms ipc message' seq=270
14: waiting for 'rdbms ipc message' seq=3494
15: waiting for 'rdbms ipc message' seq=29593
16: waiting for 'latch free' [Latch 7000001491d2a40] seq=54430
17: waiting for 'rdbms ipc message' seq=52224
18: waiting for 'latch free' [Latch 7000001491d2a40] seq=23413
19: waiting for 'latch free' [Latch 7000001491d2a40] seq=49796
20: waiting for 'latch free' [Latch 7000001491d2a40] seq=20730
21: waiting for 'latch free' [Latch 7000001491d2a40] seq=12464
Cmd: Select
22: waiting for 'latch free' [Latch 7000001491d2a40] seq=34653
Cmd: Select
23: waiting for 'latch free' [Latch 7000001491d2850] seq=58618
24: waiting for 'latch free' [Latch 7000001491d2a40] seq=36619
25: waiting for 'latch free' [Latch 7000001491d2a40] seq=56192
26: waiting for 'latch free' [Latch 7000001491d2a40] seq=60361
27: waiting for 'latch free' [Latch 7000001491d2a40] seq=19963
28: waiting for 'latch free' [Latch 7000001491d2a40] seq=10464
29: waiting for 'latch free' [Latch 7000001491d2a40] seq=29548
30: waiting for 'latch free' [Latch 7000001491d2a40] seq=15077
31: waiting for 'latch free' [Latch 7000001491d2a40] seq=32472
32: waiting for 'latch free' [Latch 7000001491d2a40] seq=17466
33: waiting for 'latch free' [Latch 7000001491d2a40] seq=34712
34: waiting for 'latch free' [Latch 7000001491d2a40] seq=27146
35: waiting for 'latch free' [Latch 7000001491d2a40] seq=2911
36: waiting for 'latch free' [Latch 7000001491d2a40] seq=16634
Cmd: Select
37: waiting for 'latch free' [Latch 7000001491d2a40] seq=43400
38: waiting for 'latch free' [Latch 7000001491d2a40] seq=8435
39: waiting for 'latch free' [Latch 7000001491d2a40] seq=39939
40: waiting for 'latch free' [Latch 7000001491d2a40] seq=59619
41: waiting for 'latch free' [Latch 7000001491d2a40] seq=16259
42: waiting for 'latch free' [Latch 7000001491d2a40] seq=21150

43: waiting for 'latch free' [Latch 7000001491d2a40] seq=35843
44: waiting for 'latch free' [Latch 7000001491d2a40] seq=27141
45: waiting for 'latch free' [Latch 7000001491d2a40] seq=29400
46: waiting for 'latch free' [Latch 7000001491d2a40] seq=51444
47: waiting for 'latch free' [Latch 7000001491d2a40] seq=19142
48: waiting for 'latch free' [Latch 7000001491d2a40] seq=1031
49: waiting for 'latch free' [Latch 7000001491d2a40] seq=31852
50: waiting for 'latch free' [Latch 7000001491d2a40] seq=46957
51: waiting for 'latch free' [Latch 7000001491d2a40] seq=41609
52: waiting for 'latch free' [Latch 7000001491d2850] seq=57255
53: waiting for 'latch free' [Latch 7000001491d2a40] seq=34982
54: waiting for 'latch free' [Latch 7000001491d2a40] seq=26060
55: waiting for 'latch free' [Latch 7000001491d2a40] seq=16070
56: waiting for 'latch free' [Latch 7000001491d2a40] seq=63913
57: waiting for 'latch free' [Latch 7000001491d2a40] seq=2724
58: waiting for 'latch free' [Latch 7000001491d2a40] seq=42177
59: waiting for 'latch free' [Latch 7000001491d2a40] seq=19973
60: waiting for 'latch free' [Latch 7000001491d2a40] seq=47446
61: waiting for 'latch free' [Latch 7000001491d2a40] seq=42112
Cmd: Select
62: waiting for 'latch free' [Latch 7000001491d2a40] seq=927
63: waiting for 'latch free' [Latch 7000001491d2850] seq=3182
64: waiting for 'latch free' [Latch 7000001491d2a40] seq=2347
65: waiting for 'latch free' [Latch 7000001491d2a40] seq=18928
66: waiting for 'latch free' [Latch 7000001491d2a40] seq=1512
67: waiting for 'latch free' [Latch 7000001491d2a40] seq=53568
68: waiting for 'latch free' [Latch 7000001491d2a40] seq=20420
69: waiting for 'latch free' [Latch 7000001491d2a40] seq=30353
70: waiting for 'latch free' [Latch 7000001491d2a40] seq=36936
71: waiting for 'latch free' [Latch 7000001491d2a40] seq=50852
72: waiting for 'latch free' [Latch 7000001491d2a40] seq=8533
73: waiting for 'latch free' [Latch 7000001491d2a40] seq=31009
74: waiting for 'latch free' [Latch 7000001491d2a40] seq=15406
75: waiting for 'latch free' [Latch 7000001491d2a40] seq=4375
Cmd: Select
76: waiting for 'library cache load lock' seq=207
78: waiting for 'latch free' [Latch 7000001491d2a40] seq=6165
79: waiting for 'latch free' [Latch 7000001491d2a40] seq=22977
80: last wait for 'SQL*Net message from client'
81: waiting for 'latch free' [Latch 7000001491d2a40] seq=20226
82: waiting for 'latch free' [Latch 7000001491d2a40] seq=17323
Cmd: Select
83: waiting for 'latch free' [Latch 7000001491d2a40] seq=19680
84: waiting for 'latch free' [Latch 7000001491d2a40] seq=5997
85: waiting for 'latch free' [Latch 7000001491d2a40] seq=54166
Cmd: Select
86: waiting for 'latch free' [Latch 7000001491d2a40] seq=61874
87: waiting for 'latch free' [Latch 7000001491d2a40] seq=33285
88: waiting for 'latch free' [Latch 7000001491d2a40] seq=16784
89: last wait for 'latch free'
Cmd: Select
90: waiting for 'latch free' [Latch 7000001491d2a40] seq=63834
91: waiting for 'latch free' [Latch 7000001491d2a40] seq=42945
92: waiting for 'latch free' [Latch 7000001491d2a40] seq=3676
93: waiting for 'latch free' [Latch 7000001491d2a40] seq=59730
94: waiting for 'latch free' [Latch 7000001491d2a40] seq=42691
Cmd: Update
95: waiting for 'latch free' [Latch 7000001491d2a40] seq=41076
96: waiting for 'latch free' [Latch 7000001491d2a40] seq=23230
97: waiting for 'latch free' [Latch 7000001491d2a40] seq=64835
98: waiting for 'latch free' [Latch 7000001491d2a40] seq=45314
99: waiting for 'latch free' [Latch 7000001491d2a40] seq=65301
100:waiting for 'latch free' [Latch 7000001491d2a40] seq=51423
101:waiting for 'latch free' [Latch 7000001491d2a40] seq=25171
Blockers
~~~~~~~~

Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.

Resource Holder State
Latch 7000001491d2a40 80: last wait for 'SQL*Net message from client'
Latch 7000001491d2a40 89: last wait for 'latch free'
Latch 7000001491d2850 26: 26: is waiting for 80: 89:

Object Names
~~~~~~~~~~~~
Latch 7000001491d2a40 Child library cache
Latch 7000001491d2850 Child library cache

很多程式都在等待library cache latch,而持有library cache latch的程式出現了異常,因而導致了資料庫被hang住。對出現異常的程式進行分析,發現異常程式執行的SQL都是應用中的普通的SQL(並且後來出現異常的多個程式執行的SQL都不相同)。由於急需恢復應用,所以不得已先重啟了資料庫,再繼續查詢原因。

然而資料庫重啟只能使用1天,就會再次出現此類問題。
針對此故障,首先懷疑可能是BUG引起。但在確認BUG之前,需要收集足夠多的證據。隨後發現,這個資料庫存在下面幾個問題:

1、這個庫的shared pool 設定很大,達到了4G。
2、應用沒有使用繫結變數,全部都是拼接的SQL,並且沒有使用任何儲存過程,也就是應用的全部SQL都是硬解析。4G的shared pool可以在1天之內全部填滿。
3、從alert日誌檔案裡面看到,這個庫經常因為ORA-04031錯誤崩潰。

看來這個庫的問題比較嚴重。資料庫hang住和ORA-04031應該都與硬解析過多有關。但目前改變應用是十分困難的事情。只好透過將cursor_sharing設定為similar,強制使用繫結變數。

將cursor_sharing設定為similiar之後,經過一段時間的觀察,資料庫執行正常,不再出現hang住的現象,hard parse從以前每秒數十次減少為每秒不到1次。shared pool的記憶體使用非常少,並且減少了shared_pool_size的設定。

不過cursor_sharing設定為similar有比較多的BUG,最常見的就是statspack的BUG和會有大量高version count的SQL。

對於OLTP系統來說,使用繫結變數,怎麼強調都不過分。

sqlplus -prelim / as sysdba
then do a hanganalyze

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29209863/viewspace-2138300/,如需轉載,請註明出處,否則將追究法律責任。

相關文章