一個客戶的資料庫突然出現了HANG死的情況，連SQLPLUS都無法正常連線到資料庫。

資料庫版本為11.2.0.2 for Linux x86-64。客戶由於要迅速解決問題，於是採用了重啟資料庫的方式，重啟後問題解決。

事後透過檢查資料庫的trace資訊以及詢問客戶dba的操作，基本上定位了導致問題的原因。

其實事後診斷是一個很頭痛的問題，因為一旦重啟，很多資訊就丟失了。所有V$檢視中的資訊都被重置，想要使用AWR來獲取資訊，也會發現由於資料庫重啟，實際上沒有辦法對重啟前時刻的資料庫狀態進行分析。

好在所有的TRACE資訊和告警日誌都可以檢查到，首先檢查告警日誌資訊：

Wed Jul 06 14:34:53 2011
Errors in file /opt/oracle/diag/rdbms/mobiledb_n/mobiledb/trace/mobiledb_cjq0_19642.trc (incident=36249):
ORA-00445: background process "J002" did not start after 120 seconds
Incident details in: /opt/oracle/diag/rdbms/mobiledb_n/mobiledb/incident/incdir_36249/mobiledb_cjq0_19642_i36249.trc
Wed Jul 06 14:35:21 2011
Dumping diagnostic data in directory=[cdmp_20110706143521], requested by (instance=1, sid=19642 (CJQ0)), summary=[incident=36249].
Wed Jul 06 14:35:30 2011
kkjcre1p: unable to spawn jobq slave process
Errors in file /opt/oracle/diag/rdbms/mobiledb_n/mobiledb/trace/mobiledb_cjq0_19642.trc:
Wed Jul 06 14:35:53 2011
PMON failed to acquire latch, see PMON dump
Wed Jul 06 14:36:57 2011
ORA-00020: No more process state objects available
ORA-20 errors will not be written to the alert log for
the next minute. Please look at trace files to see all
the ORA-20 errors.
Process m000 submission failed with error = 20
.
.
.
***********************************************************************

Fatal NI connect error 12170.

VERSION INFORMATION:
TNS for Linux: Version 11.2.0.2.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.2.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.2.0 - Production
Time: 06-JUL-2011 14:38:31
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.90.102)(PORT=56704))

***********************************************************************

.
.
.
Wed Jul 06 14:38:15 2011
WARNING: inbound connection timed out (ORA-3136)
Wed Jul 06 14:38:15 2011
WARNING: inbound connection timed out (ORA-3136)
Wed Jul 06 14:38:18 2011
WARNING: inbound connection timed out (ORA-3136)
.
.
.

首先報錯的是JOB相關的程式，J002程式在指定時間內沒有啟動：

ORA-00445: background process "J002" did not start after 120 seconds

隨後CJQ0程式在嘗試啟動JOB程式時報錯：

kkjcre1p: unable to spawn jobq slave process

接著PMON程式報錯，無法獲取LATCH：

PMON failed to acquire latch, see PMON dump

之後出現了ORA-20相關的錯誤：

ORA-00020: No more process state objects available
ORA-20 errors will not be written to the alert log for
the next minute. Please look at trace files to see all
the ORA-20 errors.
Process m000 submission failed with error = 20

最後基本上是滿屏的12170、TNS-12535和ORA-3136錯誤。

最後的網路連線超時相關的12170、ORA-3136等，都是問題產生後的現象，而並非問題的原因，同樣前面的J000和CJQ0的報錯，同樣也是由於其他問題導致了後臺程式啟動異常。

那麼這裡最明顯的錯誤就是ORA-20錯誤，也就是說系統中的程式數超過了初始化引數PROCESSES的設定。當前這個引數設定為1500，而正常情況下，這個資料庫中的連線會話數絕不會超過500。顯然，雖然隨後的網路連線超時錯誤，以及後臺JOB啟動的錯誤可能都是由於ORA-20錯誤導致的，但是當前資料庫中程式超過引數設定這絕對不是導致問題的根本原因，而同樣也是問題的表相，一定是其他的問題導致了資料庫中程式數的暴漲。

剩下的只有一個PMON無法獲取LATCH的資訊了，這並不是一個ORA開頭的錯誤資訊，而實際上應該算是一個告警資訊，但是這個告警所傳遞的資訊卻很重要。根據這個告警資訊查詢METALINK，沒有發現和當前現象相似的案例。

由於錯誤資訊與PMON程式有關，因此繼續檢查PMON程式：

Trace file /opt/oracle/diag/rdbms/mobiledb_n/mobiledb/trace/mobiledb_pmon_19539.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /opt/oracle/product/11.2.0/db
System name: Linux
Node name: Oracle112
Release: 2.6.32-71.el6.x86_64
Version: #1 SMP Wed Dec 15 12:36:54 EST 2010
Machine: x86_64
Instance name: mobiledb
Redo thread mounted by this instance: 1
Oracle process number: 2
Unix process pid: 19539, image: oracle@Oracle112 (PMON)

*** 2011-07-06 14:35:52.888
*** SESSION ID:(285.1) 2011-07-06 14:35:52.888
*** CLIENT ID:() 2011-07-06 14:35:52.888
*** SERVICE NAME:(SYS$BACKGROUND) 2011-07-06 14:35:52.888
*** MODULE NAME:() 2011-07-06 14:35:52.888
*** ACTION NAME:() 2011-07-06 14:35:52.888

PMON unable to acquire latch 60108140 Child shared pool level=7 child#=3
Location from where latch is held: kgh.h LINE:6383 ID:kghfsh:
Context saved from call: 0
state=busy [holder rapid=63] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
67 (146, 1309934152, 124)
80 (146, 1309934152, 122)
94 (146, 1309934152, 121)
91 (146, 1309934152, 121)
84 (146, 1309934152, 121)
98 (146, 1309934152, 121)
99 (146, 1309934152, 121)
46 (146, 1309934152, 146)
95 (146, 1309934152, 146)
111 (146, 1309934152, 121)
112 (146, 1309934152, 121)
115 (146, 1309934152, 120)
118 (146, 1309934152, 120)
119 (146, 1309934152, 119)
120 (146, 1309934152, 119)
121 (146, 1309934152, 119)
124 (146, 1309934152, 118)
123 (146, 1309934152, 118)
128 (146, 1309934152, 118)
127 (146, 1309934152, 118)
129 (146, 1309934152, 117)
131 (146, 1309934152, 117)
130 (146, 1309934152, 117)
134 (146, 1309934152, 116)
135 (146, 1309934152, 116)
136 (146, 1309934152, 116)
139 (146, 1309934152, 115)
138 (146, 1309934152, 115)
47 (146, 1309934152, 146)
141 (146, 1309934152, 112)
142 (146, 1309934152, 104)
140 (146, 1309934152, 113)
144 (146, 1309934152, 100)
147 (146, 1309934152, 94)
146 (146, 1309934152, 94)
148 (146, 1309934152, 92)
150 (146, 1309934152, 89)
152 (146, 1309934152, 88)
153 (146, 1309934152, 88)
155 (146, 1309934152, 88)
156 (146, 1309934152, 88)
154 (146, 1309934152, 88)
157 (146, 1309934152, 88)
.
.
.
1485 (88, 1309934152, 28)
1496 (88, 1309934152, 28)
1487 (88, 1309934152, 28)
1494 (88, 1309934152, 28)
1499 (88, 1309934152, 28)
1490 (87, 1309934152, 28)
1498 (87, 1309934152, 28)
1495 (87, 1309934152, 28)
1497 (87, 1309934152, 28)
65 (87, 1309934152, 87)
83 (47, 1309934152, 47)
waiter count=1408
gotten 92338703 times wait, failed first 49620 sleeps 6005
gotten 0 times nowait, failed: 0
Short stack dump:
ksedsts()+461 ----------------------------------------
SO: 0x1588acc90, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x1588acc90, name=process, file=ksu.h LINE:12451, pg=0
(process) Oracle pid:63, ser:85, calls cur/top: 0x147d554d8/0x147d554d8
flags : (0x0) -
flags2: (0x0), flags3: (0x0)
intr error: 0, call error: 0, sess error: 0, txn error 0
intr queue: empty
ksudlp FALSE at location: 0
(post info) last post received: 136 0 2
last post received-location: ksl2.h LINE:2293 ID:kslpsr
last process to post me: 15f820b38 1 6
last post sent: 0 0 26
last post sent-location: ksa2.h LINE:282 ID:ksasnd
last process posted by me: 15f820b38 1 6
(latch info) wait_event=0 bits=80
holding (efd=15) 60108140 Child shared pool level=7 child#=3
Location from where latch is held: kgh.h LINE:6383 ID:kghfsh:
Context saved from call: 0
state=busy [holder rapid=63] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
67 (146, 1309934152, 124)
80 (146, 1309934152, 122)
94 (146, 1309934152, 121)
91 (146, 1309934152, 121)
84 (146, 1309934152, 121)
98 (146, 1309934152, 121)
99 (146, 1309934152, 121)
46 (146, 1309934152, 146)
95 (146, 1309934152, 146)
111 (146, 1309934152, 121)
112 (146, 1309934152, 121)
115 (146, 1309934152, 120)
118 (146, 1309934152, 120)
119 (146, 1309934152, 119)
120 (146, 1309934152, 119)
121 (146, 1309934152, 119)
124 (146, 1309934152, 118)
123 (146, 1309934152, 118)
128 (146, 1309934152, 118)
127 (146, 1309934152, 118)
129 (146, 1309934152, 117)
131 (146, 1309934152, 117)
130 (146, 1309934152, 117)
134 (146, 1309934152, 116)
.
.
.
1487 (88, 1309934152, 28)
1494 (88, 1309934152, 28)
1499 (88, 1309934152, 28)
1490 (87, 1309934152, 28)
1498 (87, 1309934152, 28)
1495 (87, 1309934152, 28)
1497 (87, 1309934152, 28)
65 (87, 1309934152, 87)
83 (47, 1309934152, 47)
waiter count=1408
Process Group: DEFAULT, pseudo proc: 0x15f9a0710
O/S info: user: oracle, term: UNKNOWN, ospid: 7698
OSD pid info: Unix process pid: 7698, image: oracle@Oracle112

*** 2011-07-06 14:35:53.548
Short stack dump:
ksedsts()+461 PSO child state object changes :
Dump of memory from 0x000000015886F9D0 to 0x000000015886FBD8
15886F9D0 00000000 00000000 00000000 00000000 [................]
Repeat 31 times
15886FBD0 00000000 00000000 [........]

*** 2011-07-06 14:35:53.557

根據PMON程式的資訊，SPID為7698的一個資料庫程式阻塞1408個程式，不用說也知道問題就出在這個7698的程式上，繼續查詢這個程式的trace資訊：

Trace file /opt/oracle/diag/rdbms/mobiledb_n/mobiledb/trace/mobiledb_ora_7698.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /opt/oracle/product/11.2.0/db
System name: Linux
Node name: Oracle112
Release: 2.6.32-71.el6.x86_64
Version: #1 SMP Wed Dec 15 12:36:54 EST 2010
Machine: x86_64
Instance name: mobiledb
Redo thread mounted by this instance: 1
Oracle process number: 63
Unix process pid: 7698, image: oracle@Oracle112

*** 2011-07-06 14:44:20.926
*** SESSION ID:(2229.4669) 2011-07-06 14:44:20.926
*** CLIENT ID:() 2011-07-06 14:44:20.926
*** SERVICE NAME:(MOBILEDB) 2011-07-06 14:44:20.926
*** MODULE NAME:(PL/SQL Developer) 2011-07-06 14:44:20.926
*** ACTION NAME:(Command Window - New) 2011-07-06 14:44:20.926

KQRCMT: Write failed with error=604 po=0x13350d1f0 cid=8
diagnostics : cid=8 hash=7188274a flag=2031

顯然這是客戶透過PL/SQL Developer工具執行的操作，而且看ACTION NAME是Command Window，多半客戶是在進行DDL等操作。

而ORA-604錯誤的描述是：

ORA-00604: error occurred at recursive SQL level string
Cause: An error occurred while processing a recursive SQL statement (a statement applying to internal dictionary tables).
Action: If the situation described in the next error on the stack can be corrected, do so; otherwise contact Oracle Support.'

顯然是由於客戶執行的某個特殊命令，導致了資料字典在修改的過程中出現了錯誤，從而導致了資料庫中其他所有的會話被鎖。由於連線池中現有的連線全部被鎖，因此連線池會嘗試不斷的建立新的連線來滿足使用者的請求，而這些新的連線同樣會被鎖定，並最終導致資料庫中的程式數超過了作業系統上的設定1500，從而引發了其他一系列的錯誤。

導致問題的原理搞清楚了，那麼客戶DBA到底執行了什麼操作，導致了這個現象呢，這個透過TRACE是無法獲取到的，即使是透過LOGMINER，很可能也無法找到，由於是對DDL語句而言。

這時最簡單的辦法莫過於直接詢問客戶，實際上當時客戶在執行的操作理論上是對資料庫影響最小的線上重定義的操作，客戶在執行DBMS_REDEFINITION.FINISH_REDEF_TABLE這個操作後，很快發現程式異常，而且後臺出現了ORA-20錯誤，試圖透過CTRL + C來中止任務，雖然這個操作被中止了，但是這個程式後臺對於資料庫其他程式的阻塞並沒有釋放，從而導致所有進行被掛起，最終不得不重啟資料庫。

導致問題的原因可能是這個：Bug 11834459: DBMS_REDEFINITION.FINISH_REDEF_TABLE MAY FLUSH THE SHARED POOL。在11.2中，執行FINISH_REDEF_TABLE過程很可能會重新整理共享池，而這個操作就有可能導致上面所有會話被鎖的問題。這個問題說明即使這個操作是Oracle提供的線上操作，即使這個操作看起來沒有什麼危害，在一個新的版本上執行的時候仍然要小心，最好在相同版本的測試環境中測試，而且儘量不要在高峰期執行。

線上重定義引起的資料庫掛起

相關文章