一次資料庫HANG處理

還是幫別人處理的，不知道我為什麼運氣好，資料庫就是不宕

他的資料庫HANG住了，SQLPLUS無法登陸，一直HANG在那裡

一般情況下，10g處理方法應該為sqlplus -prelim / as sysdba 連線資料庫，注意這種方式連線資料庫，是無法關閉資料庫的，只能進行SYSTEMSTATE
SQL> oradebug setmypid
SQL> oradebug unlimit
SQL> oradebug dump systemstate 266
SQL> oradebug tracefile_name

獲得TRC檔案後，可以用ass109.awk來格式化下看看

System State 1
~~~~~~~~~~~~~~~~
1:
2: last wait for 'cursor: mutex X'
3: waiting for 'rdbms ipc message'     wait
4: waiting for 'rdbms ipc message'     wait
5: waiting for 'rdbms ipc message'     wait
6: waiting for 'rdbms ipc message'     wait
7: waiting for 'rdbms ipc message'     wait
8: waiting for 'latch: library cache' [Latch 7000001302e0170] wait
9: last wait for 'rdbms ipc message'
10: last wait for 'latch: library cache'[Latch 7000001302e0030]
11: waiting for 'latch: library cache' [Latch 7000001302e0170] wait
12: waiting for 'latch: library cache' [Latch 7000001302e0170] wait
13:
14:
15: waiting for 'Streams AQ: qmn slave idle wait' wait
16: waiting for 'rdbms ipc message'     wait
17: waiting for 'rdbms ipc message'     wait
18: last wait for 'latch: library cache'[Latch 7000001302e0350]
19: for 'Streams AQ: waiting for time management or cleanup tasks' wait
20: waiting for 'Streams AQ: qmn coordinator idle wait' wait
21: waiting for 'SQL*Net message from client' wait
22:                                    [Latch 7000001302e0030]
23: waiting for 'SQL*Net message from client' wait
24:                                    [Latch 7000001302e0030]
25: waiting for 'SQL*Net message from client' wait
26: waiting for 'SQL*Net message from client' wait
27:                                    [Latch 7000000100e5020]
28: waiting for 'latch: library cache' [Latch 7000001302e0030] wait
29: waiting for 'latch: library cache' [Latch 7000001302e0350] wait
30: waiting for 'latch: library cache' [Latch 7000001302e0170] wait
31: last wait for 'latch: library cache'[Latch 7000001302e03f0]
32: waiting for 'SQL*Net message from client' wait
33: last wait for 'latch: library cache'[Latch 7000001302e0030]
34: last wait for 'latch: library cache'[Latch 7000001302e0030]
35: waiting for 'latch: library cache' [Latch 7000001302e0170] wait
36: waiting for 'SQL*Net message from client' wait
................................
735:                                   [Latch 70000001000c240]
736:                                   [Latch 70000001000c240]
737:                                   [Latch 70000001000c240]
738:                                   [Latch 70000001000c240]
739:                                   [Latch 70000001000c240]
740:                                   [Latch 70000001000c240]
Blockers
~~~~~~~~

        Above is a list of all the processes. If they are waiting for a resource
        then it will be given in square brackets. Below is a summary of the
        waited upon resources, together with the holder of that resource.
        Notes:
        ~~~~~
         o A process id of '???' implies that the holder was not found in the
           systemstate.

                    Resource Holder State
       Latch 7000001302e0170    ??? Blocker
       Latch 7000001302e0030    ??? Blocker
       Latch 7000001302e0350    ??? Blocker
       Latch 7000000100e5020    ??? Blocker
       Latch 7000001302e03f0    ??? Blocker
       Latch 7000000100e50c0   302: Self-Deadlock
       Latch 700000010020f20    ??? Blocker
       Latch 70000001000c240   660: 660: is waiting for 660: 699:
       Latch 70000001000c240   699: 699: is waiting for 660: 699:

Object Names
~~~~~~~~~~~~
Latch 7000001302e0170   Child library cache
Latch 7000001302e0030   Child library cache
Latch 7000001302e0350   Child library cache
Latch 7000000100e5020   Child shared pool
Latch 7000001302e03f0   Child library cache
Latch 7000000100e50c0   Child shared pool
Latch 700000010020f20   user lock
Latch 70000001000c240         holding    (efd=3) 70000001000c240 OS proces

首先302的Self-Deadlock肯定是有問題的，其持有的LATCH無法釋放，那麼在SYSTEMSTATE中找 PROCESS 302 看看
PROCESS 302:
----------------------------------------
SO: 700000138541e28, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=302, calls cur/top: 0/7000000746b9fd8, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=0
        Location from where call was made: kghalo:
      waiting for 7000000100e50c0 Child shared pool level=7 child#=2
        Location from where latch is held: kgh: add extent to reserved list:
        Context saved from call: 0
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           97 (936, 1258509061, 2)
           46 (936, 1258509061, 2)
           82 (933, 1258509061, 2)
           114 (930, 1258509061, 2)
           108 (930, 1258509061, 2)
           127 (930, 1258509061, 2)
      ....(100的個，這裡省略掉)
           636 (495, 1258509061, 2)
           633 (495, 1258509061, 2)
           628 (495, 1258509061, 2)
           634 (495, 1258509061, 2)
           632 (495, 1258509061, 2)
           640 (486, 1258509061, 2)
           641 (486, 1258509061, 2)
           642 (486, 1258509061, 2)
           waiter count=262
          gotten 728126853 times wait, failed first 151722 sleeps 66474
          gotten 0 times nowait, failed: 0
        possible holder pid = 92 spid=929882
      on wait list for 7000000100e50c0
      acquiring 7000000100e50c0

這裡可以看到，其在等待7000000100e50c0，但是其卻持有7000000100e50c0，自己把自己搞死掉了

而且，還有大量的SESSION在7000000100e50c0這個LATCH上等待，有100多個

找了個SQLPLUS進去的程式看，確認其也在等待該LATCH

找到了SQLPLUS HANG住的PROCESS 738，確定其在等待70000001000c240
PROCESS 738:
----------------------------------------
SO: 7000001385ad2e8, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=738, calls cur/top: 0/0, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=0
        Location from where call was made: kso_new_process:
      waiting for 70000001000c240 OS process: request allocation level=6
        Location from where latch is held: kso_new_process:
        Context saved from call: 0
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           643 (523, 1258509101, 3)
           644 (523, 1258509101, 0)
           645 (523, 1258509101, 3)

該LATCH的持有者為PROCESS 642
PROCESS 642:
----------------------------------------
SO: 7000001385958e8, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=642, calls cur/top: 0/0, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=40
        Location from where call was made: kghalo:
      waiting for 7000000100e50c0 Child shared pool level=7 child#=2
        Location from where latch is held: kgh: add extent to reserved list:
        Context saved from call: 0
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           97 (970, 1258509095, 3)
           46 (970, 1258509095, 0)
           82 (967, 1258509095, 3)
           114 (964, 1258509095, 3)
           108 (964, 1258509095, 3)
           ......
          gotten 728126853 times wait, failed first 151722 sleeps 66474
          gotten 0 times nowait, failed: 0
        possible holder pid = 92 spid=929882
      on wait list for 7000000100e50c0
      holding    (efd=3) 70000001000c240 OS process: request allocation level=6
        Location from where latch is held: kso_new_process:
        Context saved from call: 0
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:
           643 (517, 1258509095, 0)
           644 (517, 1258509095, 0)
           645 (517, 1258509095, 0)

該程式也在等待 7000000100e50c0

所以處理方法也很明顯了，殺掉PROCESS 302，並附帶殺掉660，那個程式也死鎖了。還有那些Holder為???的程式

不過，做HANGANALYZE竟然沒有發現這個LATCH上的死鎖，真奇怪

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/8242091/viewspace-619991/，如需轉載，請註明出處，否則將追究法律責任。

一次資料庫HANG處理

相關文章