kksfbc child completion與ksdxexeotherwait引發CPU使用異常

eric0435發表於2015-12-29

某客戶操作人員反應很慢不能操作,管理人員登入小機系統後發現CPU使用到了96%。而且這種情況持續了幾個月。以下是登入後小機後載取的topas圖,而且是週末,並沒有人使用系統。小機是IBM的550,配置是2顆6核的CPU,記憶體是48G。
1

如是登入資料庫執行以下指令碼來檢視當前資料庫消耗CPU最多的程式在執行什麼

Connected to Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 
Connected as gtp2
 


SQL> select s.sid,p.SPID,s.username,s.event,s.wait_time,s.state,s.seconds_in_wait,p.PROGRAM,s.MACHINE,
  2  (select  c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=s.SQL_ID) sql_fulltext,
  3  (select  c.BIND_DATA from v$sqlarea c where c.SQL_ID=s.SQL_ID) BIND_DATA,s.SQL_ID
  4  from v$session s,v$process p
  5  where p.SPID in(491720,90116,127336,529102,987524,331990)
  6  and s.event not like'%SQL*Net%' and s.USERNAME='GTP2'
  7  order by s.wait_time desc
  8  ;
 
       SID SPID         USERNAME                       EVENT                                                             WAIT_TIME STATE               SECONDS_IN_WAIT PROGRAM                                          MACHINE                                                          SQL_FULLTEXT                                                                     BIND_DATA                                                                        SQL_ID
---------- ------------ ------------------------------ ---------------------------------------------------------------- ---------- ------------------- --------------- ------------------------------------------------ ---------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------
      1020 90116        GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 987524       GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 331990       GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 491720       GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
 
4 rows selected
 
SQL> select s.sid,p.SPID,s.username,s.event,s.wait_time,s.state,s.seconds_in_wait,p.PROGRAM,s.MACHINE,
  2  (select  c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=s.SQL_ID) sql_fulltext,
  3  (select  c.BIND_DATA from v$sqlarea c where c.SQL_ID=s.SQL_ID) BIND_DATA,s.SQL_ID
  4  from v$session s,v$process p
  5  where p.SPID in(491720,90116,127336,529102,987524,331990)
  6  and s.event not like'%SQL*Net%' and s.USERNAME='GTP2'
  7  order by s.wait_time desc
  8  ;
 
       SID SPID         USERNAME                       EVENT                                                             WAIT_TIME STATE               SECONDS_IN_WAIT PROGRAM                                          MACHINE                                                          SQL_FULLTEXT                                                                     BIND_DATA                                                                        SQL_ID
---------- ------------ ------------------------------ ---------------------------------------------------------------- ---------- ------------------- --------------- ------------------------------------------------ ---------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------
      1020 90116        GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 987524       GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 331990       GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 491720       GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
 
4 rows selected
 

從上面的資訊可以看到這些程式的等待事件為kksfbc child completion,ksdxexeotherwait。當看到這種情況時第一反應是不是遇到的BUG,以KKSFBC CHILD COMPLETION為關鍵字到MOS查詢可以找到,該Bug的症狀為程式不斷spin且hang住、出現’KKSFBC CHILD COMPLETION’等待事件、還可能伴有’Waits for “cursor: pin S”‘等待事件,直接影響的版本有11.1.0.6、10.2.0.3和10.2.0.4。而我這裡的版本是10.2.0.1。
2

對於該Bug的描述是在發生’kksfbc child completion’等待事件後會話陷入無休止的自旋(spins)中,這種自旋(spins)發生在由堆疊呼叫(stack call)kksSearchChildList->kkshgnc陷入對kksSearchChildList函式的無限迴圈中。需要更詳細的stack call,如是對系統程式90116進行跟蹤。

SQL> oradebug setospid 90116
Oracle pid: 40, Unix process pid: 90116, image: oracleorcl@dbserv
SQL> oradebug unlimit;
Statement processed.
SQL> oradebug short_stack;
ksdxfstk+002c< -ksdxcb+04e4<-sspuser+0068<-00004750<-kksfbc+0bb0<-kkspsc0+0f3c<-kksParseCursor+00d4<-opiosq0+0b10<-kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1004<-opitsk+1000<-opiino+0990<-opiodr+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0070 SQL> oradebug dump processstate 10;
Statement processed.
SQL>  oradebug dump systemstate 266;
Statement processed.
SQL> oradebug tracefile_name
/oracle/admin/orcl/udump/orcl_ora_90116.trc

檢視生成的跟蹤檔案orcl_ora_90116.trc有如下內容:

SO: 7000001486ab188, type: 4, owner: 70000014346c5a8, flag: INIT/-/-/0x00
    (session) sid: 1020 trans: 0, creator: 70000014346c5a8, flag: (41) USR/- BSY/-/-/-/-/-
              DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 7000001473dcf10, psql: 7000001225ac0c8, user: 82/GTP2
    O/S info: user: gtp-default, term: WIN-AUQ43P0UU9L, ospid: 6708:12196, machine: WORKGROUP\WIN-AUQ43P0UU9L
              program: w3wp.exe
    last wait for 'kksfbc child completion' blocking sess=0x0 seq=2831 wait_time=48850 seconds since wait started=572057
                =0, =0, =0
    Dumping Session Wait History
     for 'kksfbc child completion' count=1 wait_time=48850
                =0, =0, =0

可以從以上trace中看到會話確實曾長時間處於’kksfbc child completion’等待中,之後陷入無限自旋(spins)中消耗了大量CPU時間。但這裡實際的表現又存有差異,引發無限迴圈的函式是kksfbc而不是kksSearchChildList(常規的呼叫序列是:kksParseCursor->kkspsc0->kksfbc ->kksSearchChildList->kkshgnc)。kksfbc意為K[Kernel]K[Kompile]S[Shared]F[Find]B[Best]C[Child]該函式用以在軟解析時找尋合適的子游標,在10.2.0.2以後引入了mutex互斥體來取代原有的Cursor Pin機制,Mutex較Latch更為輕量級。雖然mutex的引入改變了眾多cursor pin的內部機制,但kksfbc仍需要持有library cache latches才能掃描library cache hash chains。另一方面當kksfbc函式針對某個parent cursor找到合適child cursor後,可能使用KKSCHLPINx方法將該child cursor pin住,這個時候就需要exclusive地持有該child cursor相應的mutex。Oracle在10.2.0.4上提供了該Bug的one-off Patch
8575528,其在10.2.0.4 psu4以後的等價補丁為(Equivalent patch)為merge patch 9696904:8557428 9696904 7527908 Both fixes are needed. 6795880 superceded by 8575528 in 9696904 which includes extra files so may cause new conflicts。但merge patch 9696904目前僅有Linux x86/64平臺上的版本,而問題資料庫所在平臺為IBM AIX on POWER Systems (64-bit),而且版本是10.2.0.1。那麼要解決這個問題是不是沒有辦法了,其實不然,我們可以將資料庫從10.2.0.1升級到10.2.0.5來解決這個BUG,在升級到10.2.0.5之後確實解決這個問題。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26015009/viewspace-1966363/,如需轉載,請註明出處,否則將追究法律責任。

相關文章