kksfbc child completion與ksdxexeotherwait引發CPU使用異常
某客戶操作人員反應很慢不能操作,管理人員登入小機系統後發現CPU使用到了96%。而且這種情況持續了幾個月。以下是登入後小機後載取的topas圖,而且是週末,並沒有人使用系統。小機是IBM的550,配置是2顆6核的CPU,記憶體是48G。
如是登入資料庫執行以下指令碼來檢視當前資料庫消耗CPU最多的程式在執行什麼
Connected to Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 Connected as gtp2 SQL> select s.sid,p.SPID,s.username,s.event,s.wait_time,s.state,s.seconds_in_wait,p.PROGRAM,s.MACHINE, 2 (select c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=s.SQL_ID) sql_fulltext, 3 (select c.BIND_DATA from v$sqlarea c where c.SQL_ID=s.SQL_ID) BIND_DATA,s.SQL_ID 4 from v$session s,v$process p 5 where p.SPID in(491720,90116,127336,529102,987524,331990) 6 and s.event not like'%SQL*Net%' and s.USERNAME='GTP2' 7 order by s.wait_time desc 8 ; SID SPID USERNAME EVENT WAIT_TIME STATE SECONDS_IN_WAIT PROGRAM MACHINE SQL_FULLTEXT BIND_DATA SQL_ID ---------- ------------ ------------------------------ ---------------------------------------------------------------- ---------- ------------------- --------------- ------------------------------------------------ ---------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- ------------- 1020 90116 GTP2 kksfbc child completion -1 WAITED SHORT TIME 53742 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 987524 GTP2 kksfbc child completion -1 WAITED SHORT TIME 53742 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 331990 GTP2 kksfbc child completion -1 WAITED SHORT TIME 53742 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 491720 GTP2 kksfbc child completion -1 WAITED SHORT TIME 53742 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 4 rows selected
SQL> select s.sid,p.SPID,s.username,s.event,s.wait_time,s.state,s.seconds_in_wait,p.PROGRAM,s.MACHINE, 2 (select c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=s.SQL_ID) sql_fulltext, 3 (select c.BIND_DATA from v$sqlarea c where c.SQL_ID=s.SQL_ID) BIND_DATA,s.SQL_ID 4 from v$session s,v$process p 5 where p.SPID in(491720,90116,127336,529102,987524,331990) 6 and s.event not like'%SQL*Net%' and s.USERNAME='GTP2' 7 order by s.wait_time desc 8 ; SID SPID USERNAME EVENT WAIT_TIME STATE SECONDS_IN_WAIT PROGRAM MACHINE SQL_FULLTEXT BIND_DATA SQL_ID ---------- ------------ ------------------------------ ---------------------------------------------------------------- ---------- ------------------- --------------- ------------------------------------------------ ---------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- ------------- 1020 90116 GTP2 ksdxexeotherwait -1 WAITED SHORT TIME 3342 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 987524 GTP2 ksdxexeotherwait -1 WAITED SHORT TIME 3342 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 331990 GTP2 ksdxexeotherwait -1 WAITED SHORT TIME 3342 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 491720 GTP2 ksdxexeotherwait -1 WAITED SHORT TIME 3342 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 4 rows selected
從上面的資訊可以看到這些程式的等待事件為kksfbc child completion,ksdxexeotherwait。當看到這種情況時第一反應是不是遇到的BUG,以KKSFBC CHILD COMPLETION為關鍵字到MOS查詢可以找到,該Bug的症狀為程式不斷spin且hang住、出現’KKSFBC CHILD COMPLETION’等待事件、還可能伴有’Waits for “cursor: pin S”‘等待事件,直接影響的版本有11.1.0.6、10.2.0.3和10.2.0.4。而我這裡的版本是10.2.0.1。
對於該Bug的描述是在發生’kksfbc child completion’等待事件後會話陷入無休止的自旋(spins)中,這種自旋(spins)發生在由堆疊呼叫(stack call)kksSearchChildList->kkshgnc陷入對kksSearchChildList函式的無限迴圈中。需要更詳細的stack call,如是對系統程式90116進行跟蹤。
SQL> oradebug setospid 90116 Oracle pid: 40, Unix process pid: 90116, image: oracleorcl@dbserv SQL> oradebug unlimit; Statement processed. SQL> oradebug short_stack; ksdxfstk+002c< -ksdxcb+04e4<-sspuser+0068<-00004750<-kksfbc+0bb0<-kkspsc0+0f3c<-kksParseCursor+00d4<-opiosq0+0b10<-kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1004<-opitsk+1000<-opiino+0990<-opiodr+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0070 SQL> oradebug dump processstate 10; Statement processed. SQL> oradebug dump systemstate 266; Statement processed. SQL> oradebug tracefile_name /oracle/admin/orcl/udump/orcl_ora_90116.trc
檢視生成的跟蹤檔案orcl_ora_90116.trc有如下內容:
SO: 7000001486ab188, type: 4, owner: 70000014346c5a8, flag: INIT/-/-/0x00 (session) sid: 1020 trans: 0, creator: 70000014346c5a8, flag: (41) USR/- BSY/-/-/-/-/- DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 7000001473dcf10, psql: 7000001225ac0c8, user: 82/GTP2 O/S info: user: gtp-default, term: WIN-AUQ43P0UU9L, ospid: 6708:12196, machine: WORKGROUP\WIN-AUQ43P0UU9L program: w3wp.exe last wait for 'kksfbc child completion' blocking sess=0x0 seq=2831 wait_time=48850 seconds since wait started=572057 =0, =0, =0 Dumping Session Wait History for 'kksfbc child completion' count=1 wait_time=48850 =0, =0, =0
可以從以上trace中看到會話確實曾長時間處於’kksfbc child completion’等待中,之後陷入無限自旋(spins)中消耗了大量CPU時間。但這裡實際的表現又存有差異,引發無限迴圈的函式是kksfbc而不是kksSearchChildList(常規的呼叫序列是:kksParseCursor->kkspsc0->kksfbc ->kksSearchChildList->kkshgnc)。kksfbc意為K[Kernel]K[Kompile]S[Shared]F[Find]B[Best]C[Child]該函式用以在軟解析時找尋合適的子游標,在10.2.0.2以後引入了mutex互斥體來取代原有的Cursor Pin機制,Mutex較Latch更為輕量級。雖然mutex的引入改變了眾多cursor pin的內部機制,但kksfbc仍需要持有library cache latches才能掃描library cache hash chains。另一方面當kksfbc函式針對某個parent cursor找到合適child cursor後,可能使用KKSCHLPINx方法將該child cursor pin住,這個時候就需要exclusive地持有該child cursor相應的mutex。Oracle在10.2.0.4上提供了該Bug的one-off Patch
8575528,其在10.2.0.4 psu4以後的等價補丁為(Equivalent patch)為merge patch 9696904:8557428 9696904 7527908 Both fixes are needed. 6795880 superceded by 8575528 in 9696904 which includes extra files so may cause new conflicts。但merge patch 9696904目前僅有Linux x86/64平臺上的版本,而問題資料庫所在平臺為IBM AIX on POWER Systems (64-bit),而且版本是10.2.0.1。那麼要解決這個問題是不是沒有辦法了,其實不然,我們可以將資料庫從10.2.0.1升級到10.2.0.5來解決這個BUG,在升級到10.2.0.5之後確實解決這個問題。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26015009/viewspace-1966363/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- kksfbc child completion
- python自定義異常,使用raise引發異常PythonAI
- 兩種異常(CPU異常、使用者模擬異常)的收集
- MongoDB 異常當機與引數cacheSizeGBMongoDB
- Android開發:系統程式中使用Webview引發異常的處理AndroidWebView
- 【高併發】由InterruptedException異常引發的思考Exception
- 一次 Jedis 引數異常引發服務雪崩
- 一場 Kafka CRC 異常引發的血案Kafka
- python生成器呼叫方法引發異常Python
- java程式佔用cpu異常升高Java
- 深度覆盤-重啟 etcd 引發的異常
- Bulk 異常引發的 Elasticsearch 記憶體洩漏Elasticsearch記憶體
- 異常與IO
- Java 異常表與異常處理原理Java
- DNS伺服器故障引發流量異常問題-VeCloudDNS伺服器Cloud
- 一次訊號量引發的tomcat異常退出Tomcat
- JVM實戰調優(空格引發的服務異常)JVM
- Java - 異常與FileJava
- MVC使用異常過濾器處理異常MVC過濾器
- C#規範整理·異常與自定義異常C#
- PostgreSQL DBA(64) - checkpoint_completion_target引數解析SQL
- oracle 使用異常exceptionOracleException
- python錯誤與異常Python
- 異常篇—— VEH 與 SEH
- Kotlin藝術探索之引數和異常Kotlin
- Java異常使用原則Java
- 記Linux使用異常2Linux
- 如何看開發板原理圖及外設引腳與CPU引腳的連線
- 一次線上Redis類轉換異常排查引發的思考Redis
- ConcurrentModificationException併發修改異常Exception
- 異常處理機制(二)之異常處理與捕獲
- 前端效能與異常上報前端
- Flutter 使用 bugly 進行異常上報與熱修復Flutter
- Android Studio: Kotlin使用DataBinding異常AndroidKotlin
- 由VIP漂移引發的演算法異常問題調查和解決演算法
- 異常和異常呼叫鏈
- 異常篇——異常記錄
- 異常篇——異常處理
- dubbo~全域性異常攔截器的使用與設計缺陷