streams capture狀態PAUSED FOR FLOW CONTROL原因分析和診斷
客戶環境:ibm p595 雙機
oracle:10.2.0.4 rac與另外一臺查詢db透過stremas交換資料
問題:事務多、事務大的時侯capture process出現PAUSED FOR FLOW CONTROL導致同步暫停
[@more@]Streams同步中斷問題診斷
· 1、中斷之後capture process的狀態是PAUSED FOR FLOW CONTROL:
SELECT CAPTURE_NAME, STATE FROM V$STREAMS_CAPTURE;
CAPTURE_NAME STATE
————– ————————
STREAM_CAPTURE PAUSED FOR FLOW CONTROL
另外從select publisher_state from v$buffered_publishers
得到的狀態是PUBLISHER_STATE: IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES
也說明確實問題是出現在傳播或者應用上,但是檢視select apply_name from v$streams_apply_server的狀態時發現都是Idle(空閒),由此斷定其實不是apply server本身的問題,apply process根本就沒有幹活…是不是propagation(傳播)的問題不好判斷,於是修改了可能引起傳播問題的2個引數: 放大了引數job_queue_processes同時減小了_job_queue_interval = 1,目的就是希望更頻繁的排程傳播作業。最初_job_queue_interval 的值是5,重啟db,重啟streams之後問題依舊。仔細檢查和傳播有關的重要檢視:dba_queue_schedules發現failures都是0以及scheduler_disabled都是N,也就是說傳播作業都是正常的,基本上可以排除問題出在傳播上。這次監控了203,204,205上的等待事件發現205上有異常,等待事件latch free非常嚴重。
檢視等待事件的方法:
* from v$session_wait where wait_class<>'Idle';
透過上面查詢獲得的latch free對應的欄位P2的值是37進一步查詢具體的latch種類:
SQL> select * from v$latchname where latch#=37;
LATCH# NAME HASH
---------- -------------------------------------------------- -------
37 3794363110
透過進一步診發現oracle的一個後臺程式LCK0阻塞了205上的幾乎一切應用。Oracle對造成這種現象的解釋是:
When SMON needs to recover a large transaction and cannot be done in parallel
then it can take longer than usual for it to read the messages from the
'obj stat del channel' broadcast channel.
LCK0 is also a subscriber of this channel and If the messages are not read
by all subcribers then they are not purged and LCK0 has to traverse the whole
list of messages to find a new one causing it to consume CPU and hold
"channel operations parent latch" while doing it
是由於資料庫用並行執行很大的事務時,意外中斷或人為KILL後,SMON程式要對其進行恢復並UNDO 由於時間很長,LOCK程式就會佔有 channel operations parent latch 並阻塞了其他會話的操做,對這個問題進行詳細的解釋是:
Since SMON was busy recovering a large dead transaction, it
could not process any of the broadcast messages being sent
to it. This in turn caused, a backlog of messages to build
up on the "obj stat del channel" broadcast channel. This
channel is used to inform SMON (on each node) that an
object has been truncated/dropped and there is no longer a
requirement to maintain any in-memory object statistics
structure (ksolselm) for that object. The LCK0 process is
also a subscriber to this channel (LCK0 acts as a gateway
for the inter-instance broadcast mechanism; LCK0 reads the
messages broadcast to local channels and then sends these
messages to other instances). Now if the messages start to
pile up on the "obj stat del channel" broadcast channel then
LCK0 has to continually traverse all the messages (linearly)
until reaching the latest/new unread messages at the end.
The traversal by LCK0 is done holding the 'channel operation
parant latch'. There is competition for this latch from other
processes which are dropping/truncating temporary objects,
plus there is from PMON who is trying to cleanup
dead processes that owned temporary objects. LCK0 is slowed
and it is slow responding to global enqueue requests which
in turn slows down SMON, and PMON takes a long time cleaning
just one dead process due to having to check for a possible
dead latchholder, and the whole situation gets worse.
(Note: as soon as SMON completes the transaction recovery
task, the performance problem should immediately clear up.)
期間我們在shutdown 205時不能正常shutdown,提示pmon存在問題,檢視pmon程式相關的日誌如下,和上面描述的現象完全吻合。
--====================================
*** 2011-09-21 20:59:43.256
*** SERVICE NAME:(SYS$BACKGROUND) 2011-09-21 20:59:43.249
*** SESSION ID:(1325.1) 2011-09-21 20:59:43.249
PMON unable to acquire latch 7000008054b69c0 Child channel operations parent latch level=5 child#=59
Location from where latch is held: ksrmfree():
Context saved from call: 504403192714062200
state=busy(exclusive) (val=0x2000000000000125) holder orapid = 293
waiters [orapid (seconds since: put on list, posted, alive check)]:
343 (45, 1316609983, 3)
350 (42, 1316609983, 3)
351 (33, 1316609983, 3)
359 (15, 1316609983, 3)
357 (6, 1316609983, 3)
353 (3, 1316609983, 3)
waiter count=6
gotten 18094214 times wait, failed first 4390982 sleeps 2171306
gotten 0 times nowait, failed: 0
possible holder pid = 293 ospid=1966160
----------------------------------------
SO: 7000008082f5848, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=293, calls cur/top: 7000007e916fd70/7000007f9cc79a0, flag: (0) -
int error: 1089, call error: 0, sess error: 0, txn error 0
(post info) last post received: 1089 8 12
last post received-location: ksusig
last process to post me: 70000080f2f5468 2 0
last post sent: 504403192714062272 37 2
last post sent-location: kslges
last process posted by me: 70000080631a728 1 14
(latch info) wait_event=0 bits=20
holding (efd=15) 7000008054b69c0 Child channel operations parent latch level=5 child#=59
Location from where latch is held: ksrmfree():
Context saved from call: 504403192714062200
state=busy(exclusive) (val=0x2000000000000125) holder orapid = 293
waiters [orapid (seconds since: put on list, posted, alive check)]:
343 (45, 1316609983, 3)
350 (42, 1316609983, 3)
351 (33, 1316609983, 3)
359 (15, 1316609983, 3)
357 (6, 1316609983, 3)
353 (3, 1316609983, 3)
waiter count=6
Process Group: DEFAULT, pseudo proc: 70000080738d260
O/S info: user: oracle, term: UNKNOWN, ospid: 1966160
OSD pid info: Unix process pid: 1966160, image: oracle@cxdb2
Short stack dump:
ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-ksrmfree+028c<-ksrchreset+03bc<-kwqidlisten+0688<-kwqilintl+02f0<-kwqilisten+0344<-spefcmpa+01c8<-spefmccallstd+00d4<-pextproc+006c<-peftrusted+00f8<-psdexsp+01f0<-rpiswu2+034c<-psdextp+038c<-pefccal+03c8<-pefcal+01dc<-pevm_FCAL+00e4<-pfrinstr_FCAL+0070<-pfrrun_no_tool+005c<-pfrrun+1014<-plsql_run+06b4<-peicnt+0224<-kkxexe+0250<-opiexe+2ef8<-kpoal8+0edc<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124<-opiino+0990<-opiodr+0ae0<-opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098
Dump of memory from 0x07000008082CA988 to 0x07000008082CAB90
7000008082CA980 0000000B 00000000 [........]
7000008082CA990 07000007 E8148FB0 00000010 000313A7 [................]
7000008082CA9A0 07000007 F9CC79A0 00000003 000313A7 [......y.........]
7000008082CA9B0 07000008 075D2088 0000000B 000313A7 [.....] .........]
7000008082CA9C0 07000008 043FC758 00000004 0003129B [.....?.X........]
7000008082CA9D0 07000007 FE5867B8 00000007 000313A7 [.....Xg.........]
7000008082CA9E0 07000007 FE586898 00000007 000313A7 [.....Xh.........]
7000008082CA9F0 07000007 FE586978 00000007 000313A7 [.....Xix........]
7000008082CAA00 07000007 FE586B38 00000007 000313A7 [.....Xk8........]
7000008082CAA10 07000007 FE586C18 00000007 000313A7 [.....Xl.........]
7000008082CAA20 07000008 075D23B8 0000000B 000313A7 [.....]#.........]
7000008082CAA30 07000008 075D24C8 0000000B 000313A7 [.....]$.........]
7000008082CAA40 00000000 00000000 00000000 00000000 [................]
Repeat 20 times
*** 2011-09-21 20:59:43.314
--====================================
結論:
透過上面分析,造成streams同步暫停的原因是有大的事務被kill,oracle smon內部恢復事務和pmon要清除死程式相互鎖定或者阻塞造成的205幾乎處於懸掛狀態。致使傳播過來的訊息根本無法入隊。所以我們看到的205上的apply server狀態都是Idle的。另外和我們部署的streams capture和apply process數量多也有一定的關係。
解決辦法:
以後遇到這種情況,先查詢205上的等待事件:
select * from v$session_wait where wait_class<>'Idle';
再驗證該等待事件是否是latch free,如果是那麼再確實該latch是否是:channel operations parent latch,如果是,那麼先把所有的streams先停下來,允許的話把205重啟一下,不方便重啟就先少啟動幾支streams,等大事務處理之後再把所有的streams都起來。
建議:
首先要保證不要無緣無故人為的去kill session,其次把大事務分來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25462274/viewspace-1975495/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- STREAM的CAPTURE出現PAUSED FOR FLOW CONTROLAPT
- oracle RAC 診斷叢集狀態命令Oracle
- 分析:全面診斷FACEBOOK
- 通過v$sql_bind_capture和dba_hist_sqlbind關聯診斷gc事件SQLAPTGC事件
- Control Flow 控制流程
- awr診斷分析之二
- 如何診斷伺服器關閉的原因伺服器
- Win2K”秘密武器”之瀏覽器狀態診斷(轉)瀏覽器
- 發電機過速診斷研究 - 基於LSTM進行狀態監測
- Swift-控制流(Control Flow)Swift
- Swift--控制流 (Control Flow)Swift
- 網站SEO診斷分析要點網站
- CRS-1606 診斷分析
- library cache pin和library cache lock的診斷分析
- 判斷Nginx存活狀態Nginx
- 問題診斷和PLSQL方面SQL
- Chapter2:Asynchronous Control Flow PatternsAPT
- ora-4031(ora-4030)診斷分析ora-4031(ora-4030)診斷分析
- 故障分析 | Kubernetes 故障診斷流程
- iOS判斷網路狀態iOS
- ssdbgrid中,設了斷點的除錯狀態和執行狀態有區別斷點除錯
- Oracle JOB異常中斷原因分析Oracle
- 光纖故障診斷和故障排查
- Flutter 狀態管理框架 Provider 和 Get 分析Flutter框架IDE
- oracle streams相關啟_停_刪capture_apply_propagationOracleAPTAPP
- sp_sysmon效能診斷結果分析(zt)
- 資料庫異常智慧分析與診斷資料庫
- 一次DG故障診斷過程分析
- online redo log 一直處於active 狀態可能原因分析 [zt]
- ThreadDump和Java應用診斷(轉)threadJava
- Thread Dump 和Java應用診斷threadJava
- [JVM] 應用診斷工具之Fastthread(線上診斷)JVMASTthread
- 有狀態和無狀態的區別
- 9. Oracle常用分析診斷工具——9.2. ASHOracle
- 9. Oracle常用分析診斷工具——9.1. AWROracle
- 解鎖伺服器連線狀態新姿勢:tcping工具助你高效診斷網路連通性伺服器TCP
- ORACLE診斷案例Oracle
- Oracle故障診斷Oracle