客戶環境：ibm p595 雙機

oracle：10.2.0.4 rac與另外一臺查詢db透過stremas交換資料

問題：事務多、事務大的時侯capture process出現PAUSED FOR FLOW CONTROL導致同步暫停

[@more@]

Streams同步中斷問題診斷

· 1、中斷之後capture process的狀態是PAUSED FOR FLOW CONTROL：

SELECT CAPTURE_NAME, STATE FROM V$STREAMS_CAPTURE;

CAPTURE_NAME STATE
————– ————————
STREAM_CAPTURE PAUSED FOR FLOW CONTROL

另外從select publisher_state from v$buffered_publishers

得到的狀態是PUBLISHER_STATE: IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES

也說明確實問題是出現在傳播或者應用上，但是檢視select apply_name from v$streams_apply_server的狀態時發現都是Idle（空閒）,由此斷定其實不是apply server本身的問題，apply process根本就沒有幹活…是不是propagation（傳播）的問題不好判斷，於是修改了可能引起傳播問題的2個引數：放大了引數job_queue_processes同時減小了_job_queue_interval = 1，目的就是希望更頻繁的排程傳播作業。最初_job_queue_interval 的值是5，重啟db，重啟streams之後問題依舊。仔細檢查和傳播有關的重要檢視：dba_queue_schedules發現failures都是0以及scheduler_disabled都是N,也就是說傳播作業都是正常的，基本上可以排除問題出在傳播上。這次監控了203，204，205上的等待事件發現205上有異常，等待事件latch free非常嚴重。

檢視等待事件的方法：

* from v$session_wait where wait_class<>'Idle';

透過上面查詢獲得的latch free對應的欄位P2的值是37進一步查詢具體的latch種類：

SQL> select * from v$latchname where latch#=37;

LATCH# NAME HASH

---------- -------------------------------------------------- -------

37 3794363110

透過進一步診發現oracle的一個後臺程式LCK0阻塞了205上的幾乎一切應用。Oracle對造成這種現象的解釋是：

When SMON needs to recover a large transaction and cannot be done in parallel
then it can take longer than usual for it to read the messages from the
'obj stat del channel' broadcast channel.

LCK0 is also a subscriber of this channel and If the messages are not read
by all subcribers then they are not purged and LCK0 has to traverse the whole
list of messages to find a new one causing it to consume CPU and hold
"channel operations parent latch" while doing it

是由於資料庫用並行執行很大的事務時,意外中斷或人為KILL後,SMON程式要對其進行恢復並UNDO 由於時間很長,LOCK程式就會佔有 channel operations parent latch 並阻塞了其他會話的操做,對這個問題進行詳細的解釋是：

Since SMON was busy recovering a large dead transaction, it

could not process any of the broadcast messages being sent

to it. This in turn caused, a backlog of messages to build

up on the "obj stat del channel" broadcast channel. This

channel is used to inform SMON (on each node) that an

object has been truncated/dropped and there is no longer a

requirement to maintain any in-memory object statistics

structure (ksolselm) for that object. The LCK0 process is

also a subscriber to this channel (LCK0 acts as a gateway

for the inter-instance broadcast mechanism; LCK0 reads the

messages broadcast to local channels and then sends these

messages to other instances). Now if the messages start to

pile up on the "obj stat del channel" broadcast channel then

LCK0 has to continually traverse all the messages (linearly)

until reaching the latest/new unread messages at the end.

The traversal by LCK0 is done holding the 'channel operation

parant latch'. There is competition for this latch from other

processes which are dropping/truncating temporary objects,

plus there is from PMON who is trying to cleanup

dead processes that owned temporary objects. LCK0 is slowed

and it is slow responding to global enqueue requests which

in turn slows down SMON, and PMON takes a long time cleaning

just one dead process due to having to check for a possible

dead latchholder, and the whole situation gets worse.

(Note: as soon as SMON completes the transaction recovery

task, the performance problem should immediately clear up.)

期間我們在shutdown 205時不能正常shutdown，提示pmon存在問題，檢視pmon程式相關的日誌如下,和上面描述的現象完全吻合。

--====================================

*** 2011-09-21 20:59:43.256

*** SERVICE NAME:(SYS$BACKGROUND) 2011-09-21 20:59:43.249

*** SESSION ID:(1325.1) 2011-09-21 20:59:43.249

PMON unable to acquire latch 7000008054b69c0 Child channel operations parent latch level=5 child#=59

Location from where latch is held: ksrmfree():

Context saved from call: 504403192714062200

state=busy(exclusive) (val=0x2000000000000125) holder orapid = 293

waiters [orapid (seconds since: put on list, posted, alive check)]:

343 (45, 1316609983, 3)

350 (42, 1316609983, 3)

351 (33, 1316609983, 3)

359 (15, 1316609983, 3)

357 (6, 1316609983, 3)

353 (3, 1316609983, 3)

waiter count=6

gotten 18094214 times wait, failed first 4390982 sleeps 2171306

gotten 0 times nowait, failed: 0

possible holder pid = 293 ospid=1966160

----------------------------------------

SO: 7000008082f5848, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=293, calls cur/top: 7000007e916fd70/7000007f9cc79a0, flag: (0) -

int error: 1089, call error: 0, sess error: 0, txn error 0

(post info) last post received: 1089 8 12

last post received-location: ksusig

last process to post me: 70000080f2f5468 2 0

last post sent: 504403192714062272 37 2

last post sent-location: kslges

last process posted by me: 70000080631a728 1 14

(latch info) wait_event=0 bits=20

holding (efd=15) 7000008054b69c0 Child channel operations parent latch level=5 child#=59

Location from where latch is held: ksrmfree():

Context saved from call: 504403192714062200

state=busy(exclusive) (val=0x2000000000000125) holder orapid = 293

waiters [orapid (seconds since: put on list, posted, alive check)]:

343 (45, 1316609983, 3)

350 (42, 1316609983, 3)

351 (33, 1316609983, 3)

359 (15, 1316609983, 3)

357 (6, 1316609983, 3)

353 (3, 1316609983, 3)

waiter count=6

Process Group: DEFAULT, pseudo proc: 70000080738d260

O/S info: user: oracle, term: UNKNOWN, ospid: 1966160

OSD pid info: Unix process pid: 1966160, image: oracle@cxdb2

Short stack dump:

ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-ksrmfree+028c<-ksrchreset+03bc<-kwqidlisten+0688<-kwqilintl+02f0<-kwqilisten+0344<-spefcmpa+01c8<-spefmccallstd+00d4<-pextproc+006c<-peftrusted+00f8<-psdexsp+01f0<-rpiswu2+034c<-psdextp+038c<-pefccal+03c8<-pefcal+01dc<-pevm_FCAL+00e4<-pfrinstr_FCAL+0070<-pfrrun_no_tool+005c<-pfrrun+1014<-plsql_run+06b4<-peicnt+0224<-kkxexe+0250<-opiexe+2ef8<-kpoal8+0edc<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124<-opiino+0990<-opiodr+0ae0<-opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098

Dump of memory from 0x07000008082CA988 to 0x07000008082CAB90

7000008082CA980 0000000B 00000000 [........]

7000008082CA990 07000007 E8148FB0 00000010 000313A7 [................]

7000008082CA9A0 07000007 F9CC79A0 00000003 000313A7 [......y.........]

7000008082CA9B0 07000008 075D2088 0000000B 000313A7 [.....] .........]

7000008082CA9C0 07000008 043FC758 00000004 0003129B [.....?.X........]

7000008082CA9D0 07000007 FE5867B8 00000007 000313A7 [.....Xg.........]

7000008082CA9E0 07000007 FE586898 00000007 000313A7 [.....Xh.........]

7000008082CA9F0 07000007 FE586978 00000007 000313A7 [.....Xix........]

7000008082CAA00 07000007 FE586B38 00000007 000313A7 [.....Xk8........]

7000008082CAA10 07000007 FE586C18 00000007 000313A7 [.....Xl.........]

7000008082CAA20 07000008 075D23B8 0000000B 000313A7 [.....]#.........]

7000008082CAA30 07000008 075D24C8 0000000B 000313A7 [.....]$.........]

7000008082CAA40 00000000 00000000 00000000 00000000 [................]

Repeat 20 times

*** 2011-09-21 20:59:43.314

--====================================

結論：

透過上面分析，造成streams同步暫停的原因是有大的事務被kill，oracle smon內部恢復事務和pmon要清除死程式相互鎖定或者阻塞造成的205幾乎處於懸掛狀態。致使傳播過來的訊息根本無法入隊。所以我們看到的205上的apply server狀態都是Idle的。另外和我們部署的streams capture和apply process數量多也有一定的關係。

解決辦法：

以後遇到這種情況，先查詢205上的等待事件：

select * from v$session_wait where wait_class<>'Idle';

再驗證該等待事件是否是latch free，如果是那麼再確實該latch是否是：channel operations parent latch，如果是，那麼先把所有的streams先停下來，允許的話把205重啟一下，不方便重啟就先少啟動幾支streams，等大事務處理之後再把所有的streams都起來。

建議：

首先要保證不要無緣無故人為的去kill session，其次把大事務分

streams capture狀態PAUSED FOR FLOW CONTROL原因分析和診斷

相關文章