在檢視資料庫負載的時候，發現早上10點開始到12點的這兩個鐘頭，系統負載異常的高。於是抓取了一個awr報告。

	Snap Id	Snap Time	Sessions	Cursors/Session
Begin Snap:	24027	18-Dec-14 10:00:43	3065	6.6
End Snap:	24033	18-Dec-14 11:00:48	3441	6.5
Elapsed:		60.08 (mins)
DB Time:		3,650.67 (mins)

可以從profile裡看到，系統的IO負載還是很高的。

Load Profile

	Per Second	Per Transaction	Per Exec	Per Call
DB Time(s):	60.8	1.2	0.02	0.01
DB CPU(s):	10.2	0.2	0.00	0.00
Redo size:	1,835,598.3	35,780.1
Logical reads:	1,910,437.0	37,238.9
Block changes:	4,923.1	96.0
Physical reads:	182,370.1	3,554.8
Physical writes:	1,670.7	32.6
User calls:	7,331.1	142.9
Parses:	2,128.1	41.5
Hard parses:	8.4	0.2
W/A MB processed:	48.0	0.9
Logons:	3.5	0.1
Executes:	3,911.2	76.2
Rollbacks:	1.5	0.0
Transactions:	51.3

等待事件的情況如下：

Top 5 Timed Foreground Events

Event	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
db file sequential read	24,806,147	91,539	4	41.79	User I/O
read by other session	20,260,140	37,030	2	16.91	User I/O
DB CPU		36,766		16.79
direct path read	3,090,582	30,292	10	13.83	User I/O
log file sync	191,401	7,738	40	3.53	Commit

看完awr報告，大概能夠定位是sql的問題了，直接生成了一個addm報告，作為參考。最後確認發現的幾個問題sql已經在修復了，還沒有部署到生產中。
今天想仔細的分析一下，畢竟很多東西都是在實踐中總結出來的。
對於read by other session自己還是比較陌生的。
從metalink的描述來看。這個等待事件是在10g之後對於buffer busy waits 的一個細分。(MOS ID "Read By Other Session" Wait Event (Doc ID 732891.1))
This wait event occurs when we are trying to access a buffer in the buffer cache but we find that the buffer is currently being read from disk by another user so we need to wait for that to complete before we can access it. In previous versions, this wait was classified under the "buffer busy waits" event. However, in Oracle 10.1 and higher, the wait time is now broken out into the "read by other session" wait event.

Excessive waits for this event are typically due to several processes repeatedly reading the same blocks, e.g. many sessions scanning the same index or performing full table scans on the same table. Tuning this issue is a matter of finding and eliminating this contention.

對於這個等待事件可以結合awr和ash來進行排查。
首先我們在awr報告中已經得知這個等待事件，一般都會存在sequential read。但是透過awr不能清晰的定位出哪些sql語句和這個等待事件相關。
使用ash可以把時間段有針對性的調整在一個合理的範圍。如果問題已經發生了一段時間，我們可以透過歷史資料能夠大體定位。
Top SQL with Top Events

Planhash	Sampled # of Executions	% Activity	Event	% Event	Top Row Source	% RwSrc	SQL Text
3622987757	73	8.72	direct path read	6.48	TABLE ACCESS - FULL	6.48	SELECT /*+leading(CYC_CUST) p...
			db file sequential read	1.30	INDEX - RANGE SCAN	0.80
2808023112	4	8.54	CPU + Wait for CPU	4.42	INDEX - UNIQUE SCAN	1.41	SELECT COUNT(1) FROM (/*
			db file sequential read	2.28	INDEX - UNIQUE SCAN	1.74
			read by other session	1.56	INDEX - UNIQUE SCAN	1.56
2134445868	6	7.24	db file sequential read	3.58	TABLE ACCESS - BY INDEX ROWID	2.61	select /*+ leading (bpm_ai bpm...
			read by other session	2.14	TABLE ACCESS - BY INDEX ROWID	1.88
			db file parallel read	1.12	TABLE ACCESS - BY INDEX ROWID	1.12
1565267664	35	5.03	direct path read	4.71	TABLE ACCESS - FULL	4.71	SELECT /*+parallel(CYC_PAY...
561414009	71	4.27	direct path read	3.87	TABLE ACCESS - FULL	3.87	SELECT SUB.CUSTOMER_ID FROM SU...

對於read by other session,和熱點塊也是相關的。
我們可以使用下面的sql語句來檢視一下指定時間段內的等待事件的細節。
透過ash中的Top Event P1/P2/P3 Values能夠直接找到。

Event	% Event	P1 Value, P2 Value, P3 Value	% Activity	Parameter 1	Parameter 2	Parameter 3
db file sequential read	43.11	"1","3386","1"	0.04	file#	block#	blocks
direct path read	25.62	"5","756618","127"	0.07	file number	first dba	block cnt
read by other session	3.98	"46","1609148","1"	0.07	file#	block#	class#
db file scattered read	2.53	"3","116107","126"	0.04	file#	block#	blocks
db file parallel read	2.50	"2","2","2"	0.51	files	blocks	requests

或者透過如下的sql語句來看看對應的熱點塊。
SQL> SELECT relative_fno, owner, segment_name, segment_type FROM dba_extents
WHERE file_id = 46
AND 1609148 BETWEEN block_id AND block_id + blocks - 1;
RELATIVE_FNO OWNER SEGMENT_NAME SEGMENT_TYPE
------------ ------------------------------ --------------------------------------------------------------------------------- ------------------
20 APPO PAYMENT_DETAILS_PK INDEX PARTITION

這條sql語句在文章(巧用rowid簡化sql查詢 http://blog.itpub.net/23718752/viewspace-1240404/)已經做了討論。
但是不知道什麼原因被遺漏了,和客戶做了確認。
這條sql語句是在產品線中統一規劃的，已經使用Hint穩定了執行計劃，但是涉及的幾個表都是千萬的大表，查詢中使用了並行，同時需要在多臺伺服器中同時執行這條語句。這樣的話，就會存在多個session執行
多個並行查詢。從這個角度來說發生read by other session也是意料之中的。因為這個查詢是一個分頁查詢，從操作上也可以做一些最佳化，目前是每隔10分鐘做一次查詢，然後查取1000條相關的資料。和業務部分協調的結果是，每隔半個小時執行一次查詢，每次查取3000條資料。

可能一個等待事件中能夠體現出一個表象。更多的細節還需要排查確認，很多最佳化工作不止能從資料庫層面做，還可以從業務層面調整。

關於IO診斷的一些問題，可以參考 Troubleshooting I/O Related Waits (Doc ID 223117.1)

關於等待事件"read by other session"

相關文章