Systematic Latch Contention Troubleshooting in Oracle

lfree發表於2020-12-16

--// 昨天一邊看一邊使用金山詞霸,翻譯這篇文章,可能存在很多錯誤。希望對大家有用。




Systematic Latch Contention Troubleshooting in Oracle

By Tanel Poder

This article was originally published at IOUG Select journal Q1 2010. Thanks to John Kanagaraj for editing the IOUG
Select article and pushing me for deadlines :-)

--//這篇文章最初發表在IOUG選擇雜誌2010年第一季度。 感謝約翰·卡納加拉傑編輯了"IOUG選擇"文章,並催促我完成截止日期:-)

Thanks to IOUG team for giving me the IOUG Select Journal Editor's Choice award (2011) for this article!!!

--//感謝IOUG團隊為這篇文章授予我IOUG選擇期刊編輯選擇獎(2011年)! !

Contents

    1 Systematic Latch Contention Troubleshooting in Oracle
        1.1 What is a latch?
        1.2 Systematic troubleshooting of latch contention
        1.3 Question 1 - Who is trying to get the latch and why?
        1.4 Question 2 - Who is holding the latch and why?
        1.5 Using LatchProf to find who are holding latches the most
        1.6 Using LatchProfX for finding the hot block causing cache buffers chains latch contention
        1.7 Limitations of V$LATCHHOLDER and LatchProf(X)
        1.8 Download LatchProf and LatchProfX

/*+
1 Oracle系統故障排除
   1.1什麼是閂鎖?
   1.2閂鎖爭用系統故障排除
   1.3問題1-誰試圖獲得閂鎖,為什麼?
   1.4問題2-誰拿著閂鎖,為什麼?
   1.5利用Latchprof 找出拿著鎖的人最多
   1.6使用Latch ProfX查詢導致快取緩衝鏈鎖存爭用的熱塊
   1.7 V$LATCHOLDER和LatchProf(X)的限制
   1.8下載LatchProf and LatchProfX
*/

Introduction

As an Oracle DBA, developer or performance analyst, you may have run into what is termed "latch contention" at various
points. So what exactly is a "latch" and why do we have contention on this "latch". In this article, we will take an
in-depth look at latches and how we determine and resolve such contention. Whether you are a newbie or an experienced
old-timer, we hope this article will cast a little more light on this ill-understood subject.

--//作為一名OracleDBA、開發人員或效能分析師,您可能在不同的時候遇到了所謂的"鎖存爭用"。 那麼,什麼是"鎖存器",為什麼我們
--//對這個"鎖存器"有爭議。 在本文中,我們將深入研究鎖存器以及我們如何確定和解決這種爭論。 無論你是新手還是經驗豐富的老手
--//,我們希望這篇文章能對這個不被理解的話題多一點啟發。

What is a latch?

I will put it very simply: Latch is a lock. In fact, a latch is just another special type of lock in Oracle. Oracle uses
many different types of locks internally, for example enqueue locks, library cache locks, library cache pins, buffer
cache pins, mutexes and latches and these different lock types are useful for different purposes. Enqueue locks are
sophisticated, provide ordered queuing capabilities and are used for cases where the locks are held for relatively long
durations. Latches on the other hand are much less sophisticated, more lightweight and are used for protecting very
short operations on memory structures such various internal linked list modifications, shared pool memory allocation,
library cache object lookups and so on. Latches are very low-level locks managed deep inside Oracle kernel code and
users or applications can not directly acquire nor release them.

--//我會簡單地說:Latch是一把鎖。 事實上,栓鎖只是Oracle中另一種特殊型別的鎖。 Oracle內部使用許多不同型別的鎖,例如
--//enqueue locks、library cache locks, library cache pins, buffer cache pins、互斥鎖和鎖存,這些不同的鎖型別對於不同的
--//目的是有用的。 Enqueue locks是複雜的,提供有序的排隊功能,並用於鎖持有相對較長時間的情況。 另一方面,鎖存器不那麼復
--//雜,更輕,並且用於保護記憶體結構上非常短的操作,例如各種內部連結串列修改、共享池記憶體分配、庫快取物件查詢等。 鎖存器是在
--//Oracle核心程式碼的深層管理的非常低階的鎖,使用者或應用程式不能直接獲取或釋放它們。

So, how is a latch physically implemented? Every latch is just a memory structure in SGA, usually 100-200 bytes in size,
depending on your Oracle version, hardware platform and whether you are running 32 or 64-bit Oracle.

--//那麼,鎖存器是如何物理實現的呢? 每個鎖存器只是SGA中的記憶體結構,通常大小為100-200位元組,這取決於您的Oracle版本、硬體
--//平臺以及執行32位還是64位Oracle。

Contrary to a common misconception, latches are in no way related to OS synchronization mechanisms such as mutexes or
semaphores! Latches are just plain memory structures in SGA and the OS doesn't even have an idea that latches exist.
Taking a latch does not require any system calls or OS interaction. If that were the case, latch operations would slow
down considerably as such system calls are expensive. If a process wants to take a latch, it just reads latch memory to
check whether the latch value is zero (i.e. latch is not taken) and if it is zero, then it changes the value to indicate
that the latch is taken. For "exclusive-get-only" latches, the value will be hardcoded, such as "0xFF" on Intel
platforms, but for "shared-get-capable" latches, the acquiring process will put its own Oracle PID in there if it takes
that latch in exclusive mode. If a shared latch is held by many processes simultaneously then the latch value will just
show the reference count (the number of shared mode holders) and tracking individual processes is done via state objects
memory structures.

--//與常見的誤解相反,鎖存器與OS同步機制(如互斥或訊號量)沒有任何關係!鎖存只是SGA中的普通記憶體結構,作業系統甚至沒有存在
--//鎖存的想法。 使用鎖存器不需要任何系統呼叫或OS互動。 如果是這樣的話,鎖存操作將大大減慢,因為這樣的系統呼叫是昂貴的。
--//如果一個程式想要接受一個鎖存器,它只是讀取鎖存記憶體來檢查鎖存值是否為零(即。 鎖存器不被取),如果它是零,那麼它改變值
--//以指示鎖存器被取。 對於"獨家獲取唯一"鎖存器,該值將被硬編碼,例如英特爾平臺上的"0x FF",但對於"共享獲取能力"鎖存器,
--//如果它以獨佔模式接受該鎖存器,則獲取過程將將將其自己的Oracle PID放在其中。如果一個共享鎖存器同時被許多程式保持,那麼
--//鎖存值將只顯示參考計數(共享模式保持器數),並且跟蹤單個程式是透過狀態物件記憶體結構完成的。

So, if a latch is just some memory content in SGA, modified by processes at will, who then guarantees that two processes
(running on different CPUs) don't accidentally take the same latch simultaneously, without knowing about each other?
This would be a race condition, and if allowed, the Oracle instance could quickly end up in a crash or a hang. The
solution lies in atomic "test-and-set" and "compare-and-swap" (CAS) instructions which all modern hardware supports.
This allows the Oracle kernel code to perform latch value checking and changing in a single atomic instruction. On some
platforms (such as Intel), this needs to be combined with a special flag which locks the memory line where the latch
lives for a single CPU access only for the duration of the latch get. This is done with LOCK CMPXCHG instruction on
Intel platform. CMPXCHG means Compare and Exchange and the LOCK prefix guarantees that only one CPU at a time can access
that latch  (remember, latches are usually held for very short periods, measured as low as just microseconds).

--//那麼,如果一個鎖存器只是SGA中的一些記憶體內容,由程式隨意修改,那麼誰又保證兩個程式(在不同的CPU上執行)不會意外地同時接
--//受同一個鎖存器,而不會互相瞭解呢? 這將是一個競賽條件,如果允許,Oracle例項可能會很快以崩潰或掛起告終。 解決辦法在於
--//所有現代硬體都支援的原子"測試和設定"和"比較和交換"(CAS)指令。 這允許Oracle核心程式碼在單個原子指令中執行鎖存值檢查和更
--//改。在一些平臺(如Intel)上,這需要與一個特殊的標誌相結合,該標誌鎖定了鎖存器僅在鎖存器獲取的持續時間內為單個CPU訪問而
--//生活的記憶體線。這是用英特爾平臺上的LOCKCMP X CHG指令完成的。 CMPX CHG的意思是比較和交換,LOCK字首保證每次只有一個CPU
--//可以訪問該鎖存器(記住,鎖存器通常保持很短的時間,測量低到只有微秒)。

Since the Oracle SGA holds lots of information in various types of memory structures,  it is not enough to have just one
latch for each type of memory structure to be protected. For example, if you have an 8 GB buffer cache for 8 KB blocks,
then you have one million buffers in the buffer cache. Reasonable concurrent access to all these million buffers can't
be provided by a single latch (in this case a "cache buffers chains" latch). As well, the library cache holds thousands
if not hundreds of thousands of cursors and objects in it, reasonable access to all of them would not be possible with a
single "library cache" latch, especially if you have lots of CPUs in your database server box. Therefore, Oracle has
split the memory structures that need protection into logical sub-pieces and each piece is protected by a separate
sub-latch known as a child latch. Some structures such redo log buffer are physically split into ranges and each range
is protected by (and accessed using) a separate latch (redo allocation latch for log buffer). In some cases,an algorithm
such as a hash or modulus function is used for determining which exact child latch should be used when accessing an
object. For example, if you run a database with large buffer cache (100GB+), you may have millions of cache buffers
chains child latches in use. When accessing a data buffer block, Oracle will run a modulus function on the data block
address (DBA) of the block to determine which cache buffers chains latch to get to satisfy that access. The right
library cache latch is also determined by running modulus function on the hash value of library    cache object name (the
hash_value column in V$SQL for example).

--//由於OracleSGA在各種型別的記憶體結構中儲存了大量資訊,因此只有一個鎖存器來保護每種型別的記憶體結構是不夠的。例如,如果您
--//有8KB塊的8GB緩衝區快取,那麼緩衝區快取中有100萬個緩衝區。 對所有這些百萬緩衝器的合理併發訪問不能由一個鎖存器提供(在
--//這種情況下是一個"快取緩衝鏈"鎖存器)。 此外,庫快取如果不包含幾十萬個遊標和物件,那麼使用一個"庫快取"鎖存器就不可能合
--//理地訪問所有遊標和物件,特別是如果資料庫伺服器框中有大量CPU。 因此,Oracle將需要保護的記憶體結構拆分為邏輯子塊,每個塊
--//都由一個單獨的子鎖存器保護,稱為子鎖存器。一些結構,如重做日誌緩衝區,物理上被分割成範圍,每個範圍由(並使用)一個單獨
--//的鎖存器(日誌緩衝區的重做分配鎖存器)保護。 在某些情況下,使用雜湊或模數函式等演算法來確定在訪問物件時應該使用哪個精確
--//的子鎖存器。 例如,如果執行具有大緩衝快取(100GB)的資料庫,則可能有數百萬個快取緩衝區鏈子鎖存器正在使用。 在訪問資料
--//緩衝塊時,Oracle將在塊的資料塊地址(D BA)上執行模數函式,以確定要獲得哪些快取緩衝鏈鎖存以滿足該訪問。 正確的庫快取鎖
--//存器也是透過在庫快取物件名稱的雜湊值上執行模數函式來確定的(例如V$SQL中的hash_value列)..

Since latches are held for very short periods, if a latch is busy due some other process, then the acquiring process
doesn't give up immediately when it cannot get the latch on the first try. If a latch is busy, the acquiring process
tries again few thousand times instead of going to sleep (and wait) immediately. The reasoning here is that if a latch
is currently busy, it normally will be freed very quickly, so instead of taking the trouble of going to sleep (requires
a syscall and context switch) we "spin" and burn some more CPU and test with the hope of getting the latch some
(hundreds) of microseconds later. In other words, some CPU time is sacrificed but ultimately this will save time as it
may avoid the needing for the process to "sleep" or suspend itself. This is called spinning (busy-waiting). Oracle
registers a "latch free" wait event only after the process has failed to acquire the latch after the initial spinning
and goes to sleep.

--//由於鎖存器被保持很短的時間,如果一個鎖存器由於其他過程而繁忙,那麼當它無法在第一次嘗試中獲得鎖存時,獲取過程不會立即
--//放棄。 如果鎖存器繁忙,獲取過程將再次嘗試幾千次,而不是立即睡覺(並等待)。 這裡的推理是,如果一個鎖存器目前很忙,它通
--//常會很快被釋放,所以我們不用費心去睡覺(需要一個系統和上下文開關),而是"旋轉",並燃燒更多的CPU和測試,希望得到一些(數
--//百)微秒後的鎖存器。 換句話說,犧牲了一些CPU時間,但最終這將節省時間,因為它可能避免程式需要"睡眠"或暫停自己。 這就是
--//所謂的旋轉(忙-等待)。 Oracle只在程式在初始旋轉後未能獲得鎖存並進入睡眠後才會註冊一個"無鎖存器""等待事件"。

Note that some latch get operations explicitly don't perform a "spin-get" and hence the latch get will fail immediately
after the first attempt to get it didn't succeed. This technique is used for some latches where Oracle doesn't care
which exact child latch to take and is called immediate latch gets or "not willing to wait" latch gets. The "redo copy"
latch is one such example as this helps the LogWriter (LGWR) process to know whether anyone is currently copying redo
data to log buffer, but not who exactly is copying and where, as this does not matter to LGWR.

--//請注意,一些鎖存器get操作顯式地不執行"spin-get",因此在第一次嘗試獲取失敗後,鎖存器get將立即失敗。 這種技術用於一些
--//鎖存器,其中Oracle不關心要採取哪一個精確的子鎖存器,稱為即時鎖存get或"不願意等待"鎖存get。 "重做複製"鎖存器就是這樣
--//的一個例子,因為這有助於日誌寫手(LGWR)程式知道是否有人正在將重做資料複製到日誌緩衝區,但不知道到底是誰在複製以及在哪
--//裡,因為這對LGWR來說並不重要。

Systematic troubleshooting of latch contention

--//鎖存爭用的系統故障排除.

So after this brief introduction to latching in Oracle, let's see why latch contention occurs and how to troubleshoot
it. Remember – essentially, a latch is a type of lock. It allows controlling concurrent access to some SGA memory
structures so we wouldn't end up with race conditions, SGA corruptions and crashes. So, take note of this simple, but
important concept in latch (or any lock) contention: Such latch contention occurs only when there is some other process
already holding the latch which our process wants to acquire!

--//因此,在簡要介紹Oracle中的鎖存之後,讓我們看看為什麼會發生鎖存爭用以及如何排除它。 記住-本質上,鎖存器是一種鎖。 它
--//允許控制對某些SGA記憶體結構的併發訪問,這樣我們就不會最終遇到競爭條件、SGA損壞和崩潰。 因此,請注意這個簡單但重要的概
--//念,在鎖存(或任何鎖)爭用:這樣的鎖存爭用只有當有其他程式已經持有鎖存,我們的程式想要獲得!

In other words. latch contention can only happen when these two conditions are satisfied:

--//換句話說。 鎖存爭用只有在滿足這兩個條件時才能發生:

1) Someone must try to get the latch
2) Someone else must be already holding the latch

--//1)一定有人要把閂鎖
--//2)一定是有人拿著閂鎖

Latch contention means that some process is forced to sleep as it cannot acquire the latch it wants due someone else
already holding it. There will be no latch contention if both of the above conditions aren't met! Therefore, whenever we
see latch contention, we need to find out two things:

--//鎖存爭用意味著某些程式被迫休眠,因為它無法獲得它想要的鎖存器,因為其他人已經持有它。 如果上述兩個條件都不滿足,就不
--//會有鎖存爭用! 因此,當我們看到鎖存爭用時,我們需要找出兩件事:

1) Who is trying to get the latch (and why)?
2) Who is already holding the latch (and why)?

--//1)誰想得到閂鎖(為什麼)?
--//2)誰已經拿著閂鎖了(為什麼)?

These two questions need to be asked for any lock contention and is applicable here as well. The following sections
explain how to answer these questions in Oracle.

--//這兩個問題需要詢問任何鎖爭用,也適用於這裡。 以下各節解釋如何在Oracle中回答這些問題。

Question 1 - Who is trying to get the latch and why?

The first question can be answered with the help of any session-level tool based on the Oracle Wait Interface, such as
the V$SESSION_WAIT view, Active Session History or SQL_TRACE. All these tools would let you know for which exact (child)
latch address the wait occurred.

--//第一個問題可以在基於OracleWait介面的任何會話級別工具的幫助下回答,例如V$SESSION_WAIT檢視、活動會話歷史記錄或
--//SQL_TRACE。 所有這些工具都會讓您知道等待發生的確切(子)鎖存器地址。

Note that I did not mention instance-level performance data collection tools such as Statspack or AWR reports in the
above list. The problem with these tools is that they only gather and store system-wide performance data such V$SYSSTAT
or V$SQL snapshots, and thus they don't know anything about which exact sessions are experiencing the latch contention
and waits. These tools would show that someone is experiencing latch contention problems in the database instance, but
finding out who and exactly why would involve a significant amount of guesswork, previous experience and luck. Thanks to
this information gap, there currently are no systematic approaches to latch contention troubleshooting published and
hence such troubleshooting is still perceived as a very complex task. This article aims to change this as there are ways
to fill this information gap, which allows us to become systematic and not depend on luck.

--//請注意,我沒有在上面的列表中提到例項級效能資料收集工具,如Statspack或AWR報告。 這些工具的問題是,它們只收集和儲存全
--//系統效能資料,如V$SYSSTAT或V$SQL快照,因此它們不知道哪些確切的會話正在經歷鎖存爭用和等待。 這些工具將顯示某人在資料
--//庫例項中遇到鎖存爭用問題,但找出誰和為什麼會涉及大量的猜測、以前的經驗和運氣。 由於這種資訊差距,目前還沒有系統的方
--//法來鎖定爭用故障排除釋出,因此這種故障排除仍然被認為是一項非常複雜的任務。 本文旨在改變這一點,因為有辦法填補這一信
--//息空白,這使我們能夠成為系統的,而不是依賴於運氣。

There are two common trigger points for latch contention troubleshooting. One is that a DBA sees (from Statspack or AWR
report or other monitoring tools) that a significant amount of instance-wide response time is spent waiting for some
latch (from the "Top 5 Timed events" section for example). The other trigger point would start from the end user (who's
complaining) or some measured business task which takes too long. In this case the DBA would identify the session(s)
used by that user or task and use some session-level tool (as mentioned above) for quantifying where the response time
is spent.

--//鎖存爭用故障排除有兩個常見的觸發點。 一個是DBA看到(從Statspack或AWR報告或其他監控工具)大量的例項響應時間被花費在等待
--//一些鎖存(例如,從"前5名定時事件"部分)。 另一個觸發點將從終端使用者(誰在抱怨)或一些測量的業務任務開始,這需要太長時間。
--//在這種情況下,DBA將確定該使用者或任務使用的會話,並使用某些會話級別的工具(如上文所述)量化響應時間的使用地點。

Question 2 - Who is holding the latch and why?

The second question can be easily answered (starting in Oracle 8.0) by querying the V$LATCHHOLDER view which lists any
process/session currently holding a latch. A description is shown below:

--//第二個問題可以很容易地回答(從Oracle8.0開始),方法是查詢V$LATCHOLDER檢視,其中列出了當前持有鎖存的任何程式/會話。 說
--//明如下:

SQL> describe v$latchholder
Name   Null? Type
------ ----- -------------
PID          NUMBER
SID          NUMBER
LADDR        RAW(4)
NAME         VARCHAR2(64)
GETS         NUMBER

Although little known, this is an extremely useful view for systematic latch contention troubleshooting as it provides
one important bit of information which no other Oracle view or trace provides - the session ID of the latch holder. Once
you know the holder's SID, you can focus on that session and see exactly what it is doing that it needs to hold the
latch for so long.

--//雖然鮮為人知,但這是一個非常有用的檢視,用於系統鎖存爭用故障排除,因為它提供了一個重要的資訊,沒有其他Oracle檢視或跟
--//蹤提供-鎖存器的會話ID。 一旦你知道了持有人的SID,你就可以專注於那個會話,看看它到底在做什麼,它需要保持鎖存這麼長時
--//間。

Columns of interest are SID (Session ID), NAME (Latch name), LADDR (Latch address in memory). LADDR is the latch's
unique identifier in an instance and helps to determine which exact child latch was held in case the latch has multiple
child latches.

--//感興趣的列是SID(會話ID)、名稱(鎖存名稱)、LADDR(記憶體中的鎖存地址)。 在例項中,LADDR是鎖存器的唯一識別符號,並有助於確定
--//在鎖存器有多個子鎖存器的情況下持有哪個精確的子鎖存器。

As I said earlier, latches are usually held for very short durations, so manually querying this view may not return
anything at all even after querying it multiple times. For this reason, I have written a tool for high-frequency
profiling of latch holders which I will introduce now. It's called LatchProf and is a plain SQL script which doesn't
require any modifications of the database or schema. You can download it for free from my website at
http://blog.tanelpoder.com – just search for "latchprof" using the search button on the right side of the page.

--//正如我前面所說,鎖存器通常保持非常短的持續時間,因此手動查詢此檢視可能根本不會返回任何東西,即使在多次查詢它之後。
--//出於這個原因,我已經編寫了一個工具的高頻分析鎖存器,我現在將介紹。 它被稱為LatchProf,是一個普通的SQL指令碼,不需要對
--//資料庫或模式進行任何修改。 您可以從我的網站免費下載它在http://blog.tanelpoder.com-只需搜尋"latchprof"使用搜尋按鈕在
--//頁面的右側。

Using LatchProf to find who are holding latches the most

--//利用Latchprof找出拿著鎖的人最多

Let's assume that we have noticed shared pool and library cache latch contention, reported by a Statspack/AWR report or
some monitoring tool. Remember what I mentioned earlier - latch and any other lock contention occurs only if someone is
already holding the latch while I'm trying to get the same. Instead of guessing who are the troublemakers, I can just
run LatchProf which samples V$LATCHHOLDER at high frequency and reports the top latch holders.

--//假設我們已經注意到共享池和庫快取鎖存爭用,由Statspack/AWR報告或一些監控工具報告。 記住我前面提到的-鎖存器和任何其他
--//鎖爭用只有當有人已經拿著鎖,而我試圖得到同樣的。 我不是猜測誰是麻煩製造者,而是可以執行LatchProf,哪些樣本
--//V$LATCHOLDER在高頻,並報告頂部鎖存器持有人。

I could start by looking which latch types are the busiest in the system. Note that as there are many latch types in
Oracle and they also have lots of child latches then the aggregate "busyness" of a latch does not necessarily reflect
the (child) latch with the highest contention. While the Oracle Wait Interface is the only correct source for detecting
contention, LatchProf should be used for drilldown once the latch experiencing contention has been detected.

--//首先,我可以看看系統中哪些鎖存型別是最繁忙的。 請注意,由於Oracle中有許多鎖存型別,並且它們也有許多子鎖存器,那麼鎖
--//存器的聚合"繁忙"並不一定反映具有最高爭用的(子)鎖存器。 雖然Oracle等待介面是檢測爭用的唯一正確來源,但一旦檢測到鎖存
--//體驗爭用,則應使用LatchProf進行鑽取。

To illustrate the capabilities of LatchProf, I start from a simple example which shows an overview of how busy different
latch types are. By "busy" I mean how much of the time these latches are held by somebody.

--//為了說明LatchProf的功能,我從一個簡單的例子開始,它概述了不同鎖存型別有多忙。我指的是這些閂鎖有多少時間是由某人持有的。

LatchProf takes four parameters:

Parameter 1 specifies which columns from V$LATCHHOLDER to report and group by. In the case below I just want to report
latch holds by latch name (and not even break it down by SID for starters).

--//引數1指定從V$LATCHOLDER報告和分組的列。 在下面的情況下,我只想報告鎖存器持有的鎖存器名稱(甚至沒有打破它的SID為啟動)。

Parameter 2 specifies which SIDs to monitor. In the case below, I am interested in any SID which holds a latch (%).

--//引數2指定要監視哪些SID。 在下面的情況下,我對任何持有鎖存(%)的SID感興趣。

Parameter 3 specifies which latches to monitor. This can be set either to latch name or latch address in memory. In the
case below, I monitor all latches (%).

--//引數3指定要監視的鎖存器。 這可以設定為記憶體中的鎖存名稱或鎖存地址。 在下面的情況下,我監視所有鎖存器(%)。

Parameter 4 specifies how many times to sample V$LATCHHOLDER. I use 100000 samples below, which completed in a couple of
seconds on my test database. The sampling speed depends on your server CPU/memory bus speed and the value of processes
parameter. You should start from lower number like 1000 and adjust it so that LatchProf would complete its sampling in a
couple of seconds, and that is usually enough for diagnosing ongoing latch contention problems. You shouldn't keep
sampling for long periods since LatchProf runs constantly on the CPU.

--//引數4指定取樣V$LATCHOLDER的次數。 我使用下面的10萬個樣本,在我的測試資料庫上幾秒鐘就完成了。 取樣速度取決於伺服器
--//CPU/記憶體匯流排速度和程式引數的值。 您應該從較低的數字開始,如1000,並調整它,以便LatchProf將在幾秒鐘內完成其取樣,這通
--//常足以診斷正在進行的鎖存爭用問題。 您不應該長期保持取樣,因為LatchProf在CPU上不斷執行。

So, let's run LatchProf now. After few seconds of sampling, LatchProf returns the above profile:

--//所以,讓我們現在執行Latch教授。 經過幾秒鐘的取樣,LatchProf返回上述配置檔案:

SQL> @latchprof name % % 100000
-- LatchProf 1.20 by Tanel Poder ( )
NAME                                      Held       Gets  Held %     Held ms Avg hold ms
----------------------------------- ---------- ---------- ------- ----------- -----------
library cache                             3319       2815    3.32      91.273        .032
shared pool                               1694       1131    1.69      46.585        .041
library cache lock                         580        580     .58      15.950        .028
shared pool simulator                      482        100     .48      13.255        .133
kks stats                                  269        258     .27       7.398        .029
enqueues                                   158        158     .16       4.345        .028
library cache pin                           67         67     .07       1.843        .028
enqueue hash chains                         62         62     .06       1.705        .028
row cache objects                            4          4     .00        .110        .028
shared pool sim alloc                        2          2     .00        .055        .028
10 rows selected.

The output column meanings are as follows:
Name            - Latch name
Held            - During how many samples out of total samples (100000) the particular latch was held by somebody
Gets            - How many latch gets against that latch were detected during LatchProf sampling
Held %          - How much % of time was the latch held by somebody during the sampling. This is the main column you
                  want to be looking at in order to see who/what holds the latch the most (the latchprof output is
                  reverse-ordered by that column)
Held ms         - How many milliseconds in total was the latch held during the sampling
Avg hold ms     - Average latch hold time in milliseconds (normally latches are held from a few to few hundred
                  microseconds)

Note that LatchProf works on Oracle versions 9.2 and higher, but the Gets and Avg hold ms columns are shown only on
Oracle 10g and higher as the required data isn't externalized in the V$LATCHHOLDER until 10g.

--//請注意,LatchProf工作在Oracle版本9.2及更高版本上,但獲取和Avg保持ms列僅顯示在Oracle10g和更高版本上,因為所需資料在
--//V$LATCHHOLDER中不會外化到10g。

So, from the above output we can see that the library cache latches were the busiest during LatchProf run (3.32% of the
sampling time there was some library cache latch held by some session).

--//因此,從上面的輸出我們可以看到,庫快取鎖f執行期間,庫快取鎖存器是最忙的(3.32%的取樣時間有一些庫快取鎖存器由某個會話
--//持有)。

When a latch is busy, one obvious question would be: "which sessions are holding the latch the most?" Sometimes all the
sessions in an instance are both victims and troublemakers (each contributing to the latch contention problem just a
little), but sometimes there are one (or a few) heavy troublemakers and most other sessions end up as victims.

--//當一個鎖存器繁忙時,一個明顯的問題是:"哪個會話最能握住鎖存器?"有時,一個例項中的所有會話都是受害者和麻煩製造者(每
--//個會話對鎖存爭用問題只有一點),但有時有一個(或幾個)嚴重的麻煩製造者,大多數其他會話最終都是受害者。

Luckily LatchProf allows us to break down the latch holder statistics by SID and since I am interested only in the
library cache latches for now, I can monitor only this latch, as shown in bold below:

--//幸運的是,LatchProf允許我們用SID分解鎖存器的統計資料,並且由於我現在只對庫快取鎖存器感興趣,所以我只能監視這個鎖存器
--//,如下粗體所示:

SQL> @latchprof sid,name % "library cache" 100000
-- LatchProf 1.20 by Tanel Poder ( )
SID NAME               Held Gets Held % Avg hold ms
--- ------------------ ---- ---- ------ -----------
120 library cache      5059 4618   5.06        .037
120 library cache lock  189  188    .19        .034
120 library cache pin   135  135    .14        .034
139 library cache lock  131  131    .13        .034
139 library cache       109  109    .11        .034
113 library cache lock   60   60    .06        .034
113 library cache        48   48    .05        .034
7 rows selected.

In the listing above, we have just found out the SID of the main "offender", i.e. SID 120, which apparently holds some
library cache latch(es) over an order of magnitude more (5.06% of time) than the next most offending holder sessions
(139,113).

--//在上面的列表中,我們剛剛發現了主犯的SID。 SID120,它顯然持有一些庫快取鎖存(ES)超過數量級(5.06%的時間)比下一個最違規
--//的持有人會話(139,113)。

Since we know the SID of the troublemaker, we can now see what this session is doing, determine which SQL its running by
querying V$SESSION, ASH or by enabling SQL trace. In fact, the easiest way for getting an idea what kind of work a
session is doing is running my Snapper script. This script is also available on my site (and doesn't make any changes to
the database, by the way!) Snapper takes 4 parameters (out|trace - where to show the output, either DBMS_OUTPUT or
tracefile, 5 - how many seconds to sleep between session performance snapshots, 1 - how many samples to take, 120 - SID
to monitor):

--//由於我們知道麻煩製造者的SID,我們現在可以看到這個會話正在做什麼,透過查詢V$Session、ASH或啟用SQL跟蹤來確定它執行的
--//SQL。 事實上,最簡單的方法就是執行我的Snapper指令碼。 這個指令碼在我的網站上也是可用的(順便說一下,不會對資料庫做任何更
--//改)! 快照獲取4個引數(輸出|跟蹤-在哪裡顯示輸出,DBMS_OUTPUT或跟蹤檔案,5-在會話效能快照之間休眠多少秒,1-要採集多少
--//個樣本,120-SID監視):

SQL> @snapper stats 5 1 120
-- Session Snapper v2.02 by Tanel Poder ( )
--------------------------------------------------------------------------------------
 SID, USERNAME, TYPE, STATISTIC                         ,   DELTA, HDELTA/SEC,  %TIME
--------------------------------------------------------------------------------------
 120, SYS     , STAT, recursive calls                   ,    8904,      1.78k,
 120, SYS     , STAT, recursive cpu usage               ,     241,       48.2,
 120, SYS     , STAT, CPU used by this session          ,     249,       49.8,
 120, SYS     , STAT, enqueue requests                  ,    4452,      890.4,
 120, SYS     , STAT, enqueue releases                  ,    4452,      890.4,
 120, SYS     , STAT, calls to get snapshot scn: kcmgss ,    4452,      890.4,
 120, SYS     , STAT, parse time cpu                    ,     214,       42.8,
 120, SYS     , STAT, parse time elapsed                ,     452,       90.4,
 120, SYS     , STAT, parse count (total)               ,    4452,      890.4,
 120, SYS     , STAT, parse count (hard)                ,    4452,      890.4,
 120, SYS     , STAT, execute count                     ,    4452,      890.4,
 120, SYS     , TIME, hard parse elapsed time           , 3522257,   704.45ms,  70.4%
 120, SYS     , TIME, parse time elapsed                , 4425428,   885.09ms,  88.5%
 120, SYS     , TIME, PL/SQL execution elapsed time     ,   91410,    18.28ms,   1.8%
 120, SYS     , TIME, DB CPU                            , 2790000,      558ms,  55.8%
 120, SYS     , TIME, sql execute elapsed time          , 5035731,      1.01s, 100.7%
 120, SYS     , TIME, DB time                           , 5035731,      1.01s, 100.7%
--  End of snap 1, end=2009-11-10 03:37:30, seconds=5
PL/SQL procedure successfully completed.

The first thing that immediately jumps out from SID 120's execution profile is that it does heavy hard parsing, 890 hard
parses per second! (shown in bold). That's a pretty evident reason for the frequent library cache latch holding as
reported by LatchProf. We can be almost sure that the application code is not using bind variables where it should use
them. The next step would be to identify couple of SQL statements from this session, either by briefly enabling
SQL_TRACE or just by querying V$SESSION.SQL_ID and mapping this to V$SQL, and then call the developer(s) who wrote that
code and get them to fix it!

--//從SID120的執行配置檔案中立即跳出來的第一件事是,它做重硬解析,每秒890硬解析! (以粗體顯示) 這是LatchProf報告的頻繁庫
--//快取鎖存保持的一個非常明顯的原因。 我們幾乎可以確定應用程式程式碼不是在應該使用繫結變數的地方使用它們。 下一步將是透過
--//簡單啟用SQL_TRACE或僅僅透過查詢V$會話來識別本會話中的幾個SQL語句。 SQL_ID並將其對映到V$SQL,然後呼叫編寫該程式碼的開發
--//人員並讓他們修復它!

Let's assume that we could get this code fixed, but our monitoring system still reports some ongoing library cache latch
contention and we don't see an obvious troublemaker anymore (all sessions reported are holding the monitored latches
roughly equal amount of time) as seen below:

SQL> @latchprof sid,name % "library cache" 100000
-- LatchProf 1.20 by Tanel Poder ( )
       SID NAME                   Held   Gets  Held %     Held ms Avg hold ms
---------- --------------------- ----- ------ ------- ----------- -----------
       116 library cache lock      657    657     .66      12.549        .019
       116 library cache           594    594     .59      11.345        .019
       134 library cache lock      211    211     .21       4.030        .019
       134 library cache           122    122     .12       2.330        .019
       139 library cache            82     82     .08       1.566        .019
       139 library cache lock       64     64     .06       1.222        .019
6 rows selected.

So, we now have two ways to move forward:

--//因此,我們現在有兩種前進的方法:

1) Take one of the reported sessions and see what it is doing, by enabling SQL trace or similar. However this approach
   may not tell you which of the SQL statements is causing the latches to be held
2) Run LatchProf with SQLID option, which also reports the SQLIDs (or hash values in 9i) of the statements causing the
   latches to be held

--//1)透過啟用SQL跟蹤或類似的操作,獲取所報告的會話之一併檢視它正在做什麼。 但是,這種方法可能不會告訴您哪個SQL語句導致
--//  鎖存器被持有
--//2)使用SQLID選項執行LatchProf,該選項還報告導致鎖存保持的語句的SQLID(或9i中的雜湊值)   

I will proceed with option 2, and will just specify SQLID (or SQLHASH in 9i) in the first parameter to LatchProf:

--//我將繼續執行選項2,只需在第一個引數中指定SQLID(或9i中的SQLHASH)給LatchProf:

SQL> @latchprof name,sqlid % "library cache" 100000

-- LatchProf 1.20 by Tanel Poder ( )
NAME                 SQLID          Held   Gets  Held % Avg hold ms
-------------------- ------------- ----- ------ ------- -----------
library cache        5csdgx2jquqvj   858    858     .86        .028
library cache lock   8vdny41vqbq7t   377    377     .38        .028
library cache lock   5csdgx2jquqvj   369    369     .37        .028

This shows me that the main statement causing library cache latches to be held has SQL_ID '5csdgx2jquqvj'.

--//這表明導致庫快取鎖存的主語句有SQL_ID'5csdgx2j quqvj'。

Using the SQL ID I can find the text of the query as shown below:

--//用SQLID,我可以找到查詢的文字,如下所示:

SQL> select executions, sql_text from v$sql where sql_id = '5csdgx2jquqvj';

EXECUTIONS SQL_TEXT
---------- -------------------------------------------------
  40522096 select count(*) from dual where rownum = 1

Apparently someone is running the above query really frequently! (Of course I should see how much the executions
increases right now instead of just looking into the total number of executions since that cursor was loaded into the
library cache).

--//顯然有人正在頻繁地執行上述查詢! (當然,我現在應該看到執行增加了多少,而不是僅僅檢視執行的總數,因為游標被載入到庫緩
--//存中)。

The common reasons for such frequently re-executed, but useless SQL can be either someone's really bad piece of code or
bad connection pool design feature, which constantly polls DUAL table to verify if the connection is still alive.

--//這種頻繁重新執行但無用的SQL的常見原因可能是某人的真正糟糕的程式碼或糟糕的連線池設計特性,它不斷輪詢DUAL表以驗證連線是
--//否仍然存在。

Note that it's possible to break the latch holders down by both SID and SQLID to get a more detailed view of latch
holders:

--//請注意,可以透過SID和SQLID將鎖存器保持器斷開,以獲得更詳細的鎖存器保持器檢視:

SQL> @latchprof sid,name,sqlid % "library cache" 100000

-- LatchProf 1.20 by Tanel Poder ( )

 SID NAME                SQLID          Held  Gets  Held % Avg hold ms
---- ------------------- ------------- ----- ----- ------- -----------
 134 library cache       5csdgx2jquqvj   462   462     .46        .020
 134 library cache lock  5csdgx2jquqvj   395   395     .40        .020
 134 library cache lock  8vdny41vqbq7t   305   305     .31        .020
 116 library cache lock  5csdgx2jquqvj   100   100     .10        .020
 139 library cache       5csdgx2jquqvj    98    98     .10        .020
 116 library cache       5csdgx2jquqvj    93    93     .09        .020
 139 library cache lock  5csdgx2jquqvj    80    80     .08        .020
 134 library cache       8vdny41vqbq7t    73    73     .07        .020
 113 library cache       5csdgx2jquqvj    62    62     .06        .020
 139 library cache lock  8vdny41vqbq7t    60    60     .06        .020
 116 library cache lock  8vdny41vqbq7t    60    60     .06        .020
 113 library cache lock  8vdny41vqbq7t    35    35     .04        .020
 113 library cache lock  5csdgx2jquqvj    31    31     .03        .020
 113 library cache                         1     1     .00        .020
14 rows selected.

The above example was meant as an introduction to the capabilities of LatchProf. I used library cache latches as an
example, but LatchProf can show the same information for any other type of latch (there are hundreds of different types
of latches in Oracle!)

--//上面的例子是為了介紹Latchprof的能力。 我以庫快取鎖存器為例,但LatchProf可以為任何其他型別的鎖存器顯示相同的資訊
--//(Oracle中有數百種不同型別的鎖存器)!

The key difference between LatchProf and other latch contention troubleshooting approaches are:

--//Latch Prof與其他鎖存爭用故障排除方法的關鍵區別在於:

1) LatchProf shows you who is holding the latches (who is causing the contention) instead of just showing who's waiting
   for them
2) LatchProf shows you which actual sessions are holding the latches (not some system-wide ambiguous metrics), thus you
   can simply focus on the troublemaking sessions and see what are they doing
3) LatchProf shows you which SQL statement caused taking the taking of the latch, thus narrowing down the root cause
   even more

--//1)Latchprof告訴你是誰拿著門閂(誰引起了爭論),而不是僅僅顯示誰在等他們
--//2)LatchProf向您展示了哪些實際會話持有鎖存器(而不是一些系統範圍內的模糊度量),因此您可以簡單地關注故障處理會話,看看
--//  它們在做什麼
--//3)LatchProf向您展示了哪個SQL語句導致了鎖存器的佔用,從而使根本原因更加縮小   

Having established the basics, let's now dig a bit deeper with another example.

--//已經建立了基礎,現在讓我們用另一個例子更深入地挖掘。

Using LatchProfX for finding the hot block causing cache buffers chains latch contention

--//使用LatchProfX查詢導致快取緩衝區鏈鎖存爭用的熱塊

Let's say your monitoring system reported significant cache buffers chains latch contention and you want to diagnose it.
Cache buffers chains (CBC) latch contention can occur due a number of reasons. One of the most common reasons is that
you have an extremely hot block in the buffer cache. Of course we don't know the root cause in advance, so let's
systematically troubleshoot this problem.

--//假設您的監控系統報告了重要的快取緩衝區鏈鎖存爭用,您希望診斷它。 快取緩衝鏈(CBC)鎖存爭用可能發生的原因有很多. 最常見
--//的原因之一是緩衝區快取中有一個非常熱的塊。 我們當然不知道根本原因,所以讓我們系統地排除這個問題。

First, there are many CBC latches in an Oracle instance. In my test server with a small buffer cache, I have 4096 CBC
child latches. A big production database with a large buffer cache can easily have hundreds of thousands of child
latches! You can check that using the following command:

--//首先,在Oracle例項中有許多CBC鎖存器。 在帶有小緩衝區快取的測試伺服器中,我有4096個CBC子鎖存器。 一個大的生產資料庫,
--//有一個大的緩衝快取,可以很容易地擁有幾十萬個子鎖存器!! 您可以使用以下命令檢查:

SQL> select count(*) from v$latch_children where name = 'cache buffers chains';
COUNT(*)
--------
4096

So, as a first step we should identify which CBC latches are experiencing the contention – i.e. is the contention
lightly spread across many CBC latch children or is there a single child latch experiencing much heavier contention
compared to others?

--//因此,作為第一步,我們應該確定哪些CBC鎖存器正在經歷爭用-即這一爭用是否在許多CBC子鎖存器中輕輕傳播,或者是否有一個單
--//個子鎖存器與其他子鎖存器相比經歷了更重的爭用?

We will need to use the Oracle Wait Interface since it can report the contention (waits) for latches. This will require
us to use either SQL_TRACE or V$SESSION_WAIT/ASH or calculate deltas of V$LATCH_CHILDREN.WAIT_TIME to see which child
latch experiences the most sleeps. I do like the first two approaches more as they allow to tie the latch waits together
with a session and with whatever it was doing. For example, SQL_TRACE reports the following lines when a traced session
is experiencing CBC latch contention:

--//我們需要使用OracleWait介面,因為它可以報告鎖存器的爭用(等待)。 這將要求我們使用SQL_TRACE或V$SESSION_WAIT/ASH或計算
--//V$LATCH_CHILDREN.wait_time的Deltas看看哪個孩子閂鎖經歷的睡眠最多。 我確實更喜歡前兩種方法,因為它們允許將鎖存器與會
--//話以及它正在做的任何事情連線在一起。 例如,當跟蹤會話遇到CBC鎖存器爭用時,SQL_TRACE報告以下行:

WAIT #1: nam='latch: cache buffers chains' ela= 204 address=15742234376 ...
WAIT #1: nam='latch: cache buffers chains' ela= 8 address=15742234376   ...
WAIT #1: nam='latch: cache buffers chains' ela= 2 address=15742234376   ...
WAIT #1: nam='latch: cache buffers chains' ela= 7 address=15742234376   ...

The "address" part is important - it shows you exactly which child latch is experiencing the contention (every child
latch in the instance has an unique memory address). I took the most prevalent address (in decimal) from the tracefile,
converted it to hex and queried V$LATCH_CHILDREN with it as seen below:

--//"地址"部分很重要,它向您準確地顯示哪個子鎖存器正在經歷爭用(例項中的每個子鎖存器都有一個唯一的記憶體地址)。 我從trace
--//file中獲取了最流行的地址(十進位制),將其轉換為十六進位制,並用它查詢V$LATCH_CHILDREN如下:

SQL> select name, gets, misses, sleeps
  2  from v$latch_children
  3  where addr = hextoraw(trim(to_char(15742234376, '0XXXXXXXXXXXXXXX')));
NAME                     GETS MISSES SLEEPS
-------------------- -------- ------ ------
cache buffers chains 62178598 125541    805

Alternatively you can sample V$SESSION or query ASH (if you have the Diagnostics Pack licenses!) as I've shown below:

--//或者,您可以示例V$會話或查詢ASH(如果您有診斷包許可證)!如下所示:

SQL> SELECT * FROM (
  2      SELECT
  3          event
  4        , TRIM(TO_CHAR(p1, 'XXXXXXXXXXXXXXXX')) latch_addr
  5        , TRIM(ROUND(RATIO_TO_REPORT(COUNT(*)) OVER () * 100, 1))||'%' PCT
  6        , COUNT(*)
  7      FROM
  8          v$active_session_history
  9      WHERE
 10          event = 'latch: cache buffers chains'
 11      AND session_state = 'WAITING'
 12      GROUP BY
 13          event
 14        , p1
 15      ORDER BY
 16          COUNT(*) DESC
 17  )
 18  WHERE ROWNUM <= 10
 19  /
EVENT                       LATCH_ADDR PCT  COUNT(*)
--------------------------- ---------- ---- --------
latch: cache buffers chains 3AA4F6F08  4.1%       50
latch: cache buffers chains 3AA4D4908  3.7%       45
latch: cache buffers chains 3AA4E2558  3.4%       42
latch: cache buffers chains 3AA4F00E0  3.2%       39
latch: cache buffers chains 3AA519440  2.9%       36
latch: cache buffers chains 3AA527090  2.8%       34
latch: cache buffers chains 3AA4FDD30  2.8%       34
latch: cache buffers chains 3AA50B8B8  2.7%       33
latch: cache buffers chains 3AA4DB730  2.6%       32
latch: cache buffers chains 3AA534C18  2.6%       32
10 rows selected.

From above query I get the same child latch address (in hex) as reported by SQL_TRACE. Now that I know the actual CBC
child latch address, I can run LatchProf with filtering by individual child latch address! That way, I can monitor the
holders of this particular child latch experiencing worst contention.

--//從上面的查詢中,我得到了與SQL_TRACE報告的相同的子閂鎖地址(十六進位制)。 現在我知道了實際的CBC子鎖存地址,我可以透過單
--//個子鎖存器地址來執行latchprof! 這樣,我就可以監視這個特定子鎖存器的持有者,體驗到最壞的爭用。

SQL> @latchprof sid,name,sqlid % 3AA4F6F08 100000
-- LatchProf 1.20 by Tanel Poder ( )
 SID NAME                SQLID         Held Gets Held % Avg hold ms
--- -------------------- ------------- ---- ---- ------ -----------
136 cache buffers chains f0cxkf0q803f8   58   58    .06        .011
120 cache buffers chains f0cxkf0q803f8   28   28    .03        .011
119 cache buffers chains f0cxkf0q803f8   16   16    .02        .011
148 cache buffers chains f0cxkf0q803f8    4    4    .00        .011
118 cache buffers chains 75621g9y3xmvd    1    1    .00        .011

The top SQLID causing holding of the CBC child latch at address 3AA4F6F08 is "f0cxkf0q803f8".

--//導致CBC子鎖存器保持在地址3AA4F6F08的頂部SQLID是"f0cxkf0q803f8"。

SQL> select executions, sql_text from v$sql where sql_id = 'f0cxkf0q803f8';
EXECUTIONS SQL_TEXT
---------- --------------------------------------------------
   2115217 SELECT /*+  first_rows index(customers,
           customers_pk) index(orders, order_status_ix) */
           O.ORDER_ID, LINE_ITEM_ID, PRODUCT_ID, UNIT_PRICE,
           QUANTITY, ORDER_MODE, ORDER_STATUS, ORDER_TOTAL,
           SALES_REP_ID, PROMOTION_ID, C.CUSTOMER_ID,
           CUST_FIRST_NAME, CUST_LAST_NAME, CREDIT_LIMIT,
           CUST_EMAIL, ORDER_DATE FROM ORDERS O , ORDER_ITEMS
           OI, CUSTOMERS C WHERE O.ORDER_ID = OI.ORDER_ID AND
           O.CUSTOMER_ID = C.CUSTOMER_ID AND O.ORDER_STATUS
           <= 4

From this, I know both the top sessions causing the contention and also the top SQL's causing this latch contention!
Since we now know the SQLID, the next step could be to check the execution plan of the given SQL and see why is it
accessing the datablock protected by our latch so frequently. Sometimes, the contention comes from many frequent
revisits of the block(s) due a bad execution plan, and may involve nested loop joins in improper locations, looping over
many rows. However, in other situations, the problem may just occur due a hot block, such an index root block, visited
by large number of concurrent sessions.

--//由此,我知道導致爭用的頂級會話和導致這個鎖存爭用的頂級SQL! 既然我們現在知道SQLID,下一步可以是檢查給定SQL的執行計劃
--//,看看為什麼它會如此頻繁地訪問我們的鎖存器保護的datablock。 有時,爭用來自由於執行計劃不好而頻繁地重新訪問塊,並且可
--//能涉及巢狀迴圈在不適當的位置連線,迴圈在許多行上。 然而,在其他情況下,問題可能只是由於一個熱塊,這樣的索引根塊,訪
--//問了大量併發會話。

Still, it's useful to know what kind of hot blocks are causing the latch contention.

--//儘管如此,知道什麼樣的熱塊導致鎖存爭用是很有用的。

The good news is that we can exactly identify the exact data block which is so hot that it causes the contention. There
are few approaches for identifying the hot block out there, but the most accurate of them is the LatchProfX script. The
LatchProfX is available on my website, again for free. However, note the X in the end of script name - it means
LatchProf eXtended or LatchProf which requires X$ tables, which means extended privileges.

--//好訊息是,我們可以準確地識別精確的資料塊,它是如此的熱,以至於它引起了爭論。 識別熱塊的方法很少,但其中最準確的是
--//LatchProfX指令碼。 Latch Prof X可在我的網站上免費獲得。 然而,請注意指令碼名稱末尾的X-它意味著LatchProfeXtended或
--//LatchProf需要X$表,這意味著擴充套件特權。

LatchProfX shows an additional column, called "object". This column shows information about the object protected by a
given latch. For cache buffers chains latches, this object shows the Data Block Address (DBA) of the block that we
accessed, causing the latch get! See the example below:

--//Latch Prof X顯示了一個額外的列,稱為"物件". 此列顯示有關由給定鎖存器保護的物件的資訊。 對於快取緩衝鏈鎖存,此物件顯
--//示我們訪問的塊的資料塊地址(DBA),導致鎖存器獲取! 見下面的例子:

SQL> @latchprofx sid,name,sqlid,object % 3AA4F6F08 100000
-- LatchProfX 1.20 by Tanel Poder ( )
SID NAME                 SQLID          OBJECT  Held  Gets  Held % Avg hold ms
--- -------------------- ------------- ------- ----- ----- ------- -----------
148 cache buffers chains f0cxkf0q803f8 1C0005A    40    40     .04        .010
136 cache buffers chains f0cxkf0q803f8 1C0005A    39    37     .04        .011
120 cache buffers chains f0cxkf0q803f8 1C0005A     4     4     .00        .010
118 cache buffers chains 75621g9y3xmvd 1C0005A     1     1     .00        .010

It looks like the block at data block address 1C0005A (in hex) is the troublemaker!

--//看起來資料塊地址1C0005A的塊(十六進位制)是麻煩製造者!

Lets check into which segment does this block belong:

--//讓我們檢查一下這個塊屬於哪段:

SQL> @dba 1C0005A
    RFILE#     BLOCK#
---------- ----------
         7         90

--//1C0005A = set dba 7,90 = alter system dump datefile 7 block 90 = 29360218

Press enter to find the segment using V$BH (this may take CPU time), CTRL+C to cancel:

--//按Enter鍵查詢段使用V$BH(這可能需要CPU時間),CT RLC取消:

STATE      BLOCK_CLASS         DATA_OBJ# OBJECT_TYPE         object
---------- ------------------ ---------- ------------------- -----------------
xcur       data block              62691 INDEX PARTITION     SOE.CUSTOMERS_PK

As seen above, this block belongs into an index partition of an index SOE.CUSTOMERS_PK. The block itself lives in
relative file number 7 and block number 90 in that file. It is possible to dump the contents of the block using ALTER
SYSTEM DUMP DATAFILE 7 BLOCK 90 command (note that this ALTER SYSTEM command takes absolute file ID as parameter, but
the above number is a relative file ID and should be converted to absolute figure using DBA_DATA_FILES).

--//如上所述,此塊屬於索引SOE的索引分割槽。 CUSTOMERS_PK. 塊本身位於該檔案中的相對檔案號7和塊號90中。 可以使用ALTER系統
--//DUMP資料檔案7BLOCK90命令轉儲塊的內容(注意,該ALTER系統命令以絕對檔案ID作為引數,但上述數字是相對檔案ID,應使用
--//DBA_DATA_FILES轉換為絕對數字)。

However, the data_obj# column shows the data_object_id of the index partition segment and using this information, I can
query and find out where this index partition segment's header block resides. Keep in mind that Index root blocks are
always stored immediately next to index segment header so that Oracle would know where to physically start traversing
the index when accessing it.

--//但是,data_obj#列顯示了索引分割槽段的data_object_id,並且使用這些資訊,我可以查詢並找到這個索引分割槽段的頭塊所在的位置
--//。 請記住,索引根塊總是立即儲存在索引段頭旁邊,以便Oracle在訪問索引時知道從哪裡開始物理遍歷索引。

SQL> SELECT
  2      header_file
  3    , header_block
  4  FROM
  5      dba_segments
  6  WHERE
  7      (owner, segment_name, partition_name) =
  8          (SELECT owner, object_name, subobject_name
  9           FROM dba_objects WHERE data_object_id = 62691)
 10  /
HEADER_FILE HEADER_BLOCK
----------- ------------
          7           89
1 row selected.

So from the above, the segment header block (7,89) resides immediately before our hot block (7,90) so we know that the
hot block is an index root block. It is normal to have some contention on very commonly accessed blocks such index root
and first level branch blocks (especially of primary key indexes and so on). Oracle has come up with many optimizations
for relieving such contention - for example, it allows shared CBC latch access for examination of the index root and
branch blocks on most platforms.

--//因此,從上面可以看出,段頭塊(7,89)位於我們的熱塊(7,90)之前,因此我們知道熱塊是一個索引根塊。 對於非常常見的訪問塊
--//,如索引根和一級分支塊(特別是主鍵索引等),通常有一些爭用。 甲骨文提出了許多最佳化來緩解這種爭用-例如,它允許共享CBC鎖
--//存訪問,以檢查大多數平臺上的索引根和分支塊。

Note that sometimes the latch contention shows inefficient physical design for a given concurrency requirement such as
not spreading the contention points across multiple blocks/latches by partitioning the object to spread the "hot" data
to multiple blocks in various partitions.

--//請注意,有時鎖存爭用顯示給定併發需求的低效率物理設計,例如透過分割槽物件將"hot"資料傳播到不同分割槽中的多個塊,而不將爭
--//用點分散到多個塊/塊上。

Often though, the CBC latch contention is merely a symptom of an incorrect execution plan, such as a nested loop join
operation that revisits the same data blocks again and again many times in a tight loop. For example the plan below
would work well when the number of matching orders returned from ORDERS table is low, so we need to iterate through the
CUSTOMERS table (and the CUSTOMERS_PK index which had contention) relatively small number of times.

--//然而,通常情況下,CBC鎖存爭用只是不正確的執行計劃的一個症狀,例如巢狀迴圈連線操作,它在一個緊密的迴圈中反覆多次訪問
--//相同的資料塊。 例如,當從Orders表返回的匹配訂單數量較低時,下面的計劃將很好地工作,因此我們需要迭代Customers表(以及
--//具有爭用的CUSTOMERS_PK索引),次數相對較少。

---------------------------------------------------------
|Id|Operation                             |Name         |
---------------------------------------------------------
| 0|SELECT STATEMENT                      |             |
| 1| TABLE ACCESS BY GLOBAL INDEX ROWID   |ORDER_ITEMS  |
| 2|  NESTED LOOPS                        |             |
| 3|   NESTED LOOPS                       |             |
| 4|    TABLE ACCESS BY GLOBAL INDEX ROWID|ORDERS       |
|*5|     INDEX RANGE SCAN                 |ORD_STATUS_IX|
| 6|    PARTITION HASH ITERATOR           |             |
| 7|     TABLE ACCESS BY LOCAL INDEX ROWID|CUSTOMERS    |
|*8|      INDEX UNIQUE SCAN               |CUSTOMERS_PK |
|*9|   INDEX RANGE SCAN                   |ITEM_ORDER_IX|
---------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("O"."ORDER_STATUS"<=4)
   8 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   9 - access("O"."ORDER_ID"="OI"."ORDER_ID")

Limitations of V$LATCHHOLDER and LatchProf(X)

--//V$LAT CHOLDER和LatchProf(X)的侷限性

There's a limitation in the V$LATCHHOLDER view. While it is enough for troubleshooting 99.9% of latch contention
problems, there are couple of cases where V$LATCHHOLDER view does not show the latch holder even if it's actually held.

--//在V$LATCHOLDER檢視中有一個限制。 雖然它足以排除99.9%的鎖存爭用問題,但有幾種情況下,V$LATCHOLDER檢視不顯示鎖存保持架
--//,即使它實際上持有。

    If your STATISTICS_LEVEL = BASIC then V$LATCHHOLDER does not maintain the latch holder information for ultrafast
    latches. Ultrafast latches are a small subset of latches (like cache buffers chains latches) which can be "even
    faster" by not maintaining some instrumentation, stats and counters. For all other latches (in practice everything
    else than CBC latches) will still maintain the latch holder info even if the statistics_level = basic (the
    undocumented parameter for setting this independently from statistics level is named _ultrafast_latch_statistics).
    Note that even when the ultrafast latch statistics info is not gathered, it's still possible to extract the latch
    holder info directly from SGA memory, thanks to state objects which Oracle has to maintain for process recovery.

--//如果您的STATISTICS_LEVEL=BASIC,那麼V$LATCHOLDER不維護超快鎖存器的鎖存器資訊。 超快鎖存器是一小部分鎖存器(如快取緩
--//衝區鏈鎖存器),它可以透過不維護一些儀表、狀態和計數器來"更快"。 對於所有其他鎖存器(在實踐中,除了CBC鎖存器之外的一切
--//),即使statistics_level=basic(獨立於統計級別設定這一點的無證引數被命名為_ultrafast_latch_statistics),仍然會維護鎖存
--//器資訊。 請注意,即使沒有收集超快鎖存統計資訊,仍然可以直接從SGA記憶體中提取鎖存器資訊,這要歸功於Oracle必須維護的用於
--//程式恢復的狀態物件。    

    V$LATCHHOLDER scans through the process state object array (V$PROCESS/X$KSUPR) and looks into a field there which
    points to the latch held by a process. However, sometimes a process can hold multiple latches at the same time
    (library cache latch + shared pool latch for example). When a process holds multiple latches, then unfortunately
    V$LATCHHOLDER only reports the first one taken by that process. This may lead to cases where you see sessions
    waiting significantly for (let say) shared pool latch, but nobody seems to be holding it too much. However, as said
    in the beginning of this article, you should only start troubleshooting latch contention whether you see sessions
    actually wait for that latch! So if you see sessions waiting (the contention exists!), but nobody apparently holding
    the latch much in V$LATCHHOLDER, then you may be hitting this shortcoming of that V$ view. This is the case when you
    may need to fall back to old latch contention troubleshooting techniques (which do require lots of previous
    experience with this and involve quite an amount of guesswork ). Or alternatively you can hire me as a consultant, I
    have tools for reading the truth out of the state objects in SGA memory ;-)

--//V$LATCHOLDER透過程式狀態物件陣列(V$PROCESS/X$KSUPR)掃描,並檢視指向程式持有的鎖存器的欄位。 然而,有時一個程式可以同
--//時容納多個鎖存器(例如,庫快取鎖存共享池鎖存器)。 當一個程式持有多個鎖存器時,不幸的是V$LAT CHOLDER只報告該程式採取的
--//第一個鎖存器。 這可能導致您看到會話等待(讓我們說)共享池閂鎖的情況,但似乎沒有人持有太多。 然而,正如本文開頭所說的,
--//無論您是否看到會話實際上等待該鎖存器,您都應該只開始排除鎖存爭用! 因此,如果您看到會話等待(爭用存在! ),但顯然沒有
--//人在V$LATCHOLDER中持有大量的鎖存器,那麼您可能正在觸及V$檢視的這個缺點。 這種情況下,您可能需要回到舊的鎖存爭用故障
--//排除技術(這確實需要大量以前的經驗,並涉及相當多的猜測)。 或者你可以聘請我作為顧問,我有工具從SGA記憶體中的狀態物件中讀
--//取真相;-)    

Conclusion

There's a huge variety of reasons for latch contention in Oracle (and there are hundreds of different types of latches
in Oracle too) and I hope to discuss some of the most common reasons in a future article. However it helps to keep in
mind two main facts about latch contention:

--//甲骨文的鎖存爭用有各種各樣的原因(甲骨文也有數百種不同型別的鎖存器),我希望在未來的文章中討論一些最常見的原因。 然而
--//,它有助於記住關於鎖存爭用的兩個主要事實:

1) Latch contention and waits happen only if someone tries to get a latch which is already held by someone else!
   Therefore, after finding out which sessions are waiting and for which latches, it's good to find out which sessions
   are holding these latches and why (and that's why I wrote LatchProf and LatchProfX!)

--//1)鎖存爭用和等待只有當有人試圖得到一個鎖存器,已經被其他人持有! 因此,在找出哪些會話正在等待,哪些鎖存器之後,找出
--//哪些會話正在舉行這些鎖存器,以及為什麼(這就是為什麼我寫了Latchprof和LatchprofX)是很好的!   

2) Latch contention is usually a symptom of some other problem, such bad connection / cursor management, inefficient
   execution plan, physical design issues or even CPU starvation.

--//2)Latch爭用通常是其他一些問題的症狀,例如糟糕的連線/游標管理、低效的執行計劃、物理設計問題甚至CPU飢餓。   

P.S. If you want to take on an interesting SQL challenge, then look into LatchProf or LatchProfX source code and
understand how it manages to sample the V$LATCHHOLDER view up to a hundred thousand times per second while being just a
single SQL statement!

--//P.S.如果您想接受一個有趣的SQL挑戰,那麼請檢視LatchProf或LatchProf X原始碼,並瞭解它如何能夠在僅僅是一個SQL語句的同時
--//,每秒取樣多達十萬次的V$LATCHOLDER檢視?

Download LatchProf and LatchProfX

http://blog.tanelpoder.com/files/scripts/latchprof.sql
http://blog.tanelpoder.com/files/scripts/latchprofx.sql

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/267265/viewspace-2742570/,如需轉載,請註明出處,否則將追究法律責任。

相關文章