Oracle Performance Tuning 11g2 (3)

yuntui發表於2016-11-03

此篇文章作為效能調優概要介紹的最後一篇,是一定要讀的。oracle寫的非常的棒!

可能有人會說,在網上完全可能搜尋到相關的翻譯文章的,為什麼還要自己再去翻譯一下呢?我的理解是這樣子的:別人已經學習過oracle了,為什麼自己還要學?別人已經教了如何吃螃蟹的步驟,為什麼自己還要去鑽研?因此透過自己的深入學習,在這個過程中,充滿著學習的興趣、無法瞭解的痛苦、無法翻譯的急躁等等,因為這個時間拉的比較長,因此掌握的程度也會比只看別人寫的東西要穩固的從。就像以前的老師在黑板上教學效果要比現在的多媒體好的多的多!(如果是復讀生,可能多媒體更好些)。在翻譯的過程中,有些英文真是感覺無法翻譯,因此參考著網上一些朋友的寫法大概的再按自己的思路寫,所以如果發現有不對的地方歡迎指正!

 

3 Performance Improvement Methods   效能提升的方法

This chapter discusses Oracle Database improvement methods and contains the following sections:

  • The Oracle Performance Improvement Method                       oracle效能提升的方法

  • Emergency Performance Methods                                         緊急處理時如何提升效能

3.1 The Oracle Performance Improvement Method

Oracle performance methodology helps you to identify performance problems in an Oracle database. This involves identifying bottlenecks and fixing them. It is recommended that changes be made to a system only after you have confirmed that there is a bottleneck.

Performance improvement, by its nature, is iterative. For this reason, removing the first bottleneck might not lead to performance improvement immediately, because another bottleneck might be revealed. Also, in some cases, if serialization points move to a more inefficient sharing mechanism, then performance could degrade. With experience, and by following a rigorous method of bottleneck elimination, applications can be debugged and made scalable.

Performance problems generally result from either a lack of throughput, unacceptable user/job response time, or both. The problem might be localized between application modules, or it might be for the entire system.

oracle效能方法論幫助你確定問題所在。包含查詢瓶頸以及解決瓶頸。強烈建議你只有你真的確認那是一個瓶頸的時候再對系統做修改(最好是對你所做的修改都做一個文件記錄,需要修改的時候先寫到文件中,這樣在寫的時候就給自己腦子就留出一點時間去思考了,很多時候當寫完語句後就會發現原來可以不需要這樣做的)。

效能提升的過程是一個迭代的過程。基於這個原因,解決一個瓶頸可能並不能立刻提升效能,因為還有其他一些相關的效能瓶頸存在。另外,在有些時候,修改序列化方式到更新低效的共享機制後,效能還要降低。根據經驗,以及遵循一系列的效能瓶頸消除策略,應用程式就可以被除錯和伸縮自如。

效能問題通常導致低吞吐量,無法接受的響應時間。問題可能存在於應用程式模組中,或者存在於整個系統中。

 

Before looking at any database or operating system statistics, it is crucial to get feedback from the most important components of the system: the users of the system and the people ultimately paying for the application. Typical user feedback includes statements like the following:

  • "The online performance is so bad that it prevents my staff from doing their jobs."

  • "The billing run takes too long."

  • "When I experience high amounts of Web traffic, the response time becomes unacceptable, and I am losing customers."

  • "I am currently performing 5000 trades a day, and the system is maxed out. Next month, we roll out to all our users, and the number of trades is expected to quadruple."

  • 在你檢視資料庫或者OS統計資料之前,一個非常重要的事情就是從系統的使用者和最後跟蹤此應用程式人那裡瞭解一些相關模組的最重要的資訊。最典型的使用者反饋資訊如下:“這個系統聯機效能太差了,我的員工都沒法工作了”,“這個賬單程式執行礦長時間了”,“當網路擁堵時,響應時間太長了,客戶都跑了”,“我們系統目前一天可以處理5000筆交易,已經是極限了。下個月,當我們升級到所有使用者後,交易量將翻2翻,也就是要增加4倍”

From candid feedback, it is easy to set critical success factors for any performance work. Determining the performance targets and the performance engineer's exit criteria make managing the performance process much simpler and more successful at all levels. These critical success factors are better defined in terms of real business goals rather than system statistics.

Some real business goals for these typical user statements might be:

  • "The billing run must process 1,000,000 accounts in a three-hour window."

  • "At a peak period on a Web site, the response time must not exceed five seconds for a page refresh."

  • "The system must be able to process 25,000 trades in an eight-hour window."

  • 從這些直白的反饋中,將可以看到一些關鍵的效能提升因素。判斷了這些效能的最低要求和效能工程師的最低標準,將使得效能最佳化變得簡單一些。這些關鍵的效能提升因素最好是根據商業的目標而定,而不是系統的統計資料。(也就是說,透過了解客戶的需求,就知道他們要什麼,是要每秒鐘1000筆交易還是批次程式快點? 任何時間先聽聽客戶想幹什麼?要求是什麼?多聽少動! ) 下面是一些真實的商業目標: “我們必須要在3小時內處理完100萬的賬單”,“在web處理的高峰期,頁面的重新整理也不能超過5秒鐘”,“系統必須要能在8小時內處理完25000個訂單”。

The ultimate measure of success is the user's perception of system performance. The performance engineer's role is to eliminate any bottlenecks that degrade performance. These bottlenecks could be caused by inefficient use of limited shared resources or by abuse of shared resources, causing serialization. Because all shared resources are limited, the goal of a performance engineer is to maximize the number of business operations with efficient use of shared resources. At a very high level, the entire database server can be seen as a shared resource. Conversely, at a low level, a single CPU or disk can be seen as shared resources.

You can apply the Oracle performance improvement method until performance goals are met or deemed impossible. This process is highly iterative. Inevitably, some investigations may have little or no impact on database performance. Time and experience are necessary to develop the skills to accurately and quickly pinpoint critical bottlenecks. However, prior experience can sometimes work against the experienced engineer who neglects to use the data and statistics available. This type of behavior encourages database tuning by myth and folklore. This is a very risky, expensive, and unlikely to succeed method of database tuning.

The Automatic Database Diagnostic Monitor (ADDM) implements parts of the performance improvement method and analyzes statistics to provide automatic diagnosis of major performance issues. Using ADDM can significantly shorten the time required to improve the performance of a system. See Chapter 6, "Automatic Performance Diagnostics" for a description of ADDM.

Systems are so different and complex that hard and fast rules for performance analysis are impossible. In essence, the Oracle performance improvement method defines a way of working, but not a definitive set of rules. With bottleneck detection, the only rule is that there are no rules! The best performance engineers use the data provided and think laterally to determine performance problems.

效能最終效能的成功與否的標準在於客戶的感覺。 效能工程師的角色是消耗降低效能的瓶頸所在。 這些瓶頸或許是因為共享資源使用不足,或者是是資源的亂用導致的序列化處理。因為資源是有限的,效能最佳化師的目標就是最大化的利用共享資源。 站在高處看,整個資料庫伺服器就是一個共享資源。 站在低層看,一個單一的CPU或者硬碟可以看作一個共享資源。

你可以使用oracle效能調優的方式,直到效能達到要求或者的確無法實現。 這是個高度迭代的過程。很多時候,一些研究發現可能對於效能幾乎沒什麼影響。如果想要達到快速的定位問題所在的技能是需要大量的時間經驗的(經驗一定是透過大量的時間磨出來的)然而有時候經驗也可能導致自己疏忽統計資料的研究。這種行為鼓勵使用神話與民間傳說式的調優(個人英雄主義),將會是非常危險的,昂貴的,不容易成功的調優方法。

ADDM實現了一部分效能調優的工作,它透過分析統計資料去提供一些診斷和解決問題的方法。使用ADDM可以縮短調優的時間。

因為系統千變萬化,以及相對複雜,所以要提供一個統一快速的方法去調優是不太可能的。本質上說,oracle的效能調優是定義了一種工作的方法(方法論),而不是定義了調優的步驟規則。 在偵測效能瓶頸的過程中,唯一的規則就是沒有規則!  最好的效能工程師使用資料庫提供的如AWR資料,端著咖啡站在一邊,然後仔細想這個系統的問題所在!(在系統出現問題的時候,別立即就毛手毛腳的,一會動動這,一會動動那;一定要鎮定,要自信;不能因為效能差或者出了事故就亂了手腳;做到---急事,慢慢說;大事,清楚說;沒有把握的事,謹慎說; 沒發生的事,莫胡說; 做不到的事,別亂說;總之在動手之前先動腦!)

舉個以前忘記從哪裡看的一個故事: 話說很久很久以前,機房還要使用磁帶機的時候,客戶說IBM的裝置多數時間正常,偶爾就出問題。於是IBM派人檢視問題,找不出來;接下來檢查程式程式碼,邏輯也沒問題;但問題是他真的就是有問題啊。最後IBM派了一個經驗豐富的專家去現場檢視,這個哥們一去就端著咖啡坐到一旁,什麼也不幹,就是跟其他人聊聊天,觀察一下系統的執行。最終他發現一旦有人走過來,系統就不穩定了;沒人走動的時候就沒事了。 最後發現是機房的一個地面磁磚鬆動了,導致人一走過去裝置就振動的非常厲害。結局不講了大家懂的。從這個例子上我們看到,很多看起來很不思議的問題的解決都是讓你想撞牆那種方法,但是能找到問題所在的人,往往都是那些專家級人員。在銀行業,我想其他行業都差不多,當出現問題的時候,一定是小弟先去解決,解決不了時再讓中級工程師解決,還不行就讓高階工程師解決,如果還高階工程師解決不了就說明問題不是出現在一般的技術上,專家出面時一般的技術就不用再去考慮了,多想想可能鬧鬼的地方!

 


3.1.1 Steps in The Oracle Performance Improvement Method        效能設計的方法
  1. Perform the following initial standard checks:                        先執行下面標準的初始檢查:

    • Get candid feedback from users. Determine the performance project's scope and subsequent performance goals, and performance goals for the future. This process is key in future capacity planning.   從使用者那裡得到直白的系統問題反饋。瞭解系統範圍和最佳化目標。這步很關鍵。

    • Get a full set of operating system, database, and application statistics from the system when the performance is both good and bad. If these are not available, then get whatever is available. Missing statistics are analogous to missing evidence at a crime scene: They make detectives work harder and it is more time-consuming.                         獲取一份客戶系統在好的時候或者差的時候的作業系統,資料庫,和應用程式統計資料。假如無法拿到的話,任何時候的都先拿一份再說。 如果沒有統計資料,就類似於丟了犯罪現場的證據一樣,將導致偵破工作非常艱難和曠日持久

    • Sanity-check the operating systems of all computers involved with user performance. By sanity-checking the operating system, you look for hardware or operating system resources that are fully utilized. List any over-used resources as symptoms for analysis later. In addition, check that all hardware shows no errors or diagnostics.          小心仔細的檢查與效能有關的所有機器的作業系統。透過檢查作業系統,你可以發現硬體或者作業系統的資源佔用情況。 將所有可能過多使用的地方都列出來供以後分析。 另外要檢查一下硬體是否發生了故障!

  2. Check for the top ten most common mistakes with Oracle Database, and determine if any of these are likely to be the problem. List these as symptoms for later analysis. These are included because they represent the most likely problems. ADDM automatically detects and reports nine of these top ten issues. See Chapter 6, "Automatic Performance Diagnostics" and "Top Ten Mistakes Found in Oracle Systems".        檢查一下資料庫的top10問題,看看是否問題就在其中。 列出所有的問題供以後分析。 列出這top10的事項的主要是他們代表了最有可能出現的問題。 ADDM自動的偵測和報告10箇中的9個問題

  3. Build a conceptual model of what is happening on the system using the symptoms as clues to understand what caused the performance problems. See "A Sample Decision Process for Performance Conceptual Modeling".    建立一個效能最佳化的概念模型,透過一個症狀做為線索,找出系統目前正在發生什麼,以及什麼導致了問題所在。見下面的3.1.2中。

  4. Propose a series of remedy actions and the anticipated behavior to the system, then apply them in the order that can benefit the application the most. ADDM produces recommendations each with an expected benefit. A golden rule in performance work is that you only change one thing at a time and then measure the differences. Unfortunately, system downtime requirements might prohibit such a rigorous investigation method. If multiple changes are applied at the same time, then try to ensure that they are isolated so that the effects of each change can be independently validated.    對系統進行修正工作,按照自己的修正順序,一個一個的來,看看常用的效果如何。 ADDM會產生一系列的建議以及能提升好大的效能。 一個黃金法則是: 在做效能最佳化時同一時間只修改一個地方,然後比較到底效能差異。  但是呢,如果系統一旦當機的話,根本沒有時間能讓我們這麼做。 假如一口氣改了N多東西的話,那麼確保每個效果都是不一樣的,他們可以被單獨的驗證。(我一般是按我自己的方式去除錯,如果發現和我的不一樣先全部應用上去,很少這樣一步步來的)

  5. Validate that the changes made have had the desired effect, and see if the user's perception of performance has improved. Otherwise, look for more bottlenecks, and continue refining the conceptual model until your understanding of the application becomes more accurate.   驗證一下修正之後,是否達到自己想像的那種立竿見影的效果,以及客戶的直觀感覺是否好了。 否則的話,就再檢查這些瓶頸,重新定義模型,直接到系統有了更準確的理解。

  6. Repeat the last three steps until performance goals are met or become impossible due to other constraints.  重複3,4,5步驟,直到最佳化成功或者因為一些其他條件根本無法實現。

This method identifies the biggest bottleneck and uses an objective approach to performance improvement. The focus is on making large performance improvements by increasing application efficiency and eliminating resource shortages and bottlenecks. In this process, it is anticipated that minimal (less than 10%) performance gains are made from instance tuning, and large gains (100% +) are made from isolating application inefficiencies.

這些方式確定了最大的瓶頸,以及使用了一個直觀的途徑去提升效能。 主要集中在透過增加系統的有效系統和消耗資源短缺和瓶頸來最大的提升效能。 在這個過程中,少則提升低於10%的效能,高的話有可能會超過100%。

 


3.1.2 A Sample Decision Process for Performance Conceptual Modeling        在效能概念模型中的一個簡單的決策過程

Conceptual modeling is almost deterministic. However, as you gain experience in performance tuning, you begin to appreciate that no real rules exist. A flexible heads-up approach is required to interpret statistics and make good decisions.

For a quick and easy approach to performance tuning, use ADDM. ADDM automatically monitors your Oracle system and provides recommendations for solving performance problems should problems occur. For example, suppose a DBA receives a call from a user complaining that the system is slow. The DBA simply examines the latest ADDM report to see which of the recommendations should be implemented to solve the problem. See Chapter 6, "Automatic Performance Diagnostics" for information about the features that help monitor and diagnose Oracle databases.

The following steps illustrate how a performance engineer might look for bottlenecks without using automatic diagnostic features. These steps are only intended as a guideline for the manual process. With experience, performance engineers add to the steps involved. This analysis assumes that statistics for both the operating system and the database have been gathered.

概念模型通常是起決定性意義的。 然而當你的效能調優經驗越來越多的時候,你就能領悟到根本沒有真正的規則存在。在解釋統計資料和做出決策的過程中,需要的是足智多謀,思維敏捷的途徑。

如果要使用快速和相對容易的方式去調優,那就使用ADDM吧! ADDM自動的監控你的系統,並且對於效能問題提供了一些幫助建議。 例如,假設一個DBA接收客戶說系統比較慢的抱怨電話。DBA簡單的檢查最近的ADDM報告去看看有哪個建議可以使用。

下面的步驟闡述了在沒有自動診斷特性時效能工程師如何確定問題。 這些步驟僅僅作為一個手動操作的參考。 效能工程師將連同自己的經驗一起參與到自己的解決步驟中(隨著經驗的積累,效能工程師會增加相應的步驟)。這些分析的前提是作業系統和資料庫的統計資料都已經被收集過了。

  1. Is the response time/batch run time acceptable for a single user on an empty or lightly loaded computer?

    If it is not acceptable, then the application is probably not coded or designed optimally, and it will never be acceptable in a multiple user situation when system resources are shared. In this case, get application internal statistics, and get SQL Trace and SQL plan information. Work with developers to investigate problems in data, index, transaction SQL design, and potential deferral of work to batch and background processing. 

    當只有一個使用者時,或者是系統負載比較輕的時候這個響應時間和批次處理時間都是正常的? 假如這個時候都是不可接受的,那麼說明應用程式可能真的設計的有問題,同時也說明在多使用者條件中效能更不可接受了。 在這種情況下,獲取應用內部統計資料,獲取SQL TRACE和SQL PLAN資訊。 與開發人員一起調查資料,索引,事務設計,以及可能延遲的批次任務問題,還有他們的後臺程式。(多數時候,不需要了解太多他們的業務需求,只要瞭解到哪一支程式慢,慢在哪裡?這隻程式處理了什麼邏輯等等就可以了。我有個壞毛病,總是先聽他們的處理邏輯,聽完後通常都和自己的設計理念不符,通常下他們的設計也絕對不怎麼好(好的話怎麼可能會出問題,是吧), 在解決問題之前就先鄙視一下,很多人臉上就掛不住了。個人情商太低!)

  2. Is all the CPU being utilized?

    If the kernel utilization is over 40%, then investigate the operating system for network transfers, paging, swapping, or process thrashing. Continue to check CPU utilization in user space to verify if there are any non-database jobs consuming CPU on the system limiting the amount of shared CPU resources, such as backups, file transforms, print queues, and so on. After determining that the database is using most of the CPU, investigate the top SQL by CPU utilization. These statements form the basis of all future analysis. Check the SQL and the transactions submitting the SQL for optimal execution. Oracle Database provides CPU statistics in V$SQL and V$SQLSTATS.

    See Also:

    Oracle Database Reference for more information on V$SQL and V$SQLSTATS

    If the application is optimal and no inefficiencies exist in the SQL execution, then consider rescheduling some work to off-peak hours or using a bigger computer.

    CPU的使用率如何?

    假如作業系統佔了40%多的CPU時,檢視一下網路傳輸,分頁,交換或者程式顛簸情況。 仍然在使用者空間下檢視CPU使用率問題,檢查下是否有什麼非資料庫的工作在消耗CPU的工作,例如備份,檔案傳輸,列印佇列等等問題(通常夜間都會進行日誌等備份工作,比如一個批次的檔案在處理前要解壓,處理完後要壓縮,gzip是非常消耗CPU資源的,明顯會看到CPU迅速衝到80~90%的)。如果確認是資料庫佔據了多數的CPU,檢查一下在消耗CPU中,使用率最高的SQL語句。 這些語句就是將來需要我們去分析的。 檢查SQL和事務是否正常執行了。 查詢這些CPU的統計可以透過V$SQL, V$SQLSTATS去查詢! 假如說應用程式是最優的,同時SQL語句也沒有明顯的低效之處,那麼就工作放到非高峰期處理看或者直接換個效能更好的機器試試。(銀行業資料庫伺服器好的一塌糊塗,這個不需要再看了,絕對資源過剩的。出現問題一定是SQL語句或者事務有問題,比如亂加鎖的原因)

  3. At this point, the system performance is unsatisfactory, yet the CPU resources are not fully utilized.

    In this case, you have serialization and unscalable behavior within the server. Get the WAIT_EVENTS statistics from the server, and determine the biggest serialization point. If there are no serialization points, then the problem is most likely outside the database, and this should be the focus of investigation. Elimination of WAIT_EVENTS involves modifying application SQL and tuning database parameters. This process is very iterative and requires the ability to drill down on the WAIT_EVENTS systematically to eliminate serialization points.

    單使用者也除錯了,系統資源佔用也考慮了,系統效能還不行,同時CPU也沒有充分的利用到。

    在這種情況下,通常情況下是序列化了。 獲取一下wait_events統計資料,確定最大的序列化點。 假如沒有發現什麼序列化情況,那麼問題就極有可能出在資料庫之外。 消除wait_events的事件,涉及到修改應用SQL和調優系統引數。 這個過程是一個迭代化的過程,需要深入鑽研的精神,有步驟地,系統地檢視wait_events,最終消除問題所在。

 


3.1.3 Top Ten Mistakes Found in Oracle Systems                        在oracle系統中最高的10種錯誤

This section lists the most common mistakes found in Oracle databases. By following the Oracle performance improvement methodology, you should be able to avoid these mistakes altogether. If you find these mistakes in your system, then re-engineer the application where the performance effort is worthwhile. See "Automatic Performance Tuning Features" for information about the features that help diagnose and tune Oracle databases. See Chapter 10, "Instance Tuning Using Performance Views" for a discussion on how wait event data reveals symptoms of problems that can be impacting performance.

本節列出了在Oracle系統中最常見的錯誤。使用Oracle的效能方法論,你應該能總體上避免這些錯誤。如果你的系統中存在這些錯誤,重新設計你應用程式的中關於效能部分努力將是值得的。

  1. Bad connection management                             糟糕的連線管理

    The application connects and disconnects for each database interaction. This problem is common with stateless middleware in application servers. It has over two orders of magnitude impact on performance, and is totally unscalable. 

    應用程式不斷的的連線和斷開與資料庫。此問題常見於無狀態的中介軟體應用中。它對效能的影響超過兩個數量級,並且完全無法擴充套件及不穩定。

  2. Bad use of cursors and the shared pool             遊標的錯誤使用和共享池

    Not using cursors results in repeated parses. If bind variables are not used, then there is hard parsing of all SQL statements. This has an order of magnitude impact in performance, and it is totally unscalable. Use cursors with bind variables that open the cursor and execute it many times. Be suspicious of applications generating dynamic SQL.  

    未使用遊標導致重複解析。假如不使用繫結變數,將導致所有SQL語句出現硬解析。這將導致一個資料量級的影響,完全不能擴充套件。使用繫結變數的遊標,開啟遊標,然後多次去執行。一定要注意那些動態產生的SQL。(現在有很多工具可以自動生成SQL語句,特別是java開發人員有時會使用它們,而根本不關心生成的語句效果如何,最後全部推給最佳化師,日他奶奶的這幫人!關鍵是因為他們一直在動態的生成,乾的活看起來非常多,最後獎金還很多,所以掙的錢多少跟水平不是成正比的!)

  3. Bad SQL                                                         糟糕的SQL

    Bad SQL is SQL that uses more resources than appropriate for the application requirement. This can be a decision support systems (DSS) query that runs for more than 24 hours, or a query from an online application that takes more than a minute. You should investigate SQL that consumes significant system resources for potential improvement. ADDM identifies high load SQL. SQL Tuning Advisor can provide recommendations for improvement. See Chapter 6, "Automatic Performance Diagnostics" and Chapter 17, "Automatic SQL Tuning". 

    糟糕的SQL是指較應用需求使用更多資源的SQL。這可能是一個執行超過24小時的DSS查詢或需要一分多鐘的聯機應用程式查詢。要研究消耗大量的系統資源的sql。ADDM確定高負荷SQL. STA可以提供改進的建議。

  4. Use of nonstandard initialization parameters      使用非標準初始化引數

    These might have been implemented based on poor advice or incorrect assumptions. Most databases provide acceptable performance using only the set of basic parameters. In particular, parameters associated with SPIN_COUNT on latches and undocumented optimizer features can cause a great deal of problems that can require considerable investigation.

    Likewise, optimizer parameters set in the initialization parameter file can override proven optimal execution plans. For these reasons, schemas, schema statistics, and optimizer settings should be managed as a group to ensure consistency of performance.

    這些可能基於糟糕的同行建議或不正確的假設。大多數系統僅使用基本引數設定就會得到不錯的效能。特別是與latch相關的 SPIN_COUNT 引數和未書面記錄的最佳化特性可能會導致很大的問題,修改前需要嚴格的考慮。 同樣,在初始化引數檔案中設定的最佳化引數可以覆蓋已經證明是最佳的執行計劃。由於這些原因,相關使用者,及使用者統計資料,最佳化設定應作為一組共同管理,以確保效能的一致性。

    See Also:

    • Oracle Database Administrator's Guide for information about initialization parameters and database creation

    • Oracle Database Reference for details on initialization parameters

    • "Performance Considerations for Initial Instance Configuration" for information about parameters and settings in an initial instance configuration

  5. Getting database I/O wrong                              獲取資料庫I/O錯誤

    Many sites lay out their databases poorly over the available disks. Other sites specify the number of disks incorrectly, because they configure disks by disk space and not I/O bandwidth. See Chapter 8, "I/O Configuration and Design".  

    許多站點效能不佳是因為資料庫放在比較差的磁碟上。還有一些站點磁碟數量是不對的,因為他們依據磁碟空間而不是I/O頻寬去置磁碟。(比較調整硬碟一定要作為redo使用等等)

  6. Online redo log setup problems                        重做日誌設定問題

    Many sites run with too few online redo log files and files that are too small. Small redo log files cause system checkpoints to continuously put a high load on the buffer cache and I/O system. If too few redo log files exist, then the archive cannot keep up, and the database must wait for the archiver to catch up. See Chapter 4, "Configuring a Database for Performance" for information about sizing redo log files for performance.     

    許多站點redo太小,太少。小的redo會導致系統檢查點給高緩衝區和I/O系統持續的壓力。如果redo太少,然後歸檔不能跟上,則資料庫將等待歸檔程式。

  7. Serialization of data blocks in the buffer cache due to lack of free lists, free list groups, transaction slots (INITRANS), or shortage of rollback segments.

    This is particularly common on INSERT-heavy applications, in applications that have raised the block size above 8K, or in applications with large numbers of active users and few rollback segments. Use automatic segment-space management (ASSM) and automatic undo management to solve this problem. 

    快取記憶體中的資料塊由於缺乏free lists、free list groups(這些都常出現在9i或以前的系統中,新系統中很少有DBA這樣使用表空間了,聽聽就算了),事務插槽,或回滾段導致序列化操作。這在INSERT比較多的應用中,同時塊大小設定大於8K,或在應用程式中有大量的活躍使用者數及少量回滾段的情況下常見些。使用ASSM,自動UNDO管理已經不存在這樣的問題。

  8. Long full table scans                                      長時間全表掃描

    Long full table scans for high-volume or interactive online operations could indicate poor transaction design, missing indexes, or poor SQL optimization. Long table scans, by nature, are I/O intensive and unscalable. 

    在大表或聯機互動操作中的進行長時間全表掃描,可能顯示事務設計不好、缺少索引或糟糕的SQL最佳化計劃。長時間表掃描,本質上是I/O密集??型的,不可擴充套件。

  9. High amounts of recursive (SYS) SQL              大量的遞迴(SYSSQL

    Large amounts of recursive SQL executed by SYS could indicate space management activities, such as extent allocations, taking place. This is unscalable and impacts user response time. Use locally managed tablespaces to reduce recursive SQL due to extent allocation. Recursive SQL executed under another user ID is probably SQL and PL/SQL, and this is not a problem. 

    大量的SYS使用者呼叫的遞迴SQL的執行,可能表示分配空間的管理行為,如需要更多的extent空間。這是非擴充套件的並影響使用者的響應時間。使用本地管理表空間,以減少由於分配空間造成的的遞迴SQL。其他使用者ID下執行的遞迴SQL問題可能是SQL語句或PL/SQL問題,這是沒有問題的。

  10. Deployment and migration errors                   部署和遷移錯誤

    In many cases, an application uses too many resources because the schema owning the tables has not been successfully migrated from the development environment or from an older implementation. Examples of this are missing indexes or incorrect statistics. These errors can lead to sub-optimal execution plans and poor interactive user performance. When migrating applications of known performance, export the schema statistics to maintain plan stability using the DBMS_STATS package.

    Although these errors are not directly detected by ADDM, ADDM highlights the resulting high load SQL. 

    在許多情況下,應用程式使用了太多的資源,因為使用者的表沒有成功地從開發環境遷移過來,或新設計的表沒有應用上來。比如缺少索引或不正確的統計資料。這些錯誤可能會導致不佳的執行計劃和糟糕的使用者體驗。當原有的舊系統是相對穩定的,那麼使用DBMS_STATS匯出舊系統的統計資料。

    雖然這些錯誤不能直接被ADDM檢測到,但ADDM會突出顯示出高負荷的SQL。

 

 


3.2 Emergency Performance Methods            緊急效能處理

This section provides techniques for dealing with performance emergencies. You presumably have a methodology for establishing and improving application performance. However, in an emergency situation, a component of the system has changed to transform it from a reliable, predictable system to one that is unpredictable and not satisfying user requests.

In this case, the performance engineer must rapidly determine what has changed and take appropriate actions to resume normal service as quickly as possible. In many cases, it is necessary to take immediate action, and a rigorous performance improvement project is unrealistic.

After addressing the immediate performance problem, the performance engineer must collect sufficient debugging information either to get better clarity on the performance problem or to at least ensure that it does not happen again.

The method for debugging emergency performance problems is the same as the method described in the performance improvement method earlier in this book. However, shortcuts are taken in various stages because of the timely nature of the problem. Keeping detailed notes and records of facts found as the debugging process progresses is essential for later analysis and justification of any remedial actions. This is analogous to a doctor keeping good patient notes for future reference.

本節提供了緊急情況的效能調優技術。你可能已經建立了自己的應用程式調優方面的技巧。然而,在緊急情況下,系統的一部分從一個穩定的狀態改變成一個不可預知時,不能滿足使用者請求的狀態。

在這種情況下,效能工程師的必須迅速確定什麼變化了,並採取適當的行動,以儘可能快地恢復正常的服務。在許多情況下,必須立即採取行動,嚴格的效能提升是不現實的(前段時間我看了一些關於oracle12c的東西,好像裡面有了處理這方面的東西,也就是在出現問題時先別急著去重啟資料庫,先看看到底發生了什麼,可以把東西備份出來供以後分析)。

在定位了的當前的緊急效能問題後,效能工程師必須收集足夠的除錯資訊,或者是為了對效能問題有更加清晰的瞭解,或者至少確保它真的不會再次發生了。

用於除錯緊急效能問題的方法和在本書前面章節介紹的效能改進方法是相似的。然而,因為問題的緊迫,可以採取快速方便的捷徑處理方式。保持詳細的筆記,記錄除錯處理過程中的發現的一切現象是至關重要的,方便以後的分析及做出補救措施。這類似於醫生詳細記錄病人病歷,以備將來參考。


3.2.1 Steps in the Emergency Performance Method   緊急的效能方法的步驟如下:

The Emergency Performance Method is as follows:

  1. Survey the performance problem and collect the symptoms of the performance problem. This process should include the following:

    • User feedback on how the system is underperforming. Is the problem throughput or response time?

    • Ask the question, "What has changed since we last had good performance?" This answer can give clues to the problem. However, getting unbiased answers in an escalated situation can be difficult. Try to locate some reference points, such as collected statistics or log files, that were taken before and after the problem.

    • Use automatic tuning features to diagnose and monitor the problem. See "Automatic Performance Tuning Features" for information about the features that help diagnose and tune Oracle systems. In addition, you can use Oracle Enterprise Manager performance features to identify top SQL and sessions.   

      調查效能問題並收集效能相關的問題症狀。這個過程應該包括以下內容:

      從使用者反饋中瞭解系統狀況。是吞吐量還是響應時間問題?

      問問他們,“以前正常的時候大概在幾號,之後改了什麼東西?”這個答案可以提供一些線索。然而,在緊急情況下可以得到準確的答案是相對困難的。嘗試找到一些參考點,如收集的統計資料或檢視日誌檔案,比較問題發生之前和之後的記錄。(注:透過和客戶聊天,可以得到一些相關的修改記錄,有的時候,有的“犯罪份子”可能會故意的不說實話,比如他不小心把配置檔案刪除了,然後隨便找了一個替過來等等,這時要從日誌中檢視資訊)
      使用自動調優功能,診斷和監測的問題。此外,您可以使用OEM,識別資源佔用最高的SQL和會話資訊。(一般情況下,生產上不太喜歡安裝OEM,所以還是準備好SQL語句去檢視吧!)

  2. Sanity-check the hardware utilization of all components of the application system. Check where the highest CPU utilization is, and check the disk, memory usage, and network performance on all the system components. This quick process identifies which tier is causing the problem. If the problem is in the application, then shift analysis to application debugging. Otherwise, move on to database server analysis.  

    全面檢查伺服器各個硬體部件使用率。檢查CPU使用率最高的什麼程式,並檢查磁碟,記憶體使用情況和與資料庫與關聯的所有系統的網路效能。快速的確定造成問題的是哪一層。如果問題是在應用程式中,就轉向分析應用程式的除錯。否則,將轉向資料庫伺服器的分析。(像在銀行業的話,一般說是伺服器出現問題那基本上就是資料庫的問題了,因為在找DBA之前他們已經做了大量的研究,基本上不會出現在應用程式中了,許多時候都是由於沒有打PATCH的原因)

  3. Determine if the database server is constrained on CPU or if it is spending time waiting on wait events. If the database server is CPU-constrained, then investigate the following:

    • Sessions that are consuming large amounts of CPU at the operating system level and database; check V$SESS_TIME_MODEL for database CPU usage

    • Sessions or statements that perform many buffer gets at the database level; check V$SESSTAT and V$SQLSTATS

    • Execution plan changes causing sub-optimal SQL execution; these can be difficult to locate

    • Incorrect setting of initialization parameters

    • Algorithmic issues caused by code changes or upgrades of all components

    If the database sessions are waiting on events, then follow the wait events listed in V$SESSION_WAIT to determine what is causing serialization. The V$ACTIVE_SESSION_HISTORY view contains a sampled history of session activity which you can use to perform diagnosis even after an incident has ended and the system has returned to normal operation. In cases of massive contention for the library cache, it might not be possible to logon or submit SQL to the database. In this case, use historical data to determine why there is suddenly contention on this latch. If most waits are for I/O, then examine V$ACTIVE_SESSION_HISTORY to determine the SQL being run by the sessions that are performing all of the inputs and outputs. See Chapter 10, "Instance Tuning Using Performance Views" for a discussion on wait events.

    確定是資料庫伺服器的CPU不夠還是時間花在等待事件上。如果資料庫伺服器是CPU不足的,然後考察以下內容:

    在作業系統和資料庫級別消耗大量的CPU的會話,檢查 V$SESS_TIME_MODEL、V$SYS_TIME_MODEL 確定資料庫CPU的使用情況(V$SESS_TIME_MODEL displays the session-accumulated time for various operations. The time reported is the total elapsed or CPU time (in microseconds). Any timed operation will buffer at most 5 seconds of time data.)

    在資料庫級別上執行很多緩衝區獲讀取的會話或語句(或者說正在大量消耗資源的語句),檢查 V$SESSTATV$SQLSTATS

    執行計劃的改變導致SQL效率不佳,這個比較難難定位了

    初始化引數設定不正確

    整個系統升級或部分程式碼升級時其中的演算法問題

    如果資料庫會話發生了等待事件,檢視 V$SESSION_WAIT 上的等待事件,以確定是什麼原因造成的序列化。V$ACTIVE_SESSION_HISTORY 記錄了取樣的活動會話歷史記錄,即使這些會話發生已結束,當系統恢復正常後,仍可用於診斷分析。在library cache中發生大量爭用的情況下,可能無法登入或commit。在這種情況下,使用歷史資料,以確定為什麼突然間會發生此閂鎖爭用。如果大多數在等待I/O,透過V$ACTIVE_SESSION_HISTORY以確定正在活動的會話執行的中包含大量IO的SQL語句。

  4. Apply emergency action to stabilize the system. This could involve actions that take parts of the application off-line or restrict the workload that can be applied to the system. It could also involve a system restart or the termination of job in process. These naturally have service level implications.  

    採取緊急行動來穩定系統。可能涉及停掉部分應用程式或限制系統的壓力。包括系統重啟或終止執行中的JOB。這會對服務產生一定的影響(都重啟了當然有影響了,不過現在RAC應用的挺多,重啟一臺對另一臺沒影響)。

  5. Validate that the system is stable. Having made changes and restrictions to the system, validate that the system is now stable, and collect a reference set of statistics for the database. Now follow the rigorous performance method described earlier in this book to bring back all functionality and users to the system. This process may require significant application re-engineering before it is complete.

        驗證該系統是穩定的。對系統進行了修改和限制後,驗證系統現在是穩定的,併為資料庫收集參考的統計資料。現在,按照本書前面介紹的嚴格的效能分析方法重新啟用系統的所有功能和使用者連線。這個過程完成之前,可能需要大量的重新設計應用程式的工作。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30633755/viewspace-2127723/,如需轉載,請註明出處,否則將追究法律責任。

相關文章