Troubleshooting Database Hang Issues (Doc ID 1378583.1)

rongshiyuan發表於2014-03-31

In this Document

Purpose
Troubleshooting Steps
  Priority: Immediate Resolution
  Diagnostic Data Collection
  Guidance for General Scenarios
  Guidance for Specific Scenarios
  Intermittent Hangs
  Troubleshooting Other Issues
References

Applies to:

Oracle Database - Personal Edition - Version 10.1.0.2 and later
Oracle Database - Enterprise Edition - Version 10.1.0.2 and later
Oracle Database - Standard Edition - Version 10.1.0.2 and later
Information in this document applies to any platform.
***Checked for relevance on 14-Nov-2013***

Purpose

The diagnosis of the cause of database hangs is a complex task and can involve sifting through large volumes of trace which then needs correlating to determine a cause. Due to this, it may not be feasible to produce a resolution to these issues through a document such as this. However, this document can advise on the collection of first stage diagnostic information and other information.

This document supercedes :

Document 61552.1 Troubleshooting Oracle Database Hanging Issues for versions from 7 to 9--Exhaustive

Troubleshooting Steps

Priority: Immediate Resolution

A database 'hang' situation is a rare occurrence and can have many diverse causes, but a true hang can be devastating to the operation of a system.  The priority is usually to get the system working as quickly as possible.

Often this means killing of processes or restating of the system which has the unfortunate side effect of removing all the evidence necessary to diagnose the cause of the issue.

Note that a hang and a spin situation can often be indistinguishable in as much as both can display the same symptoms (locking up the system making it inoperable). In terms of resolution, hangs and spins tend to have different directions, but in terms of initial diagnostics, these are similar. To determine between the two types see:

Document 68738.1 No Response from the Server, Does it Hang or Spin?

Diagnostic Data Collection


If possible, collect the following during the hang situation prior to any emergency resolution action:

Document 452358.1 How to Collect Diagnostics for Database Hanging Issues


The basic information to diagnose a hang is :

  • Starting from 11g release 1, the dia0 background processes starts collecting hanganalyze information and stores this in memory in the "hang analysis cache". It does this every 3 seconds for local hanganalyze information and every 10 seconds for global (RAC) hanganalyze information. This information can provide a quick view of hang chains occurring at the time of a hang being experienced.

    Document 1428210.1  Troubleshooting Database Contention With V$Wait_Chains
    • Hanganalyze and Systemstate dumps:
      Collected automatically, manually or possibly via HANGFG -

      Document 362094.1 HANGFG User Guide

 

    • Snapshots of General database performance:
      Typically with AWR reports -

      Document 1363422.1 AWR Reports - Information Center

 

  • Some background system information:
    Typically collected via RDA -

    Document 314422.1 - Remote Diagnostic Agent (RDA) 4 - Getting Started

Guidance for General Scenarios

In general, diagnosing the cause of a hang is establishing which process or processes are holding a resource that is blocking the others. This can be established in many ways but as an example, might be a process that is completely stuck waiting for some other activity.  When this activity is complete or is terminated, this frees up the waiters. In a simple example, a blocker waits for a holder and freeing the holder resolves the hang, but it is possible for a chain of holders and waiters to exist making identification of the real holder a more complex process.

In terms of interpreting the diagnostics:

  • Hanganalyze

    Hanganalyze is probably the easiest hang diagnostic to interpret since it is designed to be a summary as opposed to exhaustive. Hanganalyze trace provides a quick pointer to the waiting and potential holder processes with 'cycles' (i.e. identification of processes that are blocking each other).

    Document 215858.1 - Interpreting HANGANALYZE trace files to diagnose hanging and performance problems
  • Systemstate

    Systemstate trace gives an exhaustive trace of all processes that can be explored to determine some more detail about the holder itself. Its interpretation is complex and is beyond the scope of this article, although some of the specific articles below do touch on this. Generally these should be sent to support for interpretation.

  • General Systemwide Reports

    AWR reports provides system wide information that can be useful in determining the general area that processes are waiting on and can also give useful comparison information from the build up, and the aftermath of an issue.

    Document 1359094.1 FAQ: How to read an AWR report


  • RDA

    RDA provides background structure information that can be useful for elimination of potential solutions once found.

Guidance for Specific Scenarios


The following articles cover diagnostics that can be used to approach specific situations that may lead to hangs:

Document 976714.1 How To: Diagnose HANG with PMON failed to acquire latch, see PMON dump
Document 1342917.1 Troubleshooting 'latch: cache buffers chains' Wait Contention
Document 278316.1 Potential reasons for "WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! "
Document 406570.1 How to Avoid Contention Based Hangs

Intermittent Hangs

Performance problems that are intermittent happen without warning, last for a short duration, and are extremely difficult to diagnose. In order to assist with these issues, Support have developed the following tools to assist collect information in such circumstances:

Document 370363.1CASE STUDY: Using Real-Time Diagnostic Tools to Diagnose Intermittent Database Hangs
Document 352363.1 LTOM - The On-Board Monitor User Guide
Document 301137.1 OS Watcher User Guide


Troubleshooting Other Issues

For guidance troubleshooting other performance issues see:

Document:1377446.1 Troubleshooting Performance Issues


References

NOTE:1342917.1 - Troubleshooting 'latch: cache buffers chains' Wait Contention
NOTE:1359094.1 - How to Use AWR Reports to Diagnose Database Performance Issues
NOTE:1363422.1 - Automatic Workload Repository (AWR) Reports - Start Point
NOTE:1377446.1 - * Troubleshooting Performance Issues
NOTE:1428210.1 - Troubleshooting Database Contention With V$Wait_Chains
NOTE:215858.1 - Interpreting HANGANALYZE trace files to diagnose hanging and performance problems for 9i and 10g.
NOTE:301137.1 - OSWatcher Black Box (Includes: [Video])
NOTE:314422.1 - Remote Diagnostic Agent (RDA) - Getting Started
NOTE:352363.1 - LTOM - The On-Board Monitor User Guide
NOTE:362094.1 - HANGFG User Guide
NOTE:370363.1 - CASE STUDY: Using Real-Time Diagnostic Tools to Diagnose Intermittent Database Hangs

NOTE:406570.1 - How to Avoid Contention Based Hangs
NOTE:452358.1 - How to Collect Diagnostics for Database Hanging Issues
NOTE:61552.1 - Troubleshooting Oracle Database Hanging Issues for versions from 7 to 9--Exhaustive.
NOTE:68738.1 - No Response from the Server, Does it Hang or Spin?

NOTE:976714.1 - 'PMON failed to acquire latch, see PMON dump' in Alert Log - How To Diagnose

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/17252115/viewspace-1133584/,如需轉載,請註明出處,否則將追究法律責任。

相關文章