Linux: Hangcheck-Timer Module Requirements for Oracle 9i,10g,11gR1 RAC_726833.1

rongshiyuan發表於2014-03-20
Linux: Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11gR1 RAC (Doc ID 726833.1)

In this Document

Purpose
Scope
Details
References


Applies to:

Oracle Database - Enterprise Edition - Version 9.2.0.8 to 11.1.0.7 [Release 9.2 to 11.1]
Linux x86
Linux x86-64

Purpose

Hangcheck_timer module is required to run a supported configuration in Oracle Real Application Clusters environments on Linux, with Oracle releases 9i, 10g, or 11gR1 RAC.  This note identifies and outlines the requirements needed to configure hangcheck-timer in an Oracle Enterprise Linux, Red Hat Linux, or SUSE Linux environment.

Note : Hangheck timer is not required starting with Oracle Clusterware 11gR2

Scope

This article is provided for product management, system architects, and system administrators involved in deploying and configuring Oracle RAC 9i, 10g, or 11gR1 in a Linux environment. This document will also be useful to field engineers and consulting organizations to facilitate installations and configuration requirements of Oracle in a Linux RAC environment.

Details

Starting in release 9.2.0.2 and later, Oracle RAC environments required using a new I/O fencing model, named the hangcheck-timer module. This module was implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer was subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and above.

Hangcheck-timer should be loaded at boot time, and monitors the Linux kernel for long operating system hangs that could affect the reliability of a RAC node.  It runs in kernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays or node hangs.  This is done by setting a timer, then checking when the timer fires as to whether it was delayed by more than the allowed margin of error.  If the duration exceeds the allowed time of (hangcheck_tick + hangcheck_margin seconds), the machine is restarted.  Hangcheck-timer will not cause reboots to occur due to CPU starvation.

 Hangcheck-timer requires three configuration parameters:

  • hangcheck_tick - defines how often, in seconds, the hangcheck-timer checks the node for hangs. The default value is 60 seconds.
  • hangcheck_margin - defines how much margin is allowed, in seconds, between expected scheduling and real scheduling time. The default value is 180 seconds.
  • hangcheck_reboot - determines if the hangcheck-timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1, then the hangcheck-timer module restarts the system. If the hangcheck_reboot parameter is set to zero, then the hangcheck-timer module will not reboot the node, even if a hang is detected.   The default value varies by kernel version.  In the 2.4 kernel, the default is 1.  In 2.6 kernels, the default is 0.
All hangcheck-timer default values should be explicitly overridden when loading the kernel module, based on the Oracle release as follows:
  • 9i: Assuming the default setting of "oracm misscount" is set to 220 seconds:
    hangcheck_tick=30 hangcheck_margin=180 hangcheck_reboot=1
  • 10g/11gR1: Assuming the default setting of "CSS misscount" is set to either 30 or 60 seconds:
    hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1

You must always ensure that the Cluster misscount setting is greater than the sum of the setting for hangcheck_tick + hangcheck_margin.

@  Unpublished information for Oracle Support Internal Use:

When running Oracle Clusterware on Linux, hangcheck-timer should always be configured on each RAC cluster node, as the functionality of this module is required to provide I/O Fencing to ensure no stray writes will occur from an evicted node in a RAC cluster.  To verify if the hangcheck-timer module is running on a node execute as the root or oracle user:

# /sbin/lsmod | grep hangcheck

hangcheck-timer         2672   0


If the hangcheck-timer module is loaded (running) you will see output similar to above. When hangcheck-timer is not loaded no output is generated, and the command prompt is returned to the user.

In an Oracle Enterprise Linux, Red Hat 4/5, or SUSE 9/10 environment the hangcheck-timer module is loaded using the modprobe command:

# modprobe hangcheck-timer  hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1


In order to ensure the module is loaded at boot time, you should also place the same command in the appropriate local command execution directory (e.g. /etc/rc.d/rc.local, or /etc/init.d/boot.local).  In earlier releases, hangcheck-timer was loaded using insmod in place of modprobe. Consult your release specific documentation to determine which initialization method is required.

Hangcheck-timer will provide message logging to the system messages log when a failure is detected, and a node restart is initiated by the module:

  • When Hangcheck-timer reboots it may leave "Hangcheck: hangcheck is restarting the machine" message in /var/log/messages
  • If you see the following message in /var/log/messages:  "Hangcheck: hangcheck value past margin!" this means a reboot was required but was not performed, because hangcheck_reboot was not set to 1.  If this message is seen, you must reload the hangcheck module as described earlier in this note, with the hangcheck_reboot value set to 1.


Known Issues

  • Bug:6125546 which can prevent hangcheck-timer from rebooting in RHEL4 (fixed in 2.6.9.56 or RHEL4.6)


 

Database - RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Database - RAC/Scalability Community

References

NOTE:232355.1 - Hangcheck Timer FAQ
NOTE:559365.1 - Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
NOTE:567730.1 - Changes in Oracle Clusterware on Linux with the 10.2.0.4 Patchset

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/17252115/viewspace-1126109/,如需轉載,請註明出處,否則將追究法律責任。

相關文章