Top 11 Things to do NOW to Stabilize your RAC Cluster Environment_1344678.1

rongshiyuan發表於2014-11-09

Top 11 Things to do NOW to Stabilize your RAC Cluster Environment (文件 ID 1344678.1)


In this Document

Purpose
Scope
Details
  1.  Apply the latest Patchset Update (PSU) to your environment
  2.  Ensure that UDP buffers are sized appropriately
  3.  Set DIAGWAIT to a value of 13 on all 10.2 and 11.1 Clusters
  4.  Implement HugePages on Linux Environments
  5.  Implement OS Watcher and/or Cluster Health Monitor
  6.  Follow Best Practices for OS settings
  7.  Ensure appropriate APARS are in place on AIX platforms to avoid Excessive paging/swapping issues
  8.  Apply NUMA Patch
  9.  Increase the Windows noninteractive Desktop Heap
  10.  Run ORAchk utility
  11. Implement NTP with slewing option
References

Applies to:

Oracle Database - Enterprise Edition - Version 10.2.0.1 to 11.2.0.3 [Release 10.2 to 11.2]
Information in this document applies to any platform.

Purpose


Many RAC instability issues are attributed to a rather short list of commonly missed Best Practices and/or Configuration issues.  The goal of this document it to provide a easy to find listing of these commonly missed Best Practices and/or Configuration issues with the hope to prevent instability caused by these issues.

Scope

This article applies to ALL RAC implementations.

Details

1.  Apply the latest Patchset Update (PSU) to your environment

Applicable to Platforms:  ALL PLATFORMS

Why?: Patchset Updates (aka PSUs) were introduced in 10.2.0.4 and later versions as an improvement on the CPU patching strategy. PSUs come out quarterly, and include the latest CPUs, plus they also include other fixes that are deemed to be critical to the stability of your environment. If you are doing a new installation, you should always apply the latest/most current PSU as your baseline. For existing installation, a strategy of maintaining the environment with the latest PSU on a regular and ongoing basis is a must. Many issues that come into Oracle support and turn out to be bugs are known bugs and many of these are already fixed in the latest PSU. Note that on Windows, cumulative bundle patches come out more frequently, but the latest PSU fixes are included in the Windows bundle patch that is released during the quarterly PSU release.

More Information: For more information on PSUs refer to the following documents:
Document 854428.1 Intro to Patch Set Updates (PSU)
Document 1082394.1 11.2.0.X Grid Infrastructure PSU Known Issues
Document 756671.1 Oracle Recommended Patches -- Oracle Database
Document 161549.1 Oracle Database, Networking and Grid Agent Patches for Microsoft Platforms

2.  Ensure that UDP buffers are sized appropriately

Applicable to Platforms: ALL PLATFORMS Except Windows

Why?: The interconnect is the lifeblood of a RAC database. However, without proper buffer space allocated for UDP send and receive buffers, the performance of the interconnect will suffer substantially. This will lead to stability issues with your cluster.

More Information: For more information on properly sizing UDP buffers refer to the following documents:
Document 181489.1 Tuning Inter-Instance Performance in RAC and OPS
Document 563566.1 gc lost blocks diagnostics

Note: Windows clusters use TCP for cache fusion traffic, so UDP buffer settings are not applicable to Windows.

 

3.  Set DIAGWAIT to a value of 13 on all 10.2 and 11.1 Clusters

Applicable to Platforms:  ALL PLATFORMS Except Windows

Why?:  The default margin for the OPROCD daemon in 10gR2 (10.2.x) and 11gR1 (11.1.x) is set to only 500 milliseconds (.5 seconds). This margin may be too small for systems that are very busy, and therefore false reboots may occur on a heavily loaded system. Changing the diagwait setting to 13 changes the margin for OPROCD to 10,000 milliseconds (10 seconds) and provides more margin for busy systems to avoid false reboots. In addition, the diagwait setting allows more time to flush information to trace files for further debugging if a reboot does occur. This change cannot be included in a patchset, because it requires a clusterwide outage to implement. However, it is strongly recommended that ALL 10gR2 and 11gR1 clusters have this value set to 13.  For new implementations, this change should be made shortly after installation. For existing installations, downtime should be scheduled to make this change as soon as feasible. The current setting can be determined by the following command:

# $CLUSTERWARE_HOME\bin\crsctl get css diagwait

 

Note: This setting is not applicable to Windows environments, nor is this setting applicable to 11gR2 (11.2.0.1 and later) releases.  For 10gR2, patchset 10.2.0.4 (or later) is required.


More Information: For more information on DIAGWAIT, please refer to the following notes:
Document 559365.1  Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
Document 567730.1  Changes in Oracle Clusterware on Linux with the 10.2.0.4 Patchset

4.  Implement HugePages on Linux Environments

Applicable to Platforms:  ALL LINUX 64-Bit PLATFORMS

Why?:  Implementing HugePages greatly improves the performance of the kernel on Linux environments. This is especially true for systems with more memory. Generally speaking any system with more than 12GB of RAM is a good candidate for hugepages. The more RAM there is in the system, the more your system will benefit by having hugepages enabled. This is because the amount of work the kernel must do to map and maintain the page tables for this amount of memory increases with more memory in the system. Enabling hugepages greatly reduces the # of pages the kernel must manage, and makes the system much more efficient. If hugepages is NOT enabled, experience has shown that it is very common for the kernel to preempt the critical Oracle Clusterware or Real Application Clusters daemons, leading to instance evictions or node evictions.

Note:  11g Automatic Memory Management (AMM) is NOT compatible with Huge Pages on the Linux platform.  Best practice is to disable AMM in favor of HugePages.  See Document 749851.1 for more information regarding AMM and HugePages on Linux.



More information:
Document 361323.1  HugePages on Linux: What It Is... and What It Is Not...
Document 401749.1  Shell Script to Calculate Values Recommended Linux HugePages / HugeTLB Configuration

5.  Implement OS Watcher and/or Cluster Health Monitor

Applicable to Platforms:  ALL PLATFORMS

Why?: Though not directly related to stability, OS Watcher and Cluster Health Monitor are invaluable tools for determining the state of the OS and the potential root cause of many problems leading to node or instance evictions. Having the proper data available to diagnose a problem after the first occurrence of any problem will lead to a shorter cycle to determine the cause, and will therefore prevent future outages. Most 3rd party data gathering tools of this type have collection intervals that are too long (i.e. 5 minutes or longer) and / or they are difficult to interprert or do not collect the proper data. OS Watcher is a very simple and lightweight tool that gathers basic OS information every 30 seconds (by default). Cluster Health Monitor, though not available on all platforms, complements OS Watcher by collecting data in real time at a more granular level. It is crucial that one or both of these utilities be running on all cluster nodes at all times, to facilitate more rapid diagnosis and debugging of issues.

More Information:
Document 301137.1 OS Watcher User Guide
Document 1328466.1 Cluster Health Monitor (CHM) FAQ
Document 580513.1 How To Start OSWatcher Every System Boot (Linux specific)


6.  Follow Best Practices for OS settings

(in the Joint Oracle / IBM White Paper on memory tuning for System Stability)

Applicable to Platforms:  All AIX Versions

Why?: The Oracle Real Application Clusters on IBM AIX Best practices in memory tuning and configuring for system stability white paper is the culmination of joint testing and combined best practices from both vendors, based on mutual experience. Experience has shown that the majority of stability problems in RAC/AIX clusters can be resolved by following the recommendations in this paper. AIX version 6.1 has incorporated many of these recommendations as default values, however these settings should be confirmed on all AIX RAC clusters regardless of OS version or Oracle version.

More Information:
White Paper available at: http://www.oracle.com/technetwork/database/clusterware/overview/rac-aix-system-stability-131022.pdf
Document 811293.1  RAC Assurance Support Team: RAC Starter Kit and Best Practices (AIX)

7.  Ensure appropriate APARS are in place on AIX platforms to avoid Excessive paging/swapping issues


Applicable to platforms: All AIX Versions

Why?: Experience has shown that this is a very common issue affecting AIX environments. Because of the nature of this issue, anyone susceptible to this problem can experience a complete system hang. In a non-RAC environment, this will lead to a hang of the system until manual intervention takes place. However, in a RAC environment, this will lead to a Node eviction due to lack of responsiveness of the node.

More Information: For more information on this problem, refer to the following Oracle Document 1088076.1 Paging Space Growth May Occur Unexpectedly on AIX Systems With 64K (medium) Pages Enabled

Note: The version and # of the APAR listed in the note is specific to a given Technology Level. The actual APAR or Fix# that you need to apply will depend on your particular Technology Level (TL). Different APARs are available for different Technology Levels. Check with IBM to confirm if you have this fix in place, and if not, what TL or APAR is required in order to get this particular fix.

 

8.  Apply NUMA Patch

Applicable to platforms:  ALL PLATFORMS

Why?:  With the 10.2.0.4 and 11.1.0.7 RDBMS patchsets, NUMA optimization was enabled on those platforms which support NUMA (OS and Hardware dependent). This enablement of NUMA within the RDBMS code (on system supporting NUMA) has resulted in bugs causing performance degradation and instability of the database. A full lising of symptoms/issues related to NUMA optimization in 10.2.0.4 and 11.1.0.7 can be found in Document 759565.1. If you are running the 10.2.0.4 or 11.1.0.7 patchsets, Oracle highly recommends that Patch 8199533 be applied to your system to proactively address these NUMA related issues.

9.  Increase the Windows noninteractive Desktop Heap

Applicable to platforms:  Windows Platforms

Why?: It has been found that on Windows clusters the default size of the noninteractive desktop heap is not sufficient. This results in application connectivity issues and general instability of the cluster (hangs and/or crashes). To take proactive action on this issue it is recommended to increase the noninteractive desktop heap to 1MB. Increases beyond the recommended 1MB should not be performed without the involvment of Microsoft.

More Information: Instructions on how to make this adjustment to the noninteractive desktop heap can be found in Document 744125.1.

10.  Run ORAchk utility

Applicable to platforms:  Linux (x86 and x86_64), Solaris SPARC and AIX (with the bash shell)

Why?: ORAchk is a RAC Configuration Audit tool designed to audit various important configuration settings within Real Application Clusters (RAC), Oracle Clusterware (CRS), Automatic Storage Management (ASM) and Grid Infrastructure (GI) environments. This utility is to be used to validate the Best Practices and Success Factors defined in the series of RAC and Oracle Clusterware Best Practices and Starter Kit Notes (see Document 810394.1) which are maintained by the RAC Assurance development and support teams.  Those customers running RAC on the ORAchk supported platforms are strongly encouraged to utilize this tool to identify potential configuration issues that could impact the stability of the cluster.

More Information: More information on ORAchk as well as the link to download the utility is found in Document 1268927.1.

11. Implement NTP with slewing option

Applicable to platforms:  All Linux and Unix Platforms.

Why?: Without the slew option NTP will shift the system clock forwards or backwards when the time discrepancy exceeds a specific (platform dependent) threshold. Large backward time shifts can result in the Clusterware thinking that checkins have been missed resulting in node evictions. For this reason it is highly recommended that NTP be configured to slew time (speed up or slow down) the clock to synchronize the time to prevent such evictions.  For more information on how to implement NTP time slewing on your platform please refer to the Platform Specific RAC and Oracle Clusterware Best Practice and Starter Kit Notes (see below).

More Information:
Document 811306.1 RAC and Oracle Clusterware Best Practices and Starter Kit (Linux)
Document 811280.1 RAC and Oracle Clusterware Best Practices and Starter Kit (Solaris)
Document 811271.1 RAC and Oracle Clusterware Best Practices and Starter Kit (Windows)
Document 811293.1 RAC and Oracle Clusterware Best Practices and Starter Kit (AIX)
Document 811303.1 RAC and Oracle Clusterware Best Practices and Starter Kit (HP-UX)

 

Database - RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Database - RAC/Scalability Community

References

NOTE:301137.1 - OSWatcher (Includes: [Video])
NOTE:361323.1 - HugePages on Linux: What It Is... and What It Is Not...
NOTE:811293.1 - RAC and Oracle Clusterware Best Practices and Starter Kit (AIX)
NOTE:811280.1 - RAC and Oracle Clusterware Best Practices and Starter Kit (Solaris)
NOTE:220970.1 - RAC: Frequently Asked Questions
NOTE:756671.1 - Oracle Recommended Patches -- Oracle Database
NOTE:759565.1 - Oracle NUMA Usage Recommendation
NOTE:559365.1 - Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
NOTE:811306.1 - RAC and Oracle Clusterware Best Practices and Starter Kit (Linux)
NOTE:854428.1 - Patch Set Updates for Oracle Products
NOTE:744125.1 - Windows: Connections Fail with ORA-12640 or ORA-21561
NOTE:401749.1 - Shell Script to Calculate Values Recommended Linux HugePages / HugeTLB Configuration
NOTE:1054902.1 - How to Validate Network and Name Resolution Setup for the Clusterware and RAC
BUG:13623902 - NODE EVICTIONS ON RAC CLUSTER AFTER EXCESSIVE PAGING

NOTE:1268927.1 - ORAchk - Health Checks for the Oracle Stack
NOTE:1328466.1 - Cluster Health Monitor (CHM) FAQ
NOTE:161549.1 - Oracle Database, CRS, ASM, Networking and EM Agent Patches for Microsoft Platforms
NOTE:1082394.1 - 11.2.0.1.X Grid Infrastructure PSU Known Issues
NOTE:1088076.1 - AIX: Paging Space Growth May Occur Unexpectedly With 64K (medium) Pages Enabled
NOTE:181489.1 - Tuning Inter-Instance Performance in RAC and OPS
NOTE:563566.1 - Troubleshooting gc block lost and Poor Network Performance in a RAC Environment
NOTE:749851.1 - HugePages and Oracle Database 11g Automatic Memory Management (AMM) on Linux
NOTE:1427855.1 - AIX: Top Things to DO NOW to Stabilize 11gR2 GI/RAC Cluster
NOTE:810394.1 - RAC and Oracle Clusterware Best Practices and Starter Kit (Platform Independent)
NOTE:567730.1 - Changes in Oracle Clusterware on Linux with the 10.2.0.4 Patchset
NOTE:811271.1 - RAC and Oracle Clusterware Best Practices and Starter Kit (Windows)

 

文件詳細資訊

 
為此文件評級 通過電子郵件傳送此文件的連結在新視窗中開啟文件可列印頁
型別:
狀態:
上次主更新:
上次更新:
語言:
BULLETIN
PUBLISHED
2014-3-9
2014-9-16
English簡體中文日本語???
     
 

相關產品

 
     
 

資訊中心

 
     
 

文件引用

 
     
 

最近檢視

 
     

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/17252115/viewspace-1326304/,如需轉載,請註明出處,否則將追究法律責任。

相關文章