Oprocd & Hangcheck-timer

BTxigua發表於2011-12-24


Process Monitor Daemon (OPROCD)

1.Process Monitor Daemon (OPROCD): This process is locked in memory to monitor the cluster and provide I/O fencing. OPROCD performs its check, stops running, and if the wake up is beyond the expected time, then OPROCD resets the processor and reboots the node. An OPROCD failure results in Oracle Clusterware restarting the node. OPROCD uses the hangcheck timer on Linux platforms.
OPROCD introduced in 10.2.0.4 Linux and other Unix platform.
Note that oprocd only runs when no vendor clusterware is running or on Linux > 10.2.0.4
 
OPROCD takes two parameters
-t  - Timeout value
Length of time between executions (milliseconds)
Normally defaults to 1000
-m - Margin
Acceptable margin before rebooting (milliseconds)
Normally defaults to 500
預設的情況下,-t為1秒,-m為0.5秒,OPROCD重啟的延時為1.5秒
$ ps -efl | grep oprocd
0 S root 6444 3080 0 78 0 - 636 - Apr15 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
4 S root 7255 6444 0 -40 - - 516 - Apr15 ? 00:00:00 /u01/app/crs11g/bin/oprocd run -t 1000 -m 500 -f
在11gR2以前,可以通過diagwait來延長oprocd發起重啟的時間,步驟如下:
1.Log in as root, and run the following command on all nodes, where CRS_home is the home directory of the Oracle Clusterware installation:
  # CRS_home/bin/crsctl stop crs
2.Enter the following command, where CRS_home is the Oracle Clusterware home:
  # CRS_home/bin/oprocd stop
  Repeat this command on all nodes.
3.Ensure that Clusterware stack is down on all nodes by executing
  # ps -ef |egrep "crsd.bin|ocssd.bin|evmd.bin|oprocd"
4.From one node of the cluster, change the value of the diagwait parameter to 13 seconds by issuing the following command as root:
  # CRS_home/bin/crsctl set css diagwait 13 -force
5.Check if diagwait is set successfully by executing. the following command. The command should return 13. If diagwait is not set, the following message will be returned "Configuration parameter diagwait is not defined"
  # crsctl get css diagwait
6.Restart the Oracle Clusterware by running the following command on all nodes:
  # CRS_home/bin/crsctl start crs
7.Run the following command to ensure that Oracle Clusterware is functioning properly:
  # CRS_home/bin/crsctl check crs
Unsetting/Removing diagwait
#crsctl unset css diagwait -force
(Note:  the -force option must be used when unsetting diagwait since CRS will be down when doing so)
Starting with 11.2.0.1, Customers do not need to set diagwait as the architecture has been changed.
Diagwait can be set on windows but it does not change the behaviour as it does on Unix-Linux platforms
調整diagwait為13後的引數值,OPROCD重啟的延時為11秒
$ ps -efl | grep oprocd
0 S root 6444 3080 0 78 0 - 636 - Apr15 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
4 S root 7255 6444 0 -40 - - 516 - Apr15 ? 00:00:00 /u01/app/crs11g/bin/oprocd run -t 1000 -m 10000 -f

Hangcheck-Timer Module
2.Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
Starting in release 9.2.0.2 and later, Oracle RAC environments required using a new I/O fencing model, named the hangcheck-timer module. This module was implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer was subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and above.
Hangcheck-timer should be loaded at boot time, and monitors the Linux kernel for long operating system hangs that could affect the reliability of a RAC node.  It runs in kernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays or node hangs.  This is done by setting a timer, then checking when the timer fires as to whether it was delayed by more than the allowed margin of error.  If the duration exceeds the allowed time of (hangcheck_tick + hangcheck_margin seconds), the machine is restarted.  Hangcheck-timer will not cause reboots to occur due to CPU starvation.
 Hangcheck-timer requires three configuration parameters:
    hangcheck_tick - defines how often, in seconds, the hangcheck-timer checks the node for hangs. The default value is 60 seconds.
    hangcheck_margin - defines how much margin is allowed, in seconds, between expected scheduling and real scheduling time. The default value is 180 seconds.
    hangcheck_reboot - determines if the hangcheck-timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1, then the hangcheck-timer module restarts the system. If the hangcheck_reboot parameter is set to zero, then the hangcheck-timer module will not reboot the node, even if a hang is detected.   The default value varies by kernel version.  In the 2.4 kernel, the default is 1.  In 2.6 kernels, the default is 0.
Hangcheck-timer will provide message logging to the system messages log when a failure is detected, and a node restart is initiated by the module:
    When Hangcheck-timer reboots it may leave "Hangcheck: hangcheck is restarting the machine" message in /var/log/messages
    If you see the following message in /var/log/messages:  "Hangcheck: hangcheck value past margin!" this means a reboot was required but was not performed, because hangcheck_reboot was not set to 1.  If this message is seen, you must reload the hangcheck module as described earlier in this note, with the hangcheck_reboot value set to 1.
Note : Hangheck timer is not required starting with Oracle Clusterware 11gR2

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/10867315/viewspace-713900/,如需轉載,請註明出處,否則將追究法律責任。