Troubleshooting 11.2 Clusterware Node Evictions (Reboots)_1050693.1
Troubleshooting 11.2 Clusterware Node Evictions (Reboots) (Doc ID 1050693.1)
In this Document
Purpose |
Scope |
Details |
NODE EVICTION OVERVIEW |
1.0 - PROCESS ROLES FOR REBOOTS |
2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT |
3.0 - TROUBLESHOOTING OCSSD EVICTIONS |
3.1 - COMMON CAUSES OF OCSSD EVICTIONS |
3.2 - FILES TO REVIEW AND GATHER FOR OCSSD EVICTIONS |
4.0 - TROUBLESHOOTING CSSDAGENT OR CSSDMONITOR EVICTIONS |
4.1 - COMMON CAUSES OF CSSDAGENT OR CSSDMONITOR EVICTIONS |
4.2 - FILES TO REVIEW AND GATHER FOR CSSDAGENT OR CSSDMONITOR EVICTIONS |
References |
Applies to:
Oracle Database - Enterprise Edition - Version 11.2.0.1 to 11.2.0.2 [Release 11.2]Information in this document applies to any platform.
Purpose
This document is to provide a reference for troubleshooting 11.2 Clusterware node evictions. For clusterware node evictions prior to 11.2, see Note: 265769.1
Scope
This document is intended for DBA's and support analysts experiencing clusterware node evictions (reboots).
Details
NODE EVICTION OVERVIEW
The Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected. A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process. The purpose of this node eviction is to maintain the overall health of the cluster by removing bad members.
Starting in 11.2.0.2 RAC (or if you are on Exadata), a node eviction may not actually reboot the machine. This is called a rebootless restart. In this case we restart most of the clusterware stack to see if that fixes the unhealthy node.
1.0 - PROCESS ROLES FOR REBOOTS
OCSSD (aka CSS daemon) - This process is spawned by the cssdagent process. It runs in both
vendor clusterware and non-vendor clusterware environments. OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery. The health monitoring includes a network heartbeat and a disk heartbeat (to the voting files). OCSSD can also evict a node after escalation of a member kill from a client (such as a database LMON process). This is a multi-threaded process that runs at an elevated priority and runs as the Oracle user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin
CSSDAGENT - This process is spawned by OHASD and is responsible for spawning the OCSSD process, monitoring for node hangs (via oprocd functionality), and monitoring to the OCSSD process for hangs (via oclsomon functionality), and monitoring vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent
CSSDMONITOR - This proccess also monitors for node hangs (via oprocd functionality), monitors the OCSSD process for hangs (via oclsomon functionality), and monitors vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor
2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT
Important files to review:
-
Clusterware alert log in
/log/ -
The cssdagent log(s) in
/log/ /agent/ohasd/oracssdagent_root -
The cssdmonitor log(s) in
/log/ /agent/ohasd/oracssdmonitor_root -
The ocssd log(s) in
/log/ /cssd - The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
- IPD/OS or OS Watcher data
- 'opatch lsinventory -detail' output for the GRID home
- *Messages files:
* Messages file locations:
- Linux: /var/log/messages
- Sun: /var/adm/messages
- HP-UX: /var/adm/syslog/syslog.log
- IBM: /bin/errpt -a > messages.out
Document 1513912.1 - TFA Collector - Tool for Enhanced Diagnostic Gathering
11.2 Clusterware evictions should, in most cases, have some kind of meaningful error in the clusterware alert log. This can be used to determine which process is responsible for the reboot. Example message from a clusterware alert log:
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653
This particular eviction happened when we had hit the network timeout. CSSD exited and the cssdagent took action to evict. The cssdagent knows the information in the error message from local heartbeats made from CSSD.
If no message is in the evicted node's clusterware alert log, check the lastgasp logs on the local node and/or the clusterware alert logs of other nodes.
3.0 - TROUBLESHOOTING OCSSD EVICTIONS
If you have encountered an OCSSD eviction review common causes in section 3.1 below.
3.1 - COMMON CAUSES OF OCSSD EVICTIONS
- Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default - determined by the CSS misscount) to cause a node eviction.
- Problems writing to or reading from the CSS voting disk. If the node cannot perform a disk heartbeat to the majority of its voting files, then the node will be evicted.
- A member kill escalation. For example, database LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanism. If this times out it could escalate to a node kill.
- An unexpected failure or hang of the OCSSD process, this can be caused by any of the above issues or something else.
- An Oracle bug.
3.2 - FILES TO REVIEW AND GATHER FOR OCSSD EVICTIONS
All files from section 2.0 from all cluster nodes. More data may be required.
Example of an eviction due to loss of voting disk:
CSS log:
2012-03-27 22:05:48.693: [ CSSD][1100548416]###################################
2012-03-27 22:05:48.693: [ CSSD][1100548416]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
OS messages:
Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:Symm 000190104720 vol 0c71 is dead.
Mar 27 22:03:58 choldbr132p kernel: Buffer I/O error on device sdbig, logical block 0
...
4.0 - TROUBLESHOOTING CSSDAGENT OR CSSDMONITOR EVICTIONS
If you have encountered a CSSDAGENT or CSSDMONITOR eviction review common causes in section 4.1 below.
4.1 - COMMON CAUSES OF CSSDAGENT OR CSSDMONITOR EVICTIONS
- An OS scheduler problem. For example, if the OS is getting locked up in a driver or hardware or there is excessive amounts of load on the machine (at or near 100% cpu utilization), thus preventing the scheduler from behaving reasonably.
- A thread(s) within the CSS daemon hung.
- An Oracle bug.
4.2 - FILES TO REVIEW AND GATHER FOR CSSDAGENT OR CSSDMONITOR EVICTIONS
All files from section 2.0 from all cluster nodes. More data may be required.
Database - RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support
References
NOTE:265769.1 - Troubleshooting 10g and 11.1 Clusterware RebootsNOTE:301137.1 - OSWatcher Black Box (Includes: [Video])
NOTE:1053147.1 - 11gR2 Clusterware and Grid Home - What You Need to Know
NOTE:736752.1 - Introducing Cluster Health Monitor (IPD/OS)
NOTE:1513912.1 - TFA Collector - Tool for Enhanced Diagnostic Gathering
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25462274/viewspace-1482840/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 補接_oracle rac_node addition and deletion for clusterware or softwareOracle
- Reboot-less node fencing in Oracle Clusterware 11g Release 2bootOracle
- Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictiAIORMOracle
- Modifying the VIP or VIP Hostname of a 10g or 11g Oracle Clusterware NodeOracle
- Add Node/Instance Remove Node/Instance in 10gR2 11g Clusterware RAC_1332451.1REM
- oracle clusterwareOracle
- Troubleshooting ServeRAIDServerAI
- Troubleshooting tips
- ORACLE RAC clusterwareOracle
- 11.2
- postgreSQL troubleshooting 故障分析SQL
- Oracle Clusterware的心跳Oracle
- Oracle Clusterware工具集Oracle
- Clusterware 後臺程式
- Troubleshooting POST error codesError
- zt_oracle troubleshooting案例Oracle
- Oracle clusterware組成概述Oracle
- 安裝clusterware問題
- HACMP & Oracle Clusterware 對比ACMOracle
- Systematic Latch Contention Troubleshooting in OracleOracle
- Java Monitoring, Management and Troubleshooting ToolsJava
- Checkpoint Tuning and Troubleshooting GuideGUIIDE
- Troubleshooting Database Creation (121)Database
- clone grid INfrastructure Home and clusterwareASTStruct
- Oracle Clusterware and Oracle Grid InfrastructureOracleASTStruct
- 11.2 模型finetune模型
- 【Spark篇】---Spark故障解決(troubleshooting)Spark
- Troubleshooting Session Administration [ID 805586.1]Session
- Troubleshooting Oracle ClusterwareThis appendix introducesOracleAPP
- Troubleshooting Database Control Startup IssuesDatabase
- How to Deinstall Oracle Clusterware Home ManuallyOracle
- Oracle Clusterware 命令集分類Oracle
- Troubleshooting 'enq: TX - index contention' WaitsENQIndexAI
- Troubleshooting Database Hang Issues (Doc ID 1378583.1)Database
- websphere中介軟體故障診斷troubleshootingWeb
- mysql repilcation troubleshooting基礎知識點MySql
- Oracle 21C Clusterware Technology StackOracle
- oracle clusterware命令集的分類:Oracle