APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 to 12.1.0.2 [Release 11.2 to 12.1]
Information in this document applies to any platform.

PURPOSE

The purpose of this document is to explain when a "Waiting for clusterware split-brain resolution" alert log message precedes an instance crash or eviction.

TROUBLESHOOTING STEPS

Background

Before one of more instances crash, the alert.log shows "Waiting for clusterware split-brain resolution". This is often followed by "Evicting instance n from cluster" where n is the instance number that is getting evicted. The lmon process sends a network ping to remote instances, and if lmon processes on the remote instances do not respond, a split brain at the instance level occurred. Therefore, finding out the reason that the lmon can not communicate with each other is important in resolving this issue.

The common causes are:
1) The instance level split brain is frequently caused by the network problem, so checking the network setting and connectivity is important. However, since the clusterware (CRS) would have failed if the network is down, the network is likely not down as long as both CRS and database use the same network.
2) The server is very busy and/or the amount of free memory is low -- heavy swapping and scanning or memory will prevent lmon processes from getting scheduled.
3) The database or instance is hanging and lmon process is stuck.
4) Oracle bug

Troubleshooting Instructions

1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes.
For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes. The CHM output is available only on certain platform and version, so please check the CHM FAQ Document 1328466.1
6) Set up to run OSWatcher by following the instruction in the note Document 301137.1 if OSWatcher is not set up already.
Having OSWatcher output is helpful when CHM output is not available.

Diagnostic Collection

If TFA is installed (:)) simply run the following command:

	$GI_HOME/tfa/bin/tfactl diagcollect -from "MMM/dd/yyyy hh:mm:ss" -to "MMM/dd/yyyy hh:mm:ss"

Format example: "Jul/1/2014 21:00:00"

Specify the "from time" to be 4 hours before and the "to time" to be 4 hours after the time of error.

If TFA is not installed (:():

		Datatbase logs & trace files:
	
		cd $(orabase)/diag/rdbms

tar cf - $(find . -name '*.trc' -exec egrep '<date_time_search_string>' {} \; grep -v bucket) | gzip >  /tmp/database_trace_files.tar.gz
	
		ASM logs & trace files:
	
		cd $(orabase)/diag/asm/+asm/

tar cf - $(find . -name '*.trc' -exec egrep '<date_time_search_string>' {} \; grep -v bucket) | gzip >  /tmp/asm_trace_files.tar.gz
	
		Clusteware logs:
	
		<GI home>/bin/diagcollection.sh --collect --crs --crshome <GI home>
	
		OS logs:
	
		/var/adm/messages* or /var/log/messages* or 'errpt -a' or Windows System Event Viewer log (saved as .TXT file)