9I RAC驅逐案例分析-ORA-29740

hurp_oracle發表於2014-12-23
9I RAC驅逐案例:

參考文件:
Troubleshooting Instance Evictions (Instance terminates with ORA-29740, Instance Abort, Instance Kill) (文件 ID 1549135.1)
現狀:
開通系統:skt2b 資料庫宕掉

原因:
客服3系統服務IP被SKF3A接管原因服務IP網路卡所在的籠子損壞,被skf3a正常接管;
開通系統skt2b例項crash 原因分析:  正常情況下,心跳主網路卡down掉後會立即切換到備網路卡,不會對資料庫產生影響,但本次是由於網路卡所在的籠子損壞,網路卡本身處於UP狀態。這種情況下網路卡主備機制由於不能及時發現問題,主備切換延時,致使Oracle Rac心跳檢測超時,例項被驅逐;
問題分析步聚:
1.檢查alert  log 日誌裡查資料庫重啟原因
  被驅動的alert 日誌檔案出現ORA-29740說明LMON驅逐並中斷了該例項
Tue Dec 23 04:30:57 2014 
Communications reconfiguration: instance 0    ======>這裡說明由於例項之間通訊問題導致系統出現腦裂
Waiting for clusterware split-brain resolution 
Tue Dec 23 04:31:27 2014 
Trace dumping is performing id=[41223043057] 
Tue Dec 23 04:37:25 2014 
Errors in file /oracle/app/admin/orakt2/udump/okt2b_ora_688330.trc: 
ORA-01013: user requested cancel of current operation 
Tue Dec 23 04:37:25 2014 
Errors in file /oracle/app/admin/orakt2/udump/okt2b_ora_2666844.trc: 
ORA-01013: user requested cancel of current operation 
Tue Dec 23 04:41:02 2014 
Errors in file /oracle/app/admin/orakt2/bdump/okt2b_lmon_1163676.trc: 
ORA-29740: evicted by member 0, group incarnation 24    ==========〉If ora-29740 is found, this means that LMON of the evicted instance terminated it.
Tue Dec 23 04:41:02 2014 
LMON: terminating instance due to error 29740 
Instance terminated by LMON, pid = 1163676
2.針對ORA-29740檢查lmon trace來查詢驅逐原因
在trace裡查詢帶有”kjxgrrcfgchk: Initiating reconfig”的行,這裡將給出例項被驅逐的原因

The reconfiguration reasons are:

Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend


*** 2014-12-23 04:30:57.804 
kjxgrgetresults: Detect reconfig from 0, seq 23, reason 3 
kjxgrrcfgchk: Initiating reconfig, reason 3  ====〉從這裡可以判斷此例項驅逐是由於通訊失敗導致
*** 2014-12-23 04:31:02.762 
kjxgmrcfg: Reconfiguration started, reason 3 
kjxgmcs: Setting state to 23 0. 
*** 2014-12-23 04:31:02.792 
Name Service frozen 
kjxgmcs: Setting state to 23 1. 
*** 2014-12-23 04:31:02.816 
kjxgrrecp2: Waiting for split-brain resolution, upd 0, seq 24 
*** 2014-12-23 04:41:02.911 
Voting results, upd 0, seq 24, bitmap: 0 
*** 2014-12-23 04:41:02.911 
kjxgrdtrt: Evicted by 0, seq (24, 24) 
IMR state information 
Member 1, thread 2, state 4, flags 0x00a1 
RR seq 24, propstate 3, pending propstate 0 
rcfg rsn 3, rcfg time 1941055106, mem ct 2 
master 1, master rcfg time 1941055106 

Member information: 
Member 0, incarn 23, version -1136027477 
thrd 1, prev thrd 65535, status 0x0047, err 0x0002 
valct 0 
Member 1, incarn 23, version -24384270 
thrd 2, prev thrd 65535, status 0x0107, err 0x0000 
valct 2 

Group name: ORAKT2 
Member id: 1 
Cached SKGXN event: 0 
Group State: 
State: 23 1 
Commited Map: 0 1 
New Map: 0 1 
SKGXN Map: 0 1 
Master node: 0 
Memcnt 2 Rcvcnt 0 
Substate Proposal: false 
Inc Proposal: 
incarn 0 memcnt 0 master 0 
proposal false matched false 
map: 
Master Inc State: 
incarn 0 memcnt 0 agrees 0 flag 0x1 
wmap: 
nmap: 
ubmap: 
Submitting asynchronized dump request [1] 
error 29740 detected in background process 
ORA-29740: evicted by member 0, group incarnation 24 
ksuitm: waiting for [5] seconds before killing DIAG
3.檢查作業系統日誌
從如下的日誌資訊裡,未發現網路問題;但最後跟IBM工程師瞭解到,當時是一個籠子出問題,RAC心跳主網路卡未down,但該網路卡所在的籠子已損壞。由於網路卡主備切換是透過interface down來觸發的,因上在這種情況下心跳主備網路卡未能及時切換。 
LABEL: SCAN_ERROR_CHRP 
IDENTIFIER: BFE4C025 
Date/Time: Tue Dec 23 04:26:35 BEIST 2014 
Sequence Number: 37687 
Machine Id: 00CFA74C4C00 
Node Id: skt2b 
Class: H 
Type: PERM 
Resource Name: sysplanar0 
Resource Class: planar 
Resource Type: sysplanar_rspc 
Location: 
Description 
UNDETERMINED ERROR 
Failure Causes 
UNDETERMINED 
Recommended Actions 
RUN SYSTEM DIAGNOSTICS. 
Detail Data 
PROBLEM DATA
4.總結:
2014年12月23日4:30開通okt2a 例項當機的原因是由於RAC 心跳網路卡所在的籠子損壞,同時主網路卡沒有及時切換到備用網路卡上(由於主網路卡介面仍然處於UP狀態)。導致RAC心跳檢測超時併發生腦裂,由LMON程式中斷了該例項。


In this Document

Purpose
Troubleshooting Steps
  Background
  What is an instance eviction?
  Why do instances get evicted?
  How can I tell that I have had an instance eviction?
  What is the most common cause of instance evictions?
  Key files for troubleshooting instance evictions
  Steps to diagnose instance evictions
  Step 1. Look in the alert logs from all instances for eviction message.
  Step 2. For ora-29740, check the lmon traces for eviction reason.
  1. Find the reason for the reconfiguration.
  2. Understand the reconfiguration reason
  Step 3. Review alert logs for additional information.
  1. "IPC Send Timeout"
  2. "Waiting for clusterware split-brain resolution" or "Detected an inconsistent instance membership"
  3. " detected no messaging activity from instance "
  4. None of the above
  Step 4. Checks to carry out based on the findings of steps 1, 2, 3.
  4(a) - Network checks.
  4(b) - Check for OS hangs or severe resource contention at the OS level.
  4(c) - Check for database or process hang.
  Known Issues
References

Applies to:

Oracle Database - Enterprise Edition - Version 9.2.0.1 and later
Information in this document applies to any platform.

Purpose

Purpose: Understanding and Troubleshooting Instance Evictions.

Symptoms of an instance eviction: Instance terminates with ORA-29740, Instance Abort, Instance Kill

Troubleshooting Steps

Background

What is an instance eviction?

A RAC database has several instances.
In an instance eviction, one or more instances are suddenly aborted ("evicted").
The decision to evict these instance(s) is by mutual consensus of all the instances.

Why do instances get evicted?

Prevent problems from occuring that would affect the entire clustered database.
Evict an unresponsive instance instead of allowing cluster-wide hang to occur
Evict an instance which can't communicate with other instances to avoid a "split brain" situation - in other words, to preserve cluster consistency.

How can I tell that I have had an instance eviction?

The instance will be shut down abruptly.

In most cases, the alert log will contain this message:

ORA-29740: evicted by instance number , group incarnation

In a few cases, the ORA-29740 message will not be present, and this message will show instead:

"Received an instance abort message from instance 1"

What is the most common cause of instance evictions?

The most common reason is a communications failure.
The Oracle background processes communicate with each other across the private interconnect.
If other instances cannot communicate with one instance, an instance is evicted
This is known as a Communications reconfiguration.
Chief causes of the communications failure: Network issues, OS Load issues

Key files for troubleshooting instance evictions

1. Alert log from each instance.
2. LMON trace file from each instance.

Steps to diagnose instance evictions

1. Look in the alert logs from all instances for eviction message.
2. For ora-29740, check the lmon traces for eviction reason.
3. Review alert logs for additional information.
4. Checks to carry out based on the findings of steps 1, 2, 3.

Step 1. Look in the alert logs from all instances for eviction message.

Look for the following messages in the alert log:

a) Look for ora-29740 in the alert log of the instance that got restarted.

Example:

ORA-29740: evicted by instance number 2, group incarnation 24

 

If ora-29740 is found, this means that LMON of the evicted instance terminated it.

 

b) If no ora-29740, look for the following messages:

In the evicted instance:

"Received an instance abort message from instance 1"

In one of the surviving instances:

"Remote instance kill is issued" in the killing instance

This means that the instance was evicted by another instance, but its LMON did not terminate it with ora-29740.

This is usually an indication that LMON on the evicted instance was busy or not progressing.  If you see this symptom (Received an instance abort / Remote instance kill is issued), carry out the checks in Step 4(b) and 4(c).

 

Step 2. For ora-29740, check the lmon traces for eviction reason.

1. Find the reason for the reconfiguration.

Check the lmon traces for all instances for a line with "kjxgrrcfgchk: Initiating reconfig".
This will give a reason code such as "kjxgrrcfgchk: Initiating reconfig, reason 3".

Note: make sure that the timestamp of this line is shortly before the time of the ORA-29740.
There is a reconfiguration every time an instance joins or leaves the cluster; reason 1 or 2.
So make sure that you have found the right reconfiguration in the LMON trace.

2. Understand the reconfiguration reason

The reconfiguration reasons are:

Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

For ora-29740, by far the most common reconfiguration reason is Reason 3 = Communications Failure
This means that the background processes of one or more instances have registered a communication problem with each other.

Note: All the instances of a RAC database need to be in constant communication with each other over the interconnect in order to preserve database integrity.

In 11.2, the alert log may also print the following for a Reason 3 reconfiguration:

Communications reconfiguration: instance_number

 

If you see this symptom (Reason 3 or Communications reconfiguration), carry out the checks in Step 4(a) - Network checks.

 

If you find a different reconfiguration reason, double check to make sure that you have got the right reconfiguration, ie. the last "kjxgrrcfgchk" message before the ora-29740 occurred. See Document 219361.1 for more information on the other reconfiguration reasons.

Step 3. Review alert logs for additional information.

Look for any of the following messages in the alert log of any instance, shortly before the eviction:

1. "IPC Send Timeout"

Example: Instance 1's alert log shows:
IPC Send timeout detected.Sender: ospid 1519
Receiver: inst 8 binc 997466802 ospid 23309

This means that Instance 1's process with OS pid 1519 was trying to send a message to Instance 8's process with OS pid 23309. Instance 1 ospid 1519 timed out while waiting for acknowledgement from Instance 8 ospid 23309

To find out which background process corresponds to each ospid, look BACKWARDS in the corresponding alert log to the PRIOR startup. The ospid's of all background processes are listed at instance startup.

Example:
Thu Apr 25 16:35:41 2013
LMON started with pid=11, OS id=15510
Thu Apr 25 16:35:41 2013
LMD0 started with pid=12, OS id=15512
Thu Apr 25 16:35:41 2013
LMS0 started with pid=13, OS id=15514 at elevated priority

Broadly speaking, there are 2 kinds of reason to see IPC send timeout messages in the alert log:
(1) Network problem with communication over the interconnect, so the IPC message does not get through.
(2) The sender or receiver process is not progressing. This could be caused by OS load or scheduling problem, or by database/process hang or blocked at DB wait level.

If you see this symptom, carry out all of the checks in Section 4.
* At the OS level and/or hanganalyze level, focus particularly on the PIDs printed in the IPC send timeout.
* Also, check the trace files for the processes whose PIDs are printed in the IPC send timeout.
* In 11.1 and above, also check the LMHB trace with a focus on these processes.

 

2. "Waiting for clusterware split-brain resolution" or "Detected an inconsistent instance membership"

These messages are sometimes seen in a communications reconfiguration.

Example 1:

Mon Dec 07 19:43:07 2011
Communications reconfiguration: instance_number 2
Mon Dec 07 19:43:07 2011
Trace dumping is performing id=[cdmp_20091207194307]
Waiting for clusterware split-brain resolution

Example 2:

Thu Mar 07 17:08:03 2013
Detected an inconsistent instance membership by instance 2

Either of these messages indicates a split-brain situation. This indicates a sustained and severe problem with communication between instances over the interconnect.

See the following note to understand split-brain further:
Document 1425586.1 - What is Split Brain in Oracle Clusterware and Real Application Cluster

If you see this symptom, carry out the checks in step 4(a) - Network.

 

3. " detected no messaging activity from instance "

Example:

LMS0 (ospid: 2431) has detected no messaging activity from instance 1
LMS0 (ospid: 2431) issues an IMR to resolve the situation

 This means that the background process (LMS0 in the above example) has not received any messages from the other instance for a sustained period of time. It is a strong indication that either there are network problems on the interconnect, or the other instance is hung.

If you see this symptom, carry out the checks in step 4(a) first, if no issues found, check 4(b) and 4(c).

 

4. None of the above

If none of the above messages are seen in the alert log, but you have seen ora-29740 in the alert log, then carry out all the checks in section 4, starting with 4(a) - Network checks.

Step 4. Checks to carry out based on the findings of steps 1, 2, 3.

Note: In the following, OSW refers to OS Watcher (Document 301137.1), and CHM refers to Cluster Health Monitor(Document 1328466.1).

If you are experiencing repeated instance evictions, you will need to be able to retrospectively examine the OS statistics from the time of the eviction. If CHM is available on your platform and version, you can use CHM; make sure to review the results before they expire out of the archive. Otherwise, Oracle Support recommends that you install and run OS Watcher to facilitate diagnosis.

 

4(a) - Network checks.

* Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.

* Check network configuration to make sure that all network configurations are set up correctly on all nodes.
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
   
* Check archived "netstat" results in OSW or CHM. By default, the database communicates over the interconnect using UDP. Look for any increase in IP or UDP errors, drops, fragments not reassembled, etc.

* If OSW is in use, check archived "oswprvtnet" for any interruption in the traceroutes over private interconnect. See Document 301137.1 for more information.

4(b) - Check for OS hangs or severe resource contention at the OS level.

* Check archived vmstat and top results in OSW or CHM to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.

4(c) - Check for database or process hang.

* Check in the alert log to see if any hanganalyze dump was taken prior to the ora-29740, as instance or process hangs can trigger automatic hanganalyze dump. If hanganalyze dump was taken, see Document 390374.1 for more information on interpreting the dump.

* Check in the alert log or with dba to see if a systemstate dump was taken prior to the ora-29740. If so, Oracle Support can assist in analysing the systemstate dump.

* Check archived OS statistics in OSW or CHM to see if any LM* background process was spinning.

 

Known Issues



來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29446986/viewspace-1377354/,如需轉載,請註明出處,否則將追究法律責任。

相關文章