9I RAC驅逐案例分析-ORA-29740
參考文件:
Troubleshooting Instance Evictions (Instance terminates with ORA-29740, Instance Abort, Instance Kill) (文件 ID 1549135.1)
Communications reconfiguration: instance 0 ======>這裡說明由於例項之間通訊問題導致系統出現腦裂
Waiting for clusterware split-brain resolution
Tue Dec 23 04:31:27 2014
Trace dumping is performing id=[41223043057]
Tue Dec 23 04:37:25 2014
Errors in file /oracle/app/admin/orakt2/udump/okt2b_ora_688330.trc:
ORA-01013: user requested cancel of current operation
Tue Dec 23 04:37:25 2014
Errors in file /oracle/app/admin/orakt2/udump/okt2b_ora_2666844.trc:
ORA-01013: user requested cancel of current operation
Tue Dec 23 04:41:02 2014
Errors in file /oracle/app/admin/orakt2/bdump/okt2b_lmon_1163676.trc:
ORA-29740: evicted by member 0, group incarnation 24 ==========〉If ora-29740 is found, this means that LMON of the evicted instance terminated it.
Tue Dec 23 04:41:02 2014
LMON: terminating instance due to error 29740
Instance terminated by LMON, pid = 1163676
The reconfiguration reasons are:
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
kjxgrgetresults: Detect reconfig from 0, seq 23, reason 3
kjxgrrcfgchk: Initiating reconfig, reason 3 ====〉從這裡可以判斷此例項驅逐是由於通訊失敗導致
*** 2014-12-23 04:31:02.762
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 23 0.
*** 2014-12-23 04:31:02.792
Name Service frozen
kjxgmcs: Setting state to 23 1.
*** 2014-12-23 04:31:02.816
kjxgrrecp2: Waiting for split-brain resolution, upd 0, seq 24
*** 2014-12-23 04:41:02.911
Voting results, upd 0, seq 24, bitmap: 0
*** 2014-12-23 04:41:02.911
kjxgrdtrt: Evicted by 0, seq (24, 24)
IMR state information
Member 1, thread 2, state 4, flags 0x00a1
RR seq 24, propstate 3, pending propstate 0
rcfg rsn 3, rcfg time 1941055106, mem ct 2
master 1, master rcfg time 1941055106
Member information:
Member 0, incarn 23, version -1136027477
thrd 1, prev thrd 65535, status 0x0047, err 0x0002
valct 0
Member 1, incarn 23, version -24384270
thrd 2, prev thrd 65535, status 0x0107, err 0x0000
valct 2
Group name: ORAKT2
Member id: 1
Cached SKGXN event: 0
Group State:
State: 23 1
Commited Map: 0 1
New Map: 0 1
SKGXN Map: 0 1
Master node: 0
Memcnt 2 Rcvcnt 0
Substate Proposal: false
Inc Proposal:
incarn 0 memcnt 0 master 0
proposal false matched false
map:
Master Inc State:
incarn 0 memcnt 0 agrees 0 flag 0x1
wmap:
nmap:
ubmap:
Submitting asynchronized dump request [1]
error 29740 detected in background process
ORA-29740: evicted by member 0, group incarnation 24
ksuitm: waiting for [5] seconds before killing DIAG
IDENTIFIER: BFE4C025
Date/Time: Tue Dec 23 04:26:35 BEIST 2014
Sequence Number: 37687
Machine Id: 00CFA74C4C00
Node Id: skt2b
Class: H
Type: PERM
Resource Name: sysplanar0
Resource Class: planar
Resource Type: sysplanar_rspc
Location:
Description
UNDETERMINED ERROR
Failure Causes
UNDETERMINED
Recommended Actions
RUN SYSTEM DIAGNOSTICS.
Detail Data
PROBLEM DATA
In this Document
Purpose |
Troubleshooting Steps |
Background |
What is an instance eviction? |
Why do instances get evicted? |
How can I tell that I have had an instance eviction? |
What is the most common cause of instance evictions? |
Key files for troubleshooting instance evictions |
Steps to diagnose instance evictions |
Step 1. Look in the alert logs from all instances for eviction message. |
Step 2. For ora-29740, check the lmon traces for eviction reason. |
1. Find the reason for the reconfiguration. |
2. Understand the reconfiguration reason |
Step 3. Review alert logs for additional information. |
1. "IPC Send Timeout" |
2. "Waiting for clusterware split-brain resolution" or "Detected an inconsistent instance membership" |
3. " |
4. None of the above |
Step 4. Checks to carry out based on the findings of steps 1, 2, 3. |
4(a) - Network checks. |
4(b) - Check for OS hangs or severe resource contention at the OS level. |
4(c) - Check for database or process hang. |
Known Issues |
References |
Applies to:
Oracle Database - Enterprise Edition - Version 9.2.0.1 and laterInformation in this document applies to any platform.
Purpose
Purpose: Understanding and Troubleshooting Instance Evictions.
Symptoms of an instance eviction: Instance terminates with ORA-29740, Instance Abort, Instance Kill
Troubleshooting Steps
Background
What is an instance eviction?
A RAC database has several instances.
In an instance eviction, one or more
instances are suddenly aborted ("evicted").
The decision to evict these
instance(s) is by mutual consensus of all the instances.
Why do instances get evicted?
Prevent problems from occuring that would affect the entire clustered
database.
Evict an unresponsive instance instead of allowing cluster-wide
hang to occur
Evict an instance which can't communicate with other instances
to avoid a "split brain" situation - in other words, to preserve cluster
consistency.
How can I tell that I have had an instance eviction?
The instance will be shut down abruptly.
In most cases, the alert log will contain this message:
In a few cases, the ORA-29740 message will not be present, and this message will show instead:
What is the most common cause of instance evictions?
The most common reason is a communications failure.
The Oracle background
processes communicate with each other across the private interconnect.
If
other instances cannot communicate with one instance, an instance is
evicted
This is known as a Communications reconfiguration.
Chief causes of
the communications failure: Network issues, OS Load issues
Key files for troubleshooting instance evictions
1. Alert log from each instance.
2. LMON trace file from each
instance.
Steps to diagnose instance evictions
1. Look in the alert logs from all instances for eviction message.
2. For
ora-29740, check the lmon traces for eviction reason.
3. Review alert logs
for additional information.
4. Checks to carry out based on the findings of
steps 1, 2, 3.
Step 1. Look in the alert logs from all instances for eviction message.
Look for the following messages in the alert log:
a) Look for ora-29740 in the alert log of the instance that got restarted.
Example:
b) If no ora-29740, look for the following messages:
In the evicted instance:
In one of the surviving instances:
This means that the instance was evicted by another instance, but its LMON did not terminate it with ora-29740.
Step 2. For ora-29740, check the lmon traces for eviction reason.
1. Find the reason for the reconfiguration.
Check the lmon traces for all instances for a line with "kjxgrrcfgchk:
Initiating reconfig".
This will give a reason code such as "kjxgrrcfgchk:
Initiating reconfig, reason 3".
Note: make sure that the timestamp of this line is shortly before the time of
the ORA-29740.
There is a reconfiguration every time an instance joins or
leaves the cluster; reason 1 or 2.
So make sure that you have found the right
reconfiguration in the LMON trace.
The reconfiguration reasons are:
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the
reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 =
Communications Failure
Reason 4 = Reconfiguration after suspend
For ora-29740, by far the most common reconfiguration reason is Reason 3 =
Communications Failure
This means that the background processes of one or
more instances have registered a communication problem with each other.
In 11.2, the alert log may also print the following for a Reason 3 reconfiguration:
If you find a different reconfiguration reason, double check to make sure that you have got the right reconfiguration, ie. the last "kjxgrrcfgchk" message before the ora-29740 occurred. See Document 219361.1 for more information on the other reconfiguration reasons.
Step 3. Review alert logs for additional information.
Look for any of the following messages in the alert log of any instance, shortly before the eviction:
1. "IPC Send Timeout"
Example: Instance 1's alert log shows:
IPC Send timeout detected.Sender:
ospid 1519
Receiver: inst 8 binc 997466802 ospid 23309
This means that Instance 1's process with OS pid 1519 was trying to send a message to Instance 8's process with OS pid 23309. Instance 1 ospid 1519 timed out while waiting for acknowledgement from Instance 8 ospid 23309
To find out which background process corresponds to each ospid, look BACKWARDS in the corresponding alert log to the PRIOR startup. The ospid's of all background processes are listed at instance startup.
Example:
Thu Apr 25 16:35:41 2013
LMON started with pid=11, OS
id=15510
Thu Apr 25 16:35:41 2013
LMD0 started with pid=12, OS
id=15512
Thu Apr 25 16:35:41 2013
LMS0 started with pid=13, OS id=15514 at
elevated priority
Broadly speaking, there are 2 kinds of reason to see IPC send timeout
messages in the alert log:
(1) Network problem with communication over the
interconnect, so the IPC message does not get through.
(2) The sender or
receiver process is not progressing. This could be caused by OS load or
scheduling problem, or by database/process hang or blocked at DB wait level.
* At the OS level and/or hanganalyze level, focus particularly on the PIDs printed in the IPC send timeout.
* Also, check the trace files for the processes whose PIDs are printed in the IPC send timeout.
* In 11.1 and above, also check the LMHB trace with a focus on these processes.
2. "Waiting for clusterware split-brain resolution" or "Detected an inconsistent instance membership"
These messages are sometimes seen in a communications reconfiguration.
Example 1:
Communications reconfiguration: instance_number 2
Mon Dec 07 19:43:07 2011
Trace dumping is performing id=[cdmp_20091207194307]
Waiting for clusterware split-brain resolution
Example 2:
Thu Mar 07 17:08:03 2013
Detected an inconsistent instance membership by
instance 2
Either of these messages indicates a split-brain situation. This indicates a sustained and severe problem with communication between instances over the interconnect.
See the following note to understand split-brain further:
Document
1425586.1 - What is Split Brain in Oracle Clusterware and Real Application
Cluster
3. "
Example:
LMS0 (ospid: 2431) issues an IMR to resolve the situation
This means that the background process (LMS0 in the above example) has not received any messages from the other instance for a sustained period of time. It is a strong indication that either there are network problems on the interconnect, or the other instance is hung.
4. None of the above
If none of the above messages are seen in the alert log, but you have seen ora-29740 in the alert log, then carry out all the checks in section 4, starting with 4(a) - Network checks.
Step 4. Checks to carry out based on the findings of steps 1, 2, 3.
Note: In the following, OSW refers to OS Watcher (Document 301137.1), and CHM refers to Cluster Health Monitor(Document 1328466.1).
If you are experiencing repeated instance evictions, you will need to be able to retrospectively examine the OS statistics from the time of the eviction. If CHM is available on your platform and version, you can use CHM; make sure to review the results before they expire out of the archive. Otherwise, Oracle Support recommends that you install and run OS Watcher to facilitate diagnosis.
4(a) - Network checks.
* Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
* Check network configuration to make sure that all network configurations
are set up correctly on all nodes.
For example, MTU size must be same on
all nodes and the switch can support MTU size of 9000 if jumbo frame is
used.
* Check archived "netstat" results in OSW or CHM. By default,
the database communicates over the interconnect using UDP. Look for any increase
in IP or UDP errors, drops, fragments not reassembled, etc.
* If OSW is in use, check archived "oswprvtnet" for any interruption in the traceroutes over private interconnect. See Document 301137.1 for more information.
4(b) - Check for OS hangs or severe resource contention at the OS level.* Check archived vmstat and top results in OSW or CHM to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.
4(c) - Check for database or process hang.* Check in the alert log to see if any hanganalyze dump was taken prior to the ora-29740, as instance or process hangs can trigger automatic hanganalyze dump. If hanganalyze dump was taken, see Document 390374.1 for more information on interpreting the dump.
* Check in the alert log or with dba to see if a systemstate dump was taken prior to the ora-29740. If so, Oracle Support can assist in analysing the systemstate dump.
* Check archived OS statistics in OSW or CHM to see if any LM* background process was spinning.
Known Issues
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29446986/viewspace-1377354/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- HP平臺,9i RAC instance 2被驅逐故障診斷
- 一次詳細的RAC 節點例項驅逐分析文件
- 一次RAC例項驅逐詳細分析及解決方案
- Kubernetes Pod驅逐策略
- Oracle RAC 導致例項驅逐的五大原因[ID 1526186.1]Oracle
- oracle 9i 查詢資料字典檢視慢案例分析Oracle
- kubernetes-pod驅逐機制
- kubernetes驅逐機制總結
- Oracle 9i RAC on PowerHA5.5Oracle
- Oracle 9i RAC 互聯效能Oracle
- Oracle 9i RAC enqueue等待測試OracleENQ
- IP packet reassembles failed導致例項被驅逐AI
- Kubernetes-22:kubelet 驅逐策略詳解
- 微服務17:微服務治理之異常驅逐微服務
- 9i RAC轉換為SINGLE例項
- RAC中的腦裂(Split Brain)是根據什麼原則進行節點驅逐(Node Eviction)的 ? 1546004.1AI
- 微前端:好、壞、醜逐個分析! - KBall前端
- 案例分析
- oracle 9i single instance convert to rac databaseOracleDatabase
- Step-By-Step Installation of 9i RAC on IBM AIXIBMAI
- Redis快取刪除驅逐策略的工作方式 - codemancersRedis快取
- golang 表格驅動測試案例Golang
- 通過設定DIAGWAIT值使得RAC中節點被驅逐的時候能夠記錄更多的診斷日誌AI
- 透過設定DIAGWAIT值使得RAC中節點被驅逐的時候能夠記錄更多的診斷日誌AI
- RAC資料庫重啟案例資料庫
- 9i rac 連線時提示ORA-12545!
- Oracle 9i RAC向單例項遷移手記Oracle單例
- 【虹科乾貨】Redis 開發者需要了解的快取驅逐策略Redis快取
- rac當機分析
- KingbaseES RAC部署案例之---SAN環境構建RAC
- jemeter分析(二) — jmeter案例分析JMeter
- 一次意想不到的pod記憶體驅逐問題記憶體
- 阿里面試讓聊一聊Redis 的記憶體淘汰(驅逐)策略阿里面試Redis記憶體
- ORACLE 9I RAC IPC Send timeout detected問題處理Oracle
- 關於Oracle 9i RAC enqueue等待的一點測試OracleENQ
- VMware下RedHat安裝Oracle 9i RAC全攻略(轉)RedhatOracle
- 星雲精準測試對安卓底層驅動程式碼的測試案例分析安卓
- Oracle RAC命中ORA-7445只能開啟一個節點故障案例分析Oracle