HP平臺,9i RAC instance 2被驅逐故障診斷
前幾日上午,在毫不知情的情況下,突然手機收到了系統的告警簡訊,報一營業系統無法連線,汗~,無法連線資料庫,這是多麼嚴重的告警,何況是營業廳的核心繫統,直接影響使用者體驗。遂立即登入系統檢視,發現雙節點的RAC環境中節點2上的instance宕掉了,還好是雙節點,可能效能會有所影響但不至於營業廳無法辦理業務(這時候,RAC的好處顯現的可謂淋淋盡致了)。
檢查程式:
■■db2:[/oracle]$ps
-ef | grep ora_
oracle
914114 3548012 0 19:00:23 pts/3
0:00 grep ora_
■■db2:[/oracle]$ps
-ef | grep lsnr
oracle 22257 22179 0 11:06:48 pts/1 0:00 grep lsnr
oracle 15157 1 0
07:09:39 ? 0:00
/orarac/app/oracle/product/9.2.0/bin/tnslsnr LISTENER -inherit
系統中已經沒有任何instance’s background processes了,不過監聽程式還在,監聽為:LISTENER。
還等啥呢,趕緊把庫起來吧
SQL> startup
ORACLE instance started.
Total System Global Area 3.2212E+10 bytes
Fixed Size 782472 bytes
Variable Size 4815060992 bytes
Database Buffers 2.7380E+10 bytes
Redo Buffers 16003072 bytes
Database mounted.
Database opened.
還好,資料庫正常open了。
表象的東西解決後,就該更為深層次的分析是什麼原因導致的instance down了。
節點1的alert日誌:
Tue Oct 18 10:56:34 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:03:00 2011
Timed out trying to start process P001.
Tue Oct 18 11:06:34 2011
Evicting instance 2 from cluster
在節點1的系統時間為:
10:56:34 叢集發生腦裂,clusterware split-brain
11:03:00 啟動P001程式發生超時
11:06:34 節點2上的instance
2被逐出叢集了
節點2的alert日誌:
Tue Oct 18 10:51:57
2011
IPC Send timeout detected. Sender ospid 22239
Tue Oct 18 10:51:59 2011
Communications reconfiguration: instance 0
Tue Oct 18 10:52:03 2011
ARC0: Completed archiving log 6 thread 2
sequence 87953
Tue Oct 18 10:53:45 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:02:14 2011
Timed out trying to start process P001.
Tue Oct 18 11:03:45 2011
USER: terminating instance due to error 481
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/■■db/bdump/■■db2_lgwr_15116.trc:
ORA-00481: LMON process terminated with error
…………………..
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/■■db/bdump/■■db2_lmon_15081.trc:
ORA-29740: evicted by member 1, group incarnation 19
Tue Oct 18 11:03:45 2011
System state dump is made for local instance
在節點2的系統時間為:
10:51:57 檢測到IPC send timeout
10:51:59 Communications reconfiguration:instance 0,instance 0表示例項關閉
10:53:45 叢集發生腦裂,clusterware
split-brain
11:02:14 啟動P001程式發生超時
11:03:45 發生error
481,instance down,並出現ORA-29740錯誤,我手機也是在這個時間點接到告警簡訊的
據eygle的說法,主要是在資源緊張、網路擁堵等情況下,就有可能發生IPC超時的問題,而cluster隨後就會將問題節點驅逐,引發新一輪的Reconfiguration。
在來看看產生的trc檔案,■■db2_lmon_15081.trc
*** SESSION
ID:(3.1) 2011-10-18 07:08:57.162
GES IPC: Receivers 10 Senders 10
GES IPC: Buffers Receive 1000 Send (i:6830 b:6830) Reserve 1000
GES IPC: Msg Size Regular 440 Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 47, ack 52
Batching factor: cache replay 29 size per lock 64
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 0 Hewlett-Packard SKGXN 2.0)
*** 2011-10-18 07:09:01.648
kjxgmrcfg: Reconfiguration started, reason 1
*** 2011-10-18 07:09:01.649
Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 3122016
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 18 2.
Performed the unique instance
identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 18 3.
Name Service recovery started
Deleted all dead-instance name
entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 18 4.
Multicasted all local name entries
for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 18 5.
Name Service normal
Name Service recovery done
*** 2011-10-18 07:09:02.652
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 18 6.
*** 2011-10-18 07:09:02.652
*** 2011-10-18 07:09:02.653
Reconfiguration started (old inc 0, new inc 18)
Synchronization timeout interval: 660 sec
List of nodes:
0 1
Global Resource Directory frozen
node 0
release 9 2 0 7
node 1
release 9 2 0 7
res_master_weight for node 0 is 3122016
res_master_weight for node 1 is 3122016
Total master weight = 6244032
Dead
inst
Join
inst 0 1
Exist inst
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value
bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traversed, 0 cancelled, 0
closed
0 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 0
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2011-10-18 07:09:11.515
0 GCS shadows traversed, 0 replayed, 0
unopened
Submitted all GCS cache requests
0 write requests issued in 1518892 GCS
resources
10 PIs marked suspect, 0 flush PI msgs
*** 2011-10-18 07:09:27.724
Reconfiguration complete
*** 2011-10-18 07:09:35.232
kjxgrtmc2: Member 1 thread 2 mounted
*** 2011-10-18 10:51:59.470
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.
*** 2011-10-18 10:52:04.504
Name Service frozen
kjxgmcs: Setting state to 18 1.
*** 2011-10-18 10:52:04.514
Obtained RR update lock for sequence 18, RR seq 18
*** 2011-10-18 10:53:45.311
jxgrrecp2: Waiting for split-brain resolution, upd
1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)
IMR state information
Member 1, thread 2, state 4, flags
0x00a1
RR seq 19, propstate 3, pending propstate 0
rcfg rsn 3, rcfg time 21546700, mem ct
2
master 1, master rcfg time 21546700
Member information:
Member 0, incarn 18, version
1840361946
thrd 1, prev thrd 65535, status
0x0007, err 0x0002
valct 0
Member 1, incarn 18, version 816352
thrd 2, prev thrd 65535, status
0x0107, err 0x0000
valct 2
Group name: ■■DB
Member id: 1
Cached SKGXN event: 0
Group State:
State: 18 1
Commited Map: 0 1
New Map: 0 1
SKGXN Map: 0 1
Master node: 0
Memcnt 2 Rcvcnt 0
Substate Proposal: false
Inc Proposal:
incarn 0 memcnt 0
master 0
proposal false matched false
map:
Master Inc State:
incarn 0 memcnt 0
agrees 0 flag 0x1
wmap:
nmap:
ubmap:
Submitting asynchronized dump request [1]
error 29740 detected in background process
ORA-29740: evicted by member 1, group incarnation 19
從trc檔案中可以得到如下資訊:
*** 2011-10-18
10:51:59.470
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.
在節點2的系統時間為:
10:51:59 Initiating reconfig,reason 3
Reconfiguration
started,reason 3,這裡的reason 3是重要資訊
*** 2011-10-18
10:53:45.311
jxgrrecp2: Waiting for split-brain resolution, upd
1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)
在節點2的系統時間為:
10:53:45 Waiting for split-brain resolution,此時cluster正等待解決腦裂
11:03:45 Evicted by 1,此時節點2上的instance被驅逐
查詢metalink ID 219316.1
For troubleshooting ORA-29740 errors,the ‘reason’ will be very important.In the above example,the first section indicates the reason for the initiated reconfiguration.The reason are as follows:
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
再看看引起 reason 3的幾種可能性:
a) The LMON
processes lose communication between one another.
b) One instance loses communications with the LMS, LMD, process of another instance.
c) The LCK processes lose communication between one another.
d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck and is
not responding to remote requests.
本case的reason標識就是3,所以應該Communications Failure。我想最可能的原因就是心跳出了問題(在當日凌晨2點,給主機換硬體時,曾因網路問題導致心跳異常),其他的原因,如:資源緊張,網路堵塞…我想基本上應該可以排除的,這邊的硬體環境算是比較“高檔”了,並且,從平時的監控來看,節點2的負載還是可以的。
上面的關於重要資訊的時間點寫的比較亂,簡單整理一個在日誌中記錄重要事件的時間點:
時間 |
節點1 |
節點2 |
10:51:57 |
|
IPC Send timeout detected |
10:51:59 |
|
Communications reconfiguration:instance 0 |
10:53:45 |
|
Waiting for clusterware split-brain resolution |
10:56:34 |
Waiting for clusterware split-brain resolution |
|
11:02:14 |
|
Timed out trying to start process P001. |
11:03:00 |
Timed out trying to start process P001. |
|
11:03:45 |
|
USER: terminating
instance due to error 481 |
11:06:34 |
Evicting instance 2 from cluster |
|
不知為何原因,當日凌成2點重啟主機後,節點2上的NTP服務異常,直到節點2上的instance 異常,處理故障時發現,節點2比節點1慢了3分鐘,具體點,應該是3分鐘11秒左右。當時我逐步以每次30秒調整到與節點2的主機系統時間基本保持了一致,後來主機工程師調整了NTP後,該主機上的NTP服務也恢復正常了,此時,兩個節點的系統時間均與時間伺服器保持同步了。
所以上面的時間點不算很準確,因為節點2的系統時間如果加上3分14秒的話,在日誌中的某些重要事件資訊的時間點,兩個節點基本就吻合了。如下:
時間 |
節點1 |
節點2 |
10:51:57 |
|
IPC Send timeout detected |
10:51:59 |
|
Communications reconfiguration:instance 0 |
10:53:45 |
|
Waiting for clusterware split-brain resolution |
10:56:34 |
Waiting for clusterware split-brain resolution |
|
|
|
|
|
|
|
11:03:45 |
|
USER: terminating
instance due to error 481 |
11:06:34 |
Evicting instance 2 from cluster |
|
在上面這個表格中,發生split-brain時,其實兩個節點是同時發生的,準確時間是:10:56:34;在節點的alert日誌中發生instance 2被驅逐事件的時間點,其實節點2上報ORA-29740錯誤的時間其實也是同時發生的。
這樣以來就很清楚的知道了,節點2:
10:51:57 (這個時間點其實應該再加上3分11秒,即10:55:08)發生IPC Send imeout
10:56:34 叢集發生腦裂(Split-brain)
11:06:34 節點2上的例項 instance 2被驅逐
至於,心跳超時的原因,個人感覺最好的解釋就是,因兩個節點相差3分11秒(即:節點2 慢 節點1 3分鐘11秒),導致心跳異常,從而引起腦裂以至instance 2被驅逐。
如果大家有誰遇到過相同問題的,歡迎與我溝通、交流,謝謝!
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25834554/viewspace-709574/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- RAC故障診斷指令碼指令碼
- 9I RAC驅逐案例分析-ORA-29740
- DB2故障診斷工具DB2
- 使用 Diagwait 作為診斷工具,獲取用於診斷 Oracle Clusterware 節點驅逐AIOracle
- HP RISC平臺9i升級到HP Itanium平臺上10g
- Oracle故障診斷Oracle
- 通過設定DIAGWAIT值使得RAC中節點被驅逐的時候能夠記錄更多的診斷日誌AI
- 透過設定DIAGWAIT值使得RAC中節點被驅逐的時候能夠記錄更多的診斷日誌AI
- 光纖故障診斷和故障排查
- ASM磁碟故障診斷(二)ASM
- ASM磁碟故障診斷(一)ASM
- 故障診斷學習工具
- 故障分析 | Kubernetes 故障診斷流程
- 【Oracle】RAC11gR2Grid啟動順序及啟動故障診斷思路Oracle
- 9 Oracle Data Guard 故障診斷Oracle
- 使用 Diagwait 診斷 Oracle Clusterware 節點驅逐的更多資訊 (文件 ID 1525761.1)AIOracle
- oracle 9i single instance convert to rac databaseOracleDatabase
- 10G 11GR1 通過設定DIAGWAIT值使得RAC中節點被驅逐的時候能夠記錄更多的診斷日誌AI
- hp rx6600兩臺oracle雙機互備伺服器其中一臺經常自動關機的故障診斷Oracle伺服器
- 各平臺安裝使用 MTR 診斷網路
- mysql複製故障診斷與排除MySql
- 部落格連結—Oracle故障診斷Oracle
- DB2資料庫故障與效能瓶頸診斷思路DB2資料庫
- [平臺建設] Spark任務的診斷調優Spark
- websphere中介軟體故障診斷troubleshootingWeb
- 利用 Java dump 進行 JVM 故障診斷JavaJVM
- 如何診斷RAC系統中的
- 【RAC】Oracle Clusterware 診斷收集指令碼Oracle指令碼
- IP packet reassembles failed導致例項被驅逐AI
- OPPO大資料診斷平臺設計與實踐大資料
- 故障診斷為什麼要用深度學習?深度學習
- 一次DG故障診斷過程分析
- Cisco路由器故障診斷技術(轉)路由器
- oracle RAC 診斷叢集狀態命令Oracle
- 工商銀行打造線上診斷平臺的探索與實踐
- 微信公眾平臺運營失敗的六脈診斷
- 大語言模型與資料庫故障診斷模型資料庫
- oracle 10046事件故障診斷一例Oracle事件