rac當機分析
首先,在47分的時候例項4發起驅逐例項2的事件,例項2首先crash。
這是例項4的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
0 2 3
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 valid = 0 according to instance 0
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
這是例項4的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
0 2 3
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 valid = 0 according to instance 0
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
這是例項3的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
0 2 3
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 valid = 0 according to instance 0
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
這是例項1的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
0 2 3
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 not valid according to instance 2
* domain 0 not valid according to instance 3
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
0 2 3
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 valid = 0 according to instance 0
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
這是例項1的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
0 2 3
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 not valid according to instance 2
* domain 0 not valid according to instance 3
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
這是例項2的log
Wed Jan 11 09:47:11 GMT+08:00 2012Received an instance abort message from instance 4 (reason 0x1)
Wed Jan 11 09:47:11 GMT+08:00 2012Received an instance abort message from instance 4 (reason 0x1)
Wed Jan 11 09:47:11 GMT+08:00 2012Please check instance 4 alert and LMON trace files for detail.
Wed Jan 11 09:47:11 GMT+08:00 2012Please check instance 4 alert and LMON trace files for detail.
Wed Jan 11 09:47:11 GMT+08:00 2012LMD0: terminating instance due to error 481
Instance terminated by LMD0, pid = 4325382
Wed Jan 11 09:51:33 GMT+08:00 2012Starting ORACLE instance (normal)
Wed Jan 11 09:47:11 GMT+08:00 2012Received an instance abort message from instance 4 (reason 0x1)
Wed Jan 11 09:47:11 GMT+08:00 2012Received an instance abort message from instance 4 (reason 0x1)
Wed Jan 11 09:47:11 GMT+08:00 2012Please check instance 4 alert and LMON trace files for detail.
Wed Jan 11 09:47:11 GMT+08:00 2012Please check instance 4 alert and LMON trace files for detail.
Wed Jan 11 09:47:11 GMT+08:00 2012LMD0: terminating instance due to error 481
Instance terminated by LMD0, pid = 4325382
Wed Jan 11 09:51:33 GMT+08:00 2012Starting ORACLE instance (normal)
然後例項2在51分開始啟動,52分完成。
Wed Jan 11 09:52:40 GMT+08:00 2012Reconfiguration started (old inc 20, new inc 22)
List of nodes:
0 1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 1 according to instance 0
Wed Jan 11 09:52:40 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
例項1在51分的時候因為自己的asm異常而導致crash,具體log如下
Wed Jan 11 09:51:13 GMT+08:00 2012Errors in file /u01/app/oracle/admin/zhpt/bdump/zhpt1_asmb_6947490.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Wed Jan 11 09:51:21 GMT+08:00 2012ASMB: terminating instance due to error 15064
Instance terminated by ASMB, pid = 6947490
Wed Jan 11 09:51:13 GMT+08:00 2012Errors in file /u01/app/oracle/admin/zhpt/bdump/zhpt1_asmb_6947490.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Wed Jan 11 09:51:21 GMT+08:00 2012ASMB: terminating instance due to error 15064
Instance terminated by ASMB, pid = 6947490
在例項4上可以看到當時2沒有完全起來,而1已經crash的狀態
Wed Jan 11 09:48:46 GMT+08:00 2012 Submitted all GCS remote-cache requests
Fix write in gcs resources
Wed Jan 11 09:51:27 GMT+08:00 2012Reconfiguration started (old inc 16, new inc 18)
List of nodes:
2 3
Nested reconfiguration detected.
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 valid = 0 according to instance 2
從上面還可以看出1的crash在crs中沒有資訊,說明它的crash和crs無關,而2的crash是crs發起的,日誌是不一樣的。
Wed Jan 11 09:48:46 GMT+08:00 2012 Submitted all GCS remote-cache requests
Fix write in gcs resources
Wed Jan 11 09:51:27 GMT+08:00 2012Reconfiguration started (old inc 16, new inc 18)
List of nodes:
2 3
Nested reconfiguration detected.
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 valid = 0 according to instance 2
從上面還可以看出1的crash在crs中沒有資訊,說明它的crash和crs無關,而2的crash是crs發起的,日誌是不一樣的。
例項3的日誌
Wed Jan 11 09:49:12 GMT+08:00 2012Completed redo scan
196588 redo blocks read, 29797 data blocks need recovery
Fix write in gcs resources
Wed Jan 11 09:51:27 GMT+08:00 2012Reconfiguration started (old inc 16, new inc 18)
List of nodes:
2 3
Nested reconfiguration detected.
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
* domain 0 not valid according to instance 3
Wed Jan 11 09:51:27 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
例項2的日誌沒有同樣的資訊是因為它當時正在open中。
下面是例項4在2和1先後好了以後記錄的log
Wed Jan 11 09:51:29 GMT+08:00 2012 Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Wed Jan 11 09:51:43 GMT+08:00 2012Reconfiguration started (old inc 18, new inc 20)
List of nodes:
1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 0 according to instance 2
Wed Jan 11 09:51:43 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fix write in gcs resources
Reconfiguration complete
Wed Jan 11 09:51:43 GMT+08:00 2012Reconfiguration started (old inc 18, new inc 20)
List of nodes:
1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 0 according to instance 2
Wed Jan 11 09:51:43 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Jan 11 09:51:45 GMT+08:00 2012 Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Wed Jan 11 09:52:40 GMT+08:00 2012Reconfiguration started (old inc 20, new inc 22)
List of nodes:
0 1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 1 according to instance 1
Wed Jan 11 09:52:40 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
下面是例項3在2和1先後好了以後記錄的log
Wed Jan 11 09:51:32 GMT+08:00 2012Completed redo scan
390680 redo blocks read, 58663 data blocks need recovery
Wed Jan 11 09:51:43 GMT+08:00 2012Reconfiguration started (old inc 18, new inc 20)
List of nodes:
1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 0 according to instance 1
Wed Jan 11 09:51:43 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Jan 11 09:52:02 GMT+08:00 2012Completed redo application
Wed Jan 11 09:52:11 GMT+08:00 2012Completed instance recovery at
Thread 1: logseq 10747, block 1136082, scn 12622806928332
Thread 2: logseq 8082, block 1925780, scn 12622806926550
58633 data blocks read, 58725 data blocks written, 390680 redo blocks read
Wed Jan 11 09:52:40 GMT+08:00 2012Reconfiguration started (old inc 20, new inc 22)
List of nodes:
0 1 2 3
Global Resource Directory frozen
Communication channels reestablished
* domain 0 valid = 1 according to instance 1
Wed Jan 11 09:52:40 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Jan 11 09:52:40 GMT+08:00 2012 LMS 3: 0 GCS shadows cancelled, 0 closed
例項2crash的原因
在例項2上,有一條日誌
*** 2012-01-11 09:45:21.918
kjxggpoll: change poll time to 50 ms
然後各個節點開始
例項1crash的原因
database上的alert裡面提示看asm的log
Wed Jan 11 09:51:13 GMT+08:00 2012Errors in file /u01/app/oracle/admin/zhpt/bdump/zhpt1_asmb_6947490.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Wed Jan 11 09:51:21 GMT+08:00 2012ASMB: terminating instance due to error 15064
Instance terminated by ASMB, pid = 6947490
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Wed Jan 11 09:51:21 GMT+08:00 2012ASMB: terminating instance due to error 15064
Instance terminated by ASMB, pid = 6947490
database的zhpt1_asmb_6947490.trc裡面提示
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1
System name: AIX
Node name: jtzhptdb1
Release: 1
Version: 6
Machine: 00F6EEBB4C00
Instance name: zhpt1
Redo thread mounted by this instance: 1
Oracle process number: 43
Unix process pid: 6947490, image: (ASMB)
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1
System name: AIX
Node name: jtzhptdb1
Release: 1
Version: 6
Machine: 00F6EEBB4C00
Instance name: zhpt1
Redo thread mounted by this instance: 1
Oracle process number: 43
Unix process pid: 6947490, image: (ASMB)
*** 2012-01-11 09:51:13.858
*** SERVICE NAME:(SYS$BACKGROUND) 2012-01-11 09:51:10.117
*** SESSION ID:(1598.1) 2012-01-11 09:51:10.117
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
ksuitm: waiting up to [5] seconds before killing DIAG(17957082)
*** SERVICE NAME:(SYS$BACKGROUND) 2012-01-11 09:51:10.117
*** SESSION ID:(1598.1) 2012-01-11 09:51:10.117
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
ksuitm: waiting up to [5] seconds before killing DIAG(17957082)
asm的log裡面提示
Wed Jan 11 09:50:55 GMT+08:00 2012LMD0 (ospid: 8192988) is not heartbeating for 216 seconds.
LMON detects unhealthy receivers.
Please check LMON and DIAG trace files for detail.
Wed Jan 11 09:51:00 GMT+08:00 2012LMON (ospid: 8520498) is terminating the instance.
LMON: terminating instance due to error 481
Termination issued to instance processes. Waiting for the processes to exit
Wed Jan 11 09:51:14 GMT+08:00 2012Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 8520498
Wed Jan 11 09:52:10 GMT+08:00 2012Starting ORACLE instance (normal)
LMON detects unhealthy receivers.
Please check LMON and DIAG trace files for detail.
Wed Jan 11 09:51:00 GMT+08:00 2012LMON (ospid: 8520498) is terminating the instance.
LMON: terminating instance due to error 481
Termination issued to instance processes. Waiting for the processes to exit
Wed Jan 11 09:51:14 GMT+08:00 2012Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 8520498
Wed Jan 11 09:52:10 GMT+08:00 2012Starting ORACLE instance (normal)
lmon的log
Repeat 3 times
*** 2012-01-11 09:50:55.139
kjfmrcvrchk: receiver LMD[0] has no heartbeat for 216 sec (1326246437.1326246653.0).
kjfmrcvrchk: Dumping callstack of lmd0
Submitting asynchronized dump request [20]
kjfmrcvrchk: receivers are not healthy. kill instance.
ksuitm: waiting up to [5] seconds before killing DIAG(8258490)
*** 2012-01-11 09:51:14.825
Instance termination failed to kill one or more processes
ksuitm_check: OS PID=10616892 is still alive
*** 2012-01-11 09:51:14.825
Process diagnostic dump for (LCK0), OS id=10616892,
pid: 15, proc_ser: 1, sid: 432, sess_ser: 1
-------------------------------------------------------------------------------
loadavg : 2.12 2.51 3.42
swap info: free_mem = 6406.91M rsv = 192.00M
alloc = 15371.82M avail = 49152.00M swap_free = 33780.18M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
Short stack dump: ORA-00072: process "Unix process pid: 10616892, image: (LCK0)" is not active
Repeat 3 times
*** 2012-01-11 09:50:55.139
kjfmrcvrchk: receiver LMD[0] has no heartbeat for 216 sec (1326246437.1326246653.0).
kjfmrcvrchk: Dumping callstack of lmd0
Submitting asynchronized dump request [20]
kjfmrcvrchk: receivers are not healthy. kill instance.
ksuitm: waiting up to [5] seconds before killing DIAG(8258490)
*** 2012-01-11 09:51:14.825
Instance termination failed to kill one or more processes
ksuitm_check: OS PID=10616892 is still alive
*** 2012-01-11 09:51:14.825
Process diagnostic dump for (LCK0), OS id=10616892,
pid: 15, proc_ser: 1, sid: 432, sess_ser: 1
-------------------------------------------------------------------------------
loadavg : 2.12 2.51 3.42
swap info: free_mem = 6406.91M rsv = 192.00M
alloc = 15371.82M avail = 49152.00M swap_free = 33780.18M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
Short stack dump: ORA-00072: process "Unix process pid: 10616892, image: (LCK0)" is not active
lmod的log
*** 2012-01-11 09:49:01.111
KJM_HISTORY: RCVR STALL OP(12) context 8 elapsed 99907602 us
KJM HIST LMD0:
12:8:99907602 7:0 6:1 10:28 11:1 15:0 12:78144 7:0 6:1 10:31
11:0 15:0 12:78149 7:1 6:1 10:22 2:33 1:0 11:1 15:0
12:78147 7:1 6:0 10:37 11:1 15:1 12:78148 7:2 6:0 10:41
11:1 15:1 12:78154 7:1 6:0 10:38 11:0 15:1 12:78148 7:1
6:0 10:34 11:0 15:1 12:78150 7:2 6:0 10:62 11:2 15:1
12:78161 7:0 6:0 10:30 11:0 15:0 12:78149 7:1 6:0 10:76
11:1 15:2 12:78164 7:0
----------------------------------------
SO: 70000001709cab0, type: 4, owner: 70000001754ca80, flag: INIT/-/-/0x00
(session) sid: 441 trans: 0, creator: 70000001754ca80, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
last wait for 'ges remote message' wait_time=1 min 17 sec, seconds since wait started=2616
waittime=40, loop=0, p3=0
blocking sess=0x0 seq=11868
Dumping Session Wait History
for 'ges remote message' count=1 wait_time=1 min 17 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078142 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078146 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078143 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078145 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078149 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078145 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078147 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078152 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078147 sec
waittime=40, loop=0, p3=0
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/13337857/viewspace-715519/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- ora-00494引起rac當機的分析處理
- JVM當機分析JVM
- DRM特性引起的RAC節點當機
- 一次ODA當機分析
- AIX RAC9I 節點當機測試AI
- 當機導致slave異常分析
- Oracle RAC/Clusterware 多種心跳heartbeat機制介紹 RAC超時機制分析Oracle
- RedHat7.2的RemoveIPC設定主yes引起rac當機RedhatREM
- 一次RAC節點當機的解決過程
- 電腦當機原因分析:軟體原因
- 技術培訓 | RAC 當機罪犯案情探析之子游標
- 電腦當機的原因分析:硬體故障
- oracle 11.2.0.4 rac節點異常當機之ORA-07445Oracle
- asm例項自動dismount導致rac一個節點當機ASM
- RAC一個節點記憶體故障當機,無法訪問記憶體
- MySQL Bug導致異常當機的分析流程MySql
- 資料庫突然當機的問題及分析資料庫
- 核心引數導致的備庫當機分析
- 技術培訓|RAC 當機罪犯案情探析之子游標預告
- ORA-04031錯誤導致當機案例分析
- 一次資料庫當機問題的分析資料庫
- 【RAC】RAC 效能分析 - 'log file sync' 等待事件事件
- 電腦當機原因分析:個別多媒體應用
- 電腦崩潰和當機原因分析總結大全(轉)
- dolphinscheduler 實現master當機故障轉移能力原始碼分析AST原始碼
- ORACLE 11.2.0.4 rac for linux 鏈路宕導致的單節點異常當機OracleLinux
- RAC全域性等待事件分析事件
- 暈死,待機模式死掉了(不是當機,比當機還麻煩-_-)模式
- Redis當機恢復Redis
- mongoDB當機修復MongoDB
- 人為操作不當引起電腦當機
- 【RAC】使用VMware虛擬機器搭建RAC環境虛擬機
- Windows 2000中藍色畫面當機之停止資訊分析(轉)Windows
- RAC系統當中,job在哪個節點執行?
- RAC DATABASE關機步驟Database
- 11G RAC 一節點當機後修改監聽相關配置使通過當機節點VIP連線資料庫的客戶端可以連線資料庫客戶端
- RAC 腦裂 處理機制 Oracle RAC Brain SplitOracleAI
- Redis當機 快速恢復Redis