在RAC環境的其中一個節點上的alert檔案中，出現了這個資訊：Waiting for instances to leave。

這一篇描述問題的發現過程。

首先描述一下問題的發現過程。程式設計師報告RAC環境影響比較緩慢，而這時同事正好在這個RAC環境的一個節點上檢查RMAN的日誌，發現RMAN連線資料庫很慢，半天都沒有響應。而這時，另一個節點幾乎無法訪問了，一個普通的ssh登陸都需要1分鐘以上，更不要說執行命令了。

檢查當前還可以訪問的節點的alert檔案，發現了上面的Waiting for instances to leave資訊：

Mon Jul 27 15:52:14 2009
Waiting for instances to leave:
1

由於這個問題產生的原因比較複雜，為了將問題描述清楚，還需要重頭說起。

首先是硬體維護任意發現RAC伺服器的其中一個節點DOWN機，這個伺服器無法透過遠端登陸。

在資料庫中檢查，節點1關閉，節點2正常執行。從節點2上的alert檔案中，可以看到，節點2對資料庫進行了一些恢復的操作，併成功接管了全部資料庫的管理操作：

Fri Jul 24 16:25:09 2009
Reconfiguration started (old inc 4, new inc 6)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Jul 24 16:25:09 2009
LMS 0: 6 GCS shadows cancelled, 0 closed
Fri Jul 24 16:25:09 2009
LMS 1: 14 GCS shadows cancelled, 1 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Fri Jul 24 16:25:11 2009
Instance recovery: looking for dead threads
Fri Jul 24 16:25:11 2009
Beginning instance recovery of 1 threads
Fri Jul 24 16:25:12 2009
LMS 0: 378032 GCS shadows traversed, 0 replayed
Fri Jul 24 16:25:12 2009
LMS 1: 381728 GCS shadows traversed, 0 replayed
Fri Jul 24 16:25:12 2009
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Fri Jul 24 16:25:14 2009
parallel recovery started with 7 processes
Fri Jul 24 16:25:15 2009
Started redo scan
Fri Jul 24 16:25:15 2009
Completed redo scan
3990 redo blocks read, 986 data blocks need recovery
Fri Jul 24 16:25:15 2009
Started redo application at
Thread 1: logseq 1678, block 572689
Fri Jul 24 16:25:15 2009
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1678 Reading mem 0
Mem# 0: /dev/vx/rdsk/datavg/newtrade_redo1_1_1_1g
Mem# 1: /dev/vx/rdsk/datavg/newtrade_redo1_1_2_1g
Fri Jul 24 16:25:15 2009
Completed redo application
Fri Jul 24 16:25:16 2009
Completed instance recovery at
Thread 1: logseq 1678, block 576679, scn 1112404710247
862 data blocks read, 1014 data blocks written, 3990 redo blocks read
Switch log for thread 1 to sequence 1679

隨後，硬體維護人員在現場啟動了節點1的伺服器，在啟動的時候似乎報了一個和記憶體有關的錯誤（我不在現場，聽說的），隨後檢查沒有發現系統的錯誤資訊。

檢查Oracle的alert資訊，沒有發現任何的錯誤。懷疑節點突然崩潰和硬體異常有關。

檢查系統的dmest資訊，也沒有找到系統異常DOWN機的原因。

由於沒有找到錯誤，在節點1啟動後，將節點1上資料庫的例項1也啟動了。

但是沒過多長時間，就碰到了前面描述的錯誤。節點2的alert檔案中詳細的錯誤資訊如下：

Mon Jul 27 15:20:42 2009
MMNL absent for 1202 secs; Foregrounds taking over
Mon Jul 27 15:37:40 2009
MMNL absent for 1599 secs; Foregrounds taking over
Mon Jul 27 15:48:19 2009
IPC Send timeout detected.Sender: ospid 2284
Receiver: inst 1 binc 27438 ospid 3739
Mon Jul 27 15:49:24 2009
IPC Send timeout detected.Sender: ospid 5165
Receiver: inst 1 binc 27438 ospid 3739
Mon Jul 27 15:49:31 2009
IPC Send timeout detected.Sender: ospid 5179
Receiver: inst 1 binc 27438 ospid 3739
.
.
.
Mon Jul 27 15:49:58 2009
IPC Send timeout detected.Sender: ospid 23757
Receiver: inst 1 binc 27438 ospid 3739
Mon Jul 27 15:49:59 2009
IPC Send timeout to 0.2 inc 12 for msg type 32 from opid 46
Mon Jul 27 15:49:59 2009
Communications reconfiguration: instance_number 1
Mon Jul 27 15:50:15 2009
IPC Send timeout to 0.2 inc 12 for msg type 32 from opid 43
Mon Jul 27 15:50:32 2009
Trace dumping is performing id=[cdmp_20090727154959]
Mon Jul 27 15:51:11 2009
Trace dumping is performing id=[cdmp_20090727154959]
Mon Jul 27 15:51:49 2009
Evicting instance 1 from cluster
Mon Jul 27 15:52:14 2009
Waiting for instances to leave:
1
Mon Jul 27 15:52:34 2009
Waiting for instances to leave:
1
Mon Jul 27 15:52:54 2009
Waiting for instances to leave:
1
.
.
.
Mon Jul 27 16:07:35 2009
Waiting for instances to leave:
1
Mon Jul 27 16:07:55 2009
Waiting for instances to leave:
1
Mon Jul 27 16:08:15 2009
Waiting for instances to leave:
1

開始是IPC Send timeout detected錯誤，隨後是Waiting for instances to leave資訊。根據錯誤資訊，以及節點1的響應時間判斷，問題多半就出在節點1上：多半是由於節點1太忙，導致響應時間太長，而節點2要與節點1直接進行互動，在互動時出現了超時錯誤。

為了儘快的解決這個問題，在節點2上執行svrctl嘗試關閉節點1上的例項：

bash-3.00$ srvctl stop inst -d newtrade -i newtrade1

等待了一段時間，節點1上的例項1終於被關閉。而這時節點1上的響應終於恢復正常了。檢查節點1上的alert檔案，發現了大量的錯誤資訊：

Mon Jul 27 14:42:45 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:43:21 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:43:21 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 14:51:35 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:52:17 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:52:35 2009
Process PZ99 died, see its trace file
Mon Jul 27 14:52:35 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:52:35 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:53:29 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:54:18 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:54:42 2009
ksvcreate: Process(q002) creation failed
Mon Jul 27 14:55:27 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 14:55:44 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 15:02:27 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:02:48 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:05:00 2009
GES: Potential blocker (pid=3815) on resource CI-0000000A-00000002;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trc and DIAG trace file
Mon Jul 27 15:05:41 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:05:53 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:08:05 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:08:11 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:08:16 2009
Process PZ98 died, see its trace file
Mon Jul 27 15:08:40 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:08:41 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 15:10:36 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:10:55 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 15:11:50 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:12:26 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:14:05 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:14:59 2009
kkjcre1p: unable to spawn jobq slave process
Mon Jul 27 15:14:59 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc:

Mon Jul 27 15:15:33 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:16:12 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:24:04 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:24:44 2009
GES: Potential blocker (pid=3815) on resource CI-0000000A-00000002;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trc and DIAG trace file
Mon Jul 27 15:25:10 2009
GES: Potential blocker (pid=3815) on resource CI-00000001-00000002;
enqueue info in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trc and DIAG trace file
Mon Jul 27 15:25:13 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:27:07 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:27:13 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:27:40 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 15:27:55 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:28:16 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:28:41 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:33:07 2009
Process startup failed, error stack:
Mon Jul 27 15:33:07 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
Mon Jul 27 15:33:08 2009
Process q001 died, see its trace file
Mon Jul 27 15:33:46 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 15:37:54 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 15:39:41 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:39:50 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 15:43:30 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:45:30 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 15:45:40 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:45:40 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:46:28 2009
IPC Send timeout detected. Receiver ospid 3739
Mon Jul 27 15:46:48 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lms1_3739.trc:
Mon Jul 27 15:47:32 2009
IPC Send timeout detected. Receiver ospid 3739
Mon Jul 27 15:47:39 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lms1_3739.trc:
Mon Jul 27 15:48:16 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:49:27 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 15:50:00 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:50:34 2009
Trace dumping is performing id=[cdmp_20090727154959]
Mon Jul 27 15:52:29 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:52:30 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 15:57:04 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:57:32 2009
kkjcre1p: unable to spawn jobq slave process
Mon Jul 27 15:57:32 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc:

Mon Jul 27 15:57:45 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:58:33 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 15:59:08 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 16:00:00 2009
ksvcreate: Process(q001) creation failed
Mon Jul 27 16:01:23 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 16:01:23 2009
WARNING: inbound connection timed out (ORA-3136)
.
.
.
Mon Jul 27 16:05:55 2009
WARNING: inbound connection timed out (ORA-3136)
Mon Jul 27 16:06:22 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc:
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
Mon Jul 27 16:06:24 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_arc1_4267.trc:
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
Mon Jul 27 16:06:26 2009
Trace dumping is performing id=[cdmp_20090727160625]
Mon Jul 27 16:06:30 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc:
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
.
.
.
Mon Jul 27 16:06:30 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_arc1_4267.trc:
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
Mon Jul 27 16:06:54 2009
USER: terminating instance due to error 481
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lms1_3739.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lms0_3714.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_ckpt_3815.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/udump/newtrade1_ora_16645.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/udump/newtrade1_ora_16926.trc:
ORA-00481: LMON 程式因錯誤而終止
Mon Jul 27 16:06:54 2009
System state dump is made for local instance
Mon Jul 27 16:06:54 2009
Errors in file /data/oracle/admin/newtrade/udump/newtrade1_ora_16834.trc:
ORA-00481: LMON 程式因錯誤而終止
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_dbw0_3779.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lmd0_3704.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_pmon_3633.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_smon_3842.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lck0_4026.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lgwr_3805.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:55 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_lmon_3682.trc:
ORA-29740: evicted by member 1, group incarnation 14
Mon Jul 27 16:06:56 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_smon_3842.trc:
ORA-00481: LMON process terminated with error
Mon Jul 27 16:06:56 2009
Errors in file /data/oracle/admin/newtrade/bdump/newtrade1_mman_3757.trc:
ORA-00481: LMON process terminated with error
System State dumped to trace file /data/oracle/admin/newtrade/bdump/newtrade1_diag_3644.trc
Mon Jul 27 16:07:05 2009
Dump system state for local instance only
Mon Jul 27 16:07:05 2009
Trace dumping is performing id=[cdmp_20090727160705]
Mon Jul 27 16:07:08 2009
Shutting down instance (abort)
License high water mark = 104
Mon Jul 27 16:07:09 2009
Instance terminated by USER, pid = 26184
Mon Jul 27 16:07:13 2009
Instance terminated by USER, pid = 17258

不但有常規的錯誤，還有ORA-600錯誤，看來要解決的問題還真是不少。

RAC環境出現Waiting for instances to leave資訊（一）

相關文章