在RAC環境的其中一個節點上的alert檔案中，出現了這個資訊：Waiting for instances to leave。

這一篇描述問題的解決過程。

RAC環境出現Waiting for instances to leave資訊（一）：http://yangtingkun.itpub.net/post/468/493501

在節點1的alert檔案中發現了大量的錯誤資訊，不過雖然錯誤資訊很多，但是大部分都是重複的，主要的錯誤資訊只有幾種：

首先出現了ORA-3136錯誤，本來10.2.0.3上就存在這個bug，會出現ORA-3136錯誤，但是當前出現這個錯誤的數量，已經超過了這個錯誤幾年來累計出現了次數；

隨後是一些後臺輕量級的程式出錯並在這些程式的啟動時報錯：包括Q001程式、PZ98程式、JOB程式和cjq程式等；

PSP程式啟動時出現了ORA-27300錯誤；

LSM程式碰到了IPC Send timeout detected；

歸檔程式碰到了ORA-600（2103的錯誤）；

最後出現了ORA-00481錯誤，LMON程式中止了所有的後臺程式。

最開始出現的ORA-3136錯誤不難理解，這個錯誤本來就和連線超時有關，由於存在異常導致系統響應非常緩慢，因此出現大量的超時錯誤不足為奇。

下面再來看看cjq程式對應的錯誤資訊：

bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc
/data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /data/oracle/product/10.2/database
System name:    SunOS
Node name:      newtrade1
Release:        5.10
Version:        Generic_127111-06
Machine:        sun4u
Instance name: newtrade1
Redo thread mounted by this instance: 1
Oracle process number: 15
Unix process pid: 3879, image: oracle@newtrade1 (CJQ0)

*** 2009-07-27 15:12:03.901
*** SERVICE NAME:(SYS$BACKGROUND) 2009-07-27 15:12:03.517
*** SESSION ID:(540.1) 2009-07-27 15:12:03.412
Waited for process J000 to be spawned for 60 seconds
*** 2009-07-27 15:12:14.013
Waited for process J000 to be spawned for 70 seconds
*** 2009-07-27 15:12:24.155
Waited for process J000 to be spawned for 80 seconds
*** 2009-07-27 15:12:35.340
Waited for process J000 to be spawned for 91 seconds
*** 2009-07-27 15:12:45.430
Waited for process J000 to be spawned for 101 seconds
*** 2009-07-27 15:12:55.540
Waited for process J000 to be spawned for 111 seconds
*** 2009-07-27 15:14:59.846
*** 2009-07-27 15:53:28.354
Waited for process J000 to be spawned for 60 seconds
*** 2009-07-27 15:53:38.461
Waited for process J000 to be spawned for 70 seconds
*** 2009-07-27 15:53:48.556
Waited for process J000 to be spawned for 80 seconds
*** 2009-07-27 15:53:58.650
Waited for process J000 to be spawned for 90 seconds
*** 2009-07-27 15:54:08.829
Waited for process J000 to be spawned for 100 seconds
*** 2009-07-27 15:54:18.910
Waited for process J000 to initialize for 110 seconds
*** 2009-07-27 15:54:18.914
Dumping diagnostic information for J000:
OS pid = 6781
loadavg : 1.32 2.43 2.14
swap info: free_mem = 162.41M rsv = 21278.20M
           alloc = 17453.59M avail = 22820.93 swap_free = 26645.53M
F S      UID   PID PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD
0 S   oracle 6781     1   0 40 20        ? 15905        ? 15:53:17 ?           0:00 ora_j000_newtrade1
6781:   ora_j000_newtrade1
000000010405b53c skgmidrealm (ffffffff7fffe4a0, 106527230, 0, ffffffff7fffe4c8, 3, 0) + fc
000000010405af38 skgmlocate (ffffffff7fffe4a0, 106527230, ffffffff7fffe4c8, 29815c08, 0, 0) + 158
0000000104056db4 skgmattach (ffffffff7fffe4a0, 106527230, ffffffff7fffe4c8, 29815c08, 1065278c0, ffffffff7fffe4c8) + 114
00000001011012f8 ksmlsge (0, 106400, 0, 106527000, 106537000, 106400) + b8
00000001025153cc opirip (ffffffffffffffff, 1056db000, 105400, ffffffff7ffff018, 10652a000, 10652a720) + 10c
000000010033f774 opidrv (106537460, 0, 32, 106534198, 32, 0) + 4b4
0000000100339c10 sou2o (ffffffff7ffff4a8, 32, 4, ffffffff7ffff4d0, 105e0f000, 105e0f) + 50
00000001002fbfcc opimai_real (3, ffffffff7ffff5a8, 0, 0, 247d6fc, 14800) + 10c
00000001002fbdf8 main (1, ffffffff7ffff6b8, 0, ffffffff7ffff5b0, ffffffff7ffff6c0, ffffffff79d00140) + 98
00000001002fbd1c _start (0, 0, 0, 0, 0, 0) + 17c
*** 2009-07-27 15:56:44.622
*** 2009-07-27 15:57:18.714
Killing process (ospid 6781): (reason=x4 error=0)
... and the process is still alive after kill!
*** 2009-07-27 15:57:32.677
*** 2009-07-27 16:06:55.038
KCL: caught error 481 during cr lock op
*** 2009-07-27 16:06:55.162
ORA-00604: error occurred at recursive SQL level 1
ORA-00481: LMON process terminated with error

根據錯誤資訊分析，錯誤原因是嘗試spawned新的程式，出現了長時間的等待，這應該也是系統響應變慢造成的，而不是導致問題的原因。

隨後檢查對應的ORA-600(2103)錯誤：

bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc
/data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /data/oracle/product/10.2/database
System name:    SunOS
Node name:      newtrade1
Release:        5.10
Version:        Generic_127111-06
Machine:        sun4u
Instance name: newtrade1
Redo thread mounted by this instance: 1
Oracle process number: 22
Unix process pid: 4240, image: oracle@newtrade1 (ARC0)

*** 2009-07-27 11:03:13.336
*** SERVICE NAME:(SYS$BACKGROUND) 2009-07-27 11:03:13.336
*** SESSION ID:(533.1) 2009-07-27 11:03:13.336
Redo shipping client performing standby login
*** 2009-07-27 11:03:13.468 65194 kcrr.c
Logged on to standby successfully
Client logon and security negotiation successful!
*** 2009-07-27 16:06:22.450
TIMEOUT ON CONTROL FILE ENQUEUE
mode=S, type=0, wait=1, eqt=900
*** 2009-07-27 16:06:22.456
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp()+744         CALL     ksedst()             000000840 ?
                                                   FFFFFFFF7FFFB44C ?
                                                   000000000 ?
                                                   FFFFFFFF7FFF7F40 ?
                                                   FFFFFFFF7FFF6CA8 ?
                                                   FFFFFFFF7FFF76A8 ?
kgeriv()+220         PTR_CALL 0000000000000000     000106400 ? 10652D264 ?
                                                   10652D000 ? 00010652D ?
                                                   000106400 ? 10652D264 ?
kgesiv()+112         CALL     kgeriv()             106527C18 ? 000000000 ?
                                                   000000837 ? 000000004 ?
                                                   FFFFFFFF7FFFB9A8 ?
                                                   000001430 ?
ksesic4()+96         CALL     kgesiv()             106527C18 ? 106762320 ?
                                                   000000837 ? 000000004 ?
                                                   FFFFFFFF7FFFB9A8 ?
                                                   000001420 ?
kccocx()+4528        CALL     ksesic4()            000000837 ? 000106527 ?
                                                   10652D258 ? 000106400 ?
                                                   10652D000 ? 00010652D ?
kcrribcx()+1732      CALL     kccbcx()             00000016B ? 38000F000 ?
                                                   000000001 ? 000000384 ?
                                                   000106400 ? 00038000F ?

這個ORA-600錯誤顯然是TIMEOUT ON CONTROL FILE ENQUEUE導致的，歸檔程式要訪問控制檔案，而由於其他程式已經在訪問控制檔案，因此歸檔程式進入佇列中，歸檔程式在佇列中等待超過了900秒後，引發了這個ORA-600(2103)的錯誤。顯然這個錯誤也不是導致問題的主要原因，而是系統繁忙所引發的一個錯誤。

接著再看看psp程式對應的資訊：

bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc
/data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /data/oracle/product/10.2/database
System name:    SunOS
Node name:      newtrade1
Release:        5.10
Version:        Generic_127111-06
Machine:        sun4u
Instance name: newtrade1
Redo thread mounted by this instance: 1
Oracle process number: 4
Unix process pid: 3671, image: oracle@newtrade1 (PSP0)

*** SERVICE NAME:(SYS$BACKGROUND) 2009-07-27 14:50:50.097
*** SESSION ID:(553.1) 2009-07-27 14:50:50.097
*** 2009-07-27 14:50:50.097
Dump diagnostics for process PZ99 pid 11183 which did not start after 120 seconds:
(spawn_time:xD42371 now:xD5FF06 diff:x1DB95)
*** 2009-07-27 14:50:50.121
Dumping diagnostic information for PZ99:
OS pid = 11183
loadavg : 0.98 1.52 2.10
swap info: free_mem = 250.75M rsv = 21332.33M
           alloc = 17360.75M avail = 22774.20 swap_free = 26745.77M
F S      UID   PID PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD
0 S   oracle 11183     1   0 39 20        ? 15905        ? 14:48:23 ?           0:00 ora_pz99_newtrade1
11183: ora_pz99_newtrade1
0000000000000000 ???????? (0, 0, 0, 0, 0, 0)
*** 2009-07-27 14:51:46.295
Killing process (ospid 11183): did not start after 120 seconds
... and the process is still alive after kill!
Killed process 11183 is still alive after 62 seconds!
*** 2009-07-27 14:54:48.727
Killing process (ospid 17027): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 14:57:57.161
Dump diagnostics for process PZ98 pid 20623 which did not start after 120 seconds:
(spawn_time:xDAA953 now:xDC83AF diff:x1DA5C)
*** 2009-07-27 14:57:57.205
Dumping diagnostic information for PZ98:
OS pid = 20623
loadavg : 1.32 1.07 1.64
swap info: free_mem = 197.38M rsv = 21395.36M
           alloc = 17448.21M avail = 22713.61 swap_free = 26660.76M
F S      UID   PID PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD
0 S   oracle 20623     1   0 40 20        ? 15849        ? 14:55:34 ?           0:00 ora_pz98_newtrade1
pstack: cannot examine 20623: process is traced
*** 2009-07-27 15:00:42.098
Killing process (ospid 20623): did not start after 120 seconds
... and the process is still alive after kill!
Killed process 20623 is still alive after 168 seconds!
*** 2009-07-27 15:01:45.471
Killed process 20623 is still alive after 228 seconds!
*** 2009-07-27 15:02:45.974
Killed process 20623 is still alive after 288 seconds!
*** 2009-07-27 15:03:46.306
Killed process 20623 is still alive after 349 seconds!
*** 2009-07-27 15:04:46.853
Killed process 20623 is still alive after 409 seconds!
*** 2009-07-27 15:05:47.298
Killed process 20623 is still alive after 470 seconds!
*** 2009-07-27 15:06:47.791
Killed process 20623 is still alive after 530 seconds!
*** 2009-07-27 15:07:48.776
Killed process 20623 is still alive after 591 seconds!
*** 2009-07-27 15:14:39.249
Killing process (ospid 12827): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:19:46.554
Killing process (ospid 17820): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:20:47.222
Killed process 17820 is still alive after 60 seconds!
*** 2009-07-27 15:26:49.396
Killing process (ospid 24054): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:32:20.246
Killing process (ospid 797): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:33:07.352
Process startup failed, error stack:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
*** 2009-07-27 15:34:01.940
Killed process 797 is still alive after 101 seconds!
*** 2009-07-27 15:37:17.894
Killing process (ospid 4398): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:38:51.234
Killed process 4398 is still alive after 93 seconds!
*** 2009-07-27 15:45:29.769
Killing process (ospid 6013): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:52:00.619
Killing process (ospid 6192): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:56:14.520
Dump diagnostics for process J000 pid 6781 which did not start after 120 seconds:
(spawn_time:x11009AD now:x111E13E diff:x1D791)
*** 2009-07-27 15:56:14.538
Dumping diagnostic information for J000:
OS pid = 6781
loadavg : 1.02 1.95 2.00
swap info: free_mem = 154.73M rsv = 21285.11M
           alloc = 17466.89M avail = 22809.92 swap_free = 26628.14M
skgpgpstack: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 6781' | /bin/grep -v grep timed out after 60 seconds
pstack: cannot examine 6781: no such process or core file
*** 2009-07-27 15:59:35.070
*** 2009-07-27 16:05:24.760
Killing process (ospid 14395): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 16:06:55.478
error 481 detected in background process
ORA-00481: LMON process terminated with error

從錯誤資訊看，似乎是作業系統資源不足，導致fork新的程式時，出現了錯誤。從系統記憶體資訊看，當前記憶體資訊僅有154M的空閒。看來果然是記憶體出現了問題。

檢查節點1的記憶體配置：

root@newtrade1 # /usr/sbin/prtconf | grep "Memory size"
Memory size: 16384 Megabytes

怎麼只有16G了，而正常應該是32G才對，檢查節點2：

root@newtrade2 # /usr/sbin/prtconf | grep "Memory size"
Memory size: 32768 Megabytes

這就難怪了，要知道這個資料庫的SGA大小還在20G左右：

bash-3.00$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 7月 27 19:19:49 2009

連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> show sga

Total System Global Area 2.0972E+10 bytes
Fixed Size                  2165832 bytes
Variable Size            7157380024 bytes
Database Buffers         1.3808E+10 bytes
Redo Buffers                4325376 bytes

問題已經很明顯了，由於節點1上記憶體的故障，導致節點1出現了DOWN機。隨後啟動的時候出現了記憶體的錯誤，導致16G的記憶體沒有被載入。當前系統僅僅載入了16G的記憶體。而資料庫啟動的SGA就佔用了20G的內容，Oracle雖然可以啟動，但是要藉助SWAP區，也就是硬碟上的空間進行中轉。等待一段時間後，使用者負載增大，由於系統的記憶體頻繁的分頁，導致系統的響應時間迅速的增大，最終導致了節點1完全的癱瘓，而節點2由於要與節點1進行互動，因此也處於等待狀態。

其實早在記憶體減少一半後，資料庫例項的第一次啟動時，Oracle就給出了告警資訊：

WARNING: Not enough physical memory for SHM_SHARE_MMU segment of size 0x00000004e200c000 [flag=0x4000]

如果一早發現這個資訊，可能會更迅速的確定問題。

找到問題的原因後，解決問題就顯得易如反掌了：解決硬體的問題，就可以從根本上解決資料庫的問題；如果系統的負載不是很大，可以讓資料庫處於單例項工作狀態中；如果硬體問題需要很長時間才能解決，還可以在節點1上設定例項級的SGA，根據節點1當前可用的記憶體，降低SGA和PGA的設定，這樣也可以避免問題的產生。

在例項2上設定例項1的專用SGA_TARGET引數：

SQL> alter system set sga_target = 9663676416 scope = spfile sid = 'newtrade1';

系統已更改。

下面重啟例項1，問題解決：

$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.3.0 - Production on 星期二 7月 28 17:55:41 2009

已連線到空閒例程。

SQL> startup
ORACLE 例程已經啟動。

Total System Global Area 9663676416 bytes
Fixed Size                  2045872 bytes
Variable Size            1442842704 bytes
Database Buffers         8204058624 bytes
Redo Buffers               14729216 bytes
資料庫裝載完畢。
資料庫已經開啟。

等到解決硬體問題後，只需要RESET例項1的SGA_TARGET引數就可以恢復正常了。

RAC環境出現Waiting for instances to leave資訊（二）

相關文章