RAC環境出現Waiting for instances to leave資訊(二)
在RAC環境的其中一個節點上的alert檔案中,出現了這個資訊:Waiting for instances to leave。
這一篇描述問題的解決過程。
RAC環境出現Waiting for instances to leave資訊(一):http://yangtingkun.itpub.net/post/468/493501
在節點1的alert檔案中發現了大量的錯誤資訊,不過雖然錯誤資訊很多,但是大部分都是重複的,主要的錯誤資訊只有幾種:
首先出現了ORA-3136錯誤,本來10.2.0.3上就存在這個bug,會出現ORA-3136錯誤,但是當前出現這個錯誤的數量,已經超過了這個錯誤幾年來累計出現了次數;
隨後是一些後臺輕量級的程式出錯並在這些程式的啟動時報錯:包括Q001程式、PZ98程式、JOB程式和cjq程式等;
PSP程式啟動時出現了ORA-27300錯誤;
LSM程式碰到了IPC Send timeout detected;
歸檔程式碰到了ORA-600(2103的錯誤);
最後出現了ORA-00481錯誤,LMON程式中止了所有的後臺程式。
最開始出現的ORA-3136錯誤不難理解,這個錯誤本來就和連線超時有關,由於存在異常導致系統響應非常緩慢,因此出現大量的超時錯誤不足為奇。
下面再來看看cjq程式對應的錯誤資訊:
bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc
/data/oracle/admin/newtrade/bdump/newtrade1_cjq0_3879.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /data/oracle/product/10.2/database
System name: SunOS
Node name: newtrade1
Release: 5.10
Version: Generic_127111-06
Machine: sun4u
Instance name: newtrade1
Redo thread mounted by this instance: 1
Oracle process number: 15
Unix process pid: 3879, image: oracle@newtrade1 (CJQ0)
*** 2009-07-27 15:12:03.901
*** SERVICE NAME:(SYS$BACKGROUND) 2009-07-27 15:12:03.517
*** SESSION ID:(540.1) 2009-07-27 15:12:03.412
Waited for process J000 to be spawned for 60 seconds
*** 2009-07-27 15:12:14.013
Waited for process J000 to be spawned for 70 seconds
*** 2009-07-27 15:12:24.155
Waited for process J000 to be spawned for 80 seconds
*** 2009-07-27 15:12:35.340
Waited for process J000 to be spawned for 91 seconds
*** 2009-07-27 15:12:45.430
Waited for process J000 to be spawned for 101 seconds
*** 2009-07-27 15:12:55.540
Waited for process J000 to be spawned for 111 seconds
*** 2009-07-27 15:14:59.846
*** 2009-07-27 15:53:28.354
Waited for process J000 to be spawned for 60 seconds
*** 2009-07-27 15:53:38.461
Waited for process J000 to be spawned for 70 seconds
*** 2009-07-27 15:53:48.556
Waited for process J000 to be spawned for 80 seconds
*** 2009-07-27 15:53:58.650
Waited for process J000 to be spawned for 90 seconds
*** 2009-07-27 15:54:08.829
Waited for process J000 to be spawned for 100 seconds
*** 2009-07-27 15:54:18.910
Waited for process J000 to initialize for 110 seconds
*** 2009-07-27 15:54:18.914
Dumping diagnostic information for J000:
OS pid = 6781
loadavg : 1.32 2.43 2.14
swap info: free_mem = 162.41M rsv = 21278.20M
alloc = 17453.59M avail = 22820.93 swap_free = 26645.53M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 6781 1 0 40 20 ? 15905 ? 15:53:17 ? 0:00 ora_j000_newtrade1
6781: ora_j000_newtrade1
000000010405b53c skgmidrealm (ffffffff7fffe4a0, 106527230, 0, ffffffff7fffe4c8, 3, 0) + fc
000000010405af38 skgmlocate (ffffffff7fffe4a0, 106527230, ffffffff7fffe4c8, 29815c08, 0, 0) + 158
0000000104056db4 skgmattach (ffffffff7fffe4a0, 106527230, ffffffff7fffe4c8, 29815c08, 1065278c0, ffffffff7fffe4c8) + 114
00000001011012f8 ksmlsge (0, 106400, 0, 106527000, 106537000, 106400) + b8
00000001025153cc opirip (ffffffffffffffff, 1056db000, 105400, ffffffff7ffff018, 10652a000, 10652a720) + 10c
000000010033f774 opidrv (106537460, 0, 32, 106534198, 32, 0) + 4b4
0000000100339c10 sou2o (ffffffff7ffff4a8, 32, 4, ffffffff7ffff4d0, 105e0f000, 105e0f) + 50
00000001002fbfcc opimai_real (3, ffffffff7ffff5a8, 0, 0, 247d6fc, 14800) + 10c
00000001002fbdf8 main (1, ffffffff7ffff6b8, 0, ffffffff7ffff5b0, ffffffff7ffff6c0, ffffffff79d00140) + 98
00000001002fbd1c _start (0, 0, 0, 0, 0, 0) + 17c
*** 2009-07-27 15:56:44.622
*** 2009-07-27 15:57:18.714
Killing process (ospid 6781): (reason=x4 error=0)
... and the process is still alive after kill!
*** 2009-07-27 15:57:32.677
*** 2009-07-27 16:06:55.038
KCL: caught error 481 during cr lock op
*** 2009-07-27 16:06:55.162
ORA-00604: error occurred at recursive SQL level 1
ORA-00481: LMON process terminated with error
根據錯誤資訊分析,錯誤原因是嘗試spawned新的程式,出現了長時間的等待,這應該也是系統響應變慢造成的,而不是導致問題的原因。
隨後檢查對應的ORA-600(2103)錯誤:
bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc
/data/oracle/admin/newtrade/bdump/newtrade1_arc0_4240.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /data/oracle/product/10.2/database
System name: SunOS
Node name: newtrade1
Release: 5.10
Version: Generic_127111-06
Machine: sun4u
Instance name: newtrade1
Redo thread mounted by this instance: 1
Oracle process number: 22
Unix process pid: 4240, image: oracle@newtrade1 (ARC0)
*** 2009-07-27 11:03:13.336
*** SERVICE NAME:(SYS$BACKGROUND) 2009-07-27 11:03:13.336
*** SESSION ID:(533.1) 2009-07-27 11:03:13.336
Redo shipping client performing standby login
*** 2009-07-27 11:03:13.468 65194 kcrr.c
Logged on to standby successfully
Client logon and security negotiation successful!
*** 2009-07-27 16:06:22.450
TIMEOUT ON CONTROL FILE ENQUEUE
mode=S, type=0, wait=1, eqt=900
*** 2009-07-27 16:06:22.456
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900], [], [], []
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp()+744 CALL ksedst() 000000840 ?
FFFFFFFF7FFFB44C ?
000000000 ?
FFFFFFFF7FFF7F40 ?
FFFFFFFF7FFF6CA8 ?
FFFFFFFF7FFF76A8 ?
kgeriv()+220 PTR_CALL 0000000000000000 000106400 ? 10652D264 ?
10652D000 ? 00010652D ?
000106400 ? 10652D264 ?
kgesiv()+112 CALL kgeriv() 106527C18 ? 000000000 ?
000000837 ? 000000004 ?
FFFFFFFF7FFFB9A8 ?
000001430 ?
ksesic4()+96 CALL kgesiv() 106527C18 ? 106762320 ?
000000837 ? 000000004 ?
FFFFFFFF7FFFB9A8 ?
000001420 ?
kccocx()+4528 CALL ksesic4() 000000837 ? 000106527 ?
10652D258 ? 000106400 ?
10652D000 ? 00010652D ?
kcrribcx()+1732 CALL kccbcx() 00000016B ? 38000F000 ?
000000001 ? 000000384 ?
000106400 ? 00038000F ?
這個ORA-600錯誤顯然是TIMEOUT ON CONTROL FILE ENQUEUE導致的,歸檔程式要訪問控制檔案,而由於其他程式已經在訪問控制檔案,因此歸檔程式進入佇列中,歸檔程式在佇列中等待超過了900秒後,引發了這個ORA-600(2103)的錯誤。顯然這個錯誤也不是導致問題的主要原因,而是系統繁忙所引發的一個錯誤。
接著再看看psp程式對應的資訊:
bash-3.00$ more /data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc
/data/oracle/admin/newtrade/bdump/newtrade1_psp0_3671.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /data/oracle/product/10.2/database
System name: SunOS
Node name: newtrade1
Release: 5.10
Version: Generic_127111-06
Machine: sun4u
Instance name: newtrade1
Redo thread mounted by this instance: 1
Oracle process number: 4
Unix process pid: 3671, image: oracle@newtrade1 (PSP0)
*** SERVICE NAME:(SYS$BACKGROUND) 2009-07-27 14:50:50.097
*** SESSION ID:(553.1) 2009-07-27 14:50:50.097
*** 2009-07-27 14:50:50.097
Dump diagnostics for process PZ99 pid 11183 which did not start after 120 seconds:
(spawn_time:xD42371 now:xD5FF06 diff:x1DB95)
*** 2009-07-27 14:50:50.121
Dumping diagnostic information for PZ99:
OS pid = 11183
loadavg : 0.98 1.52 2.10
swap info: free_mem = 250.75M rsv = 21332.33M
alloc = 17360.75M avail = 22774.20 swap_free = 26745.77M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 11183 1 0 39 20 ? 15905 ? 14:48:23 ? 0:00 ora_pz99_newtrade1
11183: ora_pz99_newtrade1
0000000000000000 ???????? (0, 0, 0, 0, 0, 0)
*** 2009-07-27 14:51:46.295
Killing process (ospid 11183): did not start after 120 seconds
... and the process is still alive after kill!
Killed process 11183 is still alive after 62 seconds!
*** 2009-07-27 14:54:48.727
Killing process (ospid 17027): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 14:57:57.161
Dump diagnostics for process PZ98 pid 20623 which did not start after 120 seconds:
(spawn_time:xDAA953 now:xDC83AF diff:x1DA5C)
*** 2009-07-27 14:57:57.205
Dumping diagnostic information for PZ98:
OS pid = 20623
loadavg : 1.32 1.07 1.64
swap info: free_mem = 197.38M rsv = 21395.36M
alloc = 17448.21M avail = 22713.61 swap_free = 26660.76M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 20623 1 0 40 20 ? 15849 ? 14:55:34 ? 0:00 ora_pz98_newtrade1
pstack: cannot examine 20623: process is traced
*** 2009-07-27 15:00:42.098
Killing process (ospid 20623): did not start after 120 seconds
... and the process is still alive after kill!
Killed process 20623 is still alive after 168 seconds!
*** 2009-07-27 15:01:45.471
Killed process 20623 is still alive after 228 seconds!
*** 2009-07-27 15:02:45.974
Killed process 20623 is still alive after 288 seconds!
*** 2009-07-27 15:03:46.306
Killed process 20623 is still alive after 349 seconds!
*** 2009-07-27 15:04:46.853
Killed process 20623 is still alive after 409 seconds!
*** 2009-07-27 15:05:47.298
Killed process 20623 is still alive after 470 seconds!
*** 2009-07-27 15:06:47.791
Killed process 20623 is still alive after 530 seconds!
*** 2009-07-27 15:07:48.776
Killed process 20623 is still alive after 591 seconds!
*** 2009-07-27 15:14:39.249
Killing process (ospid 12827): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:19:46.554
Killing process (ospid 17820): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:20:47.222
Killed process 17820 is still alive after 60 seconds!
*** 2009-07-27 15:26:49.396
Killing process (ospid 24054): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:32:20.246
Killing process (ospid 797): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:33:07.352
Process startup failed, error stack:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
*** 2009-07-27 15:34:01.940
Killed process 797 is still alive after 101 seconds!
*** 2009-07-27 15:37:17.894
Killing process (ospid 4398): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:38:51.234
Killed process 4398 is still alive after 93 seconds!
*** 2009-07-27 15:45:29.769
Killing process (ospid 6013): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:52:00.619
Killing process (ospid 6192): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 15:56:14.520
Dump diagnostics for process J000 pid 6781 which did not start after 120 seconds:
(spawn_time:x11009AD now:x111E13E diff:x1D791)
*** 2009-07-27 15:56:14.538
Dumping diagnostic information for J000:
OS pid = 6781
loadavg : 1.02 1.95 2.00
swap info: free_mem = 154.73M rsv = 21285.11M
alloc = 17466.89M avail = 22809.92 swap_free = 26628.14M
skgpgpstack: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 6781' | /bin/grep -v grep timed out after 60 seconds
pstack: cannot examine 6781: no such process or core file
*** 2009-07-27 15:59:35.070
*** 2009-07-27 16:05:24.760
Killing process (ospid 14395): requester cancelled request
... and the process is still alive after kill!
*** 2009-07-27 16:06:55.478
error 481 detected in background process
ORA-00481: LMON process terminated with error
從錯誤資訊看,似乎是作業系統資源不足,導致fork新的程式時,出現了錯誤。從系統記憶體資訊看,當前記憶體資訊僅有154M的空閒。看來果然是記憶體出現了問題。
檢查節點1的記憶體配置:
root@newtrade1 # /usr/sbin/prtconf | grep "Memory size"
Memory size: 16384 Megabytes
怎麼只有16G了,而正常應該是32G才對,檢查節點2:
root@newtrade2 # /usr/sbin/prtconf | grep "Memory size"
Memory size: 32768 Megabytes
這就難怪了,要知道這個資料庫的SGA大小還在20G左右:
bash-3.00$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 7月 27 19:19:49 2009
Copyright (c) 1982, 2006, Oracle. All Rights Reserved.
連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
SQL> show sga
Total System Global Area 2.0972E+10 bytes
Fixed Size 2165832 bytes
Variable Size 7157380024 bytes
Database Buffers 1.3808E+10 bytes
Redo Buffers 4325376 bytes
問題已經很明顯了,由於節點1上記憶體的故障,導致節點1出現了DOWN機。隨後啟動的時候出現了記憶體的錯誤,導致16G的記憶體沒有被載入。當前系統僅僅載入了16G的記憶體。而資料庫啟動的SGA就佔用了20G的內容,Oracle雖然可以啟動,但是要藉助SWAP區,也就是硬碟上的空間進行中轉。等待一段時間後,使用者負載增大,由於系統的記憶體頻繁的分頁,導致系統的響應時間迅速的增大,最終導致了節點1完全的癱瘓,而節點2由於要與節點1進行互動,因此也處於等待狀態。
其實早在記憶體減少一半後,資料庫例項的第一次啟動時,Oracle就給出了告警資訊:
WARNING: Not enough physical memory for SHM_SHARE_MMU segment of size 0x00000004e200c000 [flag=0x4000]
如果一早發現這個資訊,可能會更迅速的確定問題。
找到問題的原因後,解決問題就顯得易如反掌了:解決硬體的問題,就可以從根本上解決資料庫的問題;如果系統的負載不是很大,可以讓資料庫處於單例項工作狀態中;如果硬體問題需要很長時間才能解決,還可以在節點1上設定例項級的SGA,根據節點1當前可用的記憶體,降低SGA和PGA的設定,這樣也可以避免問題的產生。
在例項2上設定例項1的專用SGA_TARGET引數:
SQL> alter system set sga_target = 9663676416 scope = spfile sid = 'newtrade1';
系統已更改。
下面重啟例項1,問題解決:
$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.3.0 - Production on 星期二 7月 28 17:55:41 2009
Copyright (c) 1982, 2006, Oracle. All Rights Reserved.
已連線到空閒例程。
SQL> startup
ORACLE 例程已經啟動。
Total System Global Area 9663676416 bytes
Fixed Size 2045872 bytes
Variable Size 1442842704 bytes
Database Buffers 8204058624 bytes
Redo Buffers 14729216 bytes
資料庫裝載完畢。
資料庫已經開啟。
等到解決硬體問題後,只需要RESET例項1的SGA_TARGET引數就可以恢復正常了。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-618381/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- RAC環境出現Waiting for instances to leave資訊(一)AI
- 【Oracle】 RAC 環境刪除oracle 之二Oracle
- RAC環境利用備份恢復RAC資料庫(二)資料庫
- rac環境vip在linux下的連線資訊Linux
- 11gr2RAC環境解除安裝(二)
- 單例項opatch出現RAC資訊單例
- Oracle RAC 10g for Solaris環境解除安裝(二)Oracle
- rac環境下standby資料庫的實現資料庫
- rac環境下使用impdp匯入資料出錯
- DB遷移RAC環境
- oracle rac 環境檢測Oracle
- 提取java環境資訊Java
- 【RAC】使用一條“ps”命令獲取Linux環境下全部RAC叢集程式資訊Linux
- Oracle10203RAC環境刪除節點(二)Oracle
- Oracle 11g R2(11.2.0.3.0) RAC環境搭建(二)Oracle
- Oracle10203RAC環境新增新節點(二)Oracle
- RAC環境重建控制檔案
- ORACLE RAC 環境下修改IPOracle
- RAC環境的恢復策略
- Oracle RAC + Data Guard 環境搭建Oracle
- RAC環境下dataguard的搭建
- 【RAC】 RAC For W2K8R2 安裝--作業系統環境配置 (二)作業系統
- 【RAC】使用VMware虛擬機器搭建RAC環境虛擬機
- 存檔-環境資訊aptAPT
- 單例項環境利用備份恢復RAC資料庫(二)單例資料庫
- RAC和ASM環境下打patchASM
- 一鍵RAC環境配置指令碼指令碼
- RAC 環境下修改歸檔模式模式
- RAC環境下配置OGG同步
- RAC環境中的TNSNAMES檔案
- RAC環境網路故障測試
- RAC環境STANDBY的SWITCHOVER切換
- 【RAC】RAC環境下歸檔日誌格式約定
- RAC環境中的儲存部分管理——RAC管理
- RAC環境在ASM上建立表空間出錯ORA-569ASM
- Solaris10下安裝Oracle1106RAC環境(二)Oracle
- Solaris10下安裝Oracle10203RAC環境(二)Oracle
- Solaris8上安裝RAC10202環境(二)