[總結]9i RAC LMON: terminating instance due to error 29702
基本配置:
Linux AS3.0 核心版本: 2.4.21-37.ELsmp
Oracle 9.2.0.4 升級到Oracle9.2.0.7 , RAC , 兩節點。 clusterware是9204軟體安裝的 。
後來的查詢發現Oracle 9.2.0.4 RAC 系統升級到Oracle9.2.0.7 , Oracle RDBMS Software 是可以升級到Oracle9.2.0.7的,但是Oracle9.2.0.7 Patchset 確實沒有ORACM Cluster管理軟件的升級版,是Oracle的一個bug (9.2.0.7.0 Bug 4163445) , 只有從Oracle9.2.0.6 Patchset上 升級Oracle9.2.0.4 Clusterware軟體ORACM (叢Oracle CM Log 中可以看到Oracle9.2.0.4 版本下安裝的ORACM版本為 oracm 9.2.0.2.0, 9.2.0.7 補丁沒有升級ORACM版本,Oracle9.2.0.6 Patch升級的版本是 oracm 9.2.0.6.0.52) . 注意的一點是所有升級動作一定要嚴格按照Readme來操作,當然Oracle的Readme也不一定都考慮到了,這個問題就是一個例子。
[@more@]http://www.itpub.net/viewthread.php?tid=922265&extra=&highlight=%2Btolywang&page=3 (9.2.0.7.0 Bug 4163445)
問題描述:
出現的問題描述如下 (節點1 以及節點 2 交替每隔5~8天左右例項crash一次) :
alter_orcl1.log
-------------------------------------------------------------------------------------------------
Sat Jan 5 18:44:19 2008
ARC1: Evaluating archive log 1 thread 1 sequence 122
ARC1: Beginning to archive log 1 thread 1 sequence 122
Creating archive destination LOG_ARCHIVE_DEST_1: '/ocfs_arch1/orcl/1_122.dbf'
ARC1: Completed archiving log 1 thread 1 sequence 122
Sat Jan 5 19:36:06 2008
Thread 1 advanced to log sequence 124
Current log# 4 seq# 124 mem# 0: /ocfs_ctrl_redo/orcl/redo04.log
Current log# 4 seq# 124 mem# 1: /ocfs_data/orcl/redo04b.log
Sat Jan 5 19:36:06 2008
ARC1: Evaluating archive log 3 thread 1 sequence 123
ARC1: Beginning to archive log 3 thread 1 sequence 123
Creating archive destination LOG_ARCHIVE_DEST_1: '/ocfs_arch1/orcl/1_123.dbf'
ARC1: Completed archiving log 3 thread 1 sequence 123
Sat Jan 5 19:45:15 2008
Errors in file /u01/product/admin/orcl/bdump/orcl1_lmon_14214.trc:
ORA-29702: error occurred in Cluster Group Service operation
Sat Jan 5 19:45:15 2008
LMON: terminating instance due to error 29702
Sat Jan 5 19:45:16 2008
System state dump is made for local instance
Sat Jan 5 19:45:20 2008
Instance terminated by LMON, pid = 14214
Sat Jan 5 19:54:53 2008
Starting ORACLE instance (normal)
Sat Jan 5 19:54:53 2008
Global Enqueue Service Resources = 26694, pool = 4
Sat Jan 5 19:54:53 2008
Global Enqueue Service Enqueues = 39350
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
SCN scheme 2
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.7.0.
System parameters with non-default values:
processes = 1000
timed_statistics = FALSE
resource_limit = TRUE
shared_pool_size = 419430400
large_pool_size = 33554432
java_pool_size = 33554432
$ vi /u01/product/admin/orcl/bdump/orcl1_lmon_14214.trc
=============
/u01/product/admin/orcl/bdump/orcl1_lmon_14214.trc
Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.7.0 - Production
ORACLE_HOME = /u01/product/oracle
System name: Linux
Node name: DELL-RAC01
Release: 2.4.21-37.ELsmp
Version: #1 SMP Wed Sep 7 13:28:55 EDT 2005
Machine: i686
Instance name: orcl1
Redo thread mounted by this instance: 0
Oracle process number: 4
Unix process pid: 14214, image: oracle@DELL-RAC01 (LMON)
*** SESSION ID
GES IPC: Receivers 3 Senders 3
GES IPC: Buffers Receive 1000 Send (i:2230 b:2230) Reserve 1000
GES IPC: Msg Size Regular 396 Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 48, ack 53
Batching factor: cache replay 34 size per lock 56
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 1 Oracle 9i Reference CM)
CMCLI WARNING: CMInitContext: init ctx(0xb6d93f8)
*** 2007-12-31 12:07:49.396
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 0 0.
*** 2007-12-31 12:07:49.396
Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 122787
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 1 2.
Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 1 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 1 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 1 5.
Name Service normal
Name Service recovery done
*** 2007-12-31 12:07:49.611
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 1 6.
*** 2007-12-31 12:07:49.832
*** 2007-12-31 12:07:49.832
Reconfiguration started (old inc 0, new inc 1)
Synchronization timeout interval: 600 sec
List of nodes:
0
Global Resource Directory frozen
node 0
release 9 2 0 7
* kjshashcfg: I'm the only node in the cluster (node 0)
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traversed, 0 cancelled, 0 closed
0 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2007-12-31 12:07:50.121
0 GCS shadows traversed, 0 replayed, 0 unopened
Submitted all GCS cache requests
0 write requests issued in 0 GCS resources
0 PIs marked suspect, 0 flush PI msgs
ORACM Log 當時的資訊: ERROR: WriteEventPort: write failed with error 32
------------------------------------------------------------
Debug Hang :ClientProcListener (PID=14257) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:688145 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M
>ERROR: WriteEventPort: write failed with error 32., tid = ClientProcListener:688145 file = unixinc.c, line = 915 {Sat Jan 5 19:45:16 2008 }^M
Debug Hang :ClientProcListener (PID=14261) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:622615 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M
Debug Hang :ClientProcListener (PID=14255) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:557077 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M
Diag trace log :
/u01/product/admin/orcl/bdump/orcl2_diag_14211.trc
Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.7.0 - Production
ORACLE_HOME = /u01/product/oracle
System name: Linux
Node name: DELL-RAC02
Release: 2.4.21-37.ELsmp
Version: #1 SMP Wed Sep 7 13:28:55 EDT 2005
Machine: i686
Instance name: orcl2
Redo thread mounted by this instance: 0
Oracle process number: 3
Unix process pid: 14211, image: oracle@DELL-RAC02 (DIAG)
*** SESSION ID:(2.1) 2008-01-16 12:16:14.524
CMCLI WARNING: CMInitContext: init ctx(0xb9115f4)
kjzcprt:rcv port created
Node id: 1
List of nodes: 0, 1,
*** 2008-01-16 12:16:14.526
Reconfiguration starts [incarn=0]
I'm the voting node
Send my bitmap to master 0
Rcfg confirmation is received from master 0
I agree with the rcfg confirmation
*** 2008-01-16 12:16:25.233
Reconfiguration completes [incarn=2]
*** 2008-01-19 04:50:21.933
Instance is terminating by process 14215 [ospid=oracle@DELL-RAC02 (LMON)]
Performing diagnostic data dump for this instance
CMCLI WARNING: CommonContextCleanup: closing comm port
DIAG detachs from CM
error 29723 detected in background process
OPIRIP: Uncaught error 447. Error stack:
ORA-00447: fatal error in background process
ORA-29723: Failed to attach to the global enqueue service (status=32)
從metalink上面的錯誤描述上看,似乎是由於rac環境兩個例項的libskgxn9.so不一致造成的。
處理方法:
1. 由於是Oracle9.2.0.4 升級到Oracle9.2.0.7 , 而9207沒有ORACM的升級版本軟體,只有RDBMS的軟體。 所以還必須透過9206的patchset來升級oracm9.2.0.2到oracm9.2.0.6.0.52版本。 注意了,一定要嚴格按照readme來操作。
2. 當然升級Oracle RDBMS , Oracm9.2.0.6之後還需要執行一些catproc.sql ……等指令碼來更新資料字典,這些在readme上都有。
3. 有些bug是沒有公佈的,在google,baidu都不能找到,必須到metalink上才能看到。而且不能僅僅透過alert log ,還要結合trc log , diag log, cm log 檔案等。
Bug 4390716 Linux: "CMCLI WARNING" messages after applying 9.2.0.6 / 7
This note gives a brief overview of bug 4390716.
Affects:
Product (Component) | Oracle Server (Rdbms) |
Range of versions believed to be affected | Versions >= 9.2.0.6 |
Versions confirmed as being affected |
|
Platforms affected |
|
It is believed to be a regression in default behaviour thus:
Regression introduced in 9.2.0.6
Fixed:
This issue is fixed in |
| |||
Symptoms: | Related To: | |||
| ||||
Description
After applying 9.2.0.6 or 9.2.0.7 Patch Set on Linux
platforms then RAC installations may start reporting
numerous errors to trace files of the form:
CMCLI WARNING: CMInitContext: init ctx(0xae5c9a4)
CMCLI WARNING: CommonContextCleanup: closing comm port
This can lead to disk full and instance crash scenarios.
Workaround:
After installation of the Patch Set ensure that the
folowing steps are executed on ALL nodes of the RAC
cluster:
cd $ORACLE_HOME/rdbms/lib
Shut down all the instances in the OH
make -f ins_rdbms.mk rac_on ioracle
執行這些bug修復的命令:
After applying 9.2.0.6 or 9.2.0.7 Patch Set on Linux
platforms then RAC installations may start reporting
numerous errors to trace files of the form:
CMCLI WARNING: CMInitContext: init ctx(0xae5c9a4)
CMCLI WARNING: CommonContextCleanup: closing comm port
This can lead to disk full and instance crash scenarios.
Workaround:
After installation of the Patch Set ensure that the
folowing steps are executed on ALL nodes of the RAC
cluster:
cd $ORACLE_HOME/rdbms/lib
Shut down all the instances in the OH
make -f ins_rdbms.mk rac_on ioracle
具體執行:
DELL-RAC01$
DELL-RAC01$make -f ins_rdbms.mk rac_on ioracle
rm -f /u01/product/oracle/lib/libskgxp9.so
cp /u01/product/oracle/lib//libskgxpu.so /u01/product/oracle/lib/libskgxp9.so
cp /u01/product/oracle/lib/libcmdll.so /u01/product/oracle/lib/libskgxn9.so
/usr/bin/ar cr /u01/product/oracle/rdbms/lib/libknlopt.a /u01/product/oracle/rdbms/lib/kcsm.o
- Linking Oracle
rm -f /u01/product/oracle/rdbms/lib/oracle
gcc -o /u01/product/oracle/rdbms/lib/oracle -L/u01/product/oracle/rdbms/lib/ -L/u01/product/oracle/lib/ -L/u01/product/oracle/lib/stubs/ -Wl,-E `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo /u01/product/oracle/rdbms/lib/skgaioi.o` /u01/product/oracle/rdbms/lib/opimai.o /u01/product/oracle/rdbms/lib/ssoraed.o /u01/product/oracle/rdbms/lib/ttcsoi.o /u01/product/oracle/lib/nautab.o /u01/product/oracle/lib/naeet.o /u01/product/oracle/lib/naect.o /u01/product/oracle/lib/naedhs.o /u01/product/oracle/rdbms/lib/config.o -lserver9 -lodm9 -lskgxp9 -lskgxn9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 /u01/product/oracle/rdbms/lib/defopt.o -lknlopt `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep xsyeolap.o > /dev/null 2>&1 ; then echo "-loraolap9" ; fi` -lslax9 -lpls9 -lplp9 -lserver9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 -lknlopt -lslax9 -lpls9 -lplp9 -ljox9 -lserver9 -locijdbcst9 -lwwg9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lmm -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -ltrace9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep "kxmnsd.o" > /dev/null 2>&1 ; then echo " " ; else echo "-lordsdo9"; fi` -lctxc9 -lctx9 -lzx9 -lgx9 -lctx9 -lzx9 -lgx9 -lordimt9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 -lsnls9 -lunls9 -lxsd9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/sysliblist` -Wl,-rpath,/u01/product/oracle/lib:/lib/i686:/lib:/usr/lib -lm `cat /u01/product/oracle/lib/sysliblist` -ldl -lm `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo -laio`
mv -f /u01/product/oracle/bin/oracle /u01/product/oracle/bin/oracleO
mv /u01/product/oracle/rdbms/lib/oracle /u01/product/oracle/bin/oracle
chmod 6751 /u01/product/oracle/bin/oracle
DELL-RAC01$q
-bash: q: command not found
DELL-RAC01$
DELL-RAC02$
DELL-RAC02$cd $ORACLE_HOME/rdbms/lib
DELL-RAC02$make -f ins_rdbms.mk rac_on ioracle
rm -f /u01/product/oracle/lib/libskgxp9.so
cp /u01/product/oracle/lib//libskgxpu.so /u01/product/oracle/lib/libskgxp9.so
cp /u01/product/oracle/lib/libcmdll.so /u01/product/oracle/lib/libskgxn9.so
/usr/bin/ar cr /u01/product/oracle/rdbms/lib/libknlopt.a /u01/product/oracle/rdbms/lib/kcsm.o
- Linking Oracle
rm -f /u01/product/oracle/rdbms/lib/oracle
gcc -o /u01/product/oracle/rdbms/lib/oracle -L/u01/product/oracle/rdbms/lib/ -L/u01/product/oracle/lib/ -L/u01/product/oracle/lib/stubs/ -Wl,-E `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo /u01/product/oracle/rdbms/lib/skgaioi.o` /u01/product/oracle/rdbms/lib/opimai.o /u01/product/oracle/rdbms/lib/ssoraed.o /u01/product/oracle/rdbms/lib/ttcsoi.o /u01/product/oracle/lib/nautab.o /u01/product/oracle/lib/naeet.o /u01/product/oracle/lib/naect.o /u01/product/oracle/lib/naedhs.o /u01/product/oracle/rdbms/lib/config.o -lserver9 -lodm9 -lskgxp9 -lskgxn9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 /u01/product/oracle/rdbms/lib/defopt.o -lknlopt `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep xsyeolap.o > /dev/null 2>&1 ; then echo "-loraolap9" ; fi` -lslax9 -lpls9 -lplp9 -lserver9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 -lknlopt -lslax9 -lpls9 -lplp9 -ljox9 -lserver9 -locijdbcst9 -lwwg9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lmm -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -ltrace9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep "kxmnsd.o" > /dev/null 2>&1 ; then echo " " ; else echo "-lordsdo9"; fi` -lctxc9 -lctx9 -lzx9 -lgx9 -lctx9 -lzx9 -lgx9 -lordimt9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 -lsnls9 -lunls9 -lxsd9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/sysliblist` -Wl,-rpath,/u01/product/oracle/lib:/lib/i686:/lib:/usr/lib -lm `cat /u01/product/oracle/lib/sysliblist` -ldl -lm `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo -laio`
mv -f /u01/product/oracle/bin/oracle /u01/product/oracle/bin/oracleO
mv /u01/product/oracle/rdbms/lib/oracle /u01/product/oracle/bin/oracle
chmod 6751 /u01/product/oracle/bin/oracle
DELL-RAC02$
DELL-RAC02$
DELL-RAC02$
DELL-RAC02$
DELL-RAC02$cd
DELL-RAC02$
觀察2周後發現沒有出現過類似問題。原來 6,7 天一次的例項crash現象消失,log也恢復正常,cm log沒有類似error 的錯誤出現。 Bug 問題解決。
整個過程參考 : http://www.itpub.net/viewthread.php?tid=922265&highlight=%2Btolywang
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/35489/viewspace-999637/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- [總結]9i RAC LMON: terminating instance due to error 29702Error
- LMON: terminating instance due to error 29702 -- ORA-29702Error
- Oracle9.2.0.4 RAC 升級到Oracle9.2.0.7 ,LMON: terminating instance due to error 29702OracleError
- 【RAC】PMON: terminating the instance due to error 481Error
- LGWR (ospid: 29534): terminating the instance due to error 4021Error
- LMON TERMINIATING WITH ERROR ORA-29702Error
- Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481AIError
- RAC Instance Crashes During Startup Due To Error 495Error
- terminating the instance due to error481導致ASM無法啟動故障ErrorASM
- 系統記憶體不足導致oracle程式被誤殺terminating the instance due to error 822記憶體OracleError
- oracle 9i single instance convert to rac databaseOracleDatabase
- Close the Database by Terminating the Instance (304)Database
- DB error due to HP-UX Error:23ErrorUX
- *** Terminating app due to uncaught exception 'NSUnknownKeyException', reason: '[APPException
- HP平臺,9i RAC instance 2被驅逐故障診斷
- error: Exited sync due to fetch errorsError
- NON-RAC Database Startup Giving Error ORA-29702 (Doc ID 433310.1)DatabaseError
- ORA-29702: error occurred in Cluster Group Service operationError
- Terminating app due to uncaught exception 'NSInternalInconsistencyException', reason: 'unable to deq...APPException
- script error總結Error
- iOS RAC總結iOS
- RAC管理總結
- 基於oracle 10.2.0.1 rac學習lmon程式系列六Oracle
- RAC筆記之instance recovery筆記
- RAC 使用方法總結
- Oracle RAC 安裝總結Oracle
- oracle RAC術語總結Oracle
- RAC部署和效能總結
- ORA-29702:error occurred in Cluster Group Service operation錯誤解決Error
- 9I DATAGUARD實施和維護總結
- rman備份rac的總結
- RAC GUARD概念和管理總結
- RAC的建立和配置總結
- iOS-程式錯誤導致App閃退了怎麼辦?Terminating app due to uncaught exception...iOSAPPException
- DML ERROR LOGGING總結Error
- 【BUG】RAC instance eviction in oracle11.2.0.4Oracle
- RAC中刪除特定instance的sessionSession
- Top 5 Database and/or Instance Performance Issues in RAC EnvironmentDatabaseORM