[總結]9i RAC LMON: terminating instance due to error 29702

tolywang發表於2008-02-18

基本配置:

Linux AS3.0 核心版本: 2.4.21-37.ELsmp
Oracle 9.2.0.4
升級到Oracle9.2.0.7 , RAC , 兩節點。 clusterware9204軟體安裝的 。

後來的查詢發現Oracle 9.2.0.4 RAC 系統升級到Oracle9.2.0.7 , Oracle RDBMS Software 是可以升級到Oracle9.2.0.7,但是Oracle9.2.0.7 Patchset 確實沒有ORACM Cluster管理軟件的升級版,是Oracle的一個bug 9.2.0.7.0 Bug 4163445, 只有從Oracle9.2.0.6 Patchset上 升級Oracle9.2.0.4 Clusterware軟體ORACM (叢Oracle CM Log 中可以看到Oracle9.2.0.4 版本下安裝的ORACM版本為 oracm 9.2.0.2.0, 9.2.0.7 補丁沒有升級ORACM版本,Oracle9.2.0.6 Patch升級的版本是 oracm 9.2.0.6.0.52 . 注意的一點是所有升級動作一定要嚴格按照Readme來操作,當然OracleReadme也不一定都考慮到了,這個問題就是一個例子。

[@more@]

http://www.itpub.net/viewthread.php?tid=922265&extra=&highlight=%2Btolywang&page=3 9.2.0.7.0 Bug 4163445

問題描述:

出現的問題描述如下 (節點1 以及節點 2 交替每隔58天左右例項crash一次)

alter_orcl1.log
-------------------------------------------------------------------------------------------------


Sat Jan 5 18:44:19 2008
ARC1: Evaluating archive log 1 thread 1 sequence 122
ARC1: Beginning to archive log 1 thread 1 sequence 122
Creating archive destination LOG_ARCHIVE_DEST_1: '/ocfs_arch1/orcl/1_122.dbf'
ARC1: Completed archiving log 1 thread 1 sequence 122
Sat Jan 5 19:36:06 2008
Thread 1 advanced to log sequence 124
Current log# 4 seq# 124 mem# 0: /ocfs_ctrl_redo/orcl/redo04.log
Current log# 4 seq# 124 mem# 1: /ocfs_data/orcl/redo04b.log
Sat Jan 5 19:36:06 2008
ARC1: Evaluating archive log 3 thread 1 sequence 123
ARC1: Beginning to archive log 3 thread 1 sequence 123
Creating archive destination LOG_ARCHIVE_DEST_1: '/ocfs_arch1/orcl/1_123.dbf'
ARC1: Completed archiving log 3 thread 1 sequence 123
Sat Jan 5 19:45:15 2008
Errors in file /u01/product/admin/orcl/bdump/orcl1_lmon_14214.trc:
ORA-29702: error occurred in Cluster Group Service operation
Sat Jan 5 19:45:15 2008
LMON: terminating instance due to error 29702
Sat Jan 5 19:45:16 2008
System state dump is made for local instance
Sat Jan 5 19:45:20 2008
Instance terminated by LMON, pid = 14214
Sat Jan 5 19:54:53 2008
Starting ORACLE instance (normal)
Sat Jan 5 19:54:53 2008
Global Enqueue Service Resources = 26694, pool = 4
Sat Jan 5 19:54:53 2008
Global Enqueue Service Enqueues = 39350
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
SCN scheme 2
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.7.0.
System parameters with non-default values:
processes = 1000
timed_statistics = FALSE
resource_limit = TRUE
shared_pool_size = 419430400
large_pool_size = 33554432
java_pool_size = 33554432







$ vi /u01/product/admin/orcl/bdump/orcl1_lmon_14214.trc
=============



/u01/product/admin/orcl/bdump/orcl1_lmon_14214.trc
Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.7.0 - Production
ORACLE_HOME = /u01/product/oracle
System name: Linux
Node name: DELL-RAC01
Release: 2.4.21-37.ELsmp
Version: #1 SMP Wed Sep 7 13:28:55 EDT 2005
Machine: i686
Instance name: orcl1
Redo thread mounted by this instance: 0
Oracle process number: 4
Unix process pid: 14214, image: oracle@DELL-RAC01 (LMON)

*** SESSION ID
3.1) 2007-12-31 12:07:45.591
GES IPC: Receivers 3 Senders 3
GES IPC: Buffers Receive 1000 Send (i:2230 b:2230) Reserve 1000
GES IPC: Msg Size Regular 396 Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 48, ack 53
Batching factor: cache replay 34 size per lock 56
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 1 Oracle 9i Reference CM)
CMCLI WARNING: CMInitContext: init ctx(0xb6d93f8)
*** 2007-12-31 12:07:49.396
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 0 0.
*** 2007-12-31 12:07:49.396
Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 122787
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 1 2.
Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 1 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 1 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 1 5.
Name Service normal
Name Service recovery done
*** 2007-12-31 12:07:49.611
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 1 6.
*** 2007-12-31 12:07:49.832
*** 2007-12-31 12:07:49.832
Reconfiguration started (old inc 0, new inc 1)
Synchronization timeout interval: 600 sec
List of nodes:
0
Global Resource Directory frozen
node 0
release 9 2 0 7
* kjshashcfg: I'm the only node in the cluster (node 0)
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traversed, 0 cancelled, 0 closed
0 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2007-12-31 12:07:50.121
0 GCS shadows traversed, 0 replayed, 0 unopened
Submitted all GCS cache requests
0 write requests issued in 0 GCS resources
0 PIs marked suspect, 0 flush PI msgs

ORACM Log 當時的資訊: ERROR: WriteEventPort: write failed with error 32

------------------------------------------------------------

Debug Hang :ClientProcListener (PID=14257) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:688145 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M
>ERROR: WriteEventPort: write failed with error 32., tid = ClientProcListener:688145 file = unixinc.c, line = 915 {Sat Jan 5 19:45:16 2008 }^M
Debug Hang :ClientProcListener (PID=14261) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:622615 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M
Debug Hang :ClientProcListener (PID=14255) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:557077 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M

Diag trace log :

/u01/product/admin/orcl/bdump/orcl2_diag_14211.trc

Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production

With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options

JServer Release 9.2.0.7.0 - Production

ORACLE_HOME = /u01/product/oracle

System name: Linux

Node name: DELL-RAC02

Release: 2.4.21-37.ELsmp

Version: #1 SMP Wed Sep 7 13:28:55 EDT 2005

Machine: i686

Instance name: orcl2

Redo thread mounted by this instance: 0

Oracle process number: 3

Unix process pid: 14211, image: oracle@DELL-RAC02 (DIAG)

*** SESSION ID:(2.1) 2008-01-16 12:16:14.524

CMCLI WARNING: CMInitContext: init ctx(0xb9115f4)

kjzcprt:rcv port created

Node id: 1

List of nodes: 0, 1,

*** 2008-01-16 12:16:14.526

Reconfiguration starts [incarn=0]

I'm the voting node

Send my bitmap to master 0

Rcfg confirmation is received from master 0

I agree with the rcfg confirmation

*** 2008-01-16 12:16:25.233

Reconfiguration completes [incarn=2]

*** 2008-01-19 04:50:21.933

Instance is terminating by process 14215 [ospid=oracle@DELL-RAC02 (LMON)]

Performing diagnostic data dump for this instance

CMCLI WARNING: CommonContextCleanup: closing comm port

DIAG detachs from CM

error 29723 detected in background process

OPIRIP: Uncaught error 447. Error stack:

ORA-00447: fatal error in background process

ORA-29723: Failed to attach to the global enqueue service (status=32)

metalink上面的錯誤描述上看,似乎是由於rac環境兩個例項的libskgxn9.so不一致造成的。

處理方法:

1. 由於是Oracle9.2.0.4 升級到Oracle9.2.0.7 , 9207沒有ORACM的升級版本軟體,只有RDBMS的軟體。 所以還必須透過9206patchset來升級oracm9.2.0.2oracm9.2.0.6.0.52版本。 注意了,一定要嚴格按照readme來操作。

2. 當然升級Oracle RDBMS , Oracm9.2.0.6之後還需要執行一些catproc.sql ……等指令碼來更新資料字典,這些在readme上都有。

3. 有些bug是沒有公佈的,在google,baidu都不能找到,必須到metalink上才能看到。而且不能僅僅透過alert log ,還要結合trc log , diag log cm log 檔案等。

Bug 4390716 Linux: "CMCLI WARNING" messages after applying 9.2.0.6 / 7

This note gives a brief overview of bug 4390716.

Affects:

Product (Component)

Oracle Server (Rdbms)

Range of versions believed to be affected

Versions >= 9.2.0.6

Versions confirmed as being affected

  • 9.2.0.6
  • 9.2.0.7

Platforms affected

  • Linux 32bit

It is believed to be a regression in default behaviour thus:
Regression introduced in 9.2.0.6

Fixed:

This issue is fixed in

  • (None Specified)

Symptoms:

Related To:

  • RAC (Real Application Clusters) / OPS

Description

After applying 9.2.0.6 or 9.2.0.7 Patch Set on Linux
platforms then RAC installations may start reporting
numerous errors to trace files of the form:
  CMCLI WARNING: CMInitContext:  init ctx(0xae5c9a4)
  CMCLI WARNING: CommonContextCleanup:  closing comm port
This can lead to disk full and instance crash scenarios.
Workaround:
  After installation of the Patch Set ensure that the
  folowing steps are executed on ALL nodes of the RAC
  cluster:
   cd $ORACLE_HOME/rdbms/lib
   Shut down all the instances in the OH
   make -f ins_rdbms.mk rac_on ioracle

執行這些bug修復的命令:

After applying 9.2.0.6 or 9.2.0.7 Patch Set on Linux

platforms then RAC installations may start reporting

numerous errors to trace files of the form:

CMCLI WARNING: CMInitContext: init ctx(0xae5c9a4)

CMCLI WARNING: CommonContextCleanup: closing comm port

This can lead to disk full and instance crash scenarios.

Workaround:

After installation of the Patch Set ensure that the

folowing steps are executed on ALL nodes of the RAC

cluster:

cd $ORACLE_HOME/rdbms/lib

Shut down all the instances in the OH

make -f ins_rdbms.mk rac_on ioracle

具體執行:

DELL-RAC01$

DELL-RAC01$make -f ins_rdbms.mk rac_on ioracle

rm -f /u01/product/oracle/lib/libskgxp9.so

cp /u01/product/oracle/lib//libskgxpu.so /u01/product/oracle/lib/libskgxp9.so

cp /u01/product/oracle/lib/libcmdll.so /u01/product/oracle/lib/libskgxn9.so

/usr/bin/ar cr /u01/product/oracle/rdbms/lib/libknlopt.a /u01/product/oracle/rdbms/lib/kcsm.o

- Linking Oracle

rm -f /u01/product/oracle/rdbms/lib/oracle

gcc -o /u01/product/oracle/rdbms/lib/oracle -L/u01/product/oracle/rdbms/lib/ -L/u01/product/oracle/lib/ -L/u01/product/oracle/lib/stubs/ -Wl,-E `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo /u01/product/oracle/rdbms/lib/skgaioi.o` /u01/product/oracle/rdbms/lib/opimai.o /u01/product/oracle/rdbms/lib/ssoraed.o /u01/product/oracle/rdbms/lib/ttcsoi.o /u01/product/oracle/lib/nautab.o /u01/product/oracle/lib/naeet.o /u01/product/oracle/lib/naect.o /u01/product/oracle/lib/naedhs.o /u01/product/oracle/rdbms/lib/config.o -lserver9 -lodm9 -lskgxp9 -lskgxn9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 /u01/product/oracle/rdbms/lib/defopt.o -lknlopt `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep xsyeolap.o > /dev/null 2>&1 ; then echo "-loraolap9" ; fi` -lslax9 -lpls9 -lplp9 -lserver9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 -lknlopt -lslax9 -lpls9 -lplp9 -ljox9 -lserver9 -locijdbcst9 -lwwg9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lmm -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -ltrace9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep "kxmnsd.o" > /dev/null 2>&1 ; then echo " " ; else echo "-lordsdo9"; fi` -lctxc9 -lctx9 -lzx9 -lgx9 -lctx9 -lzx9 -lgx9 -lordimt9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 -lsnls9 -lunls9 -lxsd9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/sysliblist` -Wl,-rpath,/u01/product/oracle/lib:/lib/i686:/lib:/usr/lib -lm `cat /u01/product/oracle/lib/sysliblist` -ldl -lm `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo -laio`

mv -f /u01/product/oracle/bin/oracle /u01/product/oracle/bin/oracleO

mv /u01/product/oracle/rdbms/lib/oracle /u01/product/oracle/bin/oracle

chmod 6751 /u01/product/oracle/bin/oracle

DELL-RAC01$q

-bash: q: command not found

DELL-RAC01$

DELL-RAC02$

DELL-RAC02$cd $ORACLE_HOME/rdbms/lib

DELL-RAC02$make -f ins_rdbms.mk rac_on ioracle

rm -f /u01/product/oracle/lib/libskgxp9.so

cp /u01/product/oracle/lib//libskgxpu.so /u01/product/oracle/lib/libskgxp9.so

cp /u01/product/oracle/lib/libcmdll.so /u01/product/oracle/lib/libskgxn9.so

/usr/bin/ar cr /u01/product/oracle/rdbms/lib/libknlopt.a /u01/product/oracle/rdbms/lib/kcsm.o

- Linking Oracle

rm -f /u01/product/oracle/rdbms/lib/oracle

gcc -o /u01/product/oracle/rdbms/lib/oracle -L/u01/product/oracle/rdbms/lib/ -L/u01/product/oracle/lib/ -L/u01/product/oracle/lib/stubs/ -Wl,-E `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo /u01/product/oracle/rdbms/lib/skgaioi.o` /u01/product/oracle/rdbms/lib/opimai.o /u01/product/oracle/rdbms/lib/ssoraed.o /u01/product/oracle/rdbms/lib/ttcsoi.o /u01/product/oracle/lib/nautab.o /u01/product/oracle/lib/naeet.o /u01/product/oracle/lib/naect.o /u01/product/oracle/lib/naedhs.o /u01/product/oracle/rdbms/lib/config.o -lserver9 -lodm9 -lskgxp9 -lskgxn9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 /u01/product/oracle/rdbms/lib/defopt.o -lknlopt `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep xsyeolap.o > /dev/null 2>&1 ; then echo "-loraolap9" ; fi` -lslax9 -lpls9 -lplp9 -lserver9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 -lknlopt -lslax9 -lpls9 -lplp9 -ljox9 -lserver9 -locijdbcst9 -lwwg9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lmm -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -ltrace9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep "kxmnsd.o" > /dev/null 2>&1 ; then echo " " ; else echo "-lordsdo9"; fi` -lctxc9 -lctx9 -lzx9 -lgx9 -lctx9 -lzx9 -lgx9 -lordimt9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 -lsnls9 -lunls9 -lxsd9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/sysliblist` -Wl,-rpath,/u01/product/oracle/lib:/lib/i686:/lib:/usr/lib -lm `cat /u01/product/oracle/lib/sysliblist` -ldl -lm `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo -laio`

mv -f /u01/product/oracle/bin/oracle /u01/product/oracle/bin/oracleO

mv /u01/product/oracle/rdbms/lib/oracle /u01/product/oracle/bin/oracle

chmod 6751 /u01/product/oracle/bin/oracle

DELL-RAC02$

DELL-RAC02$

DELL-RAC02$

DELL-RAC02$

DELL-RAC02$cd

DELL-RAC02$

觀察2周後發現沒有出現過類似問題。原來 67 天一次的例項crash現象消失,log也恢復正常,cm log沒有類似error 的錯誤出現。 Bug 問題解決。

整個過程參考 : http://www.itpub.net/viewthread.php?tid=922265&highlight=%2Btolywang

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/35489/viewspace-999637/,如需轉載,請註明出處,否則將追究法律責任。

相關文章