LMON TERMINIATING WITH ERROR ORA-29702

tolywang發表於2008-01-05
Bug No.6028803
Filed02-MAY-2007Updated12-JUN-2007
ProductOracle Server - Enterprise EditionProduct Version 9.2.0.8
PlatformLinux x86Platform VersionRHAS 3
Database Version9.2.0.8Affects Platforms Port-Specific
Severity Minimal Loss of ServiceStatusNot a Bug. To Filer
Base BugN/AFixed in Product VersionNo Data
[@more@]
Problem statement:

LMON TERMINIATING WITH ERROR ORA-29702

 

*** 05/02/07 05:40 am ***

TAR:

----

.

PROBLEM:

--------

On 9.2.0.8.0 in Production:

Every now and then LMON is terminating the instance,

the following error occurs:

.

ERROR

-----------------------

ORA-29702 error occurred in Cluster Group Service operation

.

-- Steps To Reproduce:

The issue can be reproduced at will with the following steps:

1. Suddenly the error ORA-29702 appeared.

2. LMON terminated the instance.

.

DIAGNOSTIC ANALYSIS:

--------------------

Alert Log: node#2

-------------

.

Sun Apr 15 00:21:22 2007

Errors in file

/home/oracle/app/product/9.2.0.5.0/rdbms/log/somprd2_lmon_21839.trc:

ORA-29702 : error occurred in Cluster Group Service operation

Sun Apr 15 00:21:22 2007

LMON: terminating instance due to error 29702

Sun Apr 15 00:21:22 2007

System state dump is made for local instance

Sun Apr 15 00:21:27 2007

Instance terminated by LMON, pid = 21839

.

WORKAROUND:

-----------

N/A

.

RELATED BUGS:

-------------

.

REPRODUCIBILITY:

----------------

occurs every now and then.

.

TEST CASE:

----------

N/A

.

STACK TRACE:

------------

.

SUPPORTING INFORMATION:

-----------------------

.

24 HOUR CONTACT INFORMATION FOR P1 BUGS:

----------------------------------------

.

DIAL-IN INFORMATION:

--------------------

.

IMPACT DATE:

------------

.

*** 05/02/07 05:41 am ***

Log File: cm.log on node#2

-----------

.

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850771996

file = unixinc.c, line = 833 {Sun Apr 15 00:19:16 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850788369

file = unixinc.c, line = 833 {Sun Apr 15 00:19:52 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850804764

file = unixinc.c, line = 833 {Sun Apr 15 00:19:53 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850821137

file = unixinc.c, line = 833 {Sun Apr 15 00:20:41 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850837532

file = unixinc.c, line = 833 {Sun Apr 15 00:20:41 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850853905

file = unixinc.c, line = 833 {Sun Apr 15 00:20:51 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850870300

file = unixinc.c, line = 833 {Sun Apr 15 00:20:51 2007 }

>WARNING: ReadCommPort: received error=104 on recv()., tid = 756793359 file

= unixinc.c, line = 841 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 848773149

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850116629

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757022747

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 756989978

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 784973857

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757055504

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 847069206

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 816824340

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 756973593

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757317642

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 756957208

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757006345

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757039122

file = unixinc.c, line = 833 {Sun Apr 15 00:21:2 2 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757088275

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>ERROR: WriteEventPort: write failed with error 32., tid = 757088275 file

= unixinc.c, line = 981 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 756940823

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 756760590

file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850886665

file = unixinc.c, line = 833 {Sun Apr 15 05:54:16 2007 }

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850903050

file = unixinc.c, line = 833 {Sun Apr 15 05:54:16 2007 }

*** 05/02/07 05:42 am ***

.

.

libskgxn9: node#1:

-rwxr-xr-x 1 oracle oinstall 65236 Apr 8 06:34

/home/oracle/app/product/9.2.0.5.0/lib/libskgxn9.so

.

libskgxn9: node#2:

-rwxr-xr-x 1 oracle oinstall 64274 Mar 11 2004

/home/oracle/app/product/9.2.0.5.0/lib/libskgxn9.so

.

ls -al $ORACLE_HOME/bin/ora* : node#1

-rwxr-xr-x 1 oracle oinstall 46 Nov 20 2001

/home/oracle/app/product/9.2.0.5.0/bin/oracg

-rwsr-s--x 1 oracle oinstall 53734840 Apr 8 07:23

/home/oracle/app/product/9.2.0.5.0/bin/oracle

-rwsr-s--x 1 oracle oinstall 53734840 Apr 8 07:20

/home/oracle/app/product/9.2.0.5.0/bin/oracleO

-rwxr-xr-x 1 oracle oinstall 2548 Jul 26 2003

/home/oracle/app/product/9.2.0.5.0/bin/oraenv

.

.

ls -al $ORACLE_HOME/bin/ora* : node#2

-rwxr-xr-x 1 oracle oinstall 46 Nov 20 2001

/home/oracle/app/product/9.2.0.5.0/bin/oracg

-rwsr-s--x 1 oracle oinstall 53734840 Apr 8 07:23

/home/oracle/app/product/9.2.0.5.0/bin/oracle

-rwsr-s--x 1 oracle oinstall 53735479 Apr 8 06:35

/home/oracle/app/product/9.2.0.5.0/bin/oracleO

-rwsr-s--x 1 oracle oinstall 52999216 May 15 2005

/home/oracle/app/product/9.2.0.5.0/bin/oracleO.bak

-rwxr-xr-x 1 oracle oinstall 2548 Jul 26 2003

/home/oracle/app/product/9.2.0.5.0/bin/oraenv

*** 05/09/07 08:26 am *** (CHG: Sta->16)

*** 05/09/07 08:26 am ***

uploaded the following files:

RDA.RDA_somprd1.zip

RDA.RDA_somprd2.zip

alert_traces.zip

.

Thanks in advance

*** 05/09/07 09:13 pm *** (CHG: Sta->10)

*** 05/09/07 09:14 pm ***

*** 05/14/07 05:09 am ***

This is what happenened in a previous time. The same ORA-29702 happened, it

can be seen in LMON trace log:

*** SESSION ID:(3.1) 2007-04-13 10:45:27.324

GES IPC: Receivers 3 Senders 3

GES IPC: Buffers Receive 1000 Send (i:660 b:660) Reserve 430

GES IPC: Msg Size Regular 396 Batch 2048

Batch msg size = 2048

Batching factor: enqueue replay 48, ack 53

Batching factor: cache replay 34 size per lock 56

kjxggin: receive buffer size = 32768

kjxgmin: SKGXN ver (2 1 Oracle 9i Reference CM)

CMCLI WARNING: CMInitContext: init ctx(0xccfb358)

*** 2007-04-13 10:45:31.439

kjxgmrcfg: Reconfiguration started, reason 1

kjxgmcs: Setting state to 0 0.

*** 2007-04-13 10:45:31.441

Name Service frozen

kjxgmcs: Setting state to 0 1.

kjfcpiora: publish my weight 62874

kjxgmps: proposing substate 2

kjxgmcs: Setting state to 6 2.

Performed the unique instance identification check

kjxgmps: proposing substate 3

kjxgmcs: Setting state to 6 3.

Name Service recovery started

Deleted all dead-instance name entries

kjxgmps: proposing substate 4

kjxgmcs: Setting state to 6 4.

Multicasted all local name entries for publish

Replayed all pending requests

kjxgmps: proposing substate 5

kjxgmcs: Setting state to 6 5.

Name Service normal

Name Service recovery done

*** 2007-04-13 10:45:32.808

kjxgmps: proposing substate 6

kjxgmcs: Setting state to 6 6.

*** 2007-04-13 10:45:32.920

*** 2007-04-13 10:45:32.920

Reconfiguration started (old inc 0, new inc 6)

Synchronization timeout interval: 660 sec

List of nodes:

0 1

Global Resource Directory frozen

node 0

release 9 2 0 8

node 1

release 9 2 0 8

res_master_weight for node 0 is 62874

res_master_weight for node 1 is 62874

Total master weight = 125748

Dead inst

Join inst 0 1

Exist inst

Active Sendback Threshold = 50 %

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Resources and enqueues cleaned out

Resources remastered 0

0 GCS shadows traversed, 0 cancelled, 0 closed

0 GCS resources traversed, 0 cancelled

set master node info

Submitted all remote-enqueue requests

kjfcrfg: Number of mesgs sent to node 0 = 0

Update rdomain variables

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

*** 2007-04-13 10:45:34.508

0 GCS shadows traversed, 0 replayed, 0 unopened

Submitted all GCS cache requests

0 write requests issued in 31086 GCS resources

7 PIs marked suspect, 0 flush PI msgs

*** 2007-04-13 10:45:34.775

Reconfiguration complete

*** 2007-04-13 10:45:39.636

kjxgrtmc2: Member 1 thread 2 mounted

CMCLI WARNING: ReadCommPort: poll() failed

kjxggpoll: received an error event from DBALL_DB

Return code from kjxggpoll: 10

error 29702 detected in background process

ORA-29702: error occurred in Cluster Group Service operation

ksuitm: waiting for [5] seconds before killing DIAG

.

.

--

The concern here that when checking the libskgxn9.so library on both nodes,

they are not the same. So may be when the reconfiguration happened the issue

appeared.

.

The same time the ORA-29702 appeared in alert log these errors appeared in

cm.log:

>WARNING: ReadCommPort: received error=104 on recv()., tid = 756793359 file

= unixinc.c, line = 841 {Sun Apr 15 00:21:22 2007 }

>ERROR: WriteEventPort: write failed with error 32., tid = 757088275 file

= unixinc.c, line = 981 {Sun Apr 15 00:21:22 2007 }

.

This happened in the node which has a wrong libskgxn9.so library.

.

Thanks

*** 05/14/07 05:10 am *** (CHG: Sta->16)

*** 05/23/07 07:24 am ***

Hi,

Could you please provide an update?

.

Thanks

Amr

*** 05/30/07 10:45 am *** (CHG: G/P->P Asg->NEW OWNER)

*** 05/30/07 10:45 am ***

*** 06/04/07 06:42 am *** (CHG: Asg->NEW OWNER)

*** 06/04/07 06:42 am ***

*** 06/04/07 08:34 am ***

*** 06/04/07 08:34 am *** (CHG: Sta->10)

*** 06/11/07 08:43 am *** (CHG: Sta->16)

*** 06/11/07 08:43 am ***

*** 06/12/07 07:51 am ***

*** 06/12/07 07:52 am *** (CHG: Sta->32)

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/35489/viewspace-996629/,如需轉載,請註明出處,否則將追究法律責任。

相關文章