Applies to:

Oracle Server - Enterprise Edition - Version: 11.2.0.2.0 and later [Release: 11.2 and later ]

Information in this document applies to any platform.

Symptoms

On 11.2.0.2+ cluster, instance is running on one node, startup instance on the other node(s) fails with:

PMON (ospid: 487580): terminating the instance due to error 481

If ASM is used, +ASMn alert log shows:

Sat Oct 01 19:19:38 2011

MMNL started with pid=21, OS id=6488362

lmon registered with NM - instance number 2 (internal mem no 1)

Sat Oct 01 19:21:37 2011

PMON (ospid: 4915562): terminating the instance due to error 481

Sat Oct 01 19:21:37 2011

System state dump requested by (instance=2, sid=4915562 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_4915388.trc

Dumping diagnostic data in directory=[cdmp_20111001192138], requested by (instance=2, sid=4915562 (PMON)), summary=[abnormal instance termination].

Sat Oct 01 19:21:38 2011

License high water mark = 1

Instance terminated by PMON, pid = 4915562

+ASMn_diag_xxx.trc trace shows:

*** 2011-10-01 19:19:37.526

Reconfiguration starts [incarn=0]

*** 2011-10-01 19:19:37.526

I'm the voting node

Group reconfiguration cleanup

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

...... << repeated messages

If ASM is not used, then DB instance could fail with the same error:

Mon Jul 04 16:22:50 2011

Starting ORACLE instance (normal)

...

Mon Jul 04 16:22:54 2011

MMNL started with pid=24, OS id=667660

starting up 1 shared server(s) ...

lmon registered with NM - instance number 2 (internal mem no 1)

Mon Jul 04 16:26:15 2011

PMON (ospid: 487580): terminating the instance due to error 481

lmon trace shows:

*** 2011-07-04 16:22:59.852

=====================================================

kjxgmpoll: CGS state (0 1) start 0x4e11785e cur 0x4e117863 rcfgtm 5 sec

...

*** 2011-07-04 16:26:14.248

=====================================================

kjxgmpoll: CGS state (0 1) start 0x4e11785e cur 0x4e117926 rcfgtm 200 sec

dia0 trace shows:

*** 2011-07-04 16:22:53.414

Reconfiguration starts [incarn=0]

*** 2011-07-04 16:22:53.414

I'm the voting node

Group reconfiguration cleanup

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

...<< repeated message

Changes

This could happen during patching or after node reboot.

Cause

The problem is caused by HAIP is not ONLINE on either the running node or the problem node(s).

Basically the ASM or DB instance(s) can not startup if they use a different cluster_interconnect than the running instance.

With HAIP ONLINE, all instances (DB and ASM) should use HAIP IP address: 169.254.x.x.

If on any node HAIP is OFFLINE, the ASM and DB instance will use the native private network address which causes communication problem with the instance using HAIP.

Use the following commands to verify HAIP status, as grid user:

$ crsctl stat res -t -init

check for resource ora.cluster_interconnect.haip status.

In this example, HAIP is OFFLINE on the running node 1, hence +ASM1 is using 10.1.1.1 as cluster_interconnect, while on node 2 HAIP is ONLINE, +ASM2 is using HAIP 169.254.239.144 as cluster_interconnect, causing communication problem between them and +ASM2 can not startup.

alert_+ASM1.log shows:

Cluster communication is configured to use the following interface(s) for this instance

10.1.1.1

alert_+ASM2.log shows:

Cluster communication is configured to use the following interface(s) for this instance

169.254.239.144

Solution

The solution is to start HAIP on all nodes before start ASM or DB instance by either restart HAIP resource or restart the GI stack.

For this example, +ASM1 was started first with HAIP OFFLINE:

1. Try to start HAIP manually on node 1

as grid user:

$ crsctl start res ora.cluster_interconnect.haip -init

To verify:

$ crsctl stat res -t -init

2. If this succeeds, then restart ora.asm resource (note, this will bring down all dependent diskgroup resource and db resource):

as root user:

# crsctl stop res ora.crsd -init

# crsctl stop res ora.asm -init -f

# crsctl start res ora.asm -init

# crsctl start res ora.crsd -init

startup any dependent resource as necessary

3. If above does not help, try to restart the GI stack on node 1, see if HAIP can be ONLINE after that.

As root user:

# crsctl stop crs

# crsctl start crs

Check $GRID_HOME/log//agent/ohasd/orarootagent_root/orarootagent_root.log for any HAIP error.

4. Once HAIP is ONLINE on node 1, proceed to start ASM on the rest of cluster nodes and ensure HAIP are ONLINE on all nodes.

$ crsctl start res ora.asm -init

ASM or DB instances should be able to start on all nodes after above.

【RAC】PMON: terminating the instance due to error 481

相關文章