【RAC】PMON: terminating the instance due to error 481

楊奇龍發表於2011-11-22
Applies to:
Oracle Server - Enterprise Edition - Version: 11.2.0.2.0 and later   [Release: 11.2 and later ]
Information in this document applies to any platform.
Symptoms
On 11.2.0.2+ cluster, instance is running on one node, startup instance on the other node(s) fails with:
PMON (ospid: 487580): terminating the instance due to error 481
If ASM is used, +ASMn alert log shows:
Sat Oct 01 19:19:38 2011
MMNL started with pid=21, OS id=6488362
lmon registered with NM - instance number 2 (internal mem no 1)
Sat Oct 01 19:21:37 2011
PMON (ospid: 4915562): terminating the instance due to error 481
Sat Oct 01 19:21:37 2011
System state dump requested by (instance=2, sid=4915562 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_4915388.trc
Dumping diagnostic data in directory=[cdmp_20111001192138], requested by (instance=2, sid=4915562 (PMON)), summary=[abnormal instance termination].
Sat Oct 01 19:21:38 2011
License high water mark = 1
Instance terminated by PMON, pid = 4915562
+ASMn_diag_xxx.trc trace shows:
*** 2011-10-01 19:19:37.526
Reconfiguration starts [incarn=0]

*** 2011-10-01 19:19:37.526
I'm the voting node
Group reconfiguration cleanup
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
...... << repeated messages
If ASM is not used, then DB instance could fail with the same error:
Mon Jul 04 16:22:50 2011
Starting ORACLE instance (normal)
...
Mon Jul 04 16:22:54 2011
MMNL started with pid=24, OS id=667660
starting up 1 shared server(s) ...
lmon registered with NM - instance number 2 (internal mem no 1)
Mon Jul 04 16:26:15 2011
PMON (ospid: 487580): terminating the instance due to error 481


lmon trace shows:
*** 2011-07-04 16:22:59.852
=====================================================
kjxgmpoll: CGS state (0 1) start 0x4e11785e cur 0x4e117863 rcfgtm 5 sec
...
*** 2011-07-04 16:26:14.248
=====================================================
kjxgmpoll: CGS state (0 1) start 0x4e11785e cur 0x4e117926 rcfgtm 200 sec


dia0 trace shows:
*** 2011-07-04 16:22:53.414
Reconfiguration starts [incarn=0]
*** 2011-07-04 16:22:53.414
I'm the voting node
Group reconfiguration cleanup
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

...<< repeated message

Changes
This could happen during patching or after node reboot.
Cause
The problem is caused by HAIP is not ONLINE on either the running node or the problem node(s).
Basically the ASM or DB instance(s) can not startup if they use a different cluster_interconnect than the running instance.
With HAIP ONLINE, all instances (DB and ASM) should use HAIP IP address: 169.254.x.x.
If on any node HAIP is OFFLINE, the ASM and DB instance will use the native private network address which causes communication problem with the instance using HAIP.

Use the following commands to verify HAIP status, as grid user:
$ crsctl stat res -t -init

check for resource ora.cluster_interconnect.haip status.
In this example, HAIP is OFFLINE on the running node 1, hence +ASM1 is using 10.1.1.1 as cluster_interconnect, while on node 2 HAIP is ONLINE, +ASM2 is using HAIP 169.254.239.144 as cluster_interconnect, causing communication problem between them and +ASM2 can not startup.
alert_+ASM1.log shows:

Cluster communication is configured to use the following interface(s) for this instance
10.1.1.1

alert_+ASM2.log shows:
Cluster communication is configured to use the following interface(s) for this instance
169.254.239.144
Solution
The solution is to start HAIP on all nodes before start ASM or DB instance by either restart HAIP resource or restart the GI stack.
For this example, +ASM1 was started first with HAIP OFFLINE:
1. Try to start HAIP manually on node 1
as grid user:
$ crsctl start res ora.cluster_interconnect.haip -init
To verify:
$ crsctl stat res -t -init
2. If this succeeds, then restart ora.asm resource (note, this will bring down all dependent diskgroup resource and db resource):
as root user:
# crsctl stop res ora.crsd -init
# crsctl stop res ora.asm -init -f
# crsctl start res ora.asm -init
# crsctl start res ora.crsd -init
startup any dependent resource as necessary
3. If above does not help, try to restart the GI stack on node 1, see if HAIP can be ONLINE after that.
As root user:
# crsctl stop crs
# crsctl start crs

Check $GRID_HOME/log//agent/ohasd/orarootagent_root/orarootagent_root.log for any HAIP error.
4. Once HAIP is ONLINE on node 1, proceed to start ASM on the rest of cluster nodes and ensure HAIP are ONLINE on all nodes.
$ crsctl start res ora.asm -init
ASM or DB instances should be able to start on all nodes after above.

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/22664653/viewspace-711805/,如需轉載,請註明出處,否則將追究法律責任。

相關文章