Oracle 11gR2 RAC HAIP特性相關的故障的判斷及解決方法
ASM on Non First Node (Second or Other Node) Fails to Come up With: PMON (ospid: nnnn): terminating the instance due to error 481 [ID 1383737.1]
修改時間:2012-11-9型別:REFERENCE狀態:PUBLISHED優先順序:2
In this Document
Purpose |
Details |
Case1: link local IP (169.254.x.x) is being used by other adapter/network |
Case2: firewall exists between nodes on private network (iptables etc) |
Case3: HAIP is up on some nodes but not on all |
Case4: HAIP is up on all nodes but some do not have route info |
References |
Applies to:
Oracle Server - Enterprise Edition - Version 11.2.0.1 and laterInformation in this document applies to any platform.
Purpose
This note lists common causes of ASM start up failure with the following error on non-first node (second or others):
- alert_<ASMn>.log from non-first node
lmon registered with NM - instance number 2 (internal mem no 1)
Tue Dec 06 06:16:15 2011
System state dump requested by (instance=2, sid=19095 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /g01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_19138.trc
Tue Dec 06 06:16:15 2011
PMON (ospid: 19095): terminating the instance due to error 481
Dumping diagnostic data in directory=[cdmp_20111206061615], requested by (instance=2, sid=19095 (PMON)), summary=[abnormal instance termination].
Tue Dec 06 06:16:15 2011
ORA-1092 : opitsk aborting process
Note: ASM instance terminates shortly after "lmon registered with NM"
If ASM on non-first node was running previously, likely the following will be in alert.log when it failed originally:
..
IPC Send timeout detected. Sender: ospid 32231 [oracle@ftdcslsedw01b (PING)]
..
ORA-29740: evicted by instance number 1, group incarnation 10
..
- diag trace from non-first ASM (+ASMn_diag_<pid>.trc)
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE])
- alert_<ASMn>.log from first node
LMON (ospid: 15986) detects hung instances during IMR reconfiguration
LMON (ospid: 15986) tries to kill the instance 2 in 37 seconds.
Please check instance 2's alert log and LMON trace file for more details.
..
Remote instance kill is issued with system inc 64
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Reconfiguration started (old inc 64, new inc 66)
If the issue happens while running root script. (root.sh or rootupgrade.sh) as part of Grid Infrastructure installation/upgrade process, the following symptoms will present:
- root script. screen output
Start of resource "ora.asm" failed
CRS-2672: Attempting to start 'ora.asm' on 'racnode1'
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107:)" in "/ocw/grid/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.asm' on 'racnode1' failed
..
Failed to start ASM at /ispiris-qa/app/11.2.0.3/crs/install/crsconfig_lib.pm line 1272
- $GRID_HOME/cfgtoollogs/crsconfig/rootcrs_<nodename>.log
2011-11-29 15:56:48: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl start resource ora.asm -init
..
> CRS-2672: Attempting to start 'ora.asm' on 'racnode1'
> CRS-5017: The resource action "ora.asm start" encountered the following error:
> ORA-03113: end-of-file on communication channel
> Process ID: 0
> Session ID: 0 Serial number: 0
> . For details refer to "(:CLSN00107:)" in "/ispiris-qa/app/11.2.0.3/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log".
> CRS-2674: Start of 'ora.asm' on 'racnode1' failed
> CRS-2679: Attempting to clean 'ora.asm' on 'racnode1'
> CRS-2681: Clean of 'ora.asm' on 'racnode1' succeeded
..
> CRS-4000: Command Start failed, or completed with errors.
>End Command output
2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl check resource ora.asm -init
2011-11-29 15:59:00: Executing cmd: /ispiris-qa/app/11.2.0.3/bin/crsctl status resource ora.asm -init
2011-11-29 15:59:01: Checking the status of ora.asm
..
2011-11-29 15:59:53: Start of resource "ora.asm" failed
Details
Case1: link local IP (169.254.x.x) is being used by other adapter/network
Symptoms:
- $GRID_HOME/log/<nodename>/alert<nodename>.log
[/ocw/grid/bin/orarootagent.bin(4813)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0
- OS messages (optional)
Dec 6 06:11:14 racnode1 dhclient: DHCPREQUEST on usb0 to 255.255.255.255 port 67
Dec 6 06:11:14 racnode1 dhclient: DHCPACK from 169.254.95.118
- ifconfig -a
..
usb0 Link encap:Ethernet HWaddr E6:1F:13:AD:EE:D3
inet addr:169.254.95.120 Bcast:169.254.95.255 Mask:255.255.255.0
..
Note: it's usb0 in this case, but it can be any other adapter which uses link local
Solution:
Link local IP must not be used by any other network on cluster nodes. In this case, an USB network device gets IP 169.254.95.118 from DHCP server which disrupted HAIP routing, and solution is to black list the device in udev from being activated automatically.
Case2: firewall exists between nodes on private network (iptables etc)
No firewall is allowed on private network (cluster_interconnect) between nodes including software firewall like iptables, ipmon etc
Case3: HAIP is up on some nodes but not on all
Symptoms:
- alert_<+ASMn>.log for some instances
Cluster communication is configured to use the following interface(s) for this instance
10.1.0.1
- alert_<+ASMn>.log for other instances
Cluster communication is configured to use the following interface(s) for this instance
169.254.201.65
Note: some instances is using HAIP while others are not, so they can not talk to each other
Solution:
The solution is to bring up HAIP on all nodes.
To find out HAIP status, execute the following on all nodes:
If it's offline, try to bring it up as root:
If HAIP fails to start, refer to note 1210883.1 for known issues.
If the "up node" is not using HAIP, and no outage is allowed, the workaround is to set init.ora/spfile parameter cluster_interconnect to the private IP of each node to allow ASM/DB to come up on "down node". Once a maintenance window is planned, the parameter must be removed to allow HAIP to work.
Case4: HAIP is up on all nodes but some do not have route info
Symptoms:
- alert_<+ASMn>.log for all instances
Cluster communication is configured to use the following interface(s) for this instance
169.254.xxx.xxx
- "netstat -rn" for some nodes (surviving nodes) missing HAIP route
netstat -rn
Destination Gateway Genmask Flags MSS Window irtt Iface
161.130.90.0 0.0.0.0 255.255.248.0 U 0 0 0 bond0
160.131.11.0 0.0.0.0 255.255.255.0 U 0 0 0 bond2
0.0.0.0 160.11.80.1 0.0.0.0 UG 0 0 0 bond0
The line for HAIP is missing, i.e:
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond2
Note: As HAIP route info is missing on some nodes, HAIP is not pingable; usually newly restarted node will have HAIP route info
Solution:
The solution is to manually add HAIP route info on the nodes that's missing:
4.1. Execute "netstat -rn" on any node that has HAIP route info and locate the following:
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond2
Note: the first field is HAIP subnet ID and will start with 169.254.xxx.xxx, the third field is HAIP subnet netmask and the last field is private network adapter name
4.2. Execute the following as root on the node that's missing HAIP route:
i.e.
# route add -net 169.254.0.0 netmask 255.255.0.0 dev bond2
4.3. Start ora.crsd as root on the node that's partial up:.
The other workaround is to restart GI on the node that's missing HAIP route with "crsctl stop crs -f" and "crsctl start crs" command as root.
Database - RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support
References
NOTE:1210883.1 - 11gR2 Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haipNOTE:1386709.1 - The Basics of IPv4 Subnet and Oracle Clusterware
總結:
2.確保地址是以169.254開頭。
3.確保所有節點私有網路之間沒有防火牆。
4.確保所有節點的ora.cluster_interconnect.haip資源都啟動成功。
5.所有節點的ora.cluster_interconnect.haip資源啟動成功後,確保所有節點繫結的169.254.x.x 地址在節點之間都能相互PING通。
注意:在ora.cluster_interconnect.haip資源啟動之前,cssd程式會檢查私有網路的健康狀況,從而判定是否啟動cssd程式,這個時候私有網路的IP是在作業系統級別設定的IP地址;當ora.cluster_interconnect.haip資源啟動之後,ora.asm中的LMON等程式會檢查私有網路的通訊的健康狀況,從而判定是否啟動叢集ora.asm,這個時候私有網路的IP地址是169.254.x.x,如果節點相互之間的一個或多個169.254.x.x網路地址不通,實際就是腦裂的情況,asm例項必定只能在部分節點執行,asm例項不能啟動,Clusterware和資料庫例項都無法啟動。
在11.2.0.2以上的GI上使用多網路卡構成的HAIP技術,那麼不同網路卡應該在不同的子網上,如果所有的網路卡在同一個子網上,那麼拔掉其中一個網路卡可能導致節點被踢出。詳情參加最佳實踐:
--end--
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29618264/viewspace-2146701/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Oracle 11gR2 RAC ohasd failed to start 解決方法OracleAI
- 11gR2 RAC使用SCAN故障切換問題的解決方案
- 【轉】Oracle 11gR2 RAC ORA-00845 解決方法Oracle
- Oracle 11.2 故障處理 RAC Removed unused HAIP route: **** usb0OracleREMAI
- 輸出判斷條件是或的解決方法
- oracle之 RAC Interconnect之HAIPOracleAI
- 判斷oracle是否是rac例項Oracle
- DNS故障的幾種常見原因及解決方法DNS
- Oracle 11gR2 RAC的關閉和啟動Oracle
- Oracle RAC之--安裝過程中碰到的問題及解決方法Oracle
- Oracle 11g RAC之HAIP相關問題總結OracleAI
- DVR常見故障原因及解決方法VR
- goldengate 故障及解決方法彙總Go
- oracle資料庫CPU特別高的解決方法Oracle資料庫
- 如何利用無線訊號判斷並解決路由器故障路由器
- 【筆記】oracle 判斷欄位中的中文的方法筆記Oracle
- 關於js的判斷JS
- RAC故障診斷指令碼指令碼
- 網路交換機常見故障及解決方法
- 解決DNS解析故障的幾種方法DNS
- Velocity判斷空的方法
- 如何判斷DNS解析故障?如何解決DNS解析錯誤?DNS
- Oracle9i新特點-判斷是否使用了spfileOracle
- Oracle學習遇到的問題收集及解決 - 不斷更新Oracle
- oracle rac_cssd程式故障重啟相關OracleCSS
- Oracle故障診斷Oracle
- Java中關於OOM的場景及解決方法JavaOOM
- 關於HAIPAI
- 【Oracle】RAC11gR2Grid啟動順序及啟動故障診斷思路Oracle
- ORACLE 鎖機制及解決方法Oracle
- 原型判斷方法原型
- Oracle 常見的錯誤問題及解決方法Oracle
- js基礎-12-判斷陣列和判斷物件的方法JS陣列物件
- js判斷物件的幾種方法JS物件
- 記一次關於js陣列型別判斷及js型別判斷的細節探索JS陣列型別
- Oracle 11gR2 RAC的JDBC連線串OracleJDBC
- oracle 11gR2 RAC crsctl 命令的增強Oracle
- 5種常見的 DNS 故障診斷及問題處理方法DNS