DNS 引起經典RAC故障
DNS 引起經典RAC故障
作者:吳偉龍(PrudentWoo)
一、環境介紹:
這是一套四年前部署的RAC系統,之前執行一直很好,沒有出過問題,平時基本處於無人管的狀態。
OS:Redhat EnterPrise Linux 5.8 x86_x64
DB:Oracle Database EnterPrise 11.2.0.4
GI:Oracle Grid Infrastructure 11.2.0.4
二、問題描述:
昨天臨近下班接到現場人員故障請求,描述為資料庫無法連線,報ORA-12547:TNS: lost CONNECT。當時第一反應是網路和監聽故障,讓現場人員進行tnsping和ping都是正常的。
三、問題現象:
我到達現場後,首先檢視了資料庫的狀態,發現資料庫例項是停止執行狀態,並且從日誌中看不出明顯報錯;
資料庫日誌:
-
Starting up:
-
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
-
With the Partitioning, Real Application Clusters, OLAP, Data Mining
-
and Real Application Testing options.
-
ORACLE_HOME = /u01/app/oracle/11.2.0.4/product/db_1
-
System name: Linux
-
Node name: db01
-
Release: 3.8.13-44.1.1.el6uek.x86_64
-
Version: #2 SMP Wed Sep 10 06:10:25 PDT 2014
-
Machine: x86_64
-
VM name: VMWare Version: 6
-
Using parameter settings in server-side pfile /u01/app/oracle/11.2.0.4/product/db_1/dbs/initwoo1.ora
-
System parameters with non-default values:
-
processes = 600
-
sessions = 922
-
spfile = "+DATA/woo/spfilewoo.ora"
-
nls_language = "SIMPLIFIED CHINESE"
-
nls_territory = "CHINA"
-
memory_target = 1584M
-
control_files = "+DATA/woo/controlfile/current.260.930748953"
-
control_files = "+FRA01/woo/controlfile/current.256.930748953"
-
db_block_size = 8192
-
compatible = "11.2.0.4.0"
-
cluster_database = TRUE
-
db_create_file_dest = "+DATA"
-
db_recovery_file_dest = "+FRA01"
-
db_recovery_file_dest_size= 4407M
-
thread = 1
-
undo_tablespace = "UNDOTBS1"
-
instance_number = 1
-
remote_login_passwordfile= "EXCLUSIVE"
-
db_domain = ""
-
dispatchers = "(PROTOCOL=TCP) (SERVICE=wooXDB)"
-
remote_listener = "scan.prudentwoo.com:1521"
-
audit_file_dest = "/u01/app/oracle/admin/woo/adump"
-
audit_trail = "DB"
-
db_name = "woo"
-
open_cursors = 300
-
diagnostic_dest = "/u01/app/oracle"
-
Cluster communication is configured to use the following interface(s) for this instance
-
169.254.51.38
-
169.254.243.157
-
cluster interconnect IPC version:Oracle UDP/IP (generic)
-
IPC Vendor 1 proto 2
-
Fri Dec 16 15:24:55 2016
-
USER (ospid: 4044): terminating the instance due to error 119
- Instance terminated by USER, pid = 4044
資料庫狀態:
-
[oracle@db01 ~]$ crsctl status res -t
-
--------------------------------------------------------------------------------
-
NAME TARGET STATE SERVER STATE_DETAILS
-
--------------------------------------------------------------------------------
-
Local Resources
-
--------------------------------------------------------------------------------
-
ora.BAK01.dg
-
ONLINE ONLINE db01
-
ONLINE ONLINE db02
-
ora.DATA.dg
-
ONLINE ONLINE db01
-
ONLINE ONLINE db02
-
ora.FRA01.dg
-
ONLINE ONLINE db01
-
ONLINE ONLINE db02
-
ora.LISTENER.lsnr
-
ONLINE ONLINE db01
-
ONLINE ONLINE db02
-
ora.OCR_VOT.dg
-
ONLINE ONLINE db01
-
ONLINE ONLINE db02
-
ora.asm
-
ONLINE ONLINE db01 Started
-
ONLINE ONLINE db02 Started
-
ora.gsd
-
OFFLINE OFFLINE db01
-
OFFLINE OFFLINE db02
-
ora.net1.network
-
ONLINE ONLINE db01
-
ONLINE ONLINE db02
-
ora.ons
-
ONLINE ONLINE db01
-
ONLINE ONLINE db02
-
--------------------------------------------------------------------------------
-
Cluster Resources
-
--------------------------------------------------------------------------------
-
ora.LISTENER_SCAN1.lsnr
-
1 ONLINE ONLINE db02
-
ora.LISTENER_SCAN2.lsnr
-
1 ONLINE ONLINE db01
-
ora.LISTENER_SCAN3.lsnr
-
1 ONLINE ONLINE db01
-
ora.cvu
-
1 ONLINE ONLINE db01
-
ora.db01.vip
-
1 ONLINE ONLINE db01
-
ora.db02.vip
-
1 ONLINE ONLINE db02
-
ora.oc4j
-
1 ONLINE ONLINE db01
-
ora.scan1.vip
-
1 ONLINE ONLINE db02
-
ora.scan2.vip
-
1 ONLINE ONLINE db01
-
ora.scan3.vip
-
1 ONLINE ONLINE db01
-
ora.woo.db
-
1 ONLINE OFFLINE Instance Shutdown
- 2 ONLINE OFFLINE Instance Shutdown
-
[oracle@db01 ~]$ srvctl status database -d woo
-
Instance woo1 is not running on node db01
- Instance woo2 is not running on node db02
四、手工帶起資料庫:
-
[oracle@db01 trace]$ srvctl start database -d woo
-
PRCR-1079 : Failed to start resource ora.woo.db
-
CRS-5017: The resource action "ora.woo.db start" encountered the following error:
-
ORA-00119: invalid specification for system parameter REMOTE_LISTENER
-
ORA-00132: syntax error or unresolved network name 'scan.prudentwoo.com:1521'
-
. For details refer to "(:CLSN00107:)" in "/u01/app/11.2.0.4/product/grid/log/db02/agent/crsd/oraagent_oracle/oraagent_oracle.log".
-
-
CRS-5017: The resource action "ora.woo.db start" encountered the following error:
-
ORA-00119: invalid specification for system parameter REMOTE_LISTENER
-
ORA-00132: syntax error or unresolved network name 'scan.prudentwoo.com:1521'
-
. For details refer to "(:CLSN00107:)" in "/u01/app/11.2.0.4/product/grid/log/db01/agent/crsd/oraagent_oracle/oraagent_oracle.log".
-
-
CRS-2674: Start of 'ora.woo.db' on 'db02' failed
-
CRS-2674: Start of 'ora.woo.db' on 'db01' failed
- CRS-2632: There are no more servers to try to place resource 'ora.woo.db' on that would satisfy its placement policy
日誌資訊:
-
alert.log:
-
[oracle@db01 trace]$ tail -0f alert_woo1.log
-
Fri Dec 16 15:37:08 2016
-
Starting ORACLE instance (normal)
-
LICENSE_MAX_SESSION = 0
-
LICENSE_SESSIONS_WARNING = 0
-
Initial number of CPU is 2
-
Private Interface 'eth1:1' configured from GPnP for use as a private interconnect.
-
[name='eth1:1', type=1, ip=169.254.51.38, mac=00-0c-29-7c-44-ca, net=169.254.0.0/17, mask=255.255.128.0, use=haip:cluster_interconnect/62]
-
Private Interface 'eth2:1' configured from GPnP for use as a private interconnect.
-
[name='eth2:1', type=1, ip=169.254.243.157, mac=00-0c-29-7c-44-d4, net=169.254.128.0/17, mask=255.255.128.0, use=haip:cluster_interconnect/62]
-
Public Interface 'eth0' configured from GPnP for use as a public interface.
-
[name='eth0', type=1, ip=192.168.84.11, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]
-
Public Interface 'eth0:1' configured from GPnP for use as a public interface.
-
[name='eth0:1', type=1, ip=192.168.84.22, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]
-
Public Interface 'eth0:3' configured from GPnP for use as a public interface.
-
[name='eth0:3', type=1, ip=192.168.84.20, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]
-
Public Interface 'eth0:5' configured from GPnP for use as a public interface.
-
[name='eth0:5', type=1, ip=192.168.84.13, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]
-
CELL communication is configured to use 0 interface(s):
-
CELL IP affinity details:
-
NUMA status: non-NUMA system
-
cellaffinity.ora status: N/A
-
CELL communication will use 1 IP group(s):
-
Grp 0:
-
Picked latch-free SCN scheme 3
-
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
-
Autotune of undo retention is turned on.
-
LICENSE_MAX_USERS = 0
-
SYS auditing is disabled
-
Starting up:
-
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
-
With the Partitioning, Real Application Clusters, OLAP, Data Mining
-
and Real Application Testing options.
-
ORACLE_HOME = /u01/app/oracle/11.2.0.4/product/db_1
-
System name: Linux
-
Node name: db01
-
Release: 3.8.13-44.1.1.el6uek.x86_64
-
Version: #2 SMP Wed Sep 10 06:10:25 PDT 2014
-
Machine: x86_64
-
VM name: VMWare Version: 6
-
Using parameter settings in server-side pfile /u01/app/oracle/11.2.0.4/product/db_1/dbs/initwoo1.ora
-
System parameters with non-default values:
-
processes = 600
-
sessions = 922
-
spfile = "+DATA/woo/spfilewoo.ora"
-
nls_language = "SIMPLIFIED CHINESE"
-
nls_territory = "CHINA"
-
memory_target = 1584M
-
control_files = "+DATA/woo/controlfile/current.260.930748953"
-
control_files = "+FRA01/woo/controlfile/current.256.930748953"
-
db_block_size = 8192
-
compatible = "11.2.0.4.0"
-
cluster_database = TRUE
-
db_create_file_dest = "+DATA"
-
db_recovery_file_dest = "+FRA01"
-
db_recovery_file_dest_size= 4407M
-
thread = 1
-
undo_tablespace = "UNDOTBS1"
-
instance_number = 1
-
remote_login_passwordfile= "EXCLUSIVE"
-
db_domain = ""
-
dispatchers = "(PROTOCOL=TCP) (SERVICE=wooXDB)"
-
remote_listener = "scan.prudentwoo.com:1521"
-
audit_file_dest = "/u01/app/oracle/admin/woo/adump"
-
audit_trail = "DB"
-
db_name = "woo"
-
open_cursors = 300
-
diagnostic_dest = "/u01/app/oracle"
-
Cluster communication is configured to use the following interface(s) for this instance
-
169.254.51.38
-
169.254.243.157
-
cluster interconnect IPC version:Oracle UDP/IP (generic)
-
IPC Vendor 1 proto 2
-
Fri Dec 16 15:37:49 2016
-
USER (ospid: 6043): terminating the instance due to error 119
- Instance terminated by USER, pid = 6043
五、問題分析:
我從啟動資料庫來看,發現資料庫此時無法正常啟動,並隨著報ORA-00132,日誌報error 119。
根據啟動提示可以將問題定位到scan,因scan故障引起資料庫無法正常啟動。
六、檢查scan配置資訊:
-
#check scan info:
-
[oracle@db01 ~]$ srvctl config scan
-
SCAN name: scan.prudentwoo.com, Network: 1/192.168.84.0/255.255.255.0/eth0
-
SCAN VIP name: scan1, IP: /scan.prudentwoo.com/192.168.84.21
-
SCAN VIP name: scan2, IP: /scan.prudentwoo.com/192.168.84.22
-
SCAN VIP name: scan3, IP: /scan.prudentwoo.com/192.168.84.20
-
-
-
[oracle@db01 ~]$ ping 192.168.84.20 -c 2
-
PING 192.168.84.20 (192.168.84.20) 56(84) bytes of data.
-
64 bytes from 192.168.84.20: icmp_seq=1 ttl=64 time=0.032 ms
-
64 bytes from 192.168.84.20: icmp_seq=2 ttl=64 time=0.039 ms
-
-
--- 192.168.84.20 ping statistics ---
-
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
-
rtt min/avg/max/mdev = 0.032/0.035/0.039/0.006 ms
-
[oracle@db01 ~]$ ping 192.168.84.21 -c 2
-
PING 192.168.84.21 (192.168.84.21) 56(84) bytes of data.
-
64 bytes from 192.168.84.21: icmp_seq=1 ttl=64 time=0.231 ms
-
64 bytes from 192.168.84.21: icmp_seq=2 ttl=64 time=0.292 ms
-
-
--- 192.168.84.21 ping statistics ---
-
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
-
rtt min/avg/max/mdev = 0.231/0.261/0.292/0.034 ms
-
[oracle@db01 ~]$ ping 192.168.84.22 -c 2
-
PING 192.168.84.22 (192.168.84.22) 56(84) bytes of data.
-
64 bytes from 192.168.84.22: icmp_seq=1 ttl=64 time=0.024 ms
-
64 bytes from 192.168.84.22: icmp_seq=2 ttl=64 time=0.034 ms
-
-
--- 192.168.84.22 ping statistics ---
-
2 packets transmitted, 2 received, 0% packet loss, time 999ms
-
rtt min/avg/max/mdev = 0.024/0.029/0.034/0.005 ms
-
[oracle@db01 ~]$ ping scan.prudentwoo.com -c 2
- ping: unknown host scan.prudentwoo.com
我們可以看到,現在scan對應的三個地址都是通的,說明SCAN的服務是好的,但是ping scan所對應的域名的時候報無法找到主機,無法解析域名,那麼下一步可以定位應該是域名服務出問題了。
七、在兩臺資料庫伺服器上檢查域名(dns)服務,結果是域名伺服器沒有在這兩臺資料伺服器上:
-
#check dns client and server:
-
[oracle@db01 ~]$ /sbin/chkconfig --list|grep named
-
[oracle@db01 ~]$ ssh db02 '/sbin/chkconfig --list|grep named'
-
[oracle@db01 ~]$
-
-
check dns client:
-
[oracle@db01 ~]$ cat /etc/resolv.conf
-
search prudentwoo.com
- nameserver 192.168.84.15
八、根據resolv.conf配置找到真正的域名伺服器,發現域名域名伺服器hang住:
-
[oracle@db01 ~]$ ping 192.168.84.15 -c 2
-
PING 192.168.84.15 (192.168.84.15) 56(84) bytes of data.
-
From 192.168.84.11 icmp_seq=1 Destination Host Unreachable
-
From 192.168.84.11 icmp_seq=2 Destination Host Unreachable
-
-
--- 192.168.84.15 ping statistics ---
-
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3007ms
- pipe 2
九、修復域名伺服器,現在可以正常解析:
-
[oracle@db01 ~]$ ping scan.prudentwoo.com -c 2
-
PING scan.prudentwoo.com (192.168.84.21) 56(84) bytes of data.
-
64 bytes from scan.prudentwoo.com (192.168.84.21): icmp_seq=1 ttl=64 time=0.494 ms
-
64 bytes from scan.prudentwoo.com (192.168.84.21): icmp_seq=2 ttl=64 time=0.289 ms
-
-
--- scan.prudentwoo.com ping statistics ---
-
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
- rtt min/avg/max/mdev = 0.289/0.391/0.494/0.104 ms
十、再次啟動資料庫:
-
[oracle@db01 ~]$ srvctl start database -d woo
-
[oracle@db01 ~]$ srvctl status database -d woo
-
Instance woo1 is running on node db01
-
Instance woo2 is running on node db02
-
-
[oracle@db01 ~]$ srvctl config database -d woo
-
Database unique name: woo
-
Database name: woo
-
Oracle home: /u01/app/oracle/11.2.0.4/product/db_1
-
Oracle user: oracle
-
Spfile: +DATA/woo/spfilewoo.ora
-
Domain:
-
Start options: open
-
Stop options: immediate
-
Database role: PRIMARY
-
Management policy: AUTOMATIC
-
Server pools: woo
-
Database instances: woo1,woo2
-
Disk Groups: DATA,FRA01
-
Mount point paths:
-
Services:
-
Type: RAC
- Database is administrator managed
能正常啟動,故障修復。
從整個問題的處理思路來看該故障不僅考驗解決資料庫故障能力,同時安裝,基本執行原理都有考察到,當然考驗更多的應該還是和DNS服務的深入理解。
當然我是很慶幸的,出於職業敏感度,一堆報錯中瞬間發現問題根源ORA-00132,而沒有從其它報錯資訊入手。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26736162/viewspace-2130925/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Oracle RAC啟動失敗(DNS故障)OracleDNS
- 經典乾貨:Docker 常見故障排查處理Docker
- DRM特性引起的RAC節點當機
- 故障分析 | show processlist 引起的效能問題
- 一個由於侵入框架引起的故障框架
- Oracle 面試寶典-RAC篇Oracle面試
- Oracle 10g RAC故障處理Oracle 10g
- gc current request 引起長期鎖表的故障GC
- 如何判斷DNS解析故障?如何解決DNS解析錯誤?DNS
- Oracle RAC日常運維-DATA磁碟組故障Oracle運維
- 解決DNS解析故障的幾種方法DNS
- RedHat7.2的RemoveIPC設定主yes引起rac當機RedhatREM
- 3節點RAC資料庫夯故障分析資料庫
- 【RAC啟動故障】ORA-21561: OID generation failedAI
- Oracle RAC常見啟動失敗故障分析Oracle
- 事務註解(@Transactional)引起的資料覆蓋故障
- SICP 經典
- 記錄一次因subprocess PIPE 引起的線上故障
- Oracle RAC日常運維-NetworkManager導致叢集故障Oracle運維
- 人工智慧研究:經典推理和非經典推理人工智慧
- nslookup命令怎麼用?如何查詢DNS解析故障?DNS
- AT 經典90題
- Spring 經典教程Spring
- Oracle 變數窺視引起執行計劃異常故障分析Oracle變數
- 巨經典論文!推薦系統經典模型Wide & Deep模型IDE
- DNS故障的幾種常見原因及解決方法DNS
- 暴風影音宣告:DNS伺服器才是故障源頭DNS伺服器
- 雲伺服器如何解決DNS解析錯誤故障伺服器DNS
- csharp入門經典CSharp
- 經典的反轉
- 經典 backbone 總結
- YFII經典語錄
- MySQL經典案例分析MySql
- 文章經典總結
- jvm經典文章整理JVM
- corn表示式 經典
- 經典面試題面試題
- JavaScript經典案例(二)JavaScript
- NLP的經典書