DNS 引起經典RAC故障

作者：吳偉龍(PrudentWoo)

一、環境介紹：

這是一套四年前部署的RAC系統，之前執行一直很好，沒有出過問題，平時基本處於無人管的狀態。

OS:Redhat EnterPrise Linux 5.8 x86_x64

DB:Oracle Database EnterPrise 11.2.0.4

GI:Oracle Grid Infrastructure 11.2.0.4

二、問題描述：

昨天臨近下班接到現場人員故障請求，描述為資料庫無法連線，報ORA-12547:TNS: lost CONNECT。當時第一反應是網路和監聽故障，讓現場人員進行tnsping和ping都是正常的。

三、問題現象：

我到達現場後，首先檢視了資料庫的狀態，發現資料庫例項是停止執行狀態，並且從日誌中看不出明顯報錯；

資料庫日誌：

Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /u01/app/oracle/11.2.0.4/product/db_1
System name: Linux
Node name: db01
Release: 3.8.13-44.1.1.el6uek.x86_64
Version: #2 SMP Wed Sep 10 06:10:25 PDT 2014
Machine: x86_64
VM name: VMWare Version: 6
Using parameter settings in server-side pfile /u01/app/oracle/11.2.0.4/product/db_1/dbs/initwoo1.ora
System parameters with non-default values:
processes = 600
sessions = 922
spfile = "+DATA/woo/spfilewoo.ora"
nls_language = "SIMPLIFIED CHINESE"
nls_territory = "CHINA"
memory_target = 1584M
control_files = "+DATA/woo/controlfile/current.260.930748953"
control_files = "+FRA01/woo/controlfile/current.256.930748953"
db_block_size = 8192
compatible = "11.2.0.4.0"
cluster_database = TRUE
db_create_file_dest = "+DATA"
db_recovery_file_dest = "+FRA01"
db_recovery_file_dest_size= 4407M
thread = 1
undo_tablespace = "UNDOTBS1"
instance_number = 1
remote_login_passwordfile= "EXCLUSIVE"
db_domain = ""
dispatchers = "(PROTOCOL=TCP) (SERVICE=wooXDB)"
remote_listener = "scan.prudentwoo.com:1521"
audit_file_dest = "/u01/app/oracle/admin/woo/adump"
audit_trail = "DB"
db_name = "woo"
open_cursors = 300
diagnostic_dest = "/u01/app/oracle"
Cluster communication is configured to use the following interface(s) for this instance
169.254.51.38
169.254.243.157
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Fri Dec 16 15:24:55 2016
USER (ospid: 4044): terminating the instance due to error 119
Instance terminated by USER, pid = 4044

資料庫狀態：

[oracle@db01 ~]$ crsctl status res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.BAK01.dg
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.DATA.dg
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.FRA01.dg
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.LISTENER.lsnr
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.OCR_VOT.dg
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.asm
ONLINE ONLINE db01 Started
ONLINE ONLINE db02 Started
ora.gsd
OFFLINE OFFLINE db01
OFFLINE OFFLINE db02
ora.net1.network
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.ons
ONLINE ONLINE db01
ONLINE ONLINE db02
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE db02
ora.LISTENER_SCAN2.lsnr
1 ONLINE ONLINE db01
ora.LISTENER_SCAN3.lsnr
1 ONLINE ONLINE db01
ora.cvu
1 ONLINE ONLINE db01
ora.db01.vip
1 ONLINE ONLINE db01
ora.db02.vip
1 ONLINE ONLINE db02
ora.oc4j
1 ONLINE ONLINE db01
ora.scan1.vip
1 ONLINE ONLINE db02
ora.scan2.vip
1 ONLINE ONLINE db01
ora.scan3.vip
1 ONLINE ONLINE db01
ora.woo.db
1 ONLINE OFFLINE Instance Shutdown
2 ONLINE OFFLINE Instance Shutdown

[oracle@db01 ~]$ srvctl status database -d woo
Instance woo1 is not running on node db01
Instance woo2 is not running on node db02

四、手工帶起資料庫：

[oracle@db01 trace]$ srvctl start database -d woo
PRCR-1079 : Failed to start resource ora.woo.db
CRS-5017: The resource action "ora.woo.db start" encountered the following error:
ORA-00119: invalid specification for system parameter REMOTE_LISTENER
ORA-00132: syntax error or unresolved network name 'scan.prudentwoo.com:1521'
. For details refer to "(:CLSN00107:)" in "/u01/app/11.2.0.4/product/grid/log/db02/agent/crsd/oraagent_oracle/oraagent_oracle.log".
CRS-5017: The resource action "ora.woo.db start" encountered the following error:
ORA-00119: invalid specification for system parameter REMOTE_LISTENER
ORA-00132: syntax error or unresolved network name 'scan.prudentwoo.com:1521'
. For details refer to "(:CLSN00107:)" in "/u01/app/11.2.0.4/product/grid/log/db01/agent/crsd/oraagent_oracle/oraagent_oracle.log".
CRS-2674: Start of 'ora.woo.db' on 'db02' failed
CRS-2674: Start of 'ora.woo.db' on 'db01' failed
CRS-2632: There are no more servers to try to place resource 'ora.woo.db' on that would satisfy its placement policy

日誌資訊：

alert.log:
[oracle@db01 trace]$ tail -0f alert_woo1.log
Fri Dec 16 15:37:08 2016
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 2
Private Interface 'eth1:1' configured from GPnP for use as a private interconnect.
[name='eth1:1', type=1, ip=169.254.51.38, mac=00-0c-29-7c-44-ca, net=169.254.0.0/17, mask=255.255.128.0, use=haip:cluster_interconnect/62]
Private Interface 'eth2:1' configured from GPnP for use as a private interconnect.
[name='eth2:1', type=1, ip=169.254.243.157, mac=00-0c-29-7c-44-d4, net=169.254.128.0/17, mask=255.255.128.0, use=haip:cluster_interconnect/62]
Public Interface 'eth0' configured from GPnP for use as a public interface.
[name='eth0', type=1, ip=192.168.84.11, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'eth0:1' configured from GPnP for use as a public interface.
[name='eth0:1', type=1, ip=192.168.84.22, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'eth0:3' configured from GPnP for use as a public interface.
[name='eth0:3', type=1, ip=192.168.84.20, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'eth0:5' configured from GPnP for use as a public interface.
[name='eth0:5', type=1, ip=192.168.84.13, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]
CELL communication is configured to use 0 interface(s):
CELL IP affinity details:
NUMA status: non-NUMA system
cellaffinity.ora status: N/A
CELL communication will use 1 IP group(s):
Grp 0:
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /u01/app/oracle/11.2.0.4/product/db_1
System name: Linux
Node name: db01
Release: 3.8.13-44.1.1.el6uek.x86_64
Version: #2 SMP Wed Sep 10 06:10:25 PDT 2014
Machine: x86_64
VM name: VMWare Version: 6
Using parameter settings in server-side pfile /u01/app/oracle/11.2.0.4/product/db_1/dbs/initwoo1.ora
System parameters with non-default values:
processes = 600
sessions = 922
spfile = "+DATA/woo/spfilewoo.ora"
nls_language = "SIMPLIFIED CHINESE"
nls_territory = "CHINA"
memory_target = 1584M
control_files = "+DATA/woo/controlfile/current.260.930748953"
control_files = "+FRA01/woo/controlfile/current.256.930748953"
db_block_size = 8192
compatible = "11.2.0.4.0"
cluster_database = TRUE
db_create_file_dest = "+DATA"
db_recovery_file_dest = "+FRA01"
db_recovery_file_dest_size= 4407M
thread = 1
undo_tablespace = "UNDOTBS1"
instance_number = 1
remote_login_passwordfile= "EXCLUSIVE"
db_domain = ""
dispatchers = "(PROTOCOL=TCP) (SERVICE=wooXDB)"
remote_listener = "scan.prudentwoo.com:1521"
audit_file_dest = "/u01/app/oracle/admin/woo/adump"
audit_trail = "DB"
db_name = "woo"
open_cursors = 300
diagnostic_dest = "/u01/app/oracle"
Cluster communication is configured to use the following interface(s) for this instance
169.254.51.38
169.254.243.157
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Fri Dec 16 15:37:49 2016
USER (ospid: 6043): terminating the instance due to error 119
Instance terminated by USER, pid = 6043

五、問題分析：

我從啟動資料庫來看，發現資料庫此時無法正常啟動，並隨著報ORA-00132，日誌報error 119。

根據啟動提示可以將問題定位到scan，因scan故障引起資料庫無法正常啟動。

六、檢查scan配置資訊：

#check scan info:
[oracle@db01 ~]$ srvctl config scan
SCAN name: scan.prudentwoo.com, Network: 1/192.168.84.0/255.255.255.0/eth0
SCAN VIP name: scan1, IP: /scan.prudentwoo.com/192.168.84.21
SCAN VIP name: scan2, IP: /scan.prudentwoo.com/192.168.84.22
SCAN VIP name: scan3, IP: /scan.prudentwoo.com/192.168.84.20
[oracle@db01 ~]$ ping 192.168.84.20 -c 2
PING 192.168.84.20 (192.168.84.20) 56(84) bytes of data.
64 bytes from 192.168.84.20: icmp_seq=1 ttl=64 time=0.032 ms
64 bytes from 192.168.84.20: icmp_seq=2 ttl=64 time=0.039 ms
--- 192.168.84.20 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.032/0.035/0.039/0.006 ms
[oracle@db01 ~]$ ping 192.168.84.21 -c 2
PING 192.168.84.21 (192.168.84.21) 56(84) bytes of data.
64 bytes from 192.168.84.21: icmp_seq=1 ttl=64 time=0.231 ms
64 bytes from 192.168.84.21: icmp_seq=2 ttl=64 time=0.292 ms
--- 192.168.84.21 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.231/0.261/0.292/0.034 ms
[oracle@db01 ~]$ ping 192.168.84.22 -c 2
PING 192.168.84.22 (192.168.84.22) 56(84) bytes of data.
64 bytes from 192.168.84.22: icmp_seq=1 ttl=64 time=0.024 ms
64 bytes from 192.168.84.22: icmp_seq=2 ttl=64 time=0.034 ms
--- 192.168.84.22 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.024/0.029/0.034/0.005 ms
[oracle@db01 ~]$ ping scan.prudentwoo.com -c 2
ping: unknown host scan.prudentwoo.com

我們可以看到，現在scan對應的三個地址都是通的，說明SCAN的服務是好的，但是ping scan所對應的域名的時候報無法找到主機，無法解析域名，那麼下一步可以定位應該是域名服務出問題了。

七、在兩臺資料庫伺服器上檢查域名(dns)服務，結果是域名伺服器沒有在這兩臺資料伺服器上：

#check dns client and server:
[oracle@db01 ~]$ /sbin/chkconfig --list|grep named
[oracle@db01 ~]$ ssh db02 '/sbin/chkconfig --list|grep named'
[oracle@db01 ~]$
check dns client:
[oracle@db01 ~]$ cat /etc/resolv.conf
search prudentwoo.com
nameserver 192.168.84.15

八、根據resolv.conf配置找到真正的域名伺服器，發現域名域名伺服器hang住：

[oracle@db01 ~]$ ping 192.168.84.15 -c 2
PING 192.168.84.15 (192.168.84.15) 56(84) bytes of data.
From 192.168.84.11 icmp_seq=1 Destination Host Unreachable
From 192.168.84.11 icmp_seq=2 Destination Host Unreachable
--- 192.168.84.15 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3007ms
pipe 2

九、修復域名伺服器，現在可以正常解析：

[oracle@db01 ~]$ ping scan.prudentwoo.com -c 2
PING scan.prudentwoo.com (192.168.84.21) 56(84) bytes of data.
64 bytes from scan.prudentwoo.com (192.168.84.21): icmp_seq=1 ttl=64 time=0.494 ms
64 bytes from scan.prudentwoo.com (192.168.84.21): icmp_seq=2 ttl=64 time=0.289 ms
--- scan.prudentwoo.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.289/0.391/0.494/0.104 ms

十、再次啟動資料庫：

[oracle@db01 ~]$ srvctl start database -d woo
[oracle@db01 ~]$ srvctl status database -d woo
Instance woo1 is running on node db01
Instance woo2 is running on node db02
[oracle@db01 ~]$ srvctl config database -d woo
Database unique name: woo
Database name: woo
Oracle home: /u01/app/oracle/11.2.0.4/product/db_1
Oracle user: oracle
Spfile: +DATA/woo/spfilewoo.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: woo
Database instances: woo1,woo2
Disk Groups: DATA,FRA01
Mount point paths:
Services:
Type: RAC
Database is administrator managed

能正常啟動，故障修復。

從整個問題的處理思路來看該故障不僅考驗解決資料庫故障能力，同時安裝，基本執行原理都有考察到，當然考驗更多的應該還是和DNS服務的深入理解。

當然我是很慶幸的，出於職業敏感度，一堆報錯中瞬間發現問題根源ORA-00132，而沒有從其它報錯資訊入手。

DNS 引起經典RAC故障

DNS 引起經典RAC故障

一、環境介紹：

二、問題描述：

三、問題現象：

資料庫日誌：

資料庫狀態：

四、手工帶起資料庫：

日誌資訊：

五、問題分析：

六、檢查scan配置資訊：

七、在兩臺資料庫伺服器上檢查域名(dns)服務，結果是域名伺服器沒有在這兩臺資料伺服器上：

八、根據resolv.conf配置找到真正的域名伺服器，發現域名域名伺服器hang住：

九、修復域名伺服器，現在可以正常解析：

十、再次啟動資料庫：

相關文章