從巡檢備份失敗排查解決資料庫故障

531968912發表於2016-05-10

轉載地址:http://www.cnblogs.com/jasoname/p/5474159.html
最近某業務備份報錯:
Starting Control File and SPFILE Autobackup at 09-MAY-16
piece handle=c-335040995-20160509-00 comment=API Version 2.0,MMS Version 5.0.0.0
Finished Control File and SPFILE Autobackup at 09-MAY-16

sql statement: alter system archive log current

released channel: ch00

released channel: ch01

allocated channel: ch00
channel ch00: sid=491 instance=dlsc1 devtype=SBT_TAPE
channel ch00: Veritas NetBackup for Oracle - Release 7.0 (2010010501)

released channel: ch00
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-12001: could not open channel ch01
RMAN-10008: could not create channel context
RMAN-10003: unable to connect to target database
ORA-12170: TNS:Connect timeout occurred
RMAN> RMAN> 
Recovery Manager complete.
Script /oracle/nbu_scripts/hot_database_backup.sh
==== ended in error on Mon May 9 09:26:39 BEIST 2016 ====
從備份資訊上看資料庫在備份完Control File and SPFILE切換歸檔日誌後備份歸檔出現問題。
RMAN-10003: unable to connect to target database提示不能連線到目標庫。
檢查節點1,發現節點1正常。節點2無法進入sqlplus和rman下。
oracle@xxxxdb2:/oracle$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.3.0 - Production on Mon May 9 14:46:01 2016

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

Connected to an idle instance.

SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '/dev/rspfile'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
SQL> 
SQL> exit
Disconnected
oracle@xxxxdb2:/oracle$ rman target /

Recovery Manager: Release 10.2.0.3.0 - Production on Mon May 9 14:46:22 2016

Copyright (c) 1982, 2005, Oracle. All rights reserved.

connected to target database (not started)

RMAN> exit


Recovery Manager complete.
oracle@xxxxdb2:/oracle$ 
檢視crs狀態:
Recovery Manager complete.
oracle@xxxxdb2:/oracle$ crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.

oracle@xxxxdb2:/oracle$ 
哎,難道是crs沒起來?
嘗試啟動crs
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./crsctl start crs
Attempting to start CRS stack 
The CRS stack will be started shortly
等了一會以為crs能順利啟動結果還是起不來如下:
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM 
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
什麼情況,執行ocr命令看看ocr的情況結果:
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
到這裡已經知道啥問題了,儲存肯定沒認到。
檢視資料所在的vg
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# lsvg
rootvg
datavg
archvg
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# lsvg datavg
0516-010 : Volume group must be varied on; use varyonvg command.果然datavg沒有啟用。到這裡別急著啟用datavg
因為datavg沒有啟用很可能是hacmp沒有起來,檢視hacmp狀態,確實沒有起來。
root@xxxxdb2:/# /usr/es/sbin/cluster/utilities/clshowsrv -v
Status of the RSCT subsystems used by HACMP:
Subsystem Group PID Status 
topsvcs topsvcs inoperative
grpsvcs grpsvcs inoperative
grpglsm grpsvcs inoperative
emsvcs emsvcs inoperative
emaixos emsvcs inoperative
ctrmc rsct 262394 active
Status of the HACMP subsystems:
Subsystem Group PID Status 
clcomdES clcomdES 200944 active
clstrmgrES cluster 311646 active
Status of the optional HACMP subsystems:
Subsystem Group PID Status 
clinfoES cluster inoperative


Obtaining information via SNMP from Node: xxxxdb1...

_____________________________________________________________________________
Cluster Name: xxxxdb
Cluster State: UP
Cluster Substate: STABLE
_____________________________________________________________________________


Node Name: xxxxdb1 State: UP

Network Name: net_ether_02 State: UP

Address: 192.168.77.194 Label: xxxxdb1_priv State: UP


Node Name: xxxxdb2 State: DOWN

Network Name: net_ether_02 State:
Address: xxxx Label: xxxxdb2 State: DOWN
Address: 192.168.77.195 Label: xxxxdb2_priv State: DOWN

 

Cluster Name: xxxxdb

Resource Group Name: orarg
Startup Policy: Online On All Available Nodes
Fallover Policy: Bring Offline (On Error Node Only)
Fallback Policy: Never Fallback
Site Policy: ignore
Priority Override Information:
Primary Instance POL:
[MORE...5]
問題找到了,啟動hacmp吧
smitty clstart
Start Cluster Services on these nodes
啟動完畢:
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# lsvg datavg
VOLUME GROUP: datavg VG IDENTIFIER: 00c63cf200004c000000011d0937bc9f
VG STATE: active PP SIZE: 64 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 7990 (511360 megabytes)
MAX LVs: 256 FREE PPs: 661 (42304 megabytes)
LVs: 44 USED PPs: 7329 (469056 megabytes)
OPEN LVs: 5 QUORUM: 3
TOTAL PVs: 5 VG DESCRIPTORS: 5
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 5 AUTO ON: no
Concurrent: Enhanced-Capable Auto-Concurrent: Disabled
VG Mode: Concurrent 
Node ID: 2 Active Nodes: 1 
MAX PPs per VG: 32768 MAX PVs: 1024
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable 
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 2
Total space (kbytes) : 130852
Used space (kbytes) : 3300
Available space (kbytes) : 127552
ID : 222055846
Device/File Name : /dev/rocr
Device/File integrity check succeeded

Device/File not configured

Cluster registry integrity check succeeded

看來datavg已掛載
接下來啟動crs
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./crsctl start crs
Attempting to start CRS stack 
The CRS stack will be started shortly
root@xxxxdb2:/oracle/product/10.2.0/crs/bin# ./crs_stat -t
Name Type Target State Host 
------------------------------------------------------------
ora.xxxx.db application ONLINE ONLINE xxxxdb1 
ora....c1.inst application ONLINE ONLINE xxxxdb1 
ora....c2.inst application ONLINE ONLINE xxxxdb2 
ora....B1.lsnr application ONLINE ONLINE xxxxdb1 
ora....db1.gsd application ONLINE ONLINE xxxxdb1 
ora....db1.ons application ONLINE ONLINE xxxxdb1 
ora....db1.vip application ONLINE ONLINE xxxxdb1 
ora....B2.lsnr application ONLINE ONLINE xxxxdb2 
ora....db2.gsd application ONLINE ONLINE xxxxdb2 
ora....db2.ons application ONLINE ONLINE xxxxdb2 
ora....db2.vip application ONLINE ONLINE xxxxdb2

ok到這裡問題解決。分析下故障思路
先是從備份資訊得到NBU無法連線到節點2目標庫-->sqlplus和rman均失敗-->crs啟動失敗-->ocrcheck失敗-->datavg沒有啟用-->hacmp沒有啟動

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25462274/viewspace-2097381/,如需轉載,請註明出處,否則將追究法律責任。

相關文章