RMAN備份到帶庫被阻塞

yangtingkun發表於2010-02-16

檢查備份的輸出指令碼的時候發現,備份沒有按時完成。

 

 

備份指令碼輸出如下:

bash-3.00$ more /data/backup/backup_tradedb_100102.out

Script. /data/backup/backup_tradedb.sh
==== started on Sat Jan 2 23:00:00 CST 2010 ====


RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
RMAN> 2> 3> 4> 5> 6> 7> 8> 10.2.0.3.0 - Production on Sat Jan 2 23:00:01 2010

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

connected to target database: TRADEDB (DBID=4181457554)
using target database control file instead of recovery catalog

RMAN> 2> 3> 4> 5> 6> 7> 8>
allocated channel: C1
channel C1: sid=96 instance=tradedb1 devtype=SBT_TAPE
channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

allocated channel: C2
channel C2: sid=288 instance=tradedb1 devtype=SBT_TAPE
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

Starting backup at 02-JAN-10
input backupset count=1537 stamp=706059322 creation_time=19-DEC-09
channel C2: starting piece 1 at 02-JAN-10
channel C2: backup piece /data/backup/tradedb/g1l1b71q_1_1
input backupset count=1535 stamp=706059115 creation_time=19-DEC-09
channel C1: starting piece 1 at 02-JAN-10
channel C1: backup piece /data/backup/tradedb/fvl1b6rb_1_1

顯然備份從12日開始,就沒有結束,登陸資料庫,檢查對應的程式:

bash-3.00$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.3.0 - Production on 星期五 1 8 14:16:53 2010

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.


連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> set pages 100 lines 120
SQL> col event format a30                    
SQL> col client_info format a20
SQL> select sid, client_info, logon_time, event 
  2  from v$session
  3  where program like 'rman%';

       SID CLIENT_INFO          LOGON_TIME          EVENT
---------- -------------------- ------------------- ------------------------------
        96 rman channel=C1      2010-01-02 23:00:05 Backup: sbtbackup
        98                      2010-01-02 23:00:02 SQL*Net message from client
       288 rman channel=C2      2010-01-02 23:00:06 Backup: sbtbackup
       295                      2010-01-02 23:00:02 SQL*Net message from client

資料庫中有4rman相關的會話,其中98295是父會話,而96288分別是CHANNEL C1C2對應的會話。可以看到,問題出現在sbtbackup過程中。

從上面的備份輸出來看,錯誤出現在帶庫的第一個檔案上,也就是說第一個檔案的備份就出現了長時間的等待。

這個備份的指令碼已經執行了一兩年了,而2010年第一次向帶庫中進行備份就報錯了,莫非帶庫備份軟體也存在類似千年蟲的問題。

先中止備份程式,測試一下問題是否可以重現,並檢查問題是與備份有關,還是與帶庫或備份軟體有關:

SQL> select 'kill -9 ' || spid 
  2  from v$process p, v$session s
  3  where p.addr = s.paddr
  4  and sid in (98, 295, 96, 288);

'KILL-9'||SPID
--------------------
kill -9 12879
kill -9 12998
kill -9 12885
kill -9 12968

已選擇4行。

SQL> host
$ kill -9 12879
$ kill -9 12998
$ kill -9 12885
$ kill -9 12968
$ exit

bash-3.00$ rman target /

恢復管理器: Release 10.2.0.3.0 - Production on 星期五 1 8 14:40:07 2010

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

連線到目標資料庫: TRADEDB (DBID=4181457554)

RMAN> run
2> {
3> allocate channel c1 device type disk format '/data/backup/tradedb/%U';
4> backup tablespace users;
5> }

使用目標資料庫控制檔案替代恢復目錄
分配的通道: c1
通道 c1: sid=120 例項=tradedb1 devtype=DISK

啟動 backup 08-1 -10
通道 c1: 啟動全部資料檔案備份集
通道 c1: 正在指定備份集中的資料檔案
輸入資料檔案 fno=00013 name=/dev/vx/rdsk/datadg/tradedb_users_1_200m
通道 c1: 正在啟動段 1 08-1 -10
通道 c1: 已完成段 1 08-1 -10
段控制程式碼=/data/backup/tradedb/gjl2uv84_1_1 標記=TAG20100108T144108 註釋=NONE
通道 c1: 備份集已完成, 經過時間:00:00:01
完成 backup 08-1 -10
釋放的通道: c1

RMAN> run
2> {
3> allocate channel c1 device type sbt;  
4> backup tablespace users;
5> }

分配的通道: c1
通道 c1: sid=120 例項=tradedb1 devtype=SBT_TAPE
通道c1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

啟動 backup 08-1 -10
通道 c1: 啟動全部資料檔案備份集
通道 c1: 正在指定備份集中的資料檔案
輸入資料檔案 fno=00013 name=/dev/vx/rdsk/datadg/tradedb_users_1_200m
通道 c1: 正在啟動段 1 08-1 -10

不出所料,備份到磁碟沒有問題,而如果向磁帶備份,則備份程式被掛起。

看來問題多半出現在帶庫或備份軟體NETBACKUP的設定上。

檢查NETBACKUP的日誌,果然發現了錯誤資訊:

    TIME            SERVER/CLIENT                      TEXT
01/07/2010 14:56:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 14:56:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:06:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:06:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:16:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:16:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:26:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:26:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:36:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:36:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:46:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:46:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:56:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:56:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
.
.
.

檢查發現機械手處於DISABLE狀態,嘗試啟動時報錯,錯誤資訊位:

robot inventory failed: unable to sense robotic device 202

看來可能是硬體的問題了。嘗試重啟NETBACKUP服務,問題沒有解決。

仔細檢查了以往的日誌,結果意外的發現1226日的備份也出現了錯誤:

bash-3.00$ more /data/backup/backup_tradedb_091226.out

Script. /data/backup/backup_tradedb.sh
==== started on Sat Dec 26 23:00:00 CST 2009 ====


RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
RMAN> 2> 3> 4> 5> 6> 7> 8> RMAN> 2> 3> 4> 5> 6> 7> 8> 9> RMAN> 2> 3> 4> RMAN>

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

connected to target database: TRADEDB (DBID=4181457554)
using target database control file instead of recovery catalog

RMAN> 2> 3> 4> 5> 6> 7> 8>
allocated channel: C1
channel C1: sid=118 instance=tradedb1 devtype=SBT_TAPE
channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

allocated channel: C2
channel C2: sid=288 instance=tradedb1 devtype=SBT_TAPE
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

Starting backup at 26-DEC-09
input backupset count=1537 stamp=706059322 creation_time=19-DEC-09
channel C2: starting piece 1 at 26-DEC-09
channel C2: backup piece /data/backup/tradedb/g1l1b71q_1_1
input backupset count=1535 stamp=706059115 creation_time=19-DEC-09
channel C1: starting piece 1 at 26-DEC-09
channel C1: backup piece /data/backup/tradedb/fvl1b6rb_1_1
RMAN-03009: failure of backup command on C1 channel at 12/26/2009 23:30:33
ORA-19506: failed to create sequential file, name="fvl1b6rb_1_2", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
   VxBSACreateObject: Failed with error:
   Server Status:  Communication with the server has not been iniatated or the server status has not been retrieved from the server.
ORA-19600: input file is backup piece  (/data/backup/tradedb/fvl1b6rb_1_1)
channel C1 disabled, job failed on it will be run on another channel
released channel: C1
released channel: C2
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on C2 channel at 12/26/2009 23:30:33
ORA-19506: failed to create sequential file, name="g1l1b71q_1_2", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
   VxBSACreateObject: Failed with error:
   Server Status:  Communication with the server has not been iniatated or the server status has not been retrieved from the server.
ORA-19600: input file is backup piece  (/data/backup/tradedb/g1l1b71q_1_1)

看來這個問題存在時間不短了,檢查系統日誌,結果發現大量的硬體錯誤資訊:

# dmesg

Fri Jan  8 15:50:23 CST 2010
Jan  8 15:06:50 ahrac1 tldcd[877]: [ID 498531 daemon.error] user scsi ioctl() failed, may be timeout, errno = 5, I/O error
Jan  8 15:06:50 ahrac1 last message repeated 1 time
Jan  8 15:06:50 ahrac1 tldcd[877]: [ID 406877 daemon.error] TLD(0) mode_sense ioctl() failed: I/O error
Jan  8 15:06:50 ahrac1 tldd[857]: [ID 320639 daemon.error] TLD(0) unavailable: initialization failed: Unable to sense robotic device
Jan  8 15:06:51 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Jan  8 15:06:51 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
.
.
.

檢查過往的日誌,發現在1220日的時候出現的錯誤:

Dec 20 22:24:51 ahrac1 qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(1): Link OFFLINE
Dec 20 22:24:51 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:24:51 ahrac1  SCSI transport failed: reason 'tran_err': retrying command
Dec 20 22:26:01 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:26:01 ahrac1  SCSI transport failed: reason 'timeout': retrying command
Dec 20 22:26:21 ahrac1 fctl: [ID 517869 kern.warning] WARNING: fp(0)::OFFLINE timeout
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 0 to CK200070800845 is dead.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Killing bus 3072 to Clariion   CK200070800845 port SP A3.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 1 to CK200070800845 is dead.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 2 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=2 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.notice]    Device is gone
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=1 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.notice]    Device is gone
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:26:40 ahrac1  transport rejected fatal error
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 1 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=0 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Killing bus 3072 to Clariion   CK200070800845 port SP B3.
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 0 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=0 (trace=0), target=10700 (trace=2800004)
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 2 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=2 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=1 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=0 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=0 (trace=0), target=10600 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=0 (trace=0), target=10300 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2010,0 (st1):
Dec 20 22:26:40 ahrac1  transport rejected
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2010,0 (st1) offline
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2020,0 (st0):
Dec 20 22:26:40 ahrac1  transport rejected
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2020,0 (st0) offline
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/sgen@w500e09e00b3b2000,0 (sgen0) offline
Dec 20 22:26:55 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:26:55 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
Dec 20 22:27:10 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:27:10 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
Dec 20 22:27:25 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:27:25 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
.
.
.

看來是出現了硬體問題,僅靠調整軟體已經不起作用了。

最終由廠家人員現場診斷發現帶庫的控制器、機械手等多處都出現了硬體故障,最終透過更換了新的帶庫解決了錯誤。

本以為是軟體配置的問題,沒想到最終檢測到硬體的錯誤。

 

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-627279/,如需轉載,請註明出處,否則將追究法律責任。

相關文章