RMAN備份到帶庫被阻塞
檢查備份的輸出指令碼的時候發現,備份沒有按時完成。
備份指令碼輸出如下:
bash-3.00$ more /data/backup/backup_tradedb_100102.out
Script. /data/backup/backup_tradedb.sh
==== started on Sat Jan 2 23:00:00 CST 2010 ====
RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
RMAN> 2> 3> 4> 5> 6> 7> 8> 10.2.0.3.0 - Production on Sat Jan 2 23:00:01 2010
Copyright (c) 1982, 2005, Oracle. All rights reserved.
connected to target database: TRADEDB (DBID=4181457554)
using target database control file instead of recovery catalog
RMAN> 2> 3> 4> 5> 6> 7> 8>
allocated channel: C1
channel C1: sid=96 instance=tradedb1 devtype=SBT_TAPE
channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
allocated channel: C2
channel C2: sid=288 instance=tradedb1 devtype=SBT_TAPE
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
Starting backup at 02-JAN-10
input backupset count=1537 stamp=706059322 creation_time=19-DEC-09
channel C2: starting piece 1 at 02-JAN-10
channel C2: backup piece /data/backup/tradedb/g1l1b71q_1_1
input backupset count=1535 stamp=706059115 creation_time=19-DEC-09
channel C1: starting piece 1 at 02-JAN-10
channel C1: backup piece /data/backup/tradedb/fvl1b6rb_1_1
顯然備份從1月2日開始,就沒有結束,登陸資料庫,檢查對應的程式:
bash-3.00$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.3.0 - Production on 星期五 1月 8 14:16:53 2010
Copyright (c) 1982, 2006, Oracle. All Rights Reserved.
連線到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
SQL> set pages 100 lines 120
SQL> col event format a30
SQL> col client_info format a20
SQL> select sid, client_info, logon_time, event
2 from v$session
3 where program like 'rman%';
SID CLIENT_INFO LOGON_TIME EVENT
---------- -------------------- ------------------- ------------------------------
96 rman channel=C1 2010-01-02 23:00:05 Backup: sbtbackup
98 2010-01-02 23:00:02 SQL*Net message from client
288 rman channel=C2 2010-01-02 23:00:06 Backup: sbtbackup
295 2010-01-02 23:00:02 SQL*Net message from client
資料庫中有4個rman相關的會話,其中98和295是父會話,而96和288分別是CHANNEL C1和C2對應的會話。可以看到,問題出現在sbtbackup過程中。
從上面的備份輸出來看,錯誤出現在帶庫的第一個檔案上,也就是說第一個檔案的備份就出現了長時間的等待。
這個備份的指令碼已經執行了一兩年了,而2010年第一次向帶庫中進行備份就報錯了,莫非帶庫備份軟體也存在類似千年蟲的問題。
先中止備份程式,測試一下問題是否可以重現,並檢查問題是與備份有關,還是與帶庫或備份軟體有關:
SQL> select 'kill -9 ' || spid
2 from v$process p, v$session s
3 where p.addr = s.paddr
4 and sid in (98, 295, 96, 288);
'KILL-9'||SPID
--------------------
kill -9 12879
kill -9 12998
kill -9 12885
kill -9 12968
已選擇4行。
SQL> host
$ kill -9 12879
$ kill -9 12998
$ kill -9 12885
$ kill -9 12968
$ exit
bash-3.00$ rman target /
恢復管理器: Release 10.2.0.3.0 - Production on 星期五 1月 8 14:40:07 2010
Copyright (c) 1982, 2005, Oracle. All rights reserved.
連線到目標資料庫: TRADEDB (DBID=4181457554)
RMAN> run
2> {
3> allocate channel c1 device type disk format '/data/backup/tradedb/%U';
4> backup tablespace users;
5> }
使用目標資料庫控制檔案替代恢復目錄
分配的通道: c1
通道 c1: sid=120 例項=tradedb1 devtype=DISK
啟動 backup 於 08-1月 -10
通道 c1: 啟動全部資料檔案備份集
通道 c1: 正在指定備份集中的資料檔案
輸入資料檔案 fno=00013 name=/dev/vx/rdsk/datadg/tradedb_users_1_200m
通道 c1: 正在啟動段 1 於 08-1月 -10
通道 c1: 已完成段 1 於 08-1月 -10
段控制程式碼=/data/backup/tradedb/gjl2uv84_1_1 標記=TAG20100108T144108 註釋=NONE
通道 c1: 備份集已完成, 經過時間:00:00:01
完成 backup 於 08-1月 -10
釋放的通道: c1
RMAN> run
2> {
3> allocate channel c1 device type sbt;
4> backup tablespace users;
5> }
分配的通道: c1
通道 c1: sid=120 例項=tradedb1 devtype=SBT_TAPE
通道c1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
啟動 backup 於 08-1月 -10
通道 c1: 啟動全部資料檔案備份集
通道 c1: 正在指定備份集中的資料檔案
輸入資料檔案 fno=00013 name=/dev/vx/rdsk/datadg/tradedb_users_1_200m
通道 c1: 正在啟動段 1 於 08-1月 -10
不出所料,備份到磁碟沒有問題,而如果向磁帶備份,則備份程式被掛起。
看來問題多半出現在帶庫或備份軟體NETBACKUP的設定上。
檢查NETBACKUP的日誌,果然發現了錯誤資訊:
TIME SERVER/CLIENT TEXT
01/07/2010 14:56:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 14:56:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:06:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:06:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:16:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:16:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:26:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:26:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:36:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:36:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:46:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:46:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:56:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:56:25 ahrac1 - abnormal exit, but EXIT_Status == 0
.
.
.
檢查發現機械手處於DISABLE狀態,嘗試啟動時報錯,錯誤資訊位:
robot inventory failed: unable to sense robotic device 202
看來可能是硬體的問題了。嘗試重啟NETBACKUP服務,問題沒有解決。
仔細檢查了以往的日誌,結果意外的發現12月26日的備份也出現了錯誤:
bash-3.00$ more /data/backup/backup_tradedb_091226.out
Script. /data/backup/backup_tradedb.sh
==== started on Sat Dec 26 23:00:00 CST 2009 ====
RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
RMAN> 2> 3> 4> 5> 6> 7> 8> RMAN> 2> 3> 4> 5> 6> 7> 8> 9> RMAN> 2> 3> 4> RMAN>
Copyright (c) 1982, 2005, Oracle. All rights reserved.
connected to target database: TRADEDB (DBID=4181457554)
using target database control file instead of recovery catalog
RMAN> 2> 3> 4> 5> 6> 7> 8>
allocated channel: C1
channel C1: sid=118 instance=tradedb1 devtype=SBT_TAPE
channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
allocated channel: C2
channel C2: sid=288 instance=tradedb1 devtype=SBT_TAPE
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
Starting backup at 26-DEC-09
input backupset count=1537 stamp=706059322 creation_time=19-DEC-09
channel C2: starting piece 1 at 26-DEC-09
channel C2: backup piece /data/backup/tradedb/g1l1b71q_1_1
input backupset count=1535 stamp=706059115 creation_time=19-DEC-09
channel C1: starting piece 1 at 26-DEC-09
channel C1: backup piece /data/backup/tradedb/fvl1b6rb_1_1
RMAN-03009: failure of backup command on C1 channel at 12/26/2009 23:30:33
ORA-19506: failed to create sequential file, name="fvl1b6rb_1_2", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
VxBSACreateObject: Failed with error:
Server Status: Communication with the server has not been iniatated or the server status has not been retrieved from the server.
ORA-19600: input file is backup piece (/data/backup/tradedb/fvl1b6rb_1_1)
channel C1 disabled, job failed on it will be run on another channel
released channel: C1
released channel: C2
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on C2 channel at 12/26/2009 23:30:33
ORA-19506: failed to create sequential file, name="g1l1b71q_1_2", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
VxBSACreateObject: Failed with error:
Server Status: Communication with the server has not been iniatated or the server status has not been retrieved from the server.
ORA-19600: input file is backup piece (/data/backup/tradedb/g1l1b71q_1_1)
看來這個問題存在時間不短了,檢查系統日誌,結果發現大量的硬體錯誤資訊:
# dmesg
Fri Jan 8 15:50:23 CST 2010
Jan 8 15:06:50 ahrac1 tldcd[877]: [ID 498531 daemon.error] user scsi ioctl() failed, may be timeout, errno = 5, I/O error
Jan 8 15:06:50 ahrac1 last message repeated 1 time
Jan 8 15:06:50 ahrac1 tldcd[877]: [ID 406877 daemon.error] TLD(0) mode_sense ioctl() failed: I/O error
Jan 8 15:06:50 ahrac1 tldd[857]: [ID 320639 daemon.error] TLD(0) unavailable: initialization failed: Unable to sense robotic device
Jan 8 15:06:51 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Jan 8 15:06:51 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
.
.
.
檢查過往的日誌,發現在12月20日的時候出現的錯誤:
Dec 20 22:24:51 ahrac1 qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(1): Link OFFLINE
Dec 20 22:24:51 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:24:51 ahrac1 SCSI transport failed: reason 'tran_err': retrying command
Dec 20 22:26:01 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:26:01 ahrac1 SCSI transport failed: reason 'timeout': retrying command
Dec 20 22:26:21 ahrac1 fctl: [ID 517869 kern.warning] WARNING: fp(0)::OFFLINE timeout
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 0 to CK200070800845 is dead.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Killing bus 3072 to Clariion CK200070800845 port SP A3.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 1 to CK200070800845 is dead.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 2 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=2 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.notice] Device is gone
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=1 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.notice] Device is gone
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:26:40 ahrac1 transport rejected fatal error
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 1 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=0 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Killing bus 3072 to Clariion CK200070800845 port SP B3.
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 0 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=0 (trace=0), target=10700 (trace=2800004)
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 2 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=2 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=1 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=0 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=0 (trace=0), target=10600 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=0 (trace=0), target=10300 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2010,0 (st1):
Dec 20 22:26:40 ahrac1 transport rejected
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2010,0 (st1) offline
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2020,0 (st0):
Dec 20 22:26:40 ahrac1 transport rejected
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2020,0 (st0) offline
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/sgen@w500e09e00b3b2000,0 (sgen0) offline
Dec 20 22:26:55 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:26:55 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
Dec 20 22:27:10 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:27:10 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
Dec 20 22:27:25 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:27:25 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
.
.
.
看來是出現了硬體問題,僅靠調整軟體已經不起作用了。
最終由廠家人員現場診斷發現帶庫的控制器、機械手等多處都出現了硬體故障,最終透過更換了新的帶庫解決了錯誤。
本以為是軟體配置的問題,沒想到最終檢測到硬體的錯誤。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/4227/viewspace-627279/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- RMAN備份恢復——備份到帶庫的效能
- RMAN備份之備份多個備份集到帶庫(三)
- RMAN備份之備份多個備份集到帶庫(二)
- RMAN備份之備份多個備份集到帶庫(一)
- RMAN備份恢復——備份到帶庫的效能(二)
- RMAN備份多個備份集到帶庫的小bug
- rman 備份 全庫
- RMAN說,我能備份(3)--RMAN全庫備份和表空間備份
- solaris下透過rman工具將備份到帶庫上的db異機恢復
- solaris下通過rman工具將備份到帶庫上的db異機恢復
- 使用RMAN備份資料庫資料庫
- RMAN全庫備份指令碼指令碼
- 如何刪除控制檔案中過去rman備份到磁帶的歸檔備份集
- rman給歸檔指定備份位置;庫先備份盤同時到帶庫;更改控制檔案自備路徑;時間點的恢復
- standby全庫rman備份檔案恢復到異機
- 【RMAN】使用增量備份更新資料庫備份映象資料庫
- SAP備份資料庫到磁帶 brtools+netwoker+ADIC帶庫資料庫
- Backup And Recovery User's Guide-備份資料庫-RMAN備份概覽-RMAN備份的目的GUIIDE資料庫
- rman全庫備份備份歸檔日誌檔案
- RMAN 全庫備份和 0級備份的區別
- 【原】Oracle學習系列—資料庫備份—RMAN備份Oracle資料庫
- 利用RMAN備份重建資料庫資料庫
- RMAN備份恢復整個庫
- Backup And Recovery User's Guide-備份資料庫-RMAN備份概覽-為RMAN備份指定TAGGUIIDE資料庫
- Backup And Recovery User's Guide-備份RMAN備份-使用RMAN備份備份集GUIIDE
- HPUX系統備份到磁帶UX
- 資料庫恢復到備份之前測試_RMAN-06556資料庫
- RMAN說,我能備份(9)--RMAN增量備份與備份保留策略
- 【RMAN】使用RMAN備份將資料庫不完全恢復到指定時間點資料庫
- [RMAN]使用RMAN備份將資料庫不完全恢復到指定時間點資料庫
- Backup And Recovery User's Guide-備份資料庫-RMAN備份概覽-為RMAN備份指定格式GUIIDE資料庫
- 【RMAN】RMAN備份至ASMASM
- Mysqldump 在備庫進行備份時會阻塞備庫的sql_threadMySqlthread
- RMAN說,我能備份(14)--實戰RMAN備份
- ORACLE DG從庫 Rman備份恢復Oracle
- RMAN備份 建立catalog資料庫資料庫
- 非歸檔資料庫RMAN備份資料庫
- rman資料庫全庫備份與恢復資料庫