前幾天遇到了一起備份失敗案例,RMAN備份過程中遇到了歸檔日誌損壞的情況,還是第一次遇到這種案例,這裡記錄一下這個案例的具體情況。
備份作業失敗,檢查RMAN備份的輸出日誌,發現一個歸檔日誌檔案損壞(corrupt)了,如下所示:
RMAN-08137: warning: archived log not deleted, needed for standby or upstream capture process
RMAN-08515: archived log file name=/eapdblog/eap_1_666_1155313416.arc thread=1 sequence=666
RMAN-08137: warning: archived log not deleted, needed for standby or upstream capture process
RMAN-08515: archived log file name=/eapdblog/eap_1_667_1155313416.arc thread=1 sequence=667
RMAN-03009: failure of backup command on dev_0 channel at 04/09/2024 09:44:50
ORA-27192: skgfcls: sbtclose2 returned error - failed to close file
ORA-19511: non RMAN, but media manager or vendor specific failure, error text:
Vendor specific error: OB2_EndObjectBackup() failed ERR(-2)
ORA-19599: block number 316064 is corrupt in archived log /eapdblog/eap_1_660_1155313416.arc
檢查驗證歸檔日誌,發現歸檔日誌檔案eap_1_660_1155313416.arc確實損壞。如下所示:
RMAN> validate archivelog all;
Starting validate at 09-APR-24
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=261 device type=DISK
channel ORA_DISK_1: starting validation of archived log
channel ORA_DISK_1: specifying archived log(s) for validation
input archived log thread=1 sequence=660 RECID=645 STAMP=1165788069
input archived log thread=1 sequence=663 RECID=648 STAMP=1165824445
input archived log thread=1 sequence=664 RECID=649 STAMP=1165828881
input archived log thread=1 sequence=665 RECID=650 STAMP=1165829178
input archived log thread=1 sequence=666 RECID=651 STAMP=1165829976
input archived log thread=1 sequence=667 RECID=652 STAMP=1165830268
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Archived Logs
=====================
Thrd Seq Status Blocks Failing Blocks Examined Name
---- ------- ------ -------------- --------------- ---------------
1 660 FAILED 8 346599 /eapdblog/eap_1_660_1155313416.arc
1 663 OK 0 382900 /eapdblog/eap_1_663_1155313416.arc
1 664 OK 0 94593 /eapdblog/eap_1_664_1155313416.arc
1 665 OK 0 1748 /eapdblog/eap_1_665_1155313416.arc
1 666 OK 0 17557 /eapdblog/eap_1_666_1155313416.arc
1 667 OK 0 4226 /eapdblog/eap_1_667_1155313416.arc
validate found one or more corrupt blocks
See trace file /eapdb/diag/rdbms/eap/eap/trace/eap_ora_917867.trc for details
Finished validate at 09-APR-24
RMAN> exit
檢查告警日誌,也看到下面資訊。
2024-04-08T23:15:05.730996+08:00
***
Corrupt block seq: 660 blocknum=316064.
Bad header found during backing up archived log
Data in bad block - flag:1. format:34. bno:93696. seq:649
beg:16 cks:21324
calculated check value: 21324
Reread of seq=660, blocknum=316064, file=/eapdblog/eap_1_660_1155313416.arc, found same corrupt data
Reread of seq=660, blocknum=316064, file=/eapdblog/eap_1_660_1155313416.arc, found same corrupt data
Reread of seq=660, blocknum=316064, file=/eapdblog/eap_1_660_1155313416.arc, found same corrupt data
Reread of seq=660, blocknum=316064, file=/eapdblog/eap_1_660_1155313416.arc, found same corrupt data
Reread of seq=660, blocknum=316064, file=/eapdblog/eap_1_660_1155313416.arc, found same corrupt data
2024-04-08T23:15:21.671470+08:00
***
Corrupt block seq: 660 blocknum=316064.
Bad header found during backing up archived log
Data in bad block - flag:1. format:34. bno:93696. seq:649
beg:16 cks:21324
calculated check value: 21324
Reread of seq=660, blocknum=316064, file=/eapdblog/eap_1_660_1155313416.arc, found same corrupt data
Reread of seq=660, blocknum=316064, file=/eapdblog/eap_1_660_1155313416.arc, found same corrupt data
Reread of seq=660, blocknum=316064, file=/eapdblog/eap_1_660_1155313416.arc, found same corrupt data
Reread of seq=660, blocknum=316064, file=/eapdblog/eap_1_660_1155313416.arc, found same corrupt data
Reread of seq=660, blocknum=316064, file=/eapdblog/eap_1_660_1155313416.arc, found same corrupt data
2024-04-08T23:15:36.695623+08:00
雖然知道歸檔日誌損壞了,但是不清楚什麼原因導致歸檔日誌損壞,之前也見過別人分享的案例ORA-1578 ORA-353 ORA-19599 Corrupt blocks with zeros when filesystemio_options=SETALL on ext4 file system using Linux (Doc ID 1487957.1),但是當前環境如下所示,跟Doc ID 1487957.1中案例環境完全不一樣
作業系統 :Red Hat Enterprise Linux release 8.8 (Ootpa)
資料庫版本: Oracle 19c Enterprise Edition 19.20.0.0.0
檔案系統為: xfs
開了Service Requests,然後提交各種日誌,以及損壞歸檔日誌的dump檔案,最後官方反饋跟未公開的兩個bug非常相似(下面截圖)。不過這種現象發生的頻率非常少。還是第一次遇到這種錯誤。官方技術支援建議,如果這種情況出現的頻率很少,建議觀察,如果出現頻率很高,建議修改filesystemio_options為directio來規避這個問題。
sqlplus / as sysdba
oradebug setmypid
oradebug tracefile_name
alter system dump logfile '/eapdblog/eap_1_660_1155313416.arc' VALIDATE;
做了如下操作處理,然後重新做了RMAN完整備份,又觀察了好幾天,暫時一直未遇到這個錯誤。
手工刪除這個損壞的歸檔日誌
RMAN > crosscheck archivelog all;
RMAN> DELETE EXPIRED ARCHIVELOG sequence 660;