今天早上收到一條報警簡訊，提示是dg的接收出了問題，從v$dataguard_status得到的最新記錄如下：
2015-09-18 07:13:36.0 Fetch Archive LogErrorFAL[server, ARC1]: Error 270 creating remote archivelog file 'stest'
2015-09-18 07:13:36.0 Fetch Archive LogErrorFAL[server, ARC3]: Error 270 creating remote archivelog file 'stest'
2015-09-18 07:13:36.0 Fetch Archive LogErrorFAL[server, ARC0]: Error 270 creating remote archivelog file 'stest'
使用dg broker來檢查，發現已經提示error了。
初步猜想是備庫的檔案系統滿了，結果連線到備庫發現檔案系統沒有問題。
[root@stest~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda6 6.0G 733M 4.9G 13% /
tmpfs 32G 0 32G 0% /dev/shm
/dev/sda1 124M 57M 62M 48% /boot
/dev/sda2 6.0G 1.7G 4.0G 31% /usr
/dev/sda3 4.0G 318M 3.5G 9% /var
/dev/sda7 5.4T 2.2T 3.0T 42% /U01
這個時候排除了檔案系統的問題，那可能就是歸檔所在的閃回區的大小溢位了。這是一個11g的庫，對於閃回區的空間利用率應該還是能夠做一些管控的。
帶著疑問檢視了閃回區的設定，發現閃回區的空間設定還是比較大的，應該沒有什麼問題。
SQL> show parameter recover
NAME TYPE VALUE
------------------------------------ ------------- ------------------------------
db_recovery_file_dest string /U01/app/oracle/oradata/bidb/archive
db_recovery_file_dest_size big integer 400G
db_unrecoverable_scn_tracking boolean TRUE
檢視了歸檔路徑，
SQL>show parameter archive_log
log_archive_dest_1 string location=USE_DB_RECOVERY_FILE_DEST, valid_for=(ALL_LOGFILES,ALL_ROLES)
這個時候資料庫日誌就是一個很好的工具，可以好好利用。
發現日誌中已經有了下面的提示。
************************************************************************
Creating archive destination file : /U01/app/oracle/oradata/test/archive/STEST/archivelog/2015_09_18/o1_mf_1_81698_%u_.arc (544928 blocks)
Fri Sep 18 07:19:36 2015
Errors in file /U01/app/oracle/diag/rdbms/sbidb/test/trace/test_rfs_16263.trc:
ORA-19815: WARNING: db_recovery_file_dest_size of 429496729600 bytes is 100.00% used, and has 0 remaining bytes available.
************************************************************************
You have following choices to free up space from recovery area:
1. Consider changing RMAN RETENTION POLICY. If you are using Data Guard,
then consider changing RMAN ARCHIVELOG DELETION POLICY.
2. Back up files to tertiary device such as tape using RMAN
BACKUP RECOVERY AREA command.
3. Add disk space and increase db_recovery_file_dest_size parameter to
reflect the new space.
4. Delete unnecessary files using RMAN DELETE command. If an operating
system command was used to delete files, then use RMAN CROSSCHECK and
DELETE EXPIRED commands.
....
************************************************************************
Creating archive destination file : /U01/app/oracle/oradata/test/archive/STEST/archivelog/2015_09_18/o1_mf_1_81699_%u_.arc (546048 blocks)
Fri Sep 18 07:13:36 2015
Errors in file /U01/app/oracle/diag/rdbms/stest/bidb/trace/test_rfs_15685.trc:
ORA-19815: WARNING: db_recovery_file_dest_size of 429496729600 bytes is 100.00% used, and has 0 remaining bytes available.
好了問題，看起來已經很明顯了。
歸檔的空間佔用導致閃回區溢位，但是我們確實有歸檔的刪除策略，而且指令碼在其它環境中都在普遍使用，也沒有碰到問題。
$ crontab -l
40 * * * * (. $HOME/.bash_profile;$HOME/xxxx/scripts/rman_trun_arch.sh)

指令碼的主要內容就是定期檢查刪除一天前的歸檔。
rman target / <<EOF
crosscheck archivelog all;
delete noprompt expired archivelog all;
delete noprompt archivelog until time "sysdate-1";
exit
EOF
exec 3>&1 4>&2 1>>${LOGFILE} 2>&1
從這個指令碼來看，也沒有什麼異常之處，為什麼歸檔刪除策略有，但是還是沒有刪除歸檔。
帶著疑問排查了一圈，才發現是有下面的原因導致的。
$ ps -ef|grep smon
oracle 2019 1 0 Jul28 ? 00:01:14 ora_smon_test
oracle 29478 1 0 Jul24 ? 00:01:20 ora_smon_mtest
oracle 30508 27347 0 22:50 pts/0 00:00:00 grep smon
這臺伺服器上執行著兩個備庫，而預設的ORACLE_SID是mtest，是另外一個備庫，相當於test這個備庫還沒有配置歸檔刪除策略,所以閃回區的利用率就一直沒有釋放。
檢視歸檔的情況，已經有快半個月沒有清理過歸檔了。所以這個問題也是一點一點累計起來的，最終在特定的時間爆發出來。
所以為了儘快釋放閃回空間，就直接先執行指令碼，然後在crontab指令碼中指定ORACLE_SID來進行處理,這個問題的處理就告一段落。
再次檢視dg broker，配置已經顯示成功了。
DGMGRL> show configuration;
Configuration - dg_mbionline
Protection Mode: MaxPerformance
Databases:
test - Primary database
stest- Physical standby database
Fast-Start Failover: DISABLED
Configuration Status:
SUCCESS
閃回區的使用率一下子釋放出來了。
FILE_TYPE PER_USED PER_RECLAIMABLE FILES
-------------------- ------- -------------- -----
CONTROL FILE 0 0 0
REDO LOG 2.1 0 7
ARCHIVED LOG 1.6 0 7
BACKUP PIECE 0 0 0
IMAGE COPY 0 0 0
FLASHBACK LOG 0 0 0
FOREIGN ARCHIVED LOG 0 0 0

透過這個例子，可以看到一些通用的指令碼在特定的場景下，可能會有一些潛在的問題，需要我們明辨。

記一次dg故障的處理總結

相關文章