對於orabbix報出的警報，自己都是心懷敬畏，因為這些表面的現象如果深入分析沒準就能有所收穫，當然目的還是解決問題，不是一味追求指標。
今天收到的報警郵件如下，然後在兩分鐘後問題自動恢復。
###
ZABBIX-監控系統:
------------------------------------
報警內容: DG_issue
------------------------------------
報警級別: PROBLEM
------------------------------------
監控專案: dg_issue：2015-10-08 03:59:38.0Log Transport ServicesErrorPING[ARC1]: Error 1089 when pinging standby (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=adb1.xxxx.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=sadb1_XPT)(INSTANCE_NAME=adb1)(SERVER=dedicated))).
------------------------------------
報警時間:2015.10.08-04:00:22
這是個一主兩備的環境，因為這個問題是在主庫的v$dataguard_status報出，而且對於1089的錯誤，oracle的解釋如下：
$ oerr ora 1089
01089, 00000, "immediate shutdown in progress - no operations are permitted"
// *Cause: The SHUTDOWN IMMEDIATE command was used to shut down
//         a running ORACLE instance, so your operations have been
//         terminated.
// *Action: Wait for the instance to be restarted, or contact your DBA.
所以還是分析了一下備庫在問題時間段內的負載情況，沒有發現明顯的抖動。
sar的結果如下：
03:50:01 AM       all      0.06      0.00      0.03      0.00      0.00     99.91
04:00:01 AM       all      0.06      0.00      0.04      0.00      0.00     99.90
04:10:01 AM       all      0.11      0.00      0.05      0.23      0.00     99.61
04:20:01 AM       all      0.07      0.00      0.04      0.00      0.00     99.90
04:30:01 AM       all      0.06      0.00      0.03      0.00      0.00     99.90
04:40:01 AM       all      0.06      0.00      0.03      0.00      0.00     99.91
04:50:01 AM       all      0.06      0.00      0.03      0.00      0.00     99.91

那麼錯誤是從主庫丟擲的，來看看主庫中的錯誤日誌。就會發現在問題時間段嘗試連線出錯了，可能備庫正在做shutdown immediate的操作。
Thu Oct 08 02:19:38 CST 2015
Thread 1 advanced to log sequence 7659 (LGWR switch)
Current log# 1 seq# 7659 mem# 0: /U01/app/oracle/oradata/adb1/redo01.log
Thu Oct 08 03:59:38 CST 2015
ARC1: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (1089)
ARC1: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
PING[ARC1]: Error 1089 when pinging standby (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=xxxx.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=sadb1_XP
T)(INSTANCE_NAME=adb1)(SERVER=dedicated))).
好了，問題的背景就是這樣，那麼是否有備庫中的crontab來觸發這類操作呢。
直白的看，好像沒有和shutdown immediate相關的操作。
$ crontab -l
2-9,12-29,31-59 * * * * . $HOME/.adb1profile;$HOME/dbadmin/scripts/lsnr_check.sh
0 1 * * 1 . $HOME/.adb1profile;$HOME/dbadmin/scripts/lsnr_log.sh
43 1 * * * . $HOME/.adb1profile;$HOME/dbadmin/scripts/expdpfull.sh
#20 1 * * * . $HOME/.adb1profile;$HOME/dbadmin/scripts/rman_full_backup.sh
因為這是一個10g的備庫，所以嘗試做expdp還是需要把備庫置為read-only狀態才可以做的。開啟日誌應用還是會有shutdown的操作，所以這個還是一個可能的疑點。

另外一個備庫，db uniq name指定錯誤，應該是s2accdb1
檢視邏輯備庫的指令碼，看到確實有dgmgrl 做的read-only的操作，這樣問題就對上了。
備庫1中的備份指令碼如下：
$ less $HOME/dbadmin/scripts/expdpfull.sh
#!/bin/bash
。。。
export expfile
dgmgrl / "edit database sadb1 set state='READ-ONLY'"
if [ $? -ne 0 ]; then
echo "set read-only failed!"
exit 1
fi
...
備庫2中的備份指令碼如下：
$ less $HOME/dbadmin/scripts/expdpfull.sh
#!/bin/bash
。。。
export expfile
dgmgrl / "edit database sadb1 set state='READ-ONLY'"
if [ $? -ne 0 ]; then
echo "set read-only failed!"
exit 1
fi
...
如果仔細檢視就會發現問題，兩個指令碼的內容是一致的，裡面所指的database是同一個備庫，也就意味著在同樣的時間裡，兩個客戶端同時發出請求，對同一個備庫做read-only的操作，這種操作還是有很大的風險的。
那麼備庫2執行指令碼的時候，把備庫1置為read-only,然後嘗試在備庫2做邏輯備份，會不會成功呢，結果肯定是不行。
可以在備庫2中看到備份指令碼執行了，但是沒有生成備份，備份日誌中只有一條錯誤警告。
-rw-r--r-- 1 oracle oinstall 159 Oct 8 01:44 accdb1_20151008.log
$ less accdb1_20151008.log

EXP-00008: ORACLE error 1219 encountered
ORA-01219: database not open: queries allowed on fixed tables/views only
EXP-00000: Export terminated unsuccessfully

問題到此確實發現了備庫中的一個大的問題，透過一個小小的報警就能有如此的收穫，其實還有更多的隱患，既然檢視了備庫的日誌，那麼其中的ora錯誤是格外需要關注的。
結果發現備庫中竟然每天都會有大量的ora錯誤，是和redo相關的錯誤。
Errors in file /U01/app/oracle/admin/adb1/bdump/adb1_mrp0_3644.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/U01/app/oracle/oradata/adb1/redo01.log'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Thu Oct 08 03:24:43 CST 2015
Errors in file /U01/app/oracle/admin/adb1/bdump/adb1_mrp0_3644.trc:
ORA-19527: physical standby redo log must be renamed
ORA-00312: online log 1 thread 1: '/U01/app/oracle/oradata/adb1/redo01.log'
Clearing online redo logfile 1 complete
Media Recovery Log /U01/app/oracle/admin/adb1/arch/1_7658_782788423.dbf
Thu Oct 08 03:24:44 CST 2015
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE THROUGH ALL SWITCHOVER DISCONNECT NODELAY
Thu Oct 08 03:24:45 CST 2015
Media Recovery Waiting for thread 1 sequence 7659
Thu Oct 08 03:25:29 CST 2015
Redo Shipping Client Connected as PUBLIC
-- Connected User is Valid
RFS[1015]: Assigned to RFS process 3928
RFS[1015]: Identified database type as 'physical standby'
RFS[1015]: Archived Log: '/U01/app/oracle/admin/adb1/arch/1_7659_782788423.dbf'
Thu Oct 08 03:25:35 CST 2015
Media Recovery Log /U01/app/oracle/admin/adb1/arch/1_7659_782788423.dbf
Media Recovery Waiting for thread 1 sequence 7660
儘管有大量的ora錯誤，但是依然可以正常接收日誌，使用dg broker來檢視
DGMGRL> show configuration;
Configuration
Name:                adb1
Enabled:             YES
Protection Mode:     MaxPerformance
Fast-Start Failover: DISABLED
Databases:
    adb1   - Primary database
    sadb1 - Physical standby database
    s2adb1 - Physical standby database
Current status for "accdb1":
SUCCESS
可以看到驗證是正常的，但是還是有潛在的問題，那麼備庫中的錯誤怎麼和redo有關係呢。
如果使用show database verbose檢視，是能發現一個潛在的問題的。
DGMGRL> show database verbose sadb1
Database
Name:            sadb1
Role:            PHYSICAL STANDBY
Enabled:         YES
Intended State: ONLINE
Instance(s):
    adb1
Properties:
    InitialConnectIdentifier        = 'sadb1'
    ObserverConnectIdentifier       = ''
    LogXptMode                      = 'ARCH'
    Dependency                      = ''
    DelayMins                       = '0'
    Binding                         = 'OPTIONAL'
    MaxFailure                      = '0'
    MaxConnections                  = '1'
    ReopenSecs                      = '300'
    NetTimeout                      = '180'
    LogShipping                     = 'ON'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyParallel                   = 'AUTO'
    StandbyFileManagement           = 'auto'
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '2'
    LogArchiveMinSucceedDest        = '1'
    DbFileNameConvert               = ''
    LogFileNameConvert              = ''
    FastStartFailoverTarget         = ''
    StatusReport                    = '(monitor)'
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    SendQEntries                    = '(monitor)'
    LogXptStatus                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = 'xxxx.com'
    SidName                         = 'adb1'
    LocalListenerAddress            = '(ADDRESS=(PROTOCOL=TCP)(HOST=xxxx.com)(PORT=1521))'
    StandbyArchiveLocation          = '/U01/app/oracle/admin/adb1/arch'
    AlternateLocation               = ''
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = '%t_%s_%r.dbf'
    LatestLog                       = '(monitor)'
    TopWaitEvents                   = '(monitor)'
Current status for "sadb1":
SUCCESS
這個時候可以嘗試使用edit語句修改這些配置
edit database sadb1 set property LogFileNameConvert='/U01/app/oracle/oradata/adb1,/U01/app/oracle/oradata/adb1';
edit database sadb1 set property DbFileNameConvert='/U01/app/oracle/oradata/adb1,/U01/app/oracle/oradata/adb1';
在spfile層級修改還是需要啟動備庫的，現在spfile定義一下。
alter system set db_file_name_convert='/U01/app/oracle/oradata/adb1','/U01/app/oracle/oradata/adb1' scope=spfile;
alter system set log_file_name_convert='/U01/app/oracle/oradata/adb1','/U01/app/oracle/oradata/adb1' scope=spfile;

這個時候檢查dg broker就會丟擲下面的錯誤。
DGMGRL> show configuration;
Configuration
Name:                adb1
Enabled:             YES
Protection Mode:     MaxPerformance
Fast-Start Failover: DISABLED
Databases:
    adb1   - Primary database
    sadb1 - Physical standby database
    s2adb1 - Physical standby database
Current status for "adb1":
Warning: ORA-16608: one or more databases have warnings

對於這一類問題，如果不是很清楚，可以按照下面的思路來查。
DGMGRL> show database verbose sadb1;
Database
Name:            sadb1
Role:            PHYSICAL STANDBY
Enabled:         YES
Intended State: ONLINE
Instance(s):
    adb1

Properties:
    InitialConnectIdentifier        = 'sadb1'
    ObserverConnectIdentifier       = ''
    LogXptMode                      = 'ARCH'
    Dependency                      = ''
    DelayMins                       = '0'
    Binding                         = 'OPTIONAL'
    MaxFailure                      = '0'
    MaxConnections                  = '1'
    ReopenSecs                      = '300'
    NetTimeout                      = '180'
    LogShipping                     = 'ON'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyParallel                   = 'AUTO'
    StandbyFileManagement           = 'auto'
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '2'
    LogArchiveMinSucceedDest        = '1'
    DbFileNameConvert               = '/U01/app/oracle/oradata/adb1,/U01/app/oracle/oradata/adb1'
    LogFileNameConvert              = '/U01/app/oracle/oradata/adb1,/U01/app/oracle/oradata/adb1'
    FastStartFailoverTarget         = ''
    StatusReport                    = '(monitor)'
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    SendQEntries                    = '(monitor)'
    LogXptStatus                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = 'xxxx.com'
    SidName                         = 'adb1'
    LocalListenerAddress            = '(ADDRESS=(PROTOCOL=TCP)(HOST=xxxx.com)(PORT=1521))'
    StandbyArchiveLocation          = '/U01/app/oracle/admin/adb1/arch'
    AlternateLocation               = ''
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = '%t_%s_%r.dbf'
    LatestLog                       = '(monitor)'
    TopWaitEvents                   = '(monitor)'

Current status for "sadb1":
Warning: ORA-16792: configuration property value is inconsistent with database setting
然後使用下面的命令來檢視不匹配的地方。
DGMGRL> show database sadb1 'InconsistentProperties';
INCONSISTENT PROPERTIES
   INSTANCE_NAME        PROPERTY_NAME         MEMORY_VALUE         SPFILE_VALUE         BROKER_VALUE
          adb1    DbFileNameConvert                      /U01/app/oracle/oradata/adb1,/U01/app/oracle/oradata/adb1 /U01/app/oracle/oradata/adb1,/U01/app/oracle/oradata/adb1
          adb1   LogFileNameConvert                      /U01/app/oracle/oradata/adb1,/U01/app/oracle/oradata/adb1 /U01/app/oracle/oradata/adb1,/U01/app/oracle/oradata/adb1
可以看到memory_value,spfile_value,broker_value還是不匹配的。
所以還是需要啟動備庫的。啟動之後問題就會自然修復，對於這類ora錯誤可以參考mos文件。
    ORA-19527 reported in Standby Database when starting Managed Recovery (文件 ID 352879.1)
所以經過以上處理，兩個備庫的處理思路就清晰了，備份也存在，日誌中的ora錯誤也能夠基本消除。

一封備庫報警郵件的分析

相關文章