昨天到公司之後，收到兩份封報警郵件，可以看到在早晨6:30左右主庫的v$dataguard_status檢查時發現了一個錯誤。然後再2分鐘後就自動恢復了。
一般這種問題很自然想到可能是網路出現了一些問題。因為自動恢復了，所以也不同太著急處理，於是細細看了下。
報警郵件如下：
ZABBIX-監控系統:
------------------------------------
報警內容: DG_issue
------------------------------------
報警級別: PROBLEM
------------------------------------
監控專案: dg_issue：2015-11-06 06:30:06.0Log Transport ServicesErrorPING[ARC0]: Error 1089 when pinging standby (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=test.cyou.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=stest2_XPT)(INSTANCE_NAME=test)(SERVER=dedicated))).
------------------------------------
報警時間:2015.11.06-06:31:05

報警恢復郵件如下:
ZABBIX-監控系統:
-------------------------------
監控專案: dg_issue_none
------------------------------------
主機名稱：ora_test_p_yangjr@10.127.2.132
------------------------------------
恢復時間:2015.11.06-06:33:07

這是套一主兩備的10gR2的環境，初步感覺是不是備庫的連線數超出了。檢視了資源使用情況，發現session數也都少得很。
RESOURCE_NAME                  CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION   LIMIT_VALUE
------------------------------ ------------------- --------------- -------------------- --------------------
processes                                       37              40       1000                 1000
sessions                                        43              46       3000                 3000
enqueue_locks                                   61              98      32860                32860
enqueue_resources                               46              46      13420            UNLIMITED
檢視process確實是1000個，還遠沒有達到。
idle> show parameter process
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
aq_tm_processes                      integer     0
db_writer_processes                  integer     2
gcs_server_processes                 integer     0
job_queue_processes                  integer     10
log_archive_max_processes            integer     2
processes                            integer     1000
這個時候直接到日誌目錄下，看到產生了大量的日誌檔案。日誌檔案似乎有並行的跡象。
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p014_10824.trc
-rw-r----- 1 oracle oinstall       912 Nov 6 02:50 test_p013_10822.trc
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p012_10820.trc
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p011_10818.trc
-rw-r----- 1 oracle oinstall       912 Nov 6 02:50 test_p010_10816.trc
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p009_10814.trc
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p008_10812.trc
-rw-r----- 1 oracle oinstall       912 Nov 6 02:50 test_p007_10810.trc
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p006_10808.trc
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p005_10806.trc
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p004_10804.trc
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p003_10802.trc
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p002_10800.trc
-rw-r----- 1 oracle oinstall       912 Nov 6 02:50 test_p001_10798.trc
-rw-r----- 1 oracle oinstall       913 Nov 6 02:50 test_p000_10796.trc
-rw-r----- 1 oracle oinstall      3005 Nov 6 02:50 test_mrp0_10794.trc
-rw-r----- 1 oracle oinstall   1013705 Nov 6 06:29 test_mmon_26762.trc
-rw-r----- 1 oracle oinstall     30361 Nov 6 06:30 test_rsm0_28626.trc
-rw-r----- 1 oracle oinstall     25111 Nov 6 06:30 test_dmon_26766.trc
-rw-r----- 1 oracle oinstall 565972219 Nov 6 06:30 drctest.log
-rw-r----- 1 oracle oinstall      1876 Nov 6 06:31 test_mrp0_11131.trc
-rw-r----- 1 oracle oinstall    106067 Nov 6 06:31 alert_test.log
這個時候有幾個問題，一個是為什麼會產生如此多的並行，但是透過資源的監控檢視中沒有看到。
另外一個就是為什麼偏偏是6:30開始報錯。
最後一個是一主兩備，一個備庫有報錯，另外一個卻沒有。
首先排除資源的使用情況，這個庫是出現了大量的看似並行的日誌，我們來看看並行日誌怎麼說。隨便抓取一個。發現應該沒有關係，裡面還有slave的字樣。
Redo thread mounted by this instance: 1
Oracle process number: 29
Unix process pid: 19714, image: oracle@test.test.com (P007)

*** 2015-11-06 15:29:13.896
*** SERVICE NAME:() 2015-11-06 15:29:13.896
*** SESSION ID:(2972.1) 2015-11-06 15:29:13.896
Slave exiting with ORA-10388 exception
KCBR: Number of read descriptors = 1024
KCBR: Media recovery blocks read (ASYNC) = 513
KCBR: Redo cache copies/changes = 4284/4284
KCBR: Influx buffers flushed = 8 times
KCBR: Reads = 31 reap (29 no-op, 0 one), 30 all
那麼為什麼是6:30起開始報錯呢，一種可能就是剛好在哪個時間點觸發了。另外一種可能就是在特定的時間裡有了觸發條件。
這個一種很自然的想法就是crontab了。
檢視crontab的內容，會發現兩行內容
50 2 * * * . $HOME/.profile;$HOME/dbadmin/scripts/open_standby.sh
30 6 * * * . $HOME/.profile;$HOME/dbadmin/scripts/close_standby.sh
指令碼內容如下：
$ less $HOME/dbadmin/scripts/open_standby.sh
#!/bin/bash
cd ~
. ~oracle/.bash_profile
home=~oracle
dgmgrl / "edit database stest2 set state='READ-ONLY'"
if [ $? -ne 0 ]; then
echo "set read-only failed!"
exit 1
fi

$ less $HOME/dbadmin/scripts/close_standby.sh
#!/bin/bash
cd ~
. ~oracle/.bash_profile
home=~oracle
dgmgrl / "edit database stest2 set state='ONLINE'"
if [ $? -ne 0 ]; then
echo "set read-only failed!"
exit 1
fi
這個時候檢視資料庫日誌發現裡面其實已經有不少的錯誤資訊。擷取一段。
RFS[112]: Archived Log: '/U01/app/oracle/flash_recovery_area/Stest2/archivelog/2015_11_06/o1_mf_1_7608_c3q98op1_.arc'
Fri Nov 6 02:51:16 2015
ksvcreate: Process(m000) creation failed
Fri Nov 6 02:52:16 2015
ksvcreate: Process(m000) creation failed
Fri Nov 6 02:53:16 2015
ksvcreate: Process(m000) creation failed
...
ksvcreate: Process(m000) creation failed
Fri Nov 6 06:26:20 2015
ksvcreate: Process(m000) creation failed
Fri Nov 6 06:27:20 2015
ksvcreate: Process(m000) creation failed
Fri Nov 6 06:28:20 2015
ksvcreate: Process(m000) creation failed
Fri Nov 6 06:29:20 2015
ksvcreate: Process(m000) creation failed
對於這個問題感覺還是有些奇怪，看起來是m000程式啟動不了，那麼是核心配置過低了。檢視了一下，沒有發現問題。
這個時候在metalink上搜尋，發現一個相符的文章。
'Ksvcreate: Process(m000) creation failed' after Standby Database Open Read Only Multiple Times (Doc ID 418553.1)
對於這個問題，oracle認為是基於一個bug。
Cause
The Cause of this Problem has been identified in Bug 5583049.
至於解決方案，也提了兩個建議。
Solution
There are two Workarounds available:
    Restart the Instance..
or
     Disable ADDM - Should be re-enabled if Standby takes up the Primary Role
          * Set SGA_TARGET=0 and set shared_pool_size, db_cache_size, etc if using
            Automatic SGA Memory Management (ASMM)
          * Set STATISTICS_LEVEL=BASIC to disable statistics gathering
對於這個問題，先不說是個bug,疑問就是為什麼在一個備庫有，另外一個備庫沒有報錯呢?
檢視了一下，發現另外一個備庫中是沒有crontab的設定，一直是處於apply狀態的。
這樣問題似乎就清楚了，看來確實是一個bug導致的。那麼這個問題既然是一個bug，為什麼之前就從來沒有發現過呢。
說來慚愧，自己在bdump目錄下搜尋了一下，這個問題可以追溯到2013年了。看來是一個遺留問題了。
bdump]$ grep "ksvcreate: Process(m000) creation failed" alert*|uniq|less
alert_test.log:ksvcreate: Process(m000) creation failed
alert_test.log.20130701:ksvcreate: Process(m000) creation failed
alert_test.log.20131001:ksvcreate: Process(m000) creation failed
alert_test.log.20131101:ksvcreate: Process(m000) creation failed
按照這個報錯的頻率，基本是每天都有，那麼報錯的次數有多少，差不多30多萬次。
$ grep -B1 "ksvcreate: Process(m000) creation failed" alert*|uniq|wc -l
316539
那麼根據第一個建議，重啟資料庫即可解決，到底好不好呢，再來一個grep來驗證一下。發現這些年來，這個備庫因為各種操作，重啟了也不少次
$ grep "Starting ORACLE instance (normal)" alert*
alert_test.log:Starting ORACLE instance (normal)
alert_test.log.20111201:Starting ORACLE instance (normal)
alert_test.log.20111201:Starting ORACLE instance (normal)
alert_test.log.20111201:Starting ORACLE instance (normal)
alert_test.log.20120601:Starting ORACLE instance (normal)
alert_test.log.20130901:Starting ORACLE instance (normal)
alert_test.log.20130901:Starting ORACLE instance (normal)
。。。。
可以看出，其實已經重啟了很多次，但是報警還是依舊，所以說重啟還是不能解決問題。
那麼來看看第二個建議，就是設定sga_target=0來取消自動管理，還有addm的設定，這些還需要再評估，但我認為sga_target的影響可能更大。
所以簡單來說，修復就是設定sga_target=0
如果要設定sga_target為0，則要手工指定其它記憶體元件的大小，那麼先來看看目前的設定，作為參考。
$ sh showbuffer.sh
COMPONENT                       CURRENT_M      MIN_M      MAX_M SPECCIFIED_M LAST_OPER LAST_OPER_TYP GRANULE_M
------------------------------ ---------- ---------- ---------- ------------ --------- ------------- ----------
shared pool                          2080       2064          0            0 IMMEDIATE GROW                  16
large pool                             16         16          0            0           STATIC                16
java pool                              16         16          0            0           STATIC                16
streams pool                           32         32          0            0           STATIC                16
DEFAULT buffer cache                11152      11152          0            0 IMMEDIATE SHRINK                16
可以看見buffer_cache大概在11g，加上shared_pool 的2G，大概修是13G左右
但是show sga的時候發現，database buffers有差不多34G左右這個就有點差別太懸殊了吧。
idle> show sga
Total System Global Area 4.8318E+10 bytes
Fixed Size                  2173768 bytes
Variable Size            1.3304E+10 bytes
Database Buffers         3.4997E+10 bytes
Redo Buffers               14602240 bytes
檢視引數sga的設定大小
idle> show parameter sga
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     FALSE
sga_max_size                         big integer 45G
sga_target                           big integer 36G
發現sga_target還確實設定是36G左右，但是似乎備庫上只啟用了13G左右。
idle> alter system set shared_pool_size=4G scope=spfile;
System altered.
這個時候設定buffer cache為30g，改動寫進spfile就報錯了。
idle> alter system set db_cache_size=30G scope=spfile;
alter system set db_cache_size=30G scope=spfile
*
ERROR at line 1:
ORA-32017: failure in updating SPFILE
ORA-00384: Insufficient memory to grow cache
那麼設定略低一些，改成25G。竟然就可以了。
idle> alter system set db_cache_size=25G scope=spfile;
System altered.
重啟之後，發現buffer_cache基本在設定的範圍之內。
idle> shutdown immediate
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.
idle> startup nomount
ORACLE instance started.
Total System Global Area 4.8318E+10 bytes
Fixed Size                  2173768 bytes
Variable Size            2.0368E+10 bytes
Database Buffers         2.7934E+10 bytes
Redo Buffers               14602240 bytes
這個時候想改大一些還是不行。
idle> alter system set db_cache_size=30G scope=spfile;
alter system set db_cache_size=30G scope=spfile
*
ERROR at line 1:
ORA-32017: failure in updating SPFILE
ORA-00384: Insufficient memory to grow cache
關鍵是檢視記憶體元件的大小使用v$dynamic_components發現buffer_cache還是11G
這種問題就有些奇怪了。檢視metalink發現還有一個bug

至於這個問題為什麼只啟用了13G的sga，我檢視主庫的配置發現似乎是有關聯的。
sys@test> show parameter sga
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     FALSE
sga_max_size                         big integer 13G
sga_target                           big integer 13G
這一點在metalink還沒有找到相關的說明，不過要想避免這個bug，還是需要設定sga_target=0
看來在10g中還是有不少的小問題，還是需要考慮考慮升級到11g了。

由報警郵件分析發現的備庫oracle bug

相關文章