【新炬網路名師大講堂】11gR203 RAC一個比較嚴重的bug

shsnchyw發表於2014-12-10

       現在新安裝的資料庫無論是單機或RAC,版本基本都是Oracle 11.2.0.3了,其中經歷了2次該版本RAC的bug導致節點crash的故障,與大家分享下。今年3月初的時候,某運營商的一套RAC資料庫突然一個節點垮了,該資料庫版本就是Linux64位系統Oracle 11.2.0.3.0的,出於考慮先恢復業務,當時負責該系統的同事就嘗試去啟動CRS和資料庫,誰知道怎麼也啟動不了CRS,如下所示(由於日誌比較多,只擷取最關鍵的一部分)::

:~> sudo /Oracle/app/crs/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

使用crsctl stat res -t -init檢視CRS的狀態,啟動GPNP程式後一直存在報錯,
2013-03-04 23:48:33.756: [  OCRMSG][2657056528]GIPC error [29] msg [gipcretConnectionRefused]
2013-03-04 23:48:47.758: [  OCRMSG][2657056528]GIPC error [29] msg [gipcretConnectionRefused]
其中cssd一直為STARING狀態,其中後臺cssd的日誌為:
2013-03-03 22:40:25.319: [    CSSD][2946463504]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2013-03-03 22:40:26.095: [    CSSD][2951210768]clssnmvDHBValidateNCopy: node 1, zxdb01, has a disk HB, but no network HB, DHB has rcfg 246562024, wrtcnt, 11222804,

LATS 1783876, lastSeqNo 11222803, uniqueness 1361347261, timestamp 1362321629/975388116

        既然“has a disk HB, but no network HB”明顯就是心跳出問題,內部通訊有問題,然後我們手工去ping,都沒有問題,我們還在metalink上下載了mcasttest.pl工具來對對網路卡進行了全面的測試,發現也是沒有問題,這就奇怪了,沒問題還報心跳錯誤?不是很甘心,決定使用終極辦法,對故障機器進行重啟嘗試是否可以恢復故障,誰知道機器起來後,等了大半天,還是不能恢復,再重啟了一次,還是如此,於是我們就猜測有可能是CRS軟體出了什麼問題,然後在metalink找了一天,最後終於找到了ID 1479380.1裡面描述的症狀:bug 13653178與bug 13899736導致Node cannot join the cluster after reboot or “interconnect restored”,裡面的心跳錯誤也是一模一樣,再找了下發現ID 1456977.1也發現出現類似的心跳錯誤,這個是由bug 13334158與bug 13811209引起,然後再詳細對比了下其它日誌報錯內容,更精確定位為bug 13334158與bug 13811209。查了下PSU補丁,發現PSU5是把這4個bug都給修復的了,於是發出故障報告,建議打PSU5 patch 14038787 解決,該補丁包含了GRID和RDBMS的補丁。

        最初我們的建議按照metalink上的建議就是找個維護時間,把兩臺機器同時重啟(不要只重啟節點2),估計可以恢復,由於正在開兩會,所以局方這邊說要封網暫時不打,就讓一個節點先頂著先。於是我們自己先在虛擬機器上測試下PSU5的補丁。誰知道過了兩天,某航空的一套RAC資料庫出了問題,該航空DBA說其中一個節點垮掉了,重啟了機器好幾次,故障節點的資料庫服務始終起不來,都急死了。我詳細看了下他發來日誌,該資料庫也是Linux64位系統Oracle11.2.0.3.0版本,然後發現跟前幾天某運營商的故障症狀一模一樣啊,於是我告訴他這是個bug來的,我們有一套系統也出了這個問題。然後他問怎麼解決,我說打PSU5,他說你們打了嗎?我說現在兩會封網要過幾天,他想重啟兩臺機器,但是又怕兩個節點都起不來。該資料庫是關於氣象方面的,平時壓力不算大,但是如果兩個節點都垮掉的話,正在天空飛的飛機當然不會掉下來那麼恐怖,但會影響氣候方面的資訊排程。於是他跟他的領導彙報了下這個故障,然後決定讓我們在運營商那裡先打,再看我們打完後的狀況,直接當該運營商的資料庫為測試機了啊:)。

        過幾天我們就開始打出問題的那套運營商資料庫的PSU5,停掉了業務,在打之前先同時重啟了兩臺機器,果然故障的CRS都恢復了正常,資料庫也拉起來了。於是我們開始打PSU5,打得非常順利,結果就恢復了故障。

        過了幾天去客戶現場打PSU5,當時應用沒有停,基於一個原則:要保證資料庫故障恢復正常的情況下再打打PSU,免得故障出現重疊現象,即如果本來資料庫有問題,打完補丁後才檢查出來的話,一般人會認為是我們打補丁造成的,這樣的話會搞得相當亂。於是在申請同意的情況下,先重啟了兩臺伺服器,和我們預期的一樣,故障節點的CRS恢復了正常,在準備打PSU之前,客戶DBA有意無意去測試了下業務,即用toad去連資料庫,誰知道發現怎麼也連不上去,這就奇怪了,CRS服務都是好的,資料庫例項都是好的,偏偏就是連不上去,檢查VIP服務都是存在的,但是去ping或tnsping只能通一個節點,確定配置也沒變動,客戶DBA說下午都還可以的,找了下原因沒有看出異常,結果我們嘗試再重啟了下兩臺機器,這次更怪了,兩個節點CRS和資料庫例項還是都能起來,雖然兩個節點的所有顯示都正常,但是現在用工具是兩個節點都連線不上了。然後客戶DBA問我在運營商那邊操作的時候有沒有這種情況發生,我說我們把應用停了再重啟機器的,重啟後是測試不了業務的,就直接打PSU5的。三點多鐘大半夜裡心慌慌的,於是我們兩個商量了半個鐘頭左右,看是否決定還打不打PSU5,最後商量的結果就是:決定還是打PSU5吧,因為有回退方案,即使打完沒效果也可以乾淨回退。說不定這個問題就是bug引起的呢,只是oracle的官方都沒有說出這種症狀。當然也理解,oracle不可能在所有環境下對所有情況進行測試。於是開始打PSU5,打了大概40分鐘左右,就打完了,我們再去ping及用toad去連資料庫,這次兩個節點都連上了,資料庫瞬間恢復正常了,真神奇,看來真的是bug引起的。當然最近網上看到有人在不打補丁恢復故障節點的方法為:跑到機房裡去把心跳線拔掉,然後再插上重啟故障節點就可以臨時解決,但是我這邊倒沒有去驗證過。

        所以建議以後在裝完這種比較新版本資料庫的時候,一定要記得打上PSU,免得出一些不必要的麻煩問題。
下面是打PSU5的過程,以供參考。

[root@QiXiang-DB1 backup]# su – grid
[grid@QiXiang-DB1 ~]$ pwd
/home/grid
[grid@QiXiang-DB1 ~]$ mkdir psu5
[grid@QiXiang-DB1 ~]$ chmod -R 777 psu5/
[root@QiXiang-DB1 backup]# mv p14727347_112030_Linux-x86-64.zip /home/grid
[root@QiXiang-DB1 backup]# cd /home/grid
[root@QiXiang-DB1 ~]# chown -R grid.oinstall p14727347_112030_Linux-x86-64.zip
[root@QiXiang-DB1 ~]# su – grid
[grid@QiXiang-DB1 ~]$ unzip p14727347_112030_Linux-x86-64.zip -d /home/grid/psu5
[grid@QiXiang-DB1 ~]$ /oracle/product/11.2.3/crs11g/OPatch/opatch version
Invoking OPatch 11.2.0.1.7

OPatch Version: 11.2.0.1.7

OPatch succeeded.
[root@QiXiang-DB1 psu5]# cd /oracle/product/11.2.3/crs11g/
[root@QiXiang-DB1 crs11g]# mv OPatch/ OPatch_bak
[root@QiXiang-DB1 backup]# mv p6880880_112000_Linux-x86-64.zip /home/grid/
[root@QiXiang-DB1 ~]# chown -R grid.oinstall /home/grid/p6880880_112000_Linux-x86-64.zip
[root@QiXiang-DB1 ~]# su – grid
[grid@QiXiang-DB1 ~]$ echo $ORACLE_HOME
/oracle/product/11.2.3/crs11g
[root@QiXiang-DB1 ~]# cd /home/grid/
[root@QiXiang-DB1 grid]# unzip p6880880_112000_Linux-x86-64.zip -d /oracle/product/11.2.3/crs11g
[root@QiXiang-DB1 grid]# cd /oracle/product/11.2.3/crs11g
[root@QiXiang-DB1 crs11g]# chown -R grid.oinstall OPatch
[root@QiXiang-DB1 crs11g]# su – grid
[grid@QiXiang-DB1 ~]$ /oracle/product/11.2.3/crs11g/OPatch/opatch version
OPatch Version: 11.2.0.3.3

OPatch succeeded.
[root@QiXiang-DB1 ~]# su – oracle
[oracle@QiXiang1 /home/oracle]$ echo $ORACLE_HOME
/oracle/product/11.2.3/db11g
[oracle@QiXiang1 /home/oracle]$ cd /oracle/product/11.2.3/db11g
[oracle@QiXiang1 /oracle/product/11.2.3/db11g]$ OPatch/opatch version
Invoking OPatch 11.2.0.1.7

OPatch Version: 11.2.0.1.7

OPatch succeeded.
[oracle@QiXiang1 /oracle/product/11.2.3/db11g]$ mv OPatch/ OPatch_bak
[root@QiXiang-DB1 ~]# cd /home/
[root@QiXiang-DB1 home]# ls -lrt
total 8
drwx—— 5 oracle oinstall 4096 Mar 12 14:59 oracle
drwx—— 7 grid   oinstall 4096 Mar 20 22:15 grid
[root@QiXiang-DB1 home]# chmod -R 774 *
[root@QiXiang-DB1 home]# ls -l
total 8
drwxrwxr– 7 grid   oinstall 4096 Mar 20 22:15 grid
drwxrwxr– 5 oracle oinstall 4096 Mar 12 14:59 oracle
[root@QiXiang-DB1 ~]# cd /home/grid/
[root@QiXiang-DB1 grid]# unzip p6880880_112000_Linux-x86-64.zip -d /oracle/product/11.2.3/db11g
[root@QiXiang-DB1 grid]# cd /oracle/product/11.2.3/db11g
[root@QiXiang-DB1 db11g]# chown -R oracle.oinstall OPatch
[root@QiXiangS-DB1 db11g]# su – oracle
[oracle@QiXiang1 /home/oracle]$ /oracle/product/11.2.3/db11g/OPatch/opatch version
OPatch Version: 11.2.0.3.3

OPatch succeeded.
[root@QiXiang-DB1 ~]# su – grid
[grid@QiXiang-DB1 ~]$ cd /oracle/product/11.2.3/crs11g/OPatch/ocm/bin/
[grid@QiXiang-DB1 bin]$ ./emocmrsp
OCM Installation Response Generator 10.3.4.0.0 – Production
Copyright (c) 2005, 2010, Oracle and/or its affiliates.  All rights reserved.

Provide your email address to be informed of security issues, install and
initiate Oracle Configuration Manager. Easier for you if you use your My
Oracle Support Email address/User Name.
Visit  for details.
Email address/User Name:

You have not provided an email address for notification of security issues.
Do you wish to remain uninformed of security issues ([Y]es, [N]o) [N]:  Y
The OCM configuration response file (ocm.rsp) was successfully created.

[root@QiXiang-DB1 ~]# su – grid
[grid@QiXiang-DB1 ~]$ echo $ORACLE_HOME
/oracle/product/11.2.3/crs11g
[grid@QiXiang-DB1 ~]$ /oracle/product/11.2.3/crs11g/OPatch/opatch lsinventory -detail -oh /oracle/product/11.2.3/crs11g
……
There are no Interim patches installed in this Oracle Home.
Rac system comprising of multiple nodes
  Local node =QiXiang-DB1
  Remote node = QiXiang-DB2

——————————————————————————–

OPatch succeeded.
[root@QiXiang-DB1 ~]# su – oracle
[oracle@QiXiang1 /home/oracle]$ /oracle/product/11.2.3/db11g/OPatch/opatch lsinventory -detail -oh /oracle/product/11.2.3/db11g
……
There are no Interim patches installed in this Oracle Home.
Rac system comprising of multiple nodes
  Local node =QiXiang-db1
  Remote node = QiXiang-db2

——————————————————————————–

OPatch succeeded.

備份目錄:
su – oracle
tar cvf db11g.tar /oracle/product/11.2.3/db11g
su – grid
tar cvf crs11g.tar /oracle/product/11.2.3/crs11g

在第二個節點上做以上重複的步驟

[oracle@QiXiang1 /home/oracle]$ emctl stop dbconsole
開始打PSU補丁操作:
在節點1上執行:
[root@QiXiang-DB1 ~]#/oracle/product/11.2.3/crs11g/OPatch/opatch auto /home/grid/psu5 -ocmrf /oracle/product/11.2.3/crs11g/OPatch/ocm/bin/ocm.rsp
Executing /oracle/product/11.2.3/crs11g/perl/bin/perl /oracle/product/11.2.3/crs11g/OPatch/crs/patch11203.pl -patchdir /home/grid -patchn psu5 -ocmrf /oracle/product/11.2.3/crs11g/OPatch/ocm/bin/ocm.rsp -paramfile /oracle/product/11.2.3/crs11g/crs/install/crsconfig_params
/oracle/product/11.2.3/crs11g/crs/install/crsconfig_params
/oracle/product/11.2.3/crs11g/crs/install/s_crsconfig_defs

This is the main log file: /oracle/product/11.2.3/crs11g/cfgtoollogs/opatchauto2013-03-21_05-25-34.log
This file will show your detected configuration and all the steps that opatchauto attempted to do on your system: /oracle/product/11.2.3/crs11g/cfgtoollogs/opatchauto2013-03-21_05-25-34.report.log

2013-03-21 05:25:34: Starting Clusterware Patch Setup
Using configuration parameter file: /oracle/product/11.2.3/crs11g/crs/install/crsconfig_params

Unable to determine if /oracle/product/11.2.3/db11g is shared oracle home
Enter ‘yes’ if this is not a shared home or if the prerequiste actions are performed to patch this shared home (yes/no):yes

Unable to determine if /oracle/product/11.2.3/crs11g is shared oracle home
Enter ‘yes’ if this is not a shared home or if the prerequiste actions are performed to patch this shared home (yes/no):yes
patch /home/grid/psu5/15876003/custom/server/15876003  apply successful for home  /oracle/product/11.2.3/db11g
 patch /home/grid/psu5/14727310  apply successful for home  /oracle/product/11.2.3/db11g
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.crsd’ on ‘QiXiang-db1′
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.oc4j’ on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.GRID_DG.dg’ on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.registry.acfs’ on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.DATA_DG.dg’ on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.FLASH_DG.dg’ on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.LISTENER.lsnr’ on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.cvu’ on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.LISTENER_SCAN1.lsnr’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.LISTENER.lsnr’ on ‘QiXiang-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.QiXiang-db1.vip’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.LISTENER_SCAN1.lsnr’ on ‘QiXiang-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.scan1.vip’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.QiXiang-db1.vip’ on ‘QiXiang-db1′ succeeded
CRS-2672: Attempting to start ‘ora.QiXiang-db1.vip’ on ‘QiXiang-db2′
CRS-2677: Stop of ‘ora.scan1.vip’ on ‘QiXiang-db1′ succeeded
CRS-2672: Attempting to start ‘ora.scan1.vip’ on ‘QiXiang-db2′
CRS-2677: Stop of ‘ora.FLASH_DG.dg’ on ‘QiXiang-db1′ succeeded
CRS-2677: Stop of ‘ora.DATA_DG.dg’ on ‘QiXiang-db1′ succeeded
CRS-2677: Stop of ‘ora.registry.acfs’ on ‘QiXiang-db1′ succeeded
CRS-2676: Start of ‘ora.scan1.vip’ on ‘QiXiang-db2′ succeeded
CRS-2672: Attempting to start ‘ora.LISTENER_SCAN1.lsnr’ on ‘QiXiang-db2′
CRS-2676: Start of ‘ora.QiXiang-db1.vip’ on ‘QiXiang-db2′ succeeded
CRS-2676: Start of ‘ora.LISTENER_SCAN1.lsnr’ on ‘QiXiang-db2′ succeeded
CRS-2677: Stop of ‘ora.oc4j’ on ‘QiXiang-db1′ succeeded
CRS-2672: Attempting to start ‘ora.oc4j’ on ‘QiXiang-db2′
CRS-2677: Stop of ‘ora.cvu’ on ‘QiXiang-db1′ succeeded
CRS-2672: Attempting to start ‘ora.cvu’ on ‘QiXiang-db2′
CRS-2676: Start of ‘ora.cvu’ on ‘QiXiang-db2′ succeeded
CRS-2676: Start of ‘ora.oc4j’ on QiXiang-db2′ succeeded
CRS-2677: Stop of ‘ora.GRID_DG.dg’ on ‘QiXiang-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.asm’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.asm’ on ‘QiXiang-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.ons’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.ons’ on ‘QiXiangs-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.net1.network’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.net1.network’ on ‘QiXiang-db1′ succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on ‘QiXiang-db1′ has completed
CRS-2677: Stop of ‘ora.crsd’ on ‘QiXiang-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.drivers.acfs’ on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.ctssd’ on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.evmd’ on ‘QiXiang-db1′
CRS-2673: Attempting to stop ‘ora.asm’ on QiXiangs-db1′
CRS-2673: Attempting to stop ‘ora.mdnsd’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.evmd’ on ‘QiXiang-db1′ succeeded
CRS-2677: Stop of ‘ora.mdnsd’ on ‘acars-db1′ succeeded
CRS-2677: Stop of ‘ora.asm’ on ‘acars-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.cluster_interconnect.haip’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.cluster_interconnect.haip’ on ‘QiXiangs-db1′ succeeded
CRS-2677: Stop of ‘ora.drivers.acfs’ on ‘QiXiang-db1′ succeeded
CRS-2677: Stop of ‘ora.ctssd’ on ‘QiXiang-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.cssd’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.cssd’ on ‘QiXiang-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.crf’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.crf’ on ‘QiXiang-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.gipcd’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.gipcd’ on ‘QiXiang-db1′ succeeded
CRS-2673: Attempting to stop ‘ora.gpnpd’ on ‘QiXiang-db1′
CRS-2677: Stop of ‘ora.gpnpd’ on ‘QiXiang-db1′ succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on ‘QiXiang-db1′ has completed
CRS-4133: Oracle High Availability Services has been stopped.
Successfully unlock /oracle/product/11.2.3/crs11g
 patch /home/grid/psu5/15876003  apply successful for home  /oracle/product/11.2.3/crs11g
patch /home/grid/psu5/14727310  apply successful for home  /oracle/product/11.2.3/crs11g
CRS-4123: Oracle High Availability Services has been started.

在節點2上執行:
[root@QiXiang-DB2 ~]#/oracle/product/11.2.3/crs11g/OPatch/opatch auto /home/grid/psu5 -ocmrf /oracle/product/11.2.3/crs11g/OPatch/ocm/bin/ocm.rsp
日誌省略

在其中一個節點上執行(只需要在一個節點上執行):
cd $ORACLE_HOME/rdbms/admin
sqlplus /nolog
SQL> CONNECT / AS SYSDBA
SQL> STARTUP
SQL> @catbundle.sql psu apply
SQL> QUIT

SQL> select action,comments from registry$history;

ACTION                         COMMENTS
—————————— —————————————-
APPLY                          Patchset 11.2.0.2.0
APPLY                          Patchset 11.2.0.2.0
APPLY                          PSU 11.2.0.3.5

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29960155/viewspace-1363557/,如需轉載,請註明出處,否則將追究法律責任。

相關文章