環境說明：

OS:Redhat 7.5
DB:
Oracle 11.2.0.4.0 RAC
節點1：10.1.1.103
節點2：10.1.1.101

/etc/hosts配置：

cat /etc/hosts
......
10.1.1.103    chen-cjc-01
10.1.1.101    chen-cjc-02
10.1.1.102    chen-cjc-01-vip
10.1.1.103    chen-cjc-02-vip
25.5.255.13   chen-cjc-01-pri
25.5.255.14   chen-cjc-02-pri
10.1.1.105    chen-scan

問題：

節點1，2記憶體使用率較高，需要進行記憶體擴容。

解決方案：

停機擴容記憶體。
1.停應用服務。
2.停庫。
3.停叢集。
4.停伺服器擴容記憶體。
5.啟動叢集。
6.啟動資料庫。
7.應用修改連線資料庫地址，改為連線scan ip。
8.啟動應用服務。
9.驗證。

實施過程中遇到的問題：

1.由於歷史原因，應用沒有連線資料庫scan ip，只連線了節點1 vip 10.1.1.102，失去了高可用功能，需要停應用服務後進行資料庫擴容。
2.Oracle ASM共享儲存在NAS上，經檢查發現沒有設定NAS開機自動掛載，需要開機後手動掛載，需要提前將Oracle crs自動啟動服務關閉，啟動伺服器後，先手動掛載NAS，在手動啟動CRS。
3.Oracle 11.2.0.4.0 RAC安裝在Redhat 7及以上版本，無法直接啟動ohas服務，需要打補丁或手動新增ohas服務。
4.啟動CRS時，後臺日誌提示無法找到voting files，導致叢集無法啟動,根據ocssd日誌，最佳化NAS掛載引數。

具體問題如下：

1.停節點2資料庫例項時，速度較慢，耗時9分鐘。

主要原因，伺服器效能較差，建議下次停例項可以考慮先提前手動中止會話，在停例項。

停止例項日誌如下：

Thu Feb 10 18:42:11 2022
Shutting down instance (immediate)
Stopping background process SMCO
Shutting down instance: further logons disabled
Stopping background process QMNC
Stopping background process MMNL
Stopping background process MMON
License high water mark = 23
Thu Feb 10 18:42:38 2022
Reconfiguration started (old inc 4, new inc 6)
List of instances:
 2 (myinst: 2) 
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Thu Feb 10 18:42:38 2022
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Thu Feb 10 18:42:38 2022
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Thu Feb 10 18:42:38 2022
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Thu Feb 10 18:43:38 2022
Decreasing number of real time LMS from 2 to 0
Thu Feb 10 18:47:14 2022
Active call for process 5123 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 30720 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 1908 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 27349 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 1753 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 6241 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 22201 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
Active call for process 21133 user 'oracle' program 'oracle@chen-cjc-02 (TNS V1-V3)'
SHUTDOWN: waiting for active calls to complete.
Thu Feb 10 18:52:13 2022
License high water mark = 23
USER (ospid: 23318): terminating the instance
Instance terminated by USER, pid = 23318

2.crs無法啟動。

啟動伺服器後，系統工程師掛載NAS,DBA啟動crs服務時，前臺命令掛起，長時間無響應。

###crsctl start crs;

檢查後臺叢集日誌、crs日誌、ohas日誌、ocss日誌均沒有任何輸出。

反覆中止、啟動crs後，過一段時間，前臺報錯CRS-3124：

root@chen-cjc-02:/oracle/db/grid/product/11.2.0/bin#./crsctl start crs
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.

問題原因：

Oracle 11.2.0.4.0 RAC安裝在Redhat 7及以上版本，無法直接啟動ohas服務，需要打補丁或手動新增ohas服務。

處理過程：

建立服務ohas.service的服務檔案並賦予許可權

su - root
touch /usr/lib/systemd/system/ohas.service
chmod 777 /usr/lib/systemd/system/ohas.service

往ohas.service服務檔案新增啟動ohasd的相關資訊

vi /usr/lib/systemd/system/ohas.service
[Unit]
Description=Oracle High Availability Services
After=syslog.target
[Service]
ExecStart=/etc/init.d/init.ohasd run >/dev/null 2>&1 Type=simple
Restart=always
[Install]
WantedBy=multi-user.target

重新載入守護程式

systemctl daemon-reload

設定守護程式自動啟動

###systemctl enable ohas.service

手工啟動ohas服務

systemctl start ohas.service
systemctl status ohas.service

3.CRS-1714:Unable to discover any voting files

手動新增並啟動ohas伺服器，啟動crs前臺沒有報錯，很快返回結果。

###crsctl start crs

檢查後臺叢集日誌，叢集啟動失敗。

2022-02-10 19:37:03.771: 
[cssd(12250)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /oracle/db/grid/product/11.2.0/log/chen-cjc-02/cssd/ocssd.log
2022-02-10 19:37:18.783: 
[cssd(12250)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /oracle/db/grid/product/11.2.0/log/chen-cjc-02/cssd/ocssd.log

繼續檢查ocssd.log日誌，提示NFS配置不正確,rsize和wsize值不正確。

2022-02-10 19:43:19.113: [   SKGFD][3605935872]running stat on disk:/oradata/vote3
2022-02-10 19:43:19.113: [   SKGFD][3605935872]WARNING:NFS file system /oradata mounted with incorrect options(rw,relatime,vers=3,rsize=16384,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=25.17.1.2,mountvers=3,mountport=2050,mountproto=tcp,local_lock=none,addr=25.17.1.2)
2022-02-10 19:43:19.113: [   SKGFD][3605935872]WARNING:Expected NFS mount options: rsize>=32768,wsize>=32768,hard,

繼續檢查nfs掛載引數

檢查fstab裡記錄的nas掛載引數

cat /etc/fstab
#mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs  /oradata

檢查當前掛載引數

mount |grep oradata
25.17.1.2:/CJC_db_oradata_01_nfs on /oradata type nfs (rw,relatime,vers=3,rsize=16384,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=25.17.1.2,mountvers=3,mountport=2050,mountproto=tcp,local_lock=none,addr=25.17.1.2)

檢查歷史掛載命令

root@chen-cjc-01:/oracle/db/grid/product/11.2.0/log/chen-cjc-01#history |grep mount
......
 1194  2020-09-16-10:05:37 [root]umount /oradata
 1195  2020-09-16-10:05:44 [root]mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs  /oradata

手動重新掛載

兩節點分別停止crs服務

crsctl stop crs
ps -ef|grep d.bin

兩節點分別重新掛載

###umount /oradata
###mount -o rw,bg,hard,nointr,nolock,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0 25.17.1.2:/CJC_db_oradata_01_nfs  /oradata

重新掛載後,crs和資料庫例項可以正常啟動。

驗證：

檢查資料庫五個IP地址是否可以正常連線資料庫。

###sqlplus CJC/***@10.1.1.102:15212/CJC
###sqlplus CJC/***@10.1.1.103:15212/CJC
###sqlplus CJC/***@10.1.1.105:15212/CJC
###sqlplus CJC/***@10.1.1.101:15212/CJC
###sqlplus CJC/***@10.1.1.104:15212/CJC

待應用服務修改完IP地址後，檢查資料庫連線情況，確保節點1，2都有業務連線。

SQL> select inst_id,username,count(*) from gv$session group by inst_id,username order by 1,2;
   INST_ID USERNAME                         COUNT(*)
---------- ------------------------------ ----------
         1 CJC                                   42
         1 MONITOR                                 3
         1 SYS                                     5
         1                                        41
         2 CJC                                   42
         2 SYS                                     3
         2                                        39
7 rows selected.

#####chenjuchao 20220212#####

記一次一波三折的Oracle RAC故障處理

相關文章