一、環境描述
系統：HP-UX B.11.31 U ia64
資料庫：Oracle 11.2.0.4.5 RAC 檔案系統

二、故障描述
節點1正常，節點啟停庫很慢，大概需要11分鐘，而且connect很慢，大概需要7秒左右。

三、分析處理步驟
1、檢視db alert日誌如下：
Tue Dec 02 11:28:39 2014 <<=====資料庫開始啟動
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 32
Number of processor cores in the system is 32
Number of processor sockets in the system is 16
Private Interface 'lan101:101' configured from GPnP for use as a private interconnect.
[name='lan101:101', type=1, ip=192.1.1.29, mac=00-23-7d-3e-22-d5, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'lan100:101' configured from GPnP for use as a public interface.
[name='lan100:101', type=1, ip=15.133.55.16, mac=00-23-7d-3e-22-29, net=15.133.55.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'lan100' configured from GPnP for use as a public interface.
[name='lan100', type=1, ip=15.133.55.12, mac=00-23-7d-3e-22-29, net=15.133.55.0/24, mask=255.255.255.0, use=public/1]
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /oracle/app/oracle/product/11.2.0/db/dbs/arch
Tue Dec 02 11:28:49 2014
Autotune of undo retention is turned off.
LICENSE_MAX_USERS = 0
。。。。
Tue Dec 02 11:29:27 2014
PMON started with pid=2, OS id=2671
Tue Dec 02 11:29:42 2014
PSP0 started with pid=3, OS id=4085
Tue Dec 02 11:29:58 2014
VKTM started with pid=4, OS id=4139 at elevated priority
VKTM running at (10)millisec precision with DBRM quantum (100)ms
Tue Dec 02 11:30:12 2014
GEN0 started with pid=5, OS id=4208
Tue Dec 02 11:30:27 2014
DIAG started with pid=6, OS id=4251
Tue Dec 02 11:30:42 2014
PING started with pid=7, OS id=5285
Tue Dec 02 11:30:57 2014
ACMS started with pid=8, OS id=5341
Tue Dec 02 11:31:12 2014
DIA0 started with pid=9, OS id=5419
Tue Dec 02 11:31:27 2014
LMON started with pid=10, OS id=5461
Tue Dec 02 11:31:42 2014
LMD0 started with pid=11, OS id=6448
* Load Monitor used for high load check
* New Low - High Load Threshold Range = [30720 - 40960]
Tue Dec 02 11:31:56 2014
LMS0 started with pid=12, OS id=6516 at elevated priority
Tue Dec 02 11:32:11 2014
LMS1 started with pid=13, OS id=6564 at elevated priority
Tue Dec 02 11:32:26 2014
LMS2 started with pid=14, OS id=6626 at elevated priority
Tue Dec 02 11:32:40 2014
RMS0 started with pid=15, OS id=7619
Tue Dec 02 11:32:55 2014
LMHB started with pid=16, OS id=7686
Tue Dec 02 11:33:10 2014
MMAN started with pid=17, OS id=7733
Tue Dec 02 11:33:32 2014
DBW0 started with pid=18, OS id=7789
Tue Dec 02 11:33:54 2014
DBW1 started with pid=19, OS id=8781
Tue Dec 02 11:34:17 2014 <<=====啟動一個程式幾十秒，正常情況下秒啟的！
DBW2 started with pid=20, OS id=8896
Tue Dec 02 11:34:38 2014
DBW3 started with pid=21, OS id=8993
Tue Dec 02 11:34:53 2014
LGWR started with pid=22, OS id=10241
Tue Dec 02 11:35:08 2014
CKPT started with pid=23, OS id=10320
Tue Dec 02 11:35:23 2014
SMON started with pid=24, OS id=10434
Tue Dec 02 11:35:38 2014
RECO started with pid=25, OS id=10593
Tue Dec 02 11:35:53 2014
MMON started with pid=26, OS id=11488
Tue Dec 02 11:36:08 2014
MMNL started with pid=27, OS id=11592
Tue Dec 02 11:36:08 2014
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
Tue Dec 02 11:36:38 2014
starting up 1 shared server(s) ...
Tue Dec 02 11:36:40 2014
Restarting dead background process DIA0 <<=====從29分到37分之間都是在啟動後臺程式，而這裡DIA0啟動之後又dead了、重啟。
Tue Dec 02 11:37:08 2014
lmon registered with NM - instance number 2 (internal mem no 1)
Reconfiguration started (old inc 0, new inc 8)
List of instances:
1 2 (myinst: 2)
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
Tue Dec 02 11:37:09 2014
* domain 0 valid = 1 according to instance 1
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Tue Dec 02 11:37:09 2014
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

二、透過分析alert日誌，我們發現建立程式很慢，由於建立一個連線會話ORACLE也會建立一個程式來處理，所以sqlplus也很慢。由於庫是新裝的還沒上線，很可能是系統引數設定有問題造成的。思路已經有了，下一步就是對比好的節點1跟有問題的節點2的核心引數了。

透過對比發現節點2的shmmax才1073741824 B(1G)，初步懷疑故障由於該引數導致。
(引數shmmax釋義：每個共享記憶體段的最大位元組值，單位B，官方建議：為了避免效能下降，該值需大於或等於可用記憶體大小。)

三、修改引數並重啟主機後資料庫啟動關閉恢復正常，連線也很快了。
修改引數命令：
kctune shmmax=274598985728

如果以後遇到在生產庫前面是正常的，後面啟動關閉都很慢，這個時候就需考慮其他因素了，詳細請見：
Troubleshooting Database Startup/Shutdown Problems (文件 ID 851057.1)

某新裝業務1庫節點2啟動關閉資料庫很慢且sqlplus連線也很慢故障分析處理

相關文章