Linux平臺由於OPROCD程式導致伺服器重啟的問題

由於oprocd程式的超時等引數預設值過小有可能導致機器在安裝、升級、執行等情況重啟。特別是在虛擬機器搭建的RAC測試環境問題更為突出。
解決辦法：
編輯/etc/init.d/init.cssd
修改一下2個值:
OPROCD_DEFAULT_TIMEOUT=10000
OPROCD_DEFAULT_MARGIN=5000

[root@rhel1 bin]#ps -ef | grep oprocd
root 3842 2936 0 13:05 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
root 4191 3842 0 13:05 ? 00:00:00 /export/home/oracle/product/10.2.0/crs/bin/oprocd.bin run -t 1000 -m 8053063650

OPROCD啟動的時候有兩個引數：
-t : 超時時間，預設1000，單位毫秒 (OPROCD_DEFAULT_TIMEOUT=1000)
-m : 重啟前可接受的延遲，單位毫秒，預設500 (OPROCD_DEFAULT_MARGIN=500)

oprocd程式介紹：
PROCD is a process monitor that runs on hardware platforms supporting
other third-party cluster managers and is present only on hardware platforms
other than Linux. Its function is to create threads for the various processors
on the system and to check if the processors are hanging. Every
second, the PROCD thread wakes up and checks the processors on the system,
and then goes to sleep for about 500 ms and tries again. If it does not
receive any response after n seconds, it reboots the node. On Linux environments,
the hangcheck timer module performs the same work that PROCD
does on other hardware platforms.
linux平臺上的Oracle Clusterware 10.2.0.4和以後版本引入了一個新的Oracle Clusterware Process Monitor Daemon (OPROCD)程式來監控系統狀態和叢集中的每個節點的健康狀態，就象已經在不使用第三方的cluster軟體的UNIX系統中提供的那樣。

OPROCD 在linux平臺上的10.2.0.4版本中和hangcheck-timer一起執行，它和hangcheck-timer模組沒有聯絡和依賴關係，它由init.ccsd程式產生出來並用root使用者執行。OPROCD程式被鎖定在記憶體中來監控叢集中的每個它自己執行的節點，來檢測機器上的硬體或者驅動的freezes，並且提供I/O的fencing功能（這和SCSI提供的中斷的fencing功能不同）。如果一個機器被凍結了足夠長的時間後，它被會叢集驅逐出節點，它自己需要強制重啟自己來阻止叢集從失敗的節點上的鎖資源被重新組織後，失敗的節點仍然訪問共享的資料檔案上的有疑問的I/O操作。為了提供這樣的功能，OPROCD執行檢查，然後停止執行（休眠），然後如果在期望的時間內不能被喚醒，OPROCD將重啟本機的節點。

注意：OPROCD在第三方實現的叢集環境中是不存在的，因為在LINUX平臺下沒有透過驗證的第三方的叢集解決方案，所以linux平臺下的10.2.0.4版本中OPROCD將總是會存在的。

OPROCD程式問題（ZT）

相關文章