客戶一套ORACLE 10.2.0.4 的crs 問題處理

由於客戶更換HBA 卡和光纖交換機介面後，後來發現資料庫沒起來，下面是處理過程

客戶環境兩個 ibm p570 os 6100-04-01-0944 oracle 10.2.0.4

遠端發現第2 node ORACLE 安裝軟體的檔案按系統已經100%了，哎，肯定是哪個程式瘋狂的寫吧lv撐滿。

檢視 crs.log 發現基本所有資訊都是這個

2014-10-02 21:54:15.523: [ OCRRAW][1]proprdc_propr_fcl: proprhandle_fcl->propr_fcl_page[3980]=0x0

2014-10-02 21:54:15.523: [ OCRRAW][1]proprdc_propr_fcl: proprhandle_fcl->propr_fcl_page[3981]=0x0

2014-10-02 21:54:15.523: [ OCRRAW][1]proprdc_propr_fcl: proprhandle_fcl->propr_fcl_page[3982]=0x0

這種報錯在google上根本查不到，好吧，去MOS 看看，mos也比較少，找到了一些相似的問題，說是10.2.0.4 bug。

先檢視 crs alert 日誌檔案，發現了重大資訊

crsd(201070)]CRS-1006:The OCR location /dev/rhdisk2 is inaccessible. Details in /oracle/product/10.2.0/crs/log/jxsmdb2/crsd/crsd.log.

2014-10-02 22:32:28.215

[crsd(164818)]CRS-1006:The OCR location /dev/rhdisk2 is inaccessible. Details in /oracle/product/10.2.0/crs/log/jxsmdb2/crsd/crsd.log.

磁碟有問題啦。。

檢視你2號節點 hdisk2 hdisk6 磁碟組屬性，使用者，許可權等都是正常

crw-rw---- 1 oracle oinstall 24, 10 Oct 03 09:55 /dev/rhdisk10

crw-rw---- 1 oracle oinstall 24, 11 Oct 03 09:55 /dev/rhdisk11

crw-rw---- 1 oracle oinstall 24, 12 Oct 03 09:48 /dev/rhdisk12

crw-rw---- 1 oracle oinstall 24, 13 Oct 03 09:48 /dev/rhdisk13

crw-rw---- 1 oracle oinstall 24, 14 Oct 03 09:47 /dev/rhdisk14

crw-rw---- 1 oracle oinstall 24, 15 Oct 03 09:45 /dev/rhdisk15

crw-rw---- 1 root oinstall 24, 2 Oct 03 09:55 /dev/rhdisk2

crw-rw---- 1 oracle oinstall 24, 3 Oct 03 09:55 /dev/rhdisk3

crw-rw---- 1 oracle oinstall 24, 4 Oct 03 09:55 /dev/rhdisk4

crw-rw---- 1 oracle oinstall 24, 5 Oct 03 09:55 /dev/rhdisk5

crw-rw---- 1 root oinstall 24, 6 Oct 03 09:55 /dev/rhdisk6

crw-rw---- 1 oracle oinstall 24, 7 Oct 03 09:55 /dev/rhdisk7

crw-rw---- 1 oracle oinstall 24, 8 Oct 03 09:30 /dev/rhdisk8

crw-rw---- 1 oracle oinstall 24, 9 Oct 03 09:09 /dev/rhdisk9

再檢視1號機器

crw-rw---- 1 oracle system 24, 10 Oct 03 09:55 /dev/rhdisk10

crw-rw---- 1 oracle system 24, 11 Oct 03 09:55 /dev/rhdisk11

crw-rw---- 1 root system 24, 12 Oct 03 09:48 /dev/rhdisk12

crw-rw---- 1 root system 24, 13 Oct 03 09:48 /dev/rhdisk13

crw-rw---- 1 root system 24, 14 Oct 03 09:47 /dev/rhdisk14

crw-rw---- 1 root system 24, 15 Oct 03 09:45 /dev/rhdisk15

crw-rw---- 1 root system 24, 2 Oct 03 09:55 /dev/rhdisk2

crw-rw---- 1 root system 24, 3 Oct 03 09:55 /dev/rhdisk3

crw-rw---- 1 root system 24, 4 Oct 03 09:55 /dev/rhdisk4

crw-rw---- 1 root system 24, 5 Oct 03 09:55 /dev/rhdisk5

crw-rw---- 1 root system 24, 6 Oct 03 09:55 /dev/rhdisk6

crw-rw---- 1 root system 24, 7 Oct 03 09:55 /dev/rhdisk7

crw-rw---- 1 root system 24, 8 Oct 03 09:30 /dev/rhdisk8

crw-rw---- 1 root system 24, 9 Oct 03 09:09 /dev/rhdisk9

把1號機器的磁碟許可權和，陣列改成和2號機器一樣

crw-rw---- 1 oracle oinstall 24, 10 Oct 03 09:55 /dev/rhdisk10

crw-rw---- 1 oracle oinstall 24, 11 Oct 03 09:55 /dev/rhdisk11

crw-rw---- 1 oracle oinstall 24, 12 Oct 03 09:48 /dev/rhdisk12

crw-rw---- 1 oracle oinstall 24, 13 Oct 03 09:48 /dev/rhdisk13

crw-rw---- 1 oracle oinstall 24, 14 Oct 03 09:47 /dev/rhdisk14

crw-rw---- 1 oracle oinstall 24, 15 Oct 03 09:45 /dev/rhdisk15

crw-rw---- 1 root oinstall 24, 2 Oct 03 09:55 /dev/rhdisk2

crw-rw---- 1 oracle oinstall 24, 3 Oct 03 09:55 /dev/rhdisk3

crw-rw---- 1 oracle oinstall 24, 4 Oct 03 09:55 /dev/rhdisk4

crw-rw---- 1 oracle oinstall 24, 5 Oct 03 09:55 /dev/rhdisk5

crw-rw---- 1 root oinstall 24, 6 Oct 03 09:55 /dev/rhdisk6

crw-rw---- 1 oracle oinstall 24, 7 Oct 03 09:55 /dev/rhdisk7

crw-rw---- 1 oracle oinstall 24, 8 Oct 03 09:30 /dev/rhdisk8

crw-rw---- 1 oracle oinstall 24, 9 Oct 03 09:09 /dev/rhdisk9

但是2號好節點還是起不來，依然報同樣的錯誤，

檢視1號機器和2號機器的hdisk2 ，hdisk6 屬性

PCM PCM/friend/otherapdisk Path Control Module False

PR_key_value none Persistant Reserve Key Value True

algorithm fail_over Algorithm True

autorecovery no Path/Ownership Autorecovery True

clr_q no Device CLEARS its Queue on error True

cntl_delay_time 0 Controller Delay Time True

cntl_hcheck_int 0 Controller Health Check Interval True

dist_err_pcnt 0 Distributed Error Percentage True

dist_tw_width 50 Distributed Error Sample Time True

hcheck_cmd inquiry Health Check Command True

hcheck_interval 60 Health Check Interval True

hcheck_mode nonactive Health Check Mode True

location Location Label True

lun_id 0x0 Logical Unit Number ID False

lun_reset_spt yes LUN Reset Supported True

max_retry_delay 60 Maximum Quiesce Time True

max_transfer 0x40000 Maximum TRANSFER Size True

node_name 0x200400a0b811758c FC Node Name False

pvid none Physical volume identifier False

q_err yes Use QERR bit True

q_type simple Queuing TYPE True

queue_depth 10 Queue DEPTH True

reassign_to 120 REASSIGN time out value True

reserve_policy no_reserve Reserve Policy True

rw_timeout 30 READ/WRITE time out value True

scsi_id 0x10300 SCSI ID False

start_timeout 60 START unit time out value True

unique_id 3E213600A0B800011758C0000C04C4BBE8D500F1815 FAStT03IBMfcp Unique device identifier False

ww_name 0x201500a0b811758c FC World Wide Name False

再看1 號節點

PCM PCM/friend/otherapdisk Path Control Module False

PR_key_value none Persistant Reserve Key Value True

algorithm fail_over Algorithm True

autorecovery no Path/Ownership Autorecovery True

clr_q no Device CLEARS its Queue on error True

cntl_delay_time 0 Controller Delay Time True

cntl_hcheck_int 0 Controller Health Check Interval True

dist_err_pcnt 0 Distributed Error Percentage True

dist_tw_width 50 Distributed Error Sample Time True

hcheck_cmd inquiry Health Check Command True

hcheck_interval 60 Health Check Interval True

hcheck_mode nonactive Health Check Mode True

location Location Label True

lun_id 0x0 Logical Unit Number ID False

lun_reset_spt yes LUN Reset Supported True

max_retry_delay 60 Maximum Quiesce Time True

max_transfer 0x40000 Maximum TRANSFER Size True

node_name 0x200400a0b811758c FC Node Name False

pvid none Physical volume identifier False

q_err yes Use QERR bit True

q_type simple Queuing TYPE True

queue_depth 10 Queue DEPTH True

reassign_to 120 REASSIGN time out value True

reserve_policy single_path Reserve Policy True

rw_timeout 30 READ/WRITE time out value True

scsi_id 0x10300 SCSI ID False

start_timeout 60 START unit time out value True

unique_id 3E213600A0B800011758C0000C04C4BBE8D500F1815 FAStT03IBMfcp Unique device identifier False

ww_name 0x201500a0b811758c FC World Wide Name False

發現 1號機器 hdisk2 和hdisk6 （ocr 磁碟）怎麼是single_path 按道理應該是共享的。後來發現1號機器的所有rac 磁碟都是這樣的。

立刻改掉

Root使用者

for i in 2 3 4 5 6 7 8 9 10 11 12 13 14 15

do chdev -l hdisk$i -a reserve_policy=no_reserve

do

結果發現 hdisk2 和hdisk6 改不了，裝置比較busy

0514-062 Cannot perform the requested function because the

specified device is busy.

刪除磁碟還是不行

# rmdev -dl hdisk6

Method error (/usr/lib/methods/ucfgdevice):

0514-062 Cannot perform the requested function because the

specified device is busy.

想想應該是2號機器把ocr磁碟佔用了，所以我怎麼操作都不允許

檢視crs程式

oracle 196786 155908 0 09:05:18 - 0:00 /oracle/product/10.2.0/crs/bin/oclsomon.bin

root 103266 102694 1 09:05:17 - 0:47 /oracle/product/10.2.0/crs/bin/crsd.bin reboot

oracle 107362 192550 0 09:05:19 - 0:05 /oracle/product/10.2.0/crs/bin/ocssd.bin

1號機器停止crs，發現crs的程式還是存在，這裡介紹一下1號節點自從前幾天換了hba，手動停止 crsctl stop crs 命令感覺不好使了

重啟1號機器還是更改不了磁碟，停止不了crs，索性root使用者進位制crs自動啟動，再重啟兩個機器

As root user on all node

cd /etc/

# ./init.crs disable crs

啟動之後這下沒有任何crs程式，1號機器嘗試更改磁碟屬性，這下可以了。。哈哈

# ps -ef |grep crs

root 102694 1 0 08:54:41 - 0:00 /bin/sh /etc/init.crsd run

root 151958 180262 0 08:59:54 pts/0 0:00 grep crs

# chdev -l hdisk2 -a reserve_policy=no_reserve

hdisk2 changed

# chdev -l hdisk6 -a reserve_policy=no_reserve

hdisk6 changed

#

現在在2號節點啟動crs

# ./crsct start crs

檢視 crs alertlog

[crsd(201070)]CRS-1006:The OCR location /dev/rhdisk2 is inaccessible. Details in /oracle/product/10.2.0/crs/log/jxsmdb2/crsd/crsd.log.

2014-10-02 22:32:28.215

[crsd(164818)]CRS-1006:The OCR location /dev/rhdisk2 is inaccessible. Details in /oracle/product/10.2.0/crs/log/jxsmdb2/crsd/crsd.log.

2014-10-02 22:32:28.476

[crsd(164818)]CRS-1005:The OCR upgrade was completed. Version has changed from 169870336 to 169870336. Details in /oracle/product/10.2.0/crs/log/jxsmdb2/crsd/crsd.log.

2014-10-02 22:32:28.477

[crsd(164818)]CRS-1012:The OCR service started on node jxsmdb2.

2014-10-02 22:32:28.751

[crsd(164818)]CRS-1201:CRSD started on node jxsmdb2.

[cssd(70408)]CRS-1603:CSSD on node jxsmdb2 shutdown by user.

2014-10-03 09:05:23.615

[cssd(107362)]CRS-1605:CSSD voting file is online: /dev/rhdisk4. Details in /oracle/product/10.2.0/crs/log/jxsmdb2/cssd/ocssd.log.

2014-10-03 09:05:23.815

[cssd(107362)]CRS-1605:CSSD voting file is online: /dev/rhdisk3. Details in /oracle/product/10.2.0/crs/log/jxsmdb2/cssd/ocssd.log.

2014-10-03 09:05:23.815

[cssd(107362)]CRS-1605:CSSD voting file is online: /dev/rhdisk5. Details in /oracle/product/10.2.0/crs/log/jxsmdb2/cssd/ocssd.log.

[cssd(107362)]CRS-1601:CSSD Reconfiguration complete. Active nodes are jxsmdb2 .

2014-10-03 09:08:44.541

[evmd(99266)]CRS-1401:EVMD started on node jxsmdb2.

2014-10-03 09:08:44.585

[crsd(103266)]CRS-1005:The OCR upgrade was completed. Version has changed from 169870336 to 169870336. Details in /oracle/product/10.2.0/crs/log/jxsmdb2/crsd/crsd.log.

2014-10-03 09:08:44.586

[crsd(103266)]CRS-1012:The OCR service started on node jxsmdb2.

2014-10-03 09:08:46.874

[crsd(103266)]CRS-1201:CRSD started on node jxsmdb2.

2014-10-03 09:08:47.163