ocrd程式無法啟動，表決盤無法訪問，15秒後超時，出現dismount情況。

ocssd.log
>>>>>>>>>>>>>>>>>>>>>

2016-01-15 10:33:11.620: [ CSSD][1042200320]clssnmSendingThread: sent 5 status msgs to all nodes
2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15550 msecs
2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvWorkerThread not scheduled for 15550 msecs
2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15550 msecs
2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvWorkerThread not scheduled for 15540 msecs
2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15550 msecs

ASM alert.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Fri Jan 15 10:33:20 2016
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 20 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 20 secs for write IO to PST disk 1 in group 2.
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Fri Jan 15 10:33:20 2016
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 20 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 20 secs for write IO to PST disk 1 in group 2.
WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 20 secs for write IO to PST disk 1 in group 2.
Fri Jan 15 10:33:21 2016
NOTE: process _b000_+asm1 (11776) initiating offline of disk 0.3915952267 (OCR_0000) with mask 0x7e in group 2
NOTE: process _b000_+asm1 (11776) initiating offline of disk 1.3915952269 (OCR_0001) with mask 0x7e in group 2
NOTE: checking PST: grp = 2
GMON checking disk modes for group 2 at 9 for pid 34, osid 11776
ERROR: no read quorum in group: required 2, found 1 disks
NOTE: checking PST for grp 2 done.
NOTE: initiating PST update: grp = 2, dsk = 0/0xe968b08b, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 2, dsk = 1/0xe968b08d, mask = 0x6a, op = clear
GMON updating disk modes for group 2 at 10 for pid 34, osid 11776
ERROR: no read quorum in group: required 2, found 1 disks
Fri Jan 15 10:33:21 2016
NOTE: cache dismounting (not clean) group 2/0x22584077 (OCR)
NOTE: messaging CKPT to quiesce pins Unix process pid: 11785, image: oracle@cntl119 (B001)
Fri Jan 15 10:33:21 2016
NOTE: halting all I/Os to diskgroup 2 (OCR)
Fri Jan 15 10:33:21 2016
NOTE: LGWR doing non-clean dismount of group 2 (OCR)
NOTE: LGWR sync ABA=4.55 last written ABA 4.55
WARNING: Offline for disk OCR_0000 in mode 0x7f failed.
WARNING: Offline for disk OCR_0001 in mode 0x7f failed.
Fri Jan 15 10:33:21 2016
kjbdomdet send to inst 2
detach from dom 2, sending detach message to inst 2
Fri Jan 15 10:33:21 2016
ERROR: ORA-15130 in COD recovery for diskgroup 2/0x22584077 (OCR)
ERROR: ORA-15130 thrown in RBAL for group number 2
Errors in file /grid/base/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_4292.trc:
ORA-15130: diskgroup "OCR" is being dismounted
Fri Jan 15 10:33:21 2016
NOTE: No asm libraries found in the system
ASM Health Checker found 1 new failures
Fri Jan 15 10:33:21 2016
List of instances:
1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 2 invalid = TRUE
130 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Fri Jan 15 10:33:21 2016
WARNING: dirty detached from domain 2
NOTE: cache dismounted group 2/0x22584077 (OCR)
NOTE: cache deleting context for group OCR 2/0x22584077
SQL> alter diskgroup OCR dismount force /* ASM SERVER:576209015 */ >>>>>>>>>>>>>>>>>>>>>>>>crs強制解除安裝磁碟組
GMON dismounting group 2 at 11 for pid 36, osid 11785
NOTE: Disk OCR_0000 in mode 0x7f marked for de-assignment
NOTE: Disk OCR_0001 in mode 0x7f marked for de-assignment
NOTE: Disk OCR_0002 in mode 0x7f marked for de-assignment

APPLIES TO:

Oracle Database - Standard Edition - Version 11.2.0.1 to 11.2.0.4 [Release 11.2]
Managed Cloud Services Problem Resolution - Version N/A to N/A
Information in this document applies to any platform.

SYMPTOMS

The CRSD process (crsd.bin) aborted due to the OCR location being inaccessible (here +OCR_VOTE).

? <GI_HOME>/log/<node>/alert<node>.log shows the following messages:

	2013-11-07 06:50:58.319: 

[crsd(14307)]CRS-1006:The OCR location +OCR_VOTE is inaccessible. Details in /app/11.2.0/grid_1/log/rac1/crsd/crsd.log.

2013-11-07 06:51:08.465: 

[ohasd(12810)]CRS-2765:Resource 'ora.crsd' has failed on server 'rac1'.

2013-11-07 06:51:09.907: 

[crsd(31845)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /app/11.2.0/grid_1/log/rac1/crsd/crsd.log."

? The log of the grid user agent (<GI_HOME>/log/<node>/agent/crsd/oraagent_grid/oraagent_grid.log) shows that it received a message to stop the ora.OCR_VOTE.dg resource:

	2013-11-07 06:37:28.744: [ AGFW][1112471872]{1:58982:16178} Agent received the message: RESOURCE_STOP[ora.OCR_VOTE.dg rac1 1] ID 4099:346136

2013-11-07 06:37:28.744: [ AGFW][1112471872]{1:58982:16178} Preparing STOP command for: ora.OCR_VOTE.dg rac1 1

2013-11-07 06:37:28.745: [ AGFW][1112471872]{1:58982:16178} ora.OCR_VOTE.dg rac1 1 state changed from: ONLINE to: STOPPING

2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] (:CLSN00108:) clsn_agent::stop {

2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] DgpAgent::stop: enter { 

2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] getResAttrib: attrib name USR_ORA_OPI value true len 4

2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] Agent::flagUsrOraOpiIsSet(true) reason not dependency

2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] DgpAgent::stop: tha exit }

2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] DgpAgent::stopSingle status:2 }

2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] (:CLSN00108:) clsn_agent::stop }

2013-11-07 06:37:28.747: [ AGFW][1101465920]{1:58982:16178} Command: stop for resource: ora.OCR_VOTE.dg rac1 1 completed with status: SUCCESS

? The CSS daemon log (<GI_HOME>/log/<node>/cssd/ocssd.log) shows that several CSS threads have not been scheduled for several seconds:

		2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7760 msecs

2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvWorkerThread not scheduled for 7830 msecs

2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7830 msecs

2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 8040 msecs

2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvWorkerThread not scheduled for 8040 msecs

..

2013-11-07 06:37:11.488: [ CSSD][1118316864]clssscMonitorThreads clssnmvWorkerThread not scheduled for 60030 msecs

2013-11-07 06:37:11.488: [ CSSD][1118316864]clssscMonitorThreads clssnmvWorkerThread not scheduled for 59620 msecs

2013-11-07 06:37:11.488: [ CSSD][1118316864]clssscMonitorThreads clssnmvWorkerThread not scheduled for 60240 msecs

? The ASM alert log shows that the OCR diskgroup was dismounted after IO issue:

		Thu Nov 07 06:36:28 2013

WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.

WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.

WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.

WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.

WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.

WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.

WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.

WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.

Thu Nov 07 06:36:28 2013

NOTE: process _b000_+asm1 (21285) initiating offline of disk 0.3915951409 (OCR_VOTE01) with mask 0x7e in group 1

NOTE: process _b000_+asm1 (21285) initiating offline of disk 1.3915951410 (OCR_VOTE02) with mask 0x7e in group 1

NOTE: process _b000_+asm1 (21285) initiating offline of disk 2.3915951411 (OCR_VOTE03) with mask 0x7e in group 1
	
		Thu Nov 07 06:36:28 2013

WARNING: dirty detached from domain 1

NOTE: cache dismounted group 1/0xDE385DDC (OCR_VOTE) 

SQL> alter diskgroup OCR_VOTE dismount force /* ASM SERVER:3728235996 */
	
		Thu Nov 07 06:37:28 2013

SUCCESS: diskgroup OCR_VOTE was dismounted

SUCCESS: alter diskgroup OCR_VOTE dismount force /* ASM SERVER:3728235996 */

SUCCESS: ASM-initiated MANDATORY DISMOUNT of group OCR_VOTE

? Running crsctl shows that CRSD process (crsd.bin) is not running:

	# crsctl stat res -t

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Status failed, or completed with errors.

CAUSE

The issue was caused by a storage problem. The disks were not responding for approximately one minute, which caused the update of the PST disk to exceed 15 secs and consequently caused the OCR diskgroup to be dismounted which caused the CRS daemon (crsd.bin) to abort.

? The OS messages log shows that there was a disk issue (message may vary depending on the storage used):

	Nov 7 06:37:11 rac1 kernel: qla2xxx 0000:0f:00.0: scsi(3:0:0): Abort command issued -- 1 c1e7a3 2002.

? OSW vmstat shows 10 processes wait on IO (column b=10)

	zzz ***Thu Nov 7 06:36:18 CST 2013

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

3 10 0 50944832 776328 10809588 0 0 1 3 1 1 0 0 100 0 0

0 10 0 50936532 776328 10809628 0 0 0 244 1044 9054 1 1 97 1 0

0 10 0 50939424 776328 10809872 0 0 0 0 1024 6426 0 0 98 2 0

? In this particular case the ASM disks for the +OCR_VOTE disk group were mapped to 3 partitions (/dev/sda1 - /dev/sda3) on a single disk device :

		# ls -ltr /dev/oracleasm/disks/*

brw-rw---- 1 grid asmadmin 8, 1 Nov 18 05:04 /dev/oracleasm/disks/OCR_VOTE01 --- major is 8,minor is 1

brw-rw---- 1 grid asmadmin 8, 2 Nov 18 05:04 /dev/oracleasm/disks/OCR_VOTE02 --- major is 8,minor is 2

brw-rw---- 1 grid asmadmin 8, 3 Nov 18 05:04 /dev/oracleasm/disks/OCR_VOTE03 --- major is 8,minor is 3

brw-rw---- 1 grid asmadmin 8, 5 Nov 18 05:04 /dev/oracleasm/disks/ASM_DATA

brw-rw---- 1 grid asmadmin 8, 6 Nov 18 05:04 /dev/oracleasm/disks/ASM_BACKUP
	
		# ls -ltr /dev/sda*

brw-r----- 1 root disk 8, 0 Nov 18 05:03 /dev/sda

brw-r----- 1 root disk 8, 1 Nov 18 05:04 /dev/sda1 --- major is 8,minor is 1, used for OCR_VOTE01

brw-r----- 1 root disk 8, 2 Nov 18 05:04 /dev/sda2 --- major is 8,minor is 2, used for OCR_VOTE02

brw-r----- 1 root disk 8, 3 Nov 18 05:04 /dev/sda3 --- major is 8,minor is 3, used for OCR_VOTE03

brw-r----- 1 root disk 8, 4 Nov 18 05:04 /dev/sda4

brw-r----- 1 root disk 8, 5 Nov 18 05:04 /dev/sda5

brw-r----- 1 root disk 8, 6 Nov 18 05:04 /dev/sda6 

brw-r----- 1 root disk 8, 7 Nov 18 05:04 /dev/sda7

OSW iostat shows device /dev/sda1, /dev/sda2, /dev/sda3 are abnormal for approximate 1 minute.
Start from 06:36:18 many IO requests in request queue(avgqu-sz is 16) and %util is 100%,
however disk throughput is zero (rkB/s =0 and wkB/s=0), these information indicate disk Device saturation occurred.
Disk IO were recovered at 06:37:19.

		zzz ***Thu Nov 7 06:35:17 CST 2013

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util

sda 0.00 0.00 9.00 10.00 39.00 53.50 9.74 0.03 1.32 1.32 2.50

sda1 0.00 0.00 2.00 2.00 1.00 4.50 2.75 0.00 0.00 0.00 0.00

sda2 0.00 0.00 2.00 2.00 1.00 4.50 2.75 0.01 2.00 2.00 0.80

sda3 0.00 0.00 2.00 2.00 1.00 4.50 2.75 0.00 0.00 0.00 0.00

sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sda5 0.00 0.00 2.00 2.00 32.00 20.00 26.00 0.02 4.00 4.00 1.60

sda6 0.00 0.00 1.00 2.00 4.00 20.00 16.00 0.00 0.33 0.33 0.10

sda7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
	
		zzz ***Thu Nov 7 06:36:18 CST 2013

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util

sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16.02 0.00 0.00 100.10

sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.10

sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.10

sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.10

sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sda5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 0.00 0.00 100.10

sda6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 100.10

sda7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
	
		zzz ***Thu Nov 7 06:37:19 CST 2013

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util

sda 0.00 0.00 7.00 7.00 19.00 41.50 8.64 0.03 1.86 1.86 2.60

sda1 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.00 0.00 0.00 0.00

sda2 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.01 3.00 3.00 0.90

sda3 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.01 3.00 3.00 0.90

sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sda5 0.00 0.00 1.00 2.00 16.00 20.00 24.00 0.01 2.67 2.67 0.80

sda6 0.00 0.00 0.00 2.00 0.00 20.00 20.00 0.00 0.00 0.00 0.00

sda7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

SOLUTION

Contact the system/storage administrator and possibly the storage vendor to determine why the disk (shared storage) was not responding for an extended amount time.

In addition

avoid using multiple partitions of the same physical disk as ASM disks for the Voting disks. Each ASM disk ideally should be on a different physical device or even a different SAN if possible. This may not be possible when using LUNs as they typically don't map to a physical disk.
make sure the OCR is mirrored onto a 2nd diskgroup so the CRSD process can survive the dismount of one of the diskgroups.

Other ISSUE

Generally this kind messages comes in ASM alertlog file on below situations,

Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.

By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly.

The ASM disk could go into unresponsiveness, normally in the following scenarios:

+    Some of the paths of the physical paths of the multipath device are offline or lost
+    During path 'failover' in a multipath set up
+    Server load, or any sort of storage/multipath/OS maintenance

The Doc ID 10109915.8 briefs about Bug 10109915(this fix introduce this underscore parameter). And the issue is with no OS/Storage tunable timeout mechanism in a case of a Hung NFS Server/Filer. And then _asm_hbeatiowait helps in setting the time out.

SOLUTION

1]    Check with OS and Storage admin that there is disk unresponsiveness.

2]    Possibly keep the disk responsiveness to below 15 seconds.

This will depend on various factors like
+    Operating System
+    Presence of Multipath ( and Multipath Type )
+    Any kernel parameter

So you need to find out, what is the 'maximum' possible disk unresponsiveness for your set up.

For example, on AIX  rw_timeout  setting affects this and defaults to 30 seconds.
                            >>>>>>>>>>>>>>>>>>>>>>>>aix rw_timeout預設是30秒，我的環境是linux
Another example is Linux with native multipathing. In such set up, number of physical paths and polling_interval value in multipath.conf file, will dictate this maximum disk unresponsiveness.

So for your set up ( combination of OS / multipath / storage ), you need to find out this.

3]    If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):

    _asm_hbeatiowait

Set it to 200.

>>>>>>>>>>>>>>>>>>>>>>>>>>>polling_interval*no_path_retry

Run below in asm instance to set desired value for _asm_hbeatiowait

alter system set "_asm_hbeatiowait"= scope=spfile sid='*';

由於暫時沒有辦法解決儲存問題，故採用增大iowait來暫時解決。根本還是儲存鏈路。

CRS-2765: Resource 'ora.crsd' has failed on server

APPLIES TO:

SYMPTOMS

CAUSE

SOLUTION

Other ISSUE

SOLUTION

相關文章