CRS-1606 診斷分析

renjixinchina發表於2013-06-06
CRS-1606 CSSD Insufficient voting files available [%s of %s]. Details in %s.表示可用的VOTE盤不夠一般和儲存 或者 SAN引起的 VOTE盤不可用  MOS有一篇關於11.2以後的版本診斷的文章Oracle Grid Infrastructure: How to Troubleshoot Voting Disk Evictions [ID 1549428.1]但是診斷的思路對10g 依然有用

Applies to:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.

Purpose

The clusterware must be able to access a minimum number of the voting files, otherwise it will abort.

When the node aborts for this reason, the node alert log in $GRID_HOME/log//alert.log will show CRS-1606 error. For example:

2013-01-26 10:15:47.177
[cssd(3743)]CRS-1606:The number of voting files available, 1, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/apdbc76n1/cssd/ocssd.log

The two most common causes for this issue are: (a) interruption of the storage connection to voting disk; (b) if only one voting disk is in use and version is less than 11.2.0.3.4, hitting known bug 13869978.

The purpose of this document is to provide steps to take after a voting disk eviction.

Troubleshooting Steps

Do the following steps in order.

1. Check whether all of the voting files are currently accessible.

  1. Use the command "crsctl query css votedisk" on a node where clusterware is up to get a list of all the voting files.
  2. Check that each node can access the devices underlying each voting file.
  • Check that the permissions to each voting file/disk have not been changed
  • If the voting files are on clustered file system, check that each device is accessible and each file readable
  • If the voting files are on ASM, use "asmcmd lsdsk -k -G diskgroup_name" to list the devices used by that ASM diskgroup, then check accessibility and permissions of each of those devices on each node.
  • If asmlib is in use, use command "oracleasm querydisk /dev/* 2>/dev/null" as root to show which raw device corresponds to each asmlib label, or see Document 811457.1.
  • To check readability of a raw device, use "dd" read command, eg. "dd if=/dev/raw/raw1 f=/dev/null count=100 bs=1024". Be very careful not to overwrite the disk using dd.

If any voting files or underlying devices are not currently accessible from any node, work with storage administrator and/or system administrator to resolve it at storage and/or OS level.

2. Apply fix for 13869978 if only one voting disk is in use.

See Document 13869978.8. This issue is fixed in 11.2.0.3.4 patch set and above, and 11.2.0.4 and above.

3. Check OS, SAN, and storage logs for any errors from the time of the incident.

  1. Check the OS messages log from each node:
    • AIX: Use command "errpt -a" to show messages.
    • Linux: /var/log/messages
    • Solaris: /var/adm/messages
    • HP-UX: /var/adm/syslog/syslog.log
    • Windows: Check the System Event log
  2. Check SAN and Storage logs.
  3. If voting disks are on NFS:
    • Check NFS filer log.
    • Check network connection between evicted node and NFS for any problems such as MTU mismatch.
    • Use "nfsiostat" command to check latency to the voting disk(s) on nfs.

4. Check archived IO statistics from the time of the incident.

If the physical disk on which the voting file is located was very busy, this can result in no response from the storage for long enough that the clusterware marks the voting disk unavailable.

a) Collecting the archived IO statistics

Cluster Health Monitor (CHM)

If Cluster Health Monitor (CHM) is available on your platform. and version, it will have automatically collected IO, CPU and other statistics. However, the statistics will age out very quickly from the CHM repository, so they must be gathered up and saved as soon as possible after the eviction. This command will collect all the data into a file named chmos*.tar.gz :

$GRID_HOME/bin/diagcollection.sh --collect --chmos

To limit the CHM data to the time of the incident, use the --incidenttime and --incidentduration arguments:

$GRID_HOME/bin/diagcollection.sh --chmos --incidenttime 02/18/201205:00:00 --incidentduration 05:00

For more information on CHM, including availability, see Document 1328466.1 - Cluster Health Monitor (CHM) FAQ .

OS Watcher (OSW)

If CHM is not available, the DBA can install OS Watcher to automatically collect and archive OS statistics. OS Watcher (OSW) is a lightweight shell script. which gathers iostat, vmstat, etc. regularly and saves the data. See Document 301137.1 for more information.

IMPORTANT NOTE: OSW must be configured to collect data more frequently than every 30s or it will not capture a rapidly escalating problem leading to eviction.

b) What to look for in the archived IO statistics

  1. iostat: Check for the following on the physical devices on which the voting file is located (See Step 1 above to identify the correct devices):
    • high service time: 20ms or higher
    • high busy percentage: 50% busy or higher
    • high avg wait time: 20ms or higher
    Note: See Document 1531223.1 for how to find these measures in oswiostat data.
  2. ps: Check archived ps data near the time of the problem for process state in "D" - "D" means waiting on IO .
  3. netstat: If voting disk(s) on NFS, check the following:
    • Check netstat for the interface which the node uses to communicate with the NFS. Look for any errors, dropped packets, etc.
    • Also check NFS server's historical netstat and iostat data, if available.




來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/15747463/viewspace-763167/,如需轉載,請註明出處,否則將追究法律責任。

相關文章