CRS-1606 診斷分析
Applies to:
Oracle Database - Enterprise Edition - Version 11.2.0.1 and laterInformation in this document applies to any platform.
Purpose
The clusterware must be able to access a minimum number of the voting files, otherwise it will abort.
When the node aborts for this reason, the node alert log in $GRID_HOME/log/
[cssd(3743)]CRS-1606:The number of voting files available, 1, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/apdbc76n1/cssd/ocssd.log
The two most common causes for this issue are: (a) interruption of the storage connection to voting disk; (b) if only one voting disk is in use and version is less than 11.2.0.3.4, hitting known bug 13869978.
The purpose of this document is to provide steps to take after a voting disk eviction.
Troubleshooting Steps
Do the following steps in order.
1. Check whether all of the voting files are currently accessible.
- Use the command "crsctl query css votedisk" on a node where clusterware is up to get a list of all the voting files.
- Check that each node can access the devices underlying each voting file.
- Check that the permissions to each voting file/disk have not been changed
- If the voting files are on clustered file system, check that each device is accessible and each file readable
- If the voting files are on ASM, use "asmcmd lsdsk -k -G diskgroup_name" to list the devices used by that ASM diskgroup, then check accessibility and permissions of each of those devices on each node.
- If asmlib is in use, use command "oracleasm querydisk /dev/* 2>/dev/null" as root to show which raw device corresponds to each asmlib label, or see Document 811457.1.
- To check readability of a raw device, use "dd" read command, eg. "dd if=/dev/raw/raw1 f=/dev/null count=100 bs=1024". Be very careful not to overwrite the disk using dd.
If any voting files or underlying devices are not currently accessible from any node, work with storage administrator and/or system administrator to resolve it at storage and/or OS level.
2. Apply fix for 13869978 if only one voting disk is in use.
See Document 13869978.8. This issue is fixed in 11.2.0.3.4 patch set and above, and 11.2.0.4 and above.
3. Check OS, SAN, and storage logs for any errors from the time of the incident.
- Check the OS messages log from each node:
- AIX: Use command "errpt -a" to show messages.
- Linux: /var/log/messages
- Solaris: /var/adm/messages
- HP-UX: /var/adm/syslog/syslog.log
- Windows: Check the System Event log
- Check SAN and Storage logs.
- If voting disks are on NFS:
- Check NFS filer log.
- Check network connection between evicted node and NFS for any problems such as MTU mismatch.
- Use "nfsiostat" command to check latency to the voting disk(s) on nfs.
4. Check archived IO statistics from the time of the incident.
If the physical disk on which the voting file is located was very busy, this can result in no response from the storage for long enough that the clusterware marks the voting disk unavailable.
a) Collecting the archived IO statistics
Cluster Health Monitor (CHM)
If Cluster Health Monitor (CHM) is available on your platform. and version, it will have automatically collected IO, CPU and other statistics. However, the statistics will age out very quickly from the CHM repository, so they must be gathered up and saved as soon as possible after the eviction. This command will collect all the data into a file named chmos*.tar.gz :
$GRID_HOME/bin/diagcollection.sh --collect --chmos
To limit the CHM data to the time of the incident, use the --incidenttime and --incidentduration arguments:
$GRID_HOME/bin/diagcollection.sh --chmos --incidenttime 02/18/201205:00:00 --incidentduration 05:00
For more information on CHM, including availability, see Document 1328466.1 - Cluster Health Monitor (CHM) FAQ .
OS Watcher (OSW)
If CHM is not available, the DBA can install OS Watcher to automatically collect and archive OS statistics. OS Watcher (OSW) is a lightweight shell script. which gathers iostat, vmstat, etc. regularly and saves the data. See Document 301137.1 for more information.
IMPORTANT NOTE: OSW must be configured to collect data more frequently than every 30s or it will not capture a rapidly escalating problem leading to eviction.
b) What to look for in the archived IO statistics
- iostat: Check for the following on the physical devices on which the voting file is located (See Step 1 above to identify the correct devices):
- high service time: 20ms or higher
- high busy percentage: 50% busy or higher
- high avg wait time: 20ms or higher
- ps: Check archived ps data near the time of the problem for process state in "D" - "D" means waiting on IO .
- netstat: If voting disk(s) on NFS, check the following:
- Check netstat for the interface which the node uses to communicate with the NFS. Look for any errors, dropped packets, etc.
- Also check NFS server's historical netstat and iostat data, if available.
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/15747463/viewspace-763167/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 分析:全面診斷FACEBOOK
- awr診斷分析之二
- 網站SEO診斷分析要點網站
- ora-4031(ora-4030)診斷分析ora-4031(ora-4030)診斷分析
- 故障分析 | Kubernetes 故障診斷流程
- sp_sysmon效能診斷結果分析(zt)
- 資料庫異常智慧分析與診斷資料庫
- 一次DG故障診斷過程分析
- [JVM] 應用診斷工具之Fastthread(線上診斷)JVMASTthread
- 9. Oracle常用分析診斷工具——9.2. ASHOracle
- 9. Oracle常用分析診斷工具——9.1. AWROracle
- ORACLE診斷案例Oracle
- Oracle故障診斷Oracle
- ORACLE診斷事件Oracle事件
- 診斷事件(1)事件
- 9. Oracle常用分析診斷工具——9.3.ADDMOracle
- 某物流系統資料庫故障診斷案例分析資料庫
- linux下的一個診斷分析工具lsof(轉)Linux
- 機器學習之模型診斷機器學習模型
- Java診斷利器ArthasJava
- SQL問題診斷SQL
- oracle 效能診斷工具Oracle
- Oracle診斷事件列表Oracle事件
- ORACLE診斷事件(zt)Oracle事件
- library cache pin和library cache lock的診斷分析
- 免費網站seo診斷:從哪些維度進行診斷呢?網站
- Oracle診斷事件列表(轉)Oracle事件
- Java執行緒診斷Java執行緒
- oracle診斷工具-RDA使用Oracle
- iostat -x命令診斷iOS
- 案例 - EBS SQL效能診斷SQL
- 等待事件快速定位診斷事件
- ASM磁碟故障診斷(二)ASM
- ASM磁碟故障診斷(一)ASM
- 故障診斷學習工具
- oracle 事件診斷詳細Oracle事件
- oracle sqlt(sqltxplain) 診斷工具OracleSQLAI
- Oracle診斷工具RDA使用Oracle