Troubleshooting ServeRAID
To begin troubleshooting, check the following top issues. If your issue is listed, select the link, otherwise proceed to step 2.
What are Bad Stripe Table entries?
What are Bad Stripe Table (BST) limitations?
What are the conditions conducive to the appearance of bad stripes?
How to minimize the risk of bad stripes
How to maintain ServeRAID
How does an operating system react when it tries to read or write to a bad stripe?
How can Bad Stripes Table entries be removed from an Array/Logical Drive?
How to mitigate the existence of bad stripes on a logical drive
Frequently asked questions (FAQs)
What are Bad Stripe Table entries?
The Bad Stripe Table (BST) tracks stripes across a logical drive that contains invalid or incomplete data. The table is stored in an area that ServeRAID reserves for configuration information on each physical disk grouped into an array that hosts one or more logical drives. There is a separate table for each logical drive.
The same concepts for RAID logical drives can be brought down to the stripe level. The ServeRAID controller is designed to handle and correct a single stripe unit failure on a read or a write-verify. If two or more stripe units within the same horizontal stripe across the array fail at the same time for any reason, all stripe units within that stripe will become blocked, creating a bad stripe table entry in the hosted logical drives configuration. The error message "Multiple stripe unit failures within a single horizontal stripe" is a clear definition of a bad stripe at its most basic level. See ServeRAID stripes and stripe units for more information on stripes.
The most common cause of a bad stripe table entry is extended operation of a logical drive while in a critical state (one disk within the array is marked defunct). In this situation, one stripe is already unavailable and if the controller encounters another unrecoverable error, a second stripe unit failure will increment the bad stripe count. A bad stripe is essentially a stripe level RAID failure, although instead of taking the entire logical drive off-line, only the data within the stripe becomes unavailable. A bad stripe entry is then made to the Bad Stripe Table and becomes a part of the array and logical drive configuration.
Note: The cause of bad stripe table entries on a logical drive is a symptom of a physical, procedural, or environmental problem within the ServeRAID subsystem. Bad stripe table entries can occur while the ServeRAID controller tries to keep data available under less than optimal circumstances. Correcting the symptom of the bad stripe may not prevent them from reoccurring at a later time if the condition that is conducive to the creation of bad stripes is not corrected first.
What are Bad Stripe Table (BST) limitations?
ServeRAID firmware allows a maximum of 128 entries in the BST for a logical drive before blocking that logical drive. If 128 entries already exist in the BST for a logical drive in Rebuild state and another uncorrectable read error occurs such that the firmware would normally add this stripe to the BST, the rebuild will be halted and the Logical Drive will become blocked. The state shown for the drive that had been rebuilding will continue to show as rebuilding, but no rebuild activity will occur.
What are the conditions conducive to the appearance of bad stripes?
Physical Conditions:
- Simultaneous disk drive problems with two or more drives
- Hard system hang conditions
- Failing components in the SCSI data path to include cables, backplanes, termination, and tray interposers
- UPS failures, power fluctuations or unexpected power off situations
- System NMI conditions
- Poor seating of the SCSI bus components to include the ServeRAID controller, cables, terminators or backplanes, drive or cable connections or hot swap drive tray connections, and repeater options
Procedural issues:
- Powering a server off improperly
- Operating a server for an extended period of time with unmatched versions of ServeRAID software including BIOS, Firmware, Drivers, or Utilities
- Operating a server for an extended period of time while a logical drive is in a critical state
- Improper installation of replacement drives (can cause a poor seat against the backplane)
- Failure to follow recommended guidelines for a UPS's installation and redundant power connections, designed to prevent exposure to an unexpected power off condition
Environmental issues:
- Operating the server or disk drives outside of environmental specifications
How to minimize the risk of bad stripes
How to maintain ServeRAID
Synchronize each logical drive on a regular basis. The time period between synchronizations depend on how dynamic the data is. When there are constant changes made to the data, synchronize the data weekly. If the data is static with minimal changes, synchronize the data monthly.
Foreground syncs can be initiated two ways, by using the ServeRAID Manager GUI or using the IPSSEND command. The IPSSEND command can be used in a BAT or CMD file and then automated using almost any scheduling utility.
- What are Data Scrubbing and Synchronization?
- Understanding Hard Disk Drive error recovery and ServeRAID synchronization and when to use them
How does an operating system react when it tries to read or write to a bad stripe?
When the operating system attempts to write to a portion (a stripe) of a logical drive that has an entry in the BST, the write fails and an error code is returned to the operating system. Some operating systems can handle the error by using a write/reassign command that will write the data to another area of the logical drive. The new data will be stored at another location of the drive, but the BST is not changed.
When the operating system attempts to read from a portion (a stripe) of a logical drive that has an entry in the BST, the read fails and an error code is returned to the operating system. There is no operating system recovery, since the data is lost.
Each operating system will react differently when bad stripes entries occur. It is not possible for ServeRAID to identify which files are located within a blocked stripe. This fact may lead to varying operating system behaviors. If the blocked file is a data file, the operating system will likely complain that it cannot find or cannot read the file. If the blocked file is an operating system or application file, the operating system will likely fail the application or may crash the system, depending on how important the file is.
How can Bad Stripes Table entries be removed from an Array or Logical Drive?
By design, bad stripe table entries only increment upward from zero. There are no tools or commands that can remove an entry from the table. The actual table is a part of the ServeRAID configuration information stored in the reserved area of each physical disk associated to the array with the affected logical drive.
There are only two suggested methods for clearing the reserved areas of the drives and both are destructive to the data stored on the physical disks. Backing up the good data on the drives is recommended before any changes to the configuration are made.
The first method is to remove or delete the existing array configuration from the physical disks associated to the array with the affected logical drive, then create an identical new configuration, which will overwrite any previous existing configuration data. The BST will be rewritten and will start withzero entries.
The second method has one additional step. After the existing configuration is removed from the physical disks, do a low level format on each physical disk using the IPSSEND Format command, and then create an identical new configuration. This provides an additional benefit of verifying an error free drive.
Any other methods have a high probability of recreating the same bad stripe table entry, or exposing the operating system to invalid data that may result in other unexpected problems.
How to mitigate the existence of bad stripes on a logical drive
Every situation is different and these circumstances greatly affect how to proceed in resolving the appearance of bad stripes on a logical drive. The first step is to identify the most likely cause for the appearance of the bad stripe, which could include: a physical problem, a procedural problem, and environmental issues, then take corrective actions.
- After the condition has been corrected, assess the damage done to the data.
- What data is corrupt or missing?
- Was this data critical or non-critical?
In a Windows Environment:
- Use the CHKDSK command and the COPY filename NUL /B commands to test the data in question. You can also examine logs from a recent backup to see what files may have failed to backup properly (when the problem has existed for a while). A file sitting on a bad stripe will usually fail to backup to tape. Most backup software logs the filenames for files that cannot be backed up.
- CHKDSK will assess the over-all health of the logical drive. If CHKDSK errors out, the corruption is likely extensive.
- COPY filename NUL /B will force a binary read of the file and should result in a "1 File copied" message or an error. The file is not actually copied, as the output is NUL and the /B forces a file length binary read. If the file is sitting on a bad stripe or is otherwise damaged, the command will error out. The filename can use wildcards like *.doc. Exclusive access to the files is required to run this command.
Neither of these commands can entirely assess the scope of corruption, as the copy command will only determine if the file is valid, not if the data stored in the file is valid. You are likely to need to use data integrity tools native to the applications accessing the files on the system to fully assess the scope of data loss.
Based on all the evidence on hand, including the total number of bad stripes and the confidence level in the problem determination steps taken, and corrective action plan, make a determination on how to proceed with the recovery. One recovery option could be to restore the files lost from a recent backup, when minor problems are determined. A second recovery option could be to remove the array and recreate the array and logical drive, then restore from backup when corruption is catastrophic. A third could be a more moderate approach by restoring the lost data from backup with a planned outage to rebuild the logical drive and data later. This can get the data back on-line to users during production times.
A system can operate normally with bad stripes on the logical drive; however, it is very important to monitor the system to ensure the corrective action actually fixed the condition.
Frequently asked questions (FAQs)
Q: Will the existence of a bad stripe cause a Rebuild to fail?
A: No, a rebuild will complete normally, except as noted above under Bad Stripe Table (BST) Limitations. However, if the rebuild does fail, it is likely that the condition conducive to the appearance of bad stripes was not corrected or the corrective actions were not completely successful.
Q: Does the existence of one or more bad stripes cause additional bad stripes?
A: No. Bad stripes are symptoms of another problem most commonly SCSI bus related, for example, cables, backplanes, termination, trays, improper seating of components, and so on. The controllers contribute to new bad stripes very rarely. RETAIN tip H09680, RAID-5 Potential for Data Loss with ServeRAID Under Stress, describes all known issues.
Q: If the condition conducive to the appearance of bad stripes is eliminated, will the system operate properly from then on?
A: Yes, however there may be some residual effects under the OS. In a Windows environment, Event ID 26, 50, or 51's can occur if some missing data is not accounted for by the operating system, for example, a temporary file created to track progress of another process goes missing. The software may continue to look for the missing file resulting in Event ID 26, 50 or 51's. In Windows, very small files are often saved in the Master File Table (MFT) and if a bad stripe crosses the MFT, Event ID 26, 50 or 51's may also occur. These events occur infrequently, but sometimes they fill the System Event log. Running a CHKDSK /F should correct these problems. Continued problems with new Lost Delayed Writes (Event ID's 26 and 50) are an indication that the corrective actions were not fully successful. Check the Bad Stripe Table entry count regularly until you are sure it doesn't increment. Event ID 51's can still occur after a CHKDSK /F successfully completes and corrects file system integrity.
Q: Is there a way to clear the bad stripe entries without removing and rebuilding the array/logical drives?
A: There are no tools or commands that can remove an entry from the bad stripe table.
詳細見:
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/743764/viewspace-1003898/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- ServeRAID disk drive error recoveryServerAIError
- ServeRAID mismatched software levels can result in system problemsServerAI
- Troubleshooting tips
- postgreSQL troubleshooting 故障分析SQL
- Troubleshooting POST error codesError
- zt_oracle troubleshooting案例Oracle
- Systematic Latch Contention Troubleshooting in OracleOracle
- Java Monitoring, Management and Troubleshooting ToolsJava
- Checkpoint Tuning and Troubleshooting GuideGUIIDE
- Troubleshooting Database Creation (121)Database
- 【Spark篇】---Spark故障解決(troubleshooting)Spark
- Troubleshooting Session Administration [ID 805586.1]Session
- Troubleshooting Oracle ClusterwareThis appendix introducesOracleAPP
- Troubleshooting Database Control Startup IssuesDatabase
- Troubleshooting 'enq: TX - index contention' WaitsENQIndexAI
- Troubleshooting Database Hang Issues (Doc ID 1378583.1)Database
- websphere中介軟體故障診斷troubleshootingWeb
- mysql repilcation troubleshooting基礎知識點MySql
- Linux Troubleshooting 超實用系列 - Disk AnalysisLinux
- create index , rebuild index troubleshooting 索引故障解決IndexRebuild索引
- TROUBLESHOOTING - ASM disk not found/visible/discovered issuesASM
- Oracle診斷工具 - ORA-4030 Troubleshooting ToolOracle
- Master Note - Troubleshooting DBCA Issues (文件 ID 1269459.1)AST
- Troubleshooting 11.2 Clusterware Node Evictions (Reboots)_1050693.1boot
- Checkpoint Tuning and Troubleshooting Guide (文件 ID 147468.1)GUIIDE
- Checkpoint Tuning and Troubleshooting Guide(Metalink:147468.1)GUIIDE
- WebRTC 通話質量調優:Troubleshooting 小工具Web
- 安裝S_S相關報錯的troubleshooting
- Troubleshooting "Global Enqueue Services Deadlock detected" (Doc ID 1443482.1)ENQ
- Master Note: Troubleshooting Oracle Background Processes_1509616.1ASTOracle
- Metlink:Troubleshooting:WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!AIENQ
- oracle troubleshooting waits for locks/Enqueues other than 'TM','TX' and 'UL'OracleAIENQ
- 與IO相關的等待事件troubleshooting-系列9事件
- 與IO相關的等待事件troubleshooting-系列8事件
- 與IO相關的等待事件troubleshooting-系列7事件
- 與IO相關的等待事件troubleshooting-系列6事件
- 與IO相關的等待事件troubleshooting-系列5事件
- 與IO相關的等待事件troubleshooting-系列4事件