ServeRAID disk drive error recovery

etzhang發表於2008-05-14
The demand for very large drive capacities has created the need to group many physical disks together in the form of arrays and logical drives. It is important to understand certain aspects of hard disk technology to understand the function of synchronization.

While a drive is in manufacturing, a set of low level tests is run against it to establish two internal sector lists. These tests create the "known good sectors" and the "known bad sectors" lists. The hard drive is then locked to their capacity by the firmware installed and defines which sectors become actively available. For example, a 36GB drive may actually accommodate 38GB of usable space. This extra space is listed in the NVRAM of the drive in another list called the "Known good reserved sectors".
[@more@]

Sector sparingWhile a drive is in operation, the head may come across a sector with a weakened magnetic reading. The data is still readable but may fall below the preferred threshold for qualified good sector readings. This disk drive would consider this a failing sector and would "sector spare" this data to a new location available in the "known good reserve" list. Once the data is moved, the old sector address is added to the "Grown Defects" list, never to be used again. This process is a "recoverable" media error. The drive will give a Performance Failure Alert (PFA) once the drive "sector spares" the majority of its "known good reserved sectors". Hard drives do this as a routine, and PFAs are part of the Mean-Time-Between-Failures (MTBF) calculations for the drives.

Using this same example, a drive will only know to sector spare when it does a read, read-modify-write or a write-verify to a sector. This is important because if a drive does not read or write to a sector that is failing, the drive will never know to correct the problem, resulting in an "unrecoverable media error" on a future read or write before the disk can save the data. When an "unrecoverable media error" occurs, sector sparing still takes place, but no data can be moved.

Knowing this, you can use simple math to see that the risk of problems is doubled when you go from one drive to two drives in an array. If there are ten (10) to sixteen (16) drives in an array, media errors become more common.

Synchronization
Relating this to IBM's ServeRAID technology and synchronization, syncs are designed to force all the physical hard disks in a logical drive or array to do a read to each sector. This will cause the drives to sector spare "recoverable" media errors, hopefully before they become unrecoverable errors. If an "unrecoverable" media error occurs, it is corrected by the ServeRAID controller synchronizations operation on redundant logical drives, (RAID-1, RAID-1E, RAID-5, RAID-5E, RAID-5EE, RAID-10, RAID-1e0, and RAID-50) by rewriting the missing data.

Foreground syncs can be manually initiated two ways, by using the ServeRAID Manager GUI or using the IPSSEND command. The IPSSEND command can be used in a BAT or CMD file and then automated using most any scheduling utility.

引自:

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/743764/viewspace-1004010/,如需轉載,請註明出處,否則將追究法律責任。

相關文章