- Disk failure ways and their mitigation - Priya Gangaraju(Class Id-203)
Ways in which disks can fail- Intermittent failure. Media Decay. Write failure. Disk Crash.
Intermittent Failures. Read or write operation on a sector successful not on first try, but after repeated tries. The most common form of failure. Parity checks can be used to detect this kind of failure.
Media Decay. Serious form of failure. Bit/Bits are permanently corrupted. Impossible to read a sector correctly even after many trials. Stable storage technique for organizing a disk is used to avoid this failure.
Write failure Attempt to write a sector is not possible. Attempt to retrieve previously written sector is unsuccessful. Possible reason – power outage while writing of the sector. Stable Storage Technique can be used to avoid this.
Disk Crash Most serious form of disk failure. Entire disk becomes unreadable, suddenly and permanently. RAID techniques can be used for coping with disk crashes.
More on Intermittent failures… When we try to read a sector, but the correct content of that sector is not delivered to the disk controller. If the controller has a way to tell that the sector is good or bad (checksums), it can then reissue the read request when bad data is read.
More on Intermittent Failures.. The controller can attempt to write a sector, but the contents of the sector are not what was intended. The only way to check this is to let the disk go around again read the sector. One way to perform the check is to read the sector and compare it with the sector we intend to write.
Contd.. Instead of performing the complete comparison at the disk controller, simpler way is to read the sector and see if a good sector was read. If it is good sector, then the write was correct otherwise the write was unsuccessful and must be repeated.
Checksums. Technique used to determine the good/bad status of a sector. Each sector has some additional bits called the checksum that are set depending on the values of the data bits in that sector. If checksum is not proper on reading, then there is an error in reading.
There is a small chance that the block was not read correctly even if the checksum is proper. The probability of correctness can be increased by using many checksum bits. Checksums(contd..)
Checksum calculation. Checksum is based on the parity of all bits in the sector. If there are odd number of 1’s among a collection of bits, the bits are said to have odd parity. A parity bit ‘1’ is added. If there are even number of 1’s then the collection of bits is said to have even parity. A parity bit ‘0’ is added.
Checksum calculation(contd..) The number of 1’s among a collection of bits and their parity bit is always even. During a write operation, the disk controller calculates the parity bit and append it to the sequence of bits written in the sector. Every sector will have a even parity.
Examples… A sequence of bits has odd number of 1’s. The parity bit will be 1. So the sequence with the parity bit will now be A sequence of bits will have an even parity as it has even number of 1’s. So with the parity bit 0, the sequence will be
Checksum calculation(contd..) Any one-bit error in reading or writing the bits results in a sequence of bits that has odd-parity. The disk controller can count the number of 1’s and can determine if the sector has odd parity in the presence of an error.
Odds. There are chances that more than one bit can be corrupted and the error can be unnoticed. Increasing the number of parity bits can increase the chances of detecting errors. In general, if there are n independent bits as checksum, the chances of error will be one in 2 n.
Stable Storage. Checksums can detect the error but cannot correct it. Sometimes we overwrite the previous contents of a sector and yet cannot read the new contents correctly. To deal with these problems, Stable Storage policy can be implemented on the disks.
Stable-Storage(contd..) Sectors are paired and each pair represents one sector- contents X. The left copy of the sector may be represented as X L and X R as the right copy.
Assumptions. We assume that copies are written with sufficient number of parity bits to decrease the chance of bad sector looks good when the parity checks are considered. Also, If the read function returns a good value w for either X L or X R then it is assumed that w is the true value of X.
Stable -Storage Writing Policy: 1. Write the value of X into X L. Check the value has status “good”; i.e., the parity-check bits are correct in the written copy. If not repeat write. If after a set number of write attempts, we have not successfully written X in X L, assume that there is a media failure in this sector. A fix-up such as substituting a spare sector for X L must be adopted. 2. Repeat (1) for X R.
Stable-Storage Reading Policy: The policy is to alternate trying to read X L and X R until a good value is returned. If a good value is not returned after pre chosen number of tries, then it is assumed that X is truly unreadable.
Error-Handling capabilities: Media failures: If after storing X in sectors X L and X R, one of them undergoes media failure and becomes permanently unreadable, we can read from the second one. If both the sectors have failed to read, then sector X cannot be read. The probability of both failing is extremely small.
Error-Handling Capabilities(contd..) Write Failure: When writing X, if there is a system failure(like power shortage), the X in the main memory is lost and the copy of X being written will be erroneous. Half of the sector may be written with part of new value of X, while the other half remains as it was.
Error-Handling Capabilities(contd..) The possible cases when the system becomes available: 1. The failure occurred when writing to X L. Then X L is considered bad. Since X R was never changed, its status is good. We can make a copy of X R into X L, which is the old value of X. 2. The failure occurred after X L is written. Then X L will have the good status and X R which has the old value of X R has bad status. We can copy the new value of X to X R from X L.
Recovery from Disk Crashes. To reduce the data loss by Dish crashes, schemes which involve redundancy, extending the idea of parity checks or duplicate sectors can be applied. The term used for these strategies is RAID or Redundant Arrays of Independent Disks. In general, if the mean time to failure of disks is n years, then in any given year, 1/nth of the surviving disks fail.
Recovery from Disk Crashes(contd..) Each of the RAID schemes has data disks and redundant disks. Data disks are one or more disks that hold the data. Redundant disks are one or more disks that hold information that is completely determined by the contents of the data disks. When there is a disk crash of either of the disks, then the other disks can be used to restore the failed disk to avoid a permanent information loss.