Disk Failures Disk failure ways and their mitigation By Priya Gangaraju and Xiaqing He
Ways in which disks can fail: Intermittent failure Media Decay Write failure Disk Crash
Intermittent Failures Read or write operation on a sector successful not on first try, but after repeated tries. The most common form of failure. Parity checks can be used to detect this kind of failure.
Media Decay Serious form of failure. Bit/Bits are permanently corrupted. Impossible to read a sector correctly even after many trials. Stable storage technique for organizing a disk is used to avoid this failure.
Write failure Attempt to write a sector is not possible. Attempt to retrieve previously written sector is unsuccessful. Possible reason – power outage while writing of the sector. Stable Storage Technique can be used to avoid this.
Disk Crash Most serious form of disk failure. Entire disk becomes unreadable, suddenly and permanently. RAID techniques can be used for coping with disk crashes.
More on Intermittent failures… When we try to read a sector, but the correct content of that sector is not delivered to the disk controller. If the controller has a way to tell that the sector is good or bad (checksums), it can then reissue the read request when bad data is read.
More on Intermittent Failures.. The controller can attempt to write a sector, but the contents of the sector are not what was intended. The only way to check this is to let the disk go around again read the sector. One way to perform the check is to read the sector and compare it with the sector we intend to write.
Instead of performing the complete comparison at the disk controller, simpler way is to read the sector and see if a good sector was read. If it is good sector, then the write was correct otherwise the write was unsuccessful and must be repeated.
Checksums Technique used to determine the good/bad status of a sector. Each sector has some additional bits called the checksum that are set depending on the values of the data bits in that sector. If checksum is not proper on reading, then there is an error in reading.
More on Checksums… There is a small chance that the block was not read correctly even if the checksum is proper. The probability of correctness can be increased by using many checksum bits.
Checksum calculation.. Checksum is based on the parity of all bits in the sector. If there are odd number of 1’s among a collection of bits, the bits are said to have odd parity. A parity bit ‘1’ is added. If there are even number of 1’s then the collection of bits is said to have even parity. A parity bit ‘0’ is added.
The number of 1’s among a collection of bits and their parity bit is always even. During a write operation, the disk controller calculates the parity bit and append it to the sequence of bits written in the sector. Every sector will have a even parity.
Examples… A sequence of bits 01101000 has odd number of 1’s. The parity bit will be 1. So the sequence with the parity bit will now be 011010001. A sequence of bits 11101110 will have an even parity as it has even number of 1’s. So with the parity bit 0, the sequence will be 111011100.
Any one-bit error in reading or writing the bits results in a sequence of bits that has odd-parity. The disk controller can count the number of 1’s and can determine if the sector has odd parity in the presence of an error.
Odds… There are chances that more than one bit can be corrupted and the error can be unnoticed. Increasing the number of parity bits can increase the chances of detecting errors. In general, if there are n independent bits as checksum, the chances of error will be one in 2n.
Stable Storage Checksums can detect the error but cannot correct it. Sometimes we overwrite the previous contents of a sector and yet cannot read the new contents correctly. To deal with these problems, Stable Storage policy can be implemented on the disks.
Sectors are paired and each pair represents one sector-contents X. The left copy of the sector may be represented as XL and XR as the right copy.
Assumptions We assume that copies are written with sufficient number of parity bits to decrease the chance of bad sector looks good when the parity checks are considered. Also, If the read function returns a good value w for either XL or XR then it is assumed that w is the true value of X.
Stable -Storage Writing Policy: Write the value of X into XL. Check the value has status “good”; i.e., the parity-check bits are correct in the written copy. If not repeat write. If after a set number of write attempts, we have not successfully written X in XL, assume that there is a media failure in this sector. A fix-up such as substituting a spare sector for XL must be adopted. Repeat (1) for XR.
Stable-Storage Reading Policy: The policy is to alternate trying to read XL and XR until a good value is returned. If a good value is not returned after pre chosen number of tries, then it is assumed that X is truly unreadable.
Error-Handling capabilities: Media failures: If after storing X in sectors XL and XR, one of them undergoes media failure and becomes permanently unreadable, we can read from the second one. If both the sectors have failed to read, then sector X cannot be read. The probability of both failing is extremely small.
Write Failure: When writing X, if there is a system failure(like power shortage), the X in the main memory is lost and the copy of X being written will be erroneous. Half of the sector may be written with part of new value of X, while the other half remains as it was.
The possible cases when the system becomes available: The failure occurred when writing to XL. Then XL is considered bad. Since XR was never changed, its status is good. We can make a copy of XR into XL, which is the old value of X. The failure occurred after XL is written. Then XL will have the good status and XR which has the old value of XR has bad status. We can copy the new value of X to XR from XL.
Recovery from Disk Crashes To reduce the data loss by Dish crashes, schemes which involve redundancy, extending the idea of parity checks or duplicate sectors can be applied. The term used for these strategies is RAID or Redundant Arrays of Independent Disks. In general, if the mean time to failure of disks is n years, then in any given year, 1/nth of the surviving disks fail.
Each of the RAID schemes has data disks and redundant disks. Data disks are one or more disks that hold the data. Redundant disks are one or more disks that hold information that is completely determined by the contents of the data disks. When there is a disk crash of either of the disks, then the other disks can be used to restore the failed disk to avoid a permanent information loss.
Content 1)Focus on : “How to recover from disk crashes” common term RAID “redundancy array of independent disks” 2)Several schemes to recover from disk crashes: Mirroring—RAID level 1; Parity checks--RAID 4; Improvement--RAID 5; RAID 6;
1) Mirroring -- save data in case of one disk will fail; The simplest scheme to recovery from Disk Crashes How does Mirror work? -- making two or more copied of the data on different disks Benefit: -- save data in case of one disk will fail; -- divide data on several disks and let access to several blocks at once
1) Mirroring (con’t) For mirroring, when the data can be lost? -- the only way data can be lost if there is a second (mirror/redundant) disk crash while the first (data) disk crash is being repaired. Possibility: Suppose: One disk: mean time to failure = 10 years; One of the two disk: average of mean time to failure = 5 years; The process of replacing the failed disk= 3 hours=1/2920 year; So: the possibility of the mirror disk will fail=1/10 * 1/2,920 =1/29,200; The possibility of data loss by mirroring: 1/5 * 1/29,200 = 1/146,000
2)Parity Blocks why changes? -- disadvantages of Mirroring: uses so many redundant disks What’s new? -- RAID level 4: uses only one redundant disk How this one redundant disk works? -- modulo-2 sum; -- the jth bit of the redundant disk is the modulo-2 sum of the jth bits of all the data disks. Example
2)Parity Blocks(con’t)___Example Data disks: Disk1: 11110000 Disk2: 10101010 Disk3: 00111000 Redundant disk: Disk4: 01100010
2)RAID 4 (con’t) Reading -- Similar with reading blocks from any disk; Writing 1)change the data disk; 2)change the corresponding block of the redundant disk; Why? -- hold the parity checks for the corresponding blocks of all the data disks
2)RAID 4 (con’t) _ writing For a total N data disks: 1) naïve way: read N data disks and compute the modulo-2 sum of the corresponding blocks; rewrite the redundant disk according to modulo-2 sum of the data disks; 2) better way: Take modulo-2 sum of the old and new version of the data block which was rewritten; Change the position of the redundant disk which was 1’s in the modulo-2 sum;
2)RAID 4 (con’t) _ writing_Example Data disks: Disk1: 11110000 Disk2: 10101010 01100110 Disk3: 00111000 to do: Modulo-2 sum of the old and new version of disk 2: 11001100 So, we need to change the positions 1,2,5,6 of the redundant disk. Redundant disk: Disk4: 01100010 10101110
2)RAID 4 (con’t) _failure recovery Redundant disk crash: -- swap a new one and recomputed data from all the data disks; One of Data disks crash: -- swap a new one; -- recomputed data from the other disks including data disks and redundant disk; How to recomputed? (same rule, that’s why there will be some improvement) -- take modulo-2 sum of all the corresponding bits of all the other disks
3) An Improvement: RAID 5 Why need a improvement? -- Shortcoming of RAID level 4: suffers from a bottleneck defect (when updating data disk need to read and write the redundant disk); Principle of RAID level 5 (RAID 5): -- treat each disk as the redundant disk for some of the blocks; Why it is feasible? The rule of failure recovery for redundant disk and data disk is the same: “take modulo-2 sum of all the corresponding bits of all the other disks” So, there is no need to retreat one as redundant disk and others as data disks
3) RAID 5 (con’t) How to recognize which blocks of each disk treat this disk as redundant disk? -- if there are n+1 disks which were labeled from 0 to N, then we can treat the ith cylinder of disk J as redundant if J is the remainder when I is divided by n+1; Example;
3) RAID 5 (con’t)_example The first disk, labeled as 0 : 4,8,12…; The second disk, labeled as 1 : 1,5,9…; The third disk, labeled as 2 : 2,6,10…; ………. Suppose all the 4 disks are equally likely to be written, for one of the 4 disks, the possibility of being written: 1/4 + 3 /4 * 1/3 =1/2 If N=m => 1/m +(m-1)/m * 1/(m-1) = 2/m
4)Coping with multiple disk crashes RAID 6 – deal with any number of disk crashes if using enough redundant disks Example a system of seven disks ( four data disks_number 1-4 and 3 redundant disks_ number 5-7); How to set up this 3*7 matrix ? (why is 3? – there are 3 redundant disks) 1)every column values three 1’s and 0’s except for all three 0’s; 2) column of the redundant disk has single 1’s; 3) column of the data disk has at least two 1’s;
4) Coping with multiple disk crashes (con’t) Reading: read form the data disks and ignore the redundant disk Writing: Change the data disk change the corresponding bits of all the redundant disks
4) Coping with multiple disk crashes (con’t) In those system which has 4 data disks and 3 redundant disk, how they can correct up to 2 disk crashes? Suppose disk a and b failed: find some row r (in 3*7 matrix)in which the column for a and b are different (suppose a is 0’s and b is 1’s); Compute the correct b by taking modulo-2 sum of the corresponding bits from all the other disks other than b which have 1’s in row r; After getting the correct b, Compute the correct a with all other disks available; Example
4) Coping with multiple disk crashes (con’t)_example 3*7 matrix data disk redundant disk disk number 1 2 3 4 5 6 7 1
4) Coping with multiple disk crashes (con’t)_example First block of all the disks disk contents 1) 11110000 2) 10101010 3) 00111000 4) 01000001 5) 01100010 6) 00011011 7) 10001001
4) Coping with multiple disk crashes (con’t)_example Two disks crashes; disk contents 1) 11110000 2) ????????? 3) 00111000 4) 01000001 5) ????????? 6) 00011011 7) 10001001
4) Coping with multiple disk crashes (con’t)_example In that 3*7 matrix, find in row 2, disk 2 and 5 have different value and disk 2’s value is 1 and 5’s value is 0. so: compute the first block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,4,6; then compute the first block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,2,3; 1) 11110000 2) ????????? => 00001111 3) 00111000 4) 01000001 5) ????????? => 01100010 6) 00011011 7) 10001001
Summary.. Disk failures - their mitigation: Intermittent failure - checksums Media decay – Stable Storage Technique Write Failure – Stable Storage Technique Disk Crashes – RAID Techniques “How to recover from disk crashes” --- by RAID
Material taken from Disk Failures (Chapter 13.4.1 to 13.4.9) Database Systems – The Complete Book Second Edition.