Disk Failures Xiaqing He ID: 204 Dr. Lin
Content Focus on : “How to recover from disk crashes” -- common term RAID = redundancy array of independent disks Mirroring—RAID level 1; Parity checks--RAID 4; Improvement--RAID 5; RAID 6;
1) Mirroring The simplest scheme to recovery from Disk Crashes Mirror: making two or more copied of the data on different disks -- save data in case of one disk will fail; --divide data on several disks and let access to several blocks at once
For mirroring, the only way data can be lost if there is a second (mirror/redundant) disk crash while the first (data) disk crash is being repaired. Possibility: Suppose: One disk_mean time to failure : 10 years; One of the two disk_mean time to failure : 5 years; The process of replacing the failed disk: 3 hours=1/2920 year; So: the possibility of the mirror disk will fail=1/10 * 1/2920 =1/29,200; The possibility of data loss by mirroring: 1/5 * 1/2920 = 1/146,000
2)Parity Blocks Disadvanges of Mirroring: uses so many redundant disks RAID level 4: uses only one redundant disk How this one redundant disk works? -- modulo-2 sum; -- the j th bit of the redudant disk is the modulo-2 sum of the j th bits of all the data disks. Example
Example Data disks: Disk1: 11110000 Disk2: 10101010 Disk3: 00111000 Redundant disk: Disk4: 01100010
cont. RAID 4 Reading Similar with reading blocks from any disk; Writing: 1) change the data disk; 2) change the corresponding block of the redundant disk; Why? --hold the parity checks for the corresponding blocks of all the data disks
cont. RAID 4_ writing For a total N data disks: 1) naïve way: read N data disks and compute the modulo-2 sum of the corresponding blocks; rewrite the redundant disk according to modulo-2 sum of the data disks; 2) better way: Take modulo-2 sum of the old and new version of the data block which was rewriten; Change the position of the redundant disk which was 1’s in the modulo-2 sum; Example
Example Data disks: Disk1: 11110000 Disk2: 10101010 01100110 Modulo-2 sum of the old and new version of disk 2: 11001100 So, we need to change the positions1,2,5,6 of the redundant disk. Redundant disk: Disk4: 01100010 - 10101110
Cont. RAID4_failure recovery Redundant disk crash: -- swap a new one and recompute data from all the data disks; One of Data disks crash: -- swap a new one; -- recompute data from the other disks including data disks and redundant disk; How to recompute? -- take modulo-2 sum of all the corresponding bits of all the other disks
3) An Improvement: RAID 5 Why need a improvement? -- Shortcoming of RAID level 4: suffers from a bottelneck defect (when updating data disk need to read and write the redundant disk); Principle of RAID level 5 (RAID 5): --treat each disk as the redundant disk for some of the blocks; Why it is feasible? The rule of failure recovery for redundant disk and data disk is the same: take modulo-2 sum of all the corresponding bits of all the other disks So, there is no need to retreat one as redundant disk and others as data disks
Cont. RAID 5 How to recognize which blocks of each disk treat this disk as redundant disk? -- if there are n+1 disks which were labeld from 0 to N, then we can treat the ith cylinder of disk J as redundant if J is the remainder when I is divided by n+1; Example;
Cont. RAID 5_example N=3; The first disk, labeled as 0 : 4,8,12…; The second disk, labeled as 1 : 1,5,9…; The third disk, labeled as 2 : 2,6,10…; ………. Suppose all the 4 disks are equally likely to be written, for one of the 4 disks, the possibility of being written: 1/4 + 3 /4 * 1/3 =1/2 N=m : 1/m +(m-1)/m * 1/(m-1) = 2/m
4) Coping with multiple disk crashes RAID 6 – deal with any number of disk crashes if using enough redundant disks Focus on: a system of seven disks ( four data disks_numer 1-4 and 3 redundant disks_ number 5-7); How to set up this 3*7 matrix ? 1)every column values three 1’s and 0’s except for all three 0’s; 2) column of the redundant disk has single 1’s; 3) column of the data disk has at least two 1’s;
Cont.) Coping with multiple disk crashes Reading: read form the data disks and ignore the redundant disk Writing: Change the data disk change the corresponding bits of all the redundant disks
Cont.) Coping with multiple disk crashes In those system which has 4 data disks and 3 redundant disk, how they can correct up to 2 disk crashes? Suppose disk a and b failed: find some row r in which the column for a and b are different; Compute the correct b by taking modulo-2 sum of the corresponding bits from all the other disks other than b which have 1’s in row r; After getting the correct b, Compute the correct a with all other disks available; Example