- Disk failure ways and their mitigation - Priya Gangaraju(Class Id-203)

Slides:



Advertisements
Similar presentations
Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
Advertisements

DISK FAILURES PROF. T.Y.LIN CS-257 Presenter: Shailesh Benake(104)
What is RAID Redundant Array of Independent Disks.
1 Lecture 18: RAID n I/O bottleneck n JBOD and SLED n striping and mirroring n classic RAID levels: 1 – 5 n additional RAID levels: 6, 0+1, 10 n RAID usage.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CS 346 – April 4 Mass storage –Disk formatting –Managing swap space –RAID Commitment –Please finish chapter 12.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
- Dr. Kalpakis CMSC Dr. Kalpakis 1 Outline In implementing DBMS we need to answer How should the system store and manage very large amounts of data?
RAID Redundant Array of Independent Disks
Chapter 16: Recovery System
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 12 ERROR DETECTION & CORRECTION.
Lock-Based Concurrency Control
The Zebra Striped Network File System Presentation by Joseph Thompson.
Sean Traber CS-147 Fall  7.9 RAID  RAID Level 0  RAID Level 1  RAID Level 2  RAID Level 3  RAID Level 4 
1 CSIS 7102 Spring 2004 Lecture 8: Recovery (overview) Dr. King-Ip Lin.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Performance/Reliability of Disk Systems So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #6.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
Section Disk Failures Kevin Grant
Disk Failures Xiaqing He ID: 204 Dr. Lin. Content 1) RAID stands for: “redundancy array of independent disks” 2) Several schemes to recover from disk.
CS 333 Introduction to Operating Systems Class 16 – Secondary Storage Management Jonathan Walpole Computer Science Portland State University.
Data Representation Recovery from Disk Crashes – 13.4 Presented By: Deepti Bhardwaj Roll No. 223_103 SJSU ID:
RAID Systems CS Introduction to Operating Systems.
Servers Redundant Array of Inexpensive Disks (RAID) –A group of hard disks is called a disk array FIGURE Server with redundant NICs.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
CS4432: Database Systems II Data Storage (Better Block Organization) 1.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Distributed Deadlocks and Transaction Recovery.
1 Storage Refinement. Outline Disk failures To attack Intermittent failures To attack Media Decay and Write failure –Checksum To attack Disk crash –RAID.
RAID Ref: Stallings. Introduction The rate in improvement in secondary storage performance has been considerably less than the rate for processors and.
Chapter 2 Data Storage How does a computer system store and manage very large volumes of data ?
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.
CE Operating Systems Lecture 20 Disk I/O. Overview of lecture In this lecture we will look at: Disk Structure Disk Scheduling Disk Management Swap-Space.
Recovery System By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Unit 5 Lecture 2 Error Control Error Detection & Error Correction.
The concept of RAID in Databases By Junaid Ali Siddiqui.
Data Link Layer. Data Link Layer Topics to Cover Error Detection and Correction Data Link Control and Protocols Multiple Access Local Area Networks Wireless.
Chapter 17: Recovery System
Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765.
 Stands for Redundant Array of Independent Disks.  It’s a technology that enables greater levels of performance, reliability and/or large volumes when.
CS399 New Beginnings Jonathan Walpole. Disk Technology & Secondary Storage Management.
Disk Failures Skip. Index 13.4 Disk Failures Intermittent Failures Organizing Data by Cylinders Stable Storage Error- Handling.
Database Recovery Zheng (Godric) Gu. Transaction Concept Storage Structure Failure Classification Log-Based Recovery Deferred Database Modification Immediate.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CS Introduction to Operating Systems
File-System Management
Disk Failures Xiaqing He ID: 204 Dr. Lin.
2.8 Error Detection and Correction
8.6. Recovery By Hemanth Kumar Reddy.
Transactions and Reliability
CS 554: Advanced Database System Notes 02: Hardware
Error Detection and Correction
Disks.
RAID RAID Mukesh N Tekwani
ICOM 6005 – Database Management Systems Design
Jonathan Walpole Computer Science Portland State University
RAID Redundant Array of Inexpensive (Independent) Disks
Module 17: Recovery System
Recovery System.
CS333 Intro to Operating Systems
RAID RAID Mukesh N Tekwani April 23, 2019
Disk Failures Disk failure ways and their mitigation
2.8 Error Detection and Correction
Seminar on Enterprise Software
Presentation transcript:

- Disk failure ways and their mitigation - Priya Gangaraju(Class Id-203)

Ways in which disks can fail- Intermittent failure. Media Decay. Write failure. Disk Crash.

Intermittent Failures. Read or write operation on a sector successful not on first try, but after repeated tries. The most common form of failure. Parity checks can be used to detect this kind of failure.

Media Decay. Serious form of failure. Bit/Bits are permanently corrupted. Impossible to read a sector correctly even after many trials. Stable storage technique for organizing a disk is used to avoid this failure.

Write failure Attempt to write a sector is not possible. Attempt to retrieve previously written sector is unsuccessful. Possible reason – power outage while writing of the sector. Stable Storage Technique can be used to avoid this.

Disk Crash Most serious form of disk failure. Entire disk becomes unreadable, suddenly and permanently. RAID techniques can be used for coping with disk crashes.

More on Intermittent failures… When we try to read a sector, but the correct content of that sector is not delivered to the disk controller. If the controller has a way to tell that the sector is good or bad (checksums), it can then reissue the read request when bad data is read.

More on Intermittent Failures.. The controller can attempt to write a sector, but the contents of the sector are not what was intended. The only way to check this is to let the disk go around again read the sector. One way to perform the check is to read the sector and compare it with the sector we intend to write.

Contd.. Instead of performing the complete comparison at the disk controller, simpler way is to read the sector and see if a good sector was read. If it is good sector, then the write was correct otherwise the write was unsuccessful and must be repeated.

Checksums. Technique used to determine the good/bad status of a sector. Each sector has some additional bits called the checksum that are set depending on the values of the data bits in that sector. If checksum is not proper on reading, then there is an error in reading.

There is a small chance that the block was not read correctly even if the checksum is proper. The probability of correctness can be increased by using many checksum bits. Checksums(contd..)

Checksum calculation. Checksum is based on the parity of all bits in the sector. If there are odd number of 1’s among a collection of bits, the bits are said to have odd parity. A parity bit ‘1’ is added. If there are even number of 1’s then the collection of bits is said to have even parity. A parity bit ‘0’ is added.

Checksum calculation(contd..) The number of 1’s among a collection of bits and their parity bit is always even. During a write operation, the disk controller calculates the parity bit and append it to the sequence of bits written in the sector. Every sector will have a even parity.

Examples… A sequence of bits has odd number of 1’s. The parity bit will be 1. So the sequence with the parity bit will now be A sequence of bits will have an even parity as it has even number of 1’s. So with the parity bit 0, the sequence will be

Checksum calculation(contd..) Any one-bit error in reading or writing the bits results in a sequence of bits that has odd-parity. The disk controller can count the number of 1’s and can determine if the sector has odd parity in the presence of an error.

Odds. There are chances that more than one bit can be corrupted and the error can be unnoticed. Increasing the number of parity bits can increase the chances of detecting errors. In general, if there are n independent bits as checksum, the chances of error will be one in 2 n.

Stable Storage. Checksums can detect the error but cannot correct it. Sometimes we overwrite the previous contents of a sector and yet cannot read the new contents correctly. To deal with these problems, Stable Storage policy can be implemented on the disks.

Stable-Storage(contd..) Sectors are paired and each pair represents one sector- contents X. The left copy of the sector may be represented as X L and X R as the right copy.

Assumptions. We assume that copies are written with sufficient number of parity bits to decrease the chance of bad sector looks good when the parity checks are considered. Also, If the read function returns a good value w for either X L or X R then it is assumed that w is the true value of X.

Stable -Storage Writing Policy: 1. Write the value of X into X L. Check the value has status “good”; i.e., the parity-check bits are correct in the written copy. If not repeat write. If after a set number of write attempts, we have not successfully written X in X L, assume that there is a media failure in this sector. A fix-up such as substituting a spare sector for X L must be adopted. 2. Repeat (1) for X R.

Stable-Storage Reading Policy: The policy is to alternate trying to read X L and X R until a good value is returned. If a good value is not returned after pre chosen number of tries, then it is assumed that X is truly unreadable.

Error-Handling capabilities: Media failures: If after storing X in sectors X L and X R, one of them undergoes media failure and becomes permanently unreadable, we can read from the second one. If both the sectors have failed to read, then sector X cannot be read. The probability of both failing is extremely small.

Error-Handling Capabilities(contd..) Write Failure: When writing X, if there is a system failure(like power shortage), the X in the main memory is lost and the copy of X being written will be erroneous. Half of the sector may be written with part of new value of X, while the other half remains as it was.

Error-Handling Capabilities(contd..) The possible cases when the system becomes available: 1. The failure occurred when writing to X L. Then X L is considered bad. Since X R was never changed, its status is good. We can make a copy of X R into X L, which is the old value of X. 2. The failure occurred after X L is written. Then X L will have the good status and X R which has the old value of X R has bad status. We can copy the new value of X to X R from X L.

Recovery from Disk Crashes. To reduce the data loss by Dish crashes, schemes which involve redundancy, extending the idea of parity checks or duplicate sectors can be applied. The term used for these strategies is RAID or Redundant Arrays of Independent Disks. In general, if the mean time to failure of disks is n years, then in any given year, 1/nth of the surviving disks fail.

Recovery from Disk Crashes(contd..) Each of the RAID schemes has data disks and redundant disks. Data disks are one or more disks that hold the data. Redundant disks are one or more disks that hold information that is completely determined by the contents of the data disks. When there is a disk crash of either of the disks, then the other disks can be used to restore the failed disk to avoid a permanent information loss.