Latent Sector Errors In Disk Drives Ahmet Salih BÜYÜKKAYHAN 2007706435 - 2009 Spring.

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
I/O Management and Disk Scheduling Chapter 11. I/O Driver OS module which controls an I/O device hides the device specifics from the above layers in the.
1 Lecture 18: RAID n I/O bottleneck n JBOD and SLED n striping and mirroring n classic RAID levels: 1 – 5 n additional RAID levels: 6, 0+1, 10 n RAID usage.
CS 346 – April 4 Mass storage –Disk formatting –Managing swap space –RAID Commitment –Please finish chapter 12.
Faculty of Information Technology Department of Computer Science Computer Organization Chapter 7 External Memory Mohammad Sharaf.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
RAID Redundant Array of Independent Disks
CS 6560: Operating Systems Design
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Disks and RAID.
“Redundant Array of Inexpensive Disks”. CONTENTS Storage devices. Optical drives. Floppy disk. Hard disk. Components of Hard disks. RAID technology. Levels.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
1 Recap (RAID and Storage Architectures). 2 RAID To increase the availability and the performance (bandwidth) of a storage system, instead of a single.
Computer ArchitectureFall 2007 © November 28, 2007 Karem A. Sakallah Lecture 24 Disk IO and RAID CS : Computer Architecture.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
Fig 5-5 Interrupts Handling
High Performance Computing Course Notes High Performance Storage.
Parity Lost and Parity Regained Andrew Krioukov, Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin.
CS 333 Introduction to Operating Systems Class 16 – Secondary Storage Management Jonathan Walpole Computer Science Portland State University.
Disks CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Secondary Storage CSCI 444/544 Operating Systems Fall 2008.
Operating Systems COMP 4850/CISG 5550 Disks, Part II Dr. James Money.
Servers Redundant Array of Inexpensive Disks (RAID) –A group of hard disks is called a disk array FIGURE Server with redundant NICs.
Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.
12.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 12: Mass-Storage Systems.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
RAID Ref: Stallings. Introduction The rate in improvement in secondary storage performance has been considerably less than the rate for processors and.
CS 346 – Chapter 10 Mass storage –Advantages? –Disk features –Disk scheduling –Disk formatting –Managing swap space –RAID.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
1 Recitation 8 Disk & File System. 2 Disk Scheduling Disks are at least four orders of magnitude slower than main memory –The performance of disk I/O.
CSE 321b Computer Organization (2) تنظيم الحاسب (2) 3 rd year, Computer Engineering Winter 2015 Lecture #4 Dr. Hazem Ibrahim Shehata Dept. of Computer.
1 Chapter 7: Storage Systems Introduction Magnetic disks Buses RAID: Redundant Arrays of Inexpensive Disks.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
CE Operating Systems Lecture 20 Disk I/O. Overview of lecture In this lecture we will look at: Disk Structure Disk Scheduling Disk Management Swap-Space.
RAID SECTION (2.3.5) ASHLEY BAILEY SEYEDFARAZ YASROBI GOKUL SHANKAR.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
Chapter 12 – Mass Storage Structures (Pgs )
The concept of RAID in Databases By Junaid Ali Siddiqui.
Chapter 14: Mass-Storage Systems Disk Structure. Disk Scheduling. RAID.
CS399 New Beginnings Jonathan Walpole. Disk Technology & Secondary Storage Management.
Introduction to RAID Rogério Perino de Oliveira Neves Patrick De Causmaecker
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Part IV I/O System Chapter 12: Mass Storage Structure.
RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO
CS Introduction to Operating Systems
Multiple Platters.
External Memory.
Advanced Topics in Storage Systems
Disks.
RAID RAID Mukesh N Tekwani
ICOM 6005 – Database Management Systems Design
Overview Continuation from Monday (File system implementation)
RAID Redundant Array of Inexpensive (Independent) Disks
UNIT IV RAID.
Mark Zbikowski and Gary Kimura
CS333 Intro to Operating Systems
RAID RAID Mukesh N Tekwani April 23, 2019
Seminar on Enterprise Software
Presentation transcript:

Latent Sector Errors In Disk Drives Ahmet Salih BÜYÜKKAYHAN Spring

OUTLINE Motivation Introduction Disk Errors Error Handling Evaluation Conclusion

Motivation 90% of all new information produced in the world is being stored on magnetic media mostly hard disk drives This study analyzes data collected from production storage systems over 32 months across 1.53 million disks ◦ storage system has a built-in, low-overhead mechanism to log important system events back to a central repository This study can shed light on disk fault prevention, fault tolerance and fault forecasting researches

Introduction - Disk Drives Mechanical and electronic components Disk Controller ◦ Electronic component ◦ Convert serial bit stream to block of bytes ◦ perform error correction as necessary

Introduction - Disk Drives Sectors: the smallest addressable unit of data access, usually 512 bytes in size ◦ Error correcting codes Linear array of equal sized blocks each identified by a logical block number (LBN).

Introduction Factors other than complete disk failures influence the reliability of data and expressed as mean time to data loss (MTTDL) Disk drives do not report any latent sector error until the particular sector is accessed.

Disk Drives Failure Pattern Bad sector errors: manufactoring defects Seek errors: head can not be positioned in the right track ◦ The disk head needs to be recalibrated Data Corruption ◦ Lost writes : not write but completion is reported ◦ Misdirected writes: write to the wrong disk block ◦ Torn writes: partially write but completion is reported

Disk Errors Latent Sector Errors: disk sector cannot be read or written, or uncorrectable ECC error. ◦ Any data previously stored in the sector is lost. ◦ Requires higher-level mechanisms such as RAID reconstruction Not-Ready-Condition Errors: Disk drive is not ready to handle a command from the host. ◦ waiting and retrying. Recovered Errors: Access to a sector required disk-level retry or error-correction.

Error Handling Some disks able to re-map automatically OS can handle bad sectors by re-mapping tables ◦ Constructs a list of bad sectors ◦ Both allocated and free blocks tested

Proactive Error Detection Media scrubs use a SCSI Verify command to validate a disk sector’s integrity. (ECC) ◦ check of the sector’s content within the disk A data scrub: to detect data corruption. ◦ read operations for each disk sector, computes a checksum over its data, ◦ compares the checksum to the on-disk 8-byte checksum ◦ reconstructs the sector from other disks in the RAID group if the checksum fails ◦ Latent sector errors discovered by data scrubs appear as read errors.

System Architecture Store, verify block ID (Inode X, offset Y) Detect identity discrepancy Lost or misdirected writes WAFL ® File Sys RAID layer Storage layer Disk drives Autosupport Client IFace (NFS) Parity generation Reconstruction on failure Data scrubbing o Read blocks, verify parity o Detect parity inconsistency o Lost or misdirected writes, parity miscalculations Store, verify checksum o Detect checksum mismatch o Bit corruptions, torn writes

12 RAID – I/O Parallelism RAID is a set of disks with a single RAID controller ◦ Improve the fault tolerance and performance ◦ Reduce costs The disks in RAID appear as a single disk to the OS There are six different RAID organizations (0…5) RAID level 0 : Strips of size “k-sectors” partitioned into individual disks in round robin fashion o There is no redundant data storage in this approach o No performance gain if the requests are one sector at a time!

13 RAID RAID level 1: Duplicates all the disks. ◦ Every strip is written twice! ◦ Either of the two copies could be read! Write performance is the same Read performance can be twice as good Fault tolerance is excellent o Recovery is easy, buy a new drive, and replace it with the one that crashed

14 RAID RAID level 2: granularity striping with hamming code for error detection and correction. Disk drives must be synchronized RAID level 3: simplified version of level 2, where only parity is stored. Single disk crash?

15 RAID RAID level 4: With a strip of k bytes, an extra disk drive stores k-byte long parities constructed by XOR on the strips in each disk RAID level 5: Like RAID 4 but parity bits are distributed over the RAID disks to reduce the risk induced by parity disk crash

16 Stable Storage RAID deals with correct reads and fault tolerance against crashes How about writes? Desired Property: ◦ When a write is issued, the disk either correctly writes the data or it does nothing at all

17 Stable Storage Stable storage uses a pair of identical disks with the corresponding blocks form an error-free block Stable write: ◦ Write the block on drive 1 ◦ Read it back and verify it, if not correct repeat the operation ◦ After n consecutive failures the block is remapped to a spare one and the operation continues ◦ After the write to drive 1 succeeds, the corresponding block on drive 2 is written and re-read until it succeeds ◦ After the stable write completes, the block is successfully written to both drives

18 Stable Storage Stable read: ◦ First read from drive 1 ◦ If the ECC indicates and error, then reread ◦ If after n iterations, the error occurs, then the corresponding block is read from drive 2 Crash recovery ◦ Scan both disks and compare the corresponding blocks ◦ If one of the has ECC error, then the good one is written over the bad one ◦ If both have ECC good, but they are different, then the block in drive 1 is overwritten to drive 2.

Evaluation Disk class: Enterprise class or nearline disk drives with respectively Fiber Channel and ATA interfaces. Disk model: Combination of disk family and particular size ◦ Quantum Fireball EX – 6.4 GB Denoted as ‘E-1’ Disk Age: Amount of time in the field since ship date Error disk: This term is used to refer to a disk drive that has at least one latent sector error.

Evaluation Sample Selection ◦ Model has at least 1000 disks in the field for time period being considered ◦ Model has at least 1000 disks in the field and at least 50 error disks for time being considered ◦ Disregard the very few “outlier” disks (0.2% of error disks) with more than 1000 errors to avoid the skew caused by these numbers A-E Nearline disks, F-N Enterprise disks. E-2 have the double disk size according to E-1

Impact of Disk Age Enterprise disks  Nearline disks  Disk age impact varies across disk models Nearline disk LSE grows far more rapidly

Impact of Disk Age AFRs varies from 1.7%, for drives that were in their first year of operation, to over 8.6%, observed in the 3-year old

Impact of Disk Size The amount of probable data loss due to latent sector errors per Gigabyte does not increase or decrease consistently as disk size increases

Errors per Error Disk ES&NL are equally likely to develop more than one error once they develop their first error.

Spatial Locality There is significant locality in the occurrence of latent sector errors across logical sector addresses

26 Spatial Locality Use locality radius to measure locality Logical Block Number Space 100 block: 2/5 errors have 1 neighbor 1000 block: 4/5 errors have 1 neighbor Beginning of disk End of disk 100 Block locality radius 1000 Block locality radius

Temporal Locality Disks that develop errors beyond the first error see most of the additional errors within one month after the first error.

Detection Methods Media scrubbing detects a large percentage of observed latent sector errors 86.6% of all LSE in NL and 61.5% of LSE in ES are discovered by verify operations

Conclusion The fraction of disks affected by LSE increases linearly with time for enterprise class disks and super linearly for nearline disks. The percentage of affected disks depends on many factors, such as the disk drive model, the age of the disk drive, and the storage capacity of the drive. A disk with a latent sector error is more likely to develop another latent sector error than a disk without an error.

Conclusion The fraction of disks affected by latent sector errors increases as disk capacity increases Latent sector errors shows high spatial and temporal locality. Latent sector errors correlate with not ready condition errors especially NL. Latent sector errors also correlate with recovered error warnings especially ES.

References [1] L. Bairavasundaram, G. Goodson, S. Pasupathy, and J. Schindler. An Analysis of Latent Sector Errors in Disk Drives. In SIGMETRICS ’07, pages 289–300, San Diego, CA, June [2] E. Pinheiro, W. D. Weber, and L. A. Barroso. Failure Trends in a Large Disk Drive Population. USENIX Conference on File and Storage Technologies, Feb.13–16, [3] Bianca Schroeder and Garth Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In FAST ’07, pages 1–16, San Jose, CA, February [4] Jimmy Yang and Feng-Bin Sun. A comprehensive review of hard-disk drive reliability. In 1999 Proceedings Annual Reliability and Maintainability Symposium, 1999.

Thanks