Distributed Storage Systems Vinodh Venkatesan IBM Zurich Research Lab / EPFL.

Slides:



Advertisements
Similar presentations
A Case for Redundant Arrays Of Inexpensive Disks Paper By David A Patterson Garth Gibson Randy H Katz University of California Berkeley.
Advertisements

Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
RAID Redundant Arrays of Independent Disks Courtesy of Satya, Fall 99.
Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
CS224 Spring 2011 Computer Organization CS224 Chapter 6A: Disk Systems With thanks to M.J. Irwin, D. Patterson, and J. Hennessy for some lecture slide.
What is RAID Redundant Array of Independent Disks.
1 Lecture 18: RAID n I/O bottleneck n JBOD and SLED n striping and mirroring n classic RAID levels: 1 – 5 n additional RAID levels: 6, 0+1, 10 n RAID usage.
IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB 265.
CSE 451: Operating Systems Spring 2012 Module 20 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570 ©
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Triple-Parity RAID and Beyond Hai Lu. RAID RAID, an acronym for redundant array of independent disks or also known as redundant array of inexpensive disks,
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
RAID Redundant Array of Independent Disks
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
Raid dr. Patrick De Causmaecker What is RAID Redundant Array of Independent (Inexpensive) Disks A set of disk stations treated as one.
Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.
CSCE430/830 Computer Architecture
Henry C. H. Chen and Patrick P. C. Lee
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
CSE521: Introduction to Computer Architecture Mazin Yousif I/O Subsystem RAID (Redundant Array of Independent Disks)
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
Sean Traber CS-147 Fall  7.9 RAID  RAID Level 0  RAID Level 1  RAID Level 2  RAID Level 3  RAID Level 4 
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
RAID Systems CS Introduction to Operating Systems.
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Redundant Array of Independent Disks
Two or more disks Capacity is the same as the total capacity of the drives in the array No fault tolerance-risk of data loss is proportional to the number.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
Copyright © Curt Hill, RAID What every server wants!
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
The concept of RAID in Databases By Junaid Ali Siddiqui.
RAID Disk Arrays Hank Levy. 212/5/2015 Basic Problems Disks are improving, but much less fast than CPUs We can use multiple disks for improving performance.
RAID Systems Ver.2.0 Jan 09, 2005 Syam. RAID Primer Redundant Array of Inexpensive Disks random, real-time, redundant, array, assembly, interconnected,
W4118 Operating Systems Instructor: Junfeng Yang.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
CS Introduction to Operating Systems
RAID.
A Tale of Two Erasure Codes in HDFS
A Case for Redundant Arrays of Inexpensive Disks (RAID) -1988
Multiple Platters.
RAID Redundant Arrays of Independent Disks
What every server wants!
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
RAID Non-Redundant (RAID Level 0) has the lowest cost of any RAID
Vladimir Stojanovic & Nicholas Weaver
RAID RAID Mukesh N Tekwani
ICOM 6005 – Database Management Systems Design
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
CSE 451: Operating Systems Autumn 2010 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
RAID Redundant Array of Inexpensive (Independent) Disks
UNIT IV RAID.
CSE 451: Operating Systems Autumn 2004 Redundant Arrays of Inexpensive Disks (RAID) Hank Levy 1.
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
RAID RAID Mukesh N Tekwani April 23, 2019
CSE 451: Operating Systems Winter 2004 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB
Presentation transcript:

Distributed Storage Systems Vinodh Venkatesan IBM Zurich Research Lab / EPFL

Distributed Storage Storage needs increase almost exponentially – widespread use of e- mail, photos, videos, logs, … Cant store everything on one large disk. If the disk fails, we lose everything! Solution: Store the users information along with some redundant information across many disks. If a disk fails, then you still have enough information in the surviving disks. Bring in a new disk and replace the information lost by the failed disk ASAP. Simple? No. Todays large data centers have so many disks that multiple disk failures are more common! Permanent data loss becomes likely. This presentation is about these issues.

Distributed Storage: what we care about. Performance metrics: Storage efficiency: how much redundant information do you store? Saturation throughput: how many I/O requests can the system handle before it collapses (or delay increases to infinity)? Rebuild time: how fast can you replace information lost due to disk failure? Mean time to data loss: under assumptions on failure and usage models of the system, how long do you expect to run without any permanent loss of data? Encoding/Decoding/Update/Rebuild complexity: the computation power needed for all these operations; also, how many bytes of data on how many disks do you have to update if you just want to update 1 byte of user data? Sequential read/write bandwidth: bandwidth the system can provide for streaming data

Distributed Storage: Some Challenges Scale-out (as opposed to Scale-up): Add low cost commodity hardware to increase capacity incrementally (scale-out) rather than make up-front large investment in more expensive complex hardware High availability: Cost of down time for businesses is large Systems cannot be taken down for backing up data Rebuild time should be small Reliability vs. cost: Replication based schemes for reliability are expensive

RAID (Redundant Array of Independent Disks) Block 1 Block 4 Block 7 Block 2 Block 5 Block 8 Block 3 Block 6 Block 9 RAID-0 - No redundancy - data striped across different drives Block 1 Block 2 Block 3 Block 1 Block 2 Block 3 Block 4 Block 5 Block 4 Block 5 RAID-1 - Mirroring Bit 1 Bit 33 Bit 65 Bit 2 Bit 34 Bit 66 ECC 1-32 ECC ECC RAID-2 - ECC (Hamming code) - not practically used ECC 1-32 ECC ECC Bit 1 Bit 33 Bit 65 Bit 2 Bit 34 Bit 66 Parity 1-32 Parity Parity RAID-3 - Parity disk - Bit-interleaved parity

RAID (Redundant Array of Independent Disks) Block 1 Block 11 Block 21 Block 2 Block 12 Block 22 Parity 1-10 Parity Parity RAID-4 - Block-interleaved parity Block 1 Block 4 Block 7 Block 2 Block 5 Parity 7-9 Block 3 Parity 4-6 Block 8 Parity 1-3 Block 6 Block 9 RAID-5 - Block-interleaved distributed parity RAID-6 - Two parity blocks - Parity blocks generated using Reed Solomon coding or other schemes RAID-5 (or RAID-6) is used in storage systems nowadays to protect against a single (or double) disk failures.

Systems using RAID RAID has been used in several systems to improve read/write bandwidth and to tolerate up to 2 disk failures Usually shipped as a box with 8 disks (that can tolerate 1 failure) or 16 disks (that can tolerate 2 failures) Simple to implement

Erasure Codes n disks (n + m) disks Encode data on n disks onto (n+m) disks such that the whole system can tolerate up to m disk failures - Reliability - Specified by avg # disk failures tolerated - If avg # disk failures tolerated = m, then the code is Maximum Distance Separable (MDS) - Performance (encoding/decoding/update) - Space usage/Rate: Rate = n / (n+m) - Flexibility - can you arbitrarily add nodes? Change rate? - how does this affect failure coverage? Examples: Reed-Solomon codes, Parity Array codes (EvenOdd coding, X code, Weaver), LDPC codes

Systems using Erasure Codes RAID is also a type of erasure code RobuStore (UCSD, 2007) Uses Luby Transform (LT) codes Speculative access mechanisms Designed for large data objects, low latency, high transfer rates Centralized CERN (2008) has experimented with using LDPC codes for distributed file storage by splitting each file into chunks that are erasure coded and spread across all nodes. This scheme is more decentralized.

Other Distributed Storage Systems Amazon Dynamo (2007) Key-value store (for small data objects) Decentralized Highly available; low latency (measured at 99.9 percentile) Repetition coding Google File System (2003) For large objects Central server High bandwidth more important than low latency Repetition coding OceanStore (Berkeley, 2000) Storage in untrusted evironment Both repetition and erasure coding (Reed-Solomon/Tornado codes)

Network Coding for Distributed Storage Systems* Problem: How to rebuild data lost by failed nodes while transferring as little data as possible over the network (so as to reduce rebuild time)? Downloads functions of data from surviving nodes => reduce amount of data transferred to new node Paper shows a tradeoff between storage and repair bandwidth * Dimakis et al, Berkeley (2007)

Read as much as you transfer To transfer the required amount of data from each surviving node to the new node, potentially all the data on that node may need to be read (like the example shown before) Read bandwidth could be a bottleneck in the rebuild process We prefer to read only as much as we transfer (unlike the example shown before) Do there exist codes such that it is enough to read as much you transfer during the rebuild process? Our answer: Yes.

Schemes for faster rebuild Lower rebuild time => less chance that more disks will fail during rebuild => system more reliable!

Other Issues Failure Models: What is a disk failure? Completely unreadable? Mechanical or chip failures, for instance. Latent sector errors? How do you test a drive? Read all sectors and decide faulty if any one operation takes above threshold Result depends on threshold Users point of view: A disk has failed if the user feels it is no longer satisfactory for his/her needs Measures of disk reliability Manufacturers: MTTF: mean time to failure (based on accelerated tests on a large number of disk) AFR: annualized failure rate (percentage of disks expected to fail in a year) More recent measures (from users PoV) ARR: annualized replacement rate (percentage of disks replaced by a user in a year)

Other Issues Failure Models (contd.): Traditional assumptions Disk failures are independent and is a Poisson process (that is, time between failures is exponentially distributed with parameter λ; then the MTTF = 1/λ) Bathtub curve (plot of failure rate vs. time looks like a bathtub): infant mortality: failure rates are high in the first few months to a year or so useful life period: the failure rate is minimum and stays constant from a year to about 4 or 5 years wearout period: failure rate again goes up after 4 or 5 years New findings (Schroeder and Gibson, CMU, and Google, 2007) based on disk replacement (rates) rather than disk failure (rates) Disk replacements are not independent and do not follow a Poisson process Longer since the last disk failed => longer till the next disk will fail! Possible explanation: environmental and other factors (temperature etc.) are more important than component specific factors Disk replacement rates do not enter a steady state after 1 year (like a bathtub curve); instead, the replacement rates steadily increase over time Different failure model => different values for performance metrics! => different design of codes for achieving performance!

Other Issues Usage models are as important as failure models in estimating performance metrics and designing good codes: Write once, (almost) never read: eg. Archival storage Write once, read many times: eg. Streaming applications like Youtube Random short reads and writes: eg. Systems handling lots of short transactions like shopping sites, or high performance computing A storage system with a certain coding scheme and a certain failure model can perform significantly differently under each of the above usage models Unlikely that there is one coding scheme that is suited for all failure and usage models Code design Model failures as accurately as possible Pick a usage model that resembles the application Identify performance metrics the user is interested in (it is impossible to optimize all metrics; tradeoffs must be identified) Design code that gives satisfactory results in the metrics of interest under the assumed failure and usage model

Vision Machine generated data far exceeds human generated data; data is growing exponentially. We need storage systems of the order of exabytes (a million terabytes) The distributed storage system could be as small as within the premises of company connected by LAN or it could span large geographic areas connected by the Web Failure of disks is normal because we want to use inexpensive commodity disks The system should be highly available (=> decentralized, reliable, etc.) Ideas from Coding Theory and Network Coding could prove useful in designing such a system