Erasure Codes for Systems

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
1 Lecture 18: RAID n I/O bottleneck n JBOD and SLED n striping and mirroring n classic RAID levels: 1 – 5 n additional RAID levels: 6, 0+1, 10 n RAID usage.
CS 346 – April 4 Mass storage –Disk formatting –Managing swap space –RAID Commitment –Please finish chapter 12.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
RAID Redundant Array of Independent Disks
CSCE430/830 Computer Architecture
4/11/2017.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
Sean Traber CS-147 Fall  7.9 RAID  RAID Level 0  RAID Level 1  RAID Level 2  RAID Level 3  RAID Level 4 
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.
RAID Systems CS Introduction to Operating Systems.
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Configuring File Services Lesson 6. Skills Matrix Technology SkillObjective DomainObjective # Configuring a File ServerConfigure a file server4.1 Using.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Two or more disks Capacity is the same as the total capacity of the drives in the array No fault tolerance-risk of data loss is proportional to the number.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CSE791 COURSE PRESENTATION QIUWEN CHEN Workload-aware Storage.
Chapter 14: Mass-Storage Systems Disk Structure. Disk Scheduling. RAID.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
CSE 451: Operating Systems Spring 2010 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center 534.
CS Introduction to Operating Systems
A Tale of Two Erasure Codes in HDFS
Configuring File Services
RAID Redundant Arrays of Independent Disks
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
Repair Pipelining for Erasure-Coded Storage
Presented by Haoran Wang
Steve Ko Computer Sciences and Engineering University at Buffalo
Section 7 Erasure Coding Overview
Unit OS10: Fault Tolerance
Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.
Erasure Codes for Systems
CSE 451: Operating Systems Spring 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center.
RAID RAID Mukesh N Tekwani
Steve Ko Computer Sciences and Engineering University at Buffalo
ICOM 6005 – Database Management Systems Design
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
CSE 451: Operating Systems Spring 2005 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Autumn 2010 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
Lecture 6: Reliability, PCM
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
TECHNICAL SEMINAR PRESENTATION
RAID Redundant Array of Inexpensive (Independent) Disks
CMPE 252A : Computer Networks
Mark Zbikowski and Gary Kimura
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Scalable Causal Consistency
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Autumn 2009 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
RAID RAID Mukesh N Tekwani April 23, 2019
IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB
CSE 451: Operating Systems Winter 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
Presentation transcript:

Erasure Codes for Systems COS 518: Advanced Computer Systems Lecture 14 Michael Freedman Slides originally by Wyatt Lloyd

Things Fail, Let’s Not Lose Data Replication Store multiple copies of the data Simple and very commonly used! But, requires a lot of extra storage Erasure coding Store extra information we can use to recover the data Fault tolerance with less storage overhead

Erasure Codes vs Error Correcting Codes Error correcting code (ECC): Protects against errors is data, i.e., silent corruptions Bit flips can happen in memory -> use ECC memory Bits can flip in network transmissions -> use ECCs Erasure code: Data is erased, i.e., we know it’s not there Cheaper/easier than ECC Special case of ECC What we’ll discuss today and use in practice Protect against errors with checksums

Erasure Codes, a simple example w/ XOR B A⊕B

Erasure Codes, a simple example w/ XOR B A⊕B

Erasure Codes, a simple example w/ XOR B A⊕B A = B ⊕ A⊕B

Reed-Solomon Codes (1960) N data blocks K coding blocks M = N+K total blocks Recover any block from any N other blocks! Tolerates up to K simultaneous failures Works for any N and K (within reason)

Reed-Solomon Code Notation N data blocks K coding blocks M = N+K total blocks RS(N,K) (10,4): 10 data blocks, 4 coding blocks f4 uses this, FB HDFS for data warehouse does too Will also see (M, N) notation sometimes (14,10): 14 total blocks, 10 data blocks, (4 coding blocks)

Reed-Solomon Codes, How They Work Galois field arithmetic is the secret sauce Details aren’t important for us  See “J. S. Plank. A tutorial on Reed-Solomon coding for fault- tolerance in RAID-like systems. Software—Practice & Experience 27(9):995–1012, September 1997.”

Reed-Solomon (4,2) Example B C D 1 2

Reed-Solomon (4,2) Example B C D 1 2

Reed-Solomon (4,2) Example B C D 1 2 A = B C D 1 + + +

Reed-Solomon (4,2) Example B C D 1 2 A = B C D 1 + + + = = = A = C D 1 2 + + +

Erasure Codes Save Storage Tolerating 2 failures 3x replication = ___ storage overhead

Erasure Codes Save Storage Tolerating 2 failures 3x replication = 3x storage overhead RS(4,2) = ___ storage overhead

Erasure Codes Save Storage Tolerating 2 failures 3x replication = 3x storage overhead RS(4,2) = (4+2)/4 = 1.5x storage overhead

Erasure Codes Save Storage Tolerating 2 failures 3x replication = 3x storage overhead RS(4,2) = (4+2)/4 = 1.5x storage overhead Tolerating 4 failures 5x replication = 5x storage overhead RS(10,4) = ___ storage overhead

Erasure Codes Save Storage Tolerating 2 failures 3x replication = 3x storage overhead RS(4,2) = (4+2)/4 = 1.5x storage overhead Tolerating 4 failures 5x replication = 5x storage overhead RS(10,4) = (10+4)/10 = 1.4x storage overhead RS(100,4) = ___ storage overhead

Erasure Codes Save Storage Tolerating 2 failures 3x replication = 3x storage overhead RS(4,2) = (4+2)/4 = 1.5x storage overhead Tolerating 4 failures 5x replication = 5x storage overhead RS(10,4) = (10+4)/10 = 1.4x storage overhead RS(100,4) = (100+4)/100 = 1.04x storage overhead

What’s the Catch?

Catch 1: Encoding Overhead Replication: Just copy the data Erasure coding: Compute codes over N data blocks for each of the K coding blocks

Catch 2: Decoding Overhead Replication Just read the data Erasure Coding

Catch 2: Decoding Overhead Replication Just read the data Erasure Coding Normal case is no failures -> just read the data! If there are failures Read N blocks from disks and over the network Compute code over N blocks to reconstruct the failed block

Catch 3: Updating Overhead Replication: Update the data in each copy Erasure coding Update the data in the data block And all of the coding blocks

Catch 3’: Deleting Overhead Replication: Delete the data in each copy Erasure coding Delete the data in the data block Update all of the coding blocks

Catch 4: Update Consistency Replication: Erasure coding

Catch 4: Update Consistency Replication: Consensus protocol (Paxos!) Erasure coding Need to consistently update all coding blocks with a data block Need to consistently apply updates in a total order across all blocks Need to ensure reads, including decoding, are consistent

Catch 5: Fewer Copies for Reading Replication Read from any of the copies Erasure coding Read from the data block Or reconstruct the data on fly if there is a failure

Catch 6: Larger Min System Size Replication Need K+1 disjoint places to store data e.g., 3 disks for 3x replication Erasure coding Need M=N+K disjoint places to store data e.g., 14 disks for RS(10,4) replication

What’s the Catch? Encoding overhead Decoding overhead Updating overhead Deleting overhead Update consistency Fewer copies for serving reads Larger minimum system size

Different codes make different tradeoffs Encoding, decoding, and updating overheads Storage overheads Best are “Maximum Distance Separable” or “MDS” codes where K extra blocks allows you to tolerate any K failures Configuration options Some allow any (N,K), some restrict choices of N and K See “Erasure Codes for Storage Systems, A Brief Primer. James S. Plank. Usenix ;login: Dec 2013” for a good jumping off point Also a good, accessible resource generally

Erasure Coding Big Picture Huge Positive Fault tolerance with less storage overhead! Many drawbacks Encoding overhead Decoding overhead Updating overhead Deleting overhead Update consistency Fewer copies for serving reads Larger minimum system size

Let’s Improve Our New Hammer!

Erasure Coding Big Picture Huge Positive Fault tolerance with less storage overhead! Many drawbacks Encoding overhead Decoding overhead Updating overhead Deleting overhead Update consistency Fewer copies for serving reads Larger minimum system size Immutable data

Erasure Coding Big Picture Huge Positive Fault tolerance with less storage overhead! Many drawbacks Encoding overhead Decoding overhead Updating overhead Deleting overhead Update consistency Fewer copies for serving reads Larger minimum system size Immutable data

Erasure Coding Big Picture Huge Positive Fault tolerance with less storage overhead! Many drawbacks Encoding overhead Decoding overhead Updating overhead Deleting overhead Update consistency Fewer copies for serving reads Larger minimum system size Immutable data Storing lots of data (when storage overhead actually matters this is true)

Erasure Coding Big Picture Huge Positive Fault tolerance with less storage overhead! Many drawbacks Encoding overhead Decoding overhead Updating overhead Deleting overhead Update consistency Fewer copies for serving reads Larger minimum system size Immutable data Storing lots of data (when storage overhead actually matters this is true)

Erasure Coding Big Picture Huge Positive Fault tolerance with less storage overhead! Many drawbacks Encoding overhead Decoding overhead Updating overhead Deleting overhead Update consistency Fewer copies for serving reads Larger minimum system size Data is stored for a long time after being written Immutable data Storing lots of data (when storage overhead actually matters this is true)

Erasure Coding Big Picture Huge Positive Fault tolerance with less storage overhead! Many drawbacks Encoding overhead Decoding overhead Updating overhead Deleting overhead Update consistency Fewer copies for serving reads Larger minimum system size Low read rate Data is stored for a long time after being written Immutable data Storing lots of data (when storage overhead actually matters this is true)

Erasure Coding Big Picture Huge Positive Fault tolerance with less storage overhead! Many drawbacks Encoding overhead Decoding overhead Updating overhead Deleting overhead Update consistency Fewer copies for serving reads Larger minimum system size Low read rate Data is stored for a long time after being written Immutable data Storing lots of data (when storage overhead actually matters this is true)

f4: Facebook’s Warm BLOB Storage System [OSDI ‘14] 7/8/2019 f4: Facebook’s Warm BLOB Storage System [OSDI ‘14] Subramanian Muralidhar*, Wyatt Lloyd*ᵠ, Sabyasachi Roy*, Cory Hill*, Ernest Lin*, Weiwen Liu*, Satadru Pan*, Shiva Shankar*, Viswanath Sivakumar*, Linpeng Tang*⁺, Sanjeev Kumar* *Facebook Inc., ᵠUniversity of Southern California, ⁺Princeton University 1

BLOBs@FB Immutable & Unstructured Diverse A LOT of them!! Cover Photo 7/8/2019 BLOBs@FB Cover Photo Profile Photo Immutable & Unstructured Feed Photo Diverse Feed Video A LOT of them!!

HOT DATA WARM DATA Data cools off rapidly < 1 Days 1 Day 1 Week 7/8/2019 590X 510X HOT DATA WARM DATA Data cools off rapidly 98X 68X 30X 16X 14X 7X 6X 2X 1X 1X < 1 Days 1 Day 1 Week 1 Month 3 Months 1 Year

Handling failures 1.2 Replication: * 3 = 3.6 3 DC failures 7/8/2019 3 DC failures 3 Rack failures 3 Host failures 9 Disk failures Handling failures DATACENTER DATACENTER DATACENTER RACKS RACKS RACKS HOST HOST HOST 1.2 Replication: * 3 = 3.6

Not compromise reliability 7/8/2019 Handling load HOST HOST HOST HOST HOST HOST HOST 6 Reduce space usage AND Not compromise reliability

Background: Data serving 7/8/2019 Background: Data serving User Requests CDN protects storage Router abstracts storage Web tier adds business logic Reads Writes CDN Web Servers Router BLOB Storage

Background: Haystack [OSDI’10] 7/8/2019 Background: Haystack [OSDI’10] Header BID1: Off Volume is a series of BLOBs In-memory index BLOB1 BID2: Off Footer BIDN: Off Header BLOB1 In-Memory Index Footer Header BLOB1 Footer Volume

Introducing f4: Haystack on cells 7/8/2019 Introducing f4: Haystack on cells Rack Rack Rack Data+Index Cell Compute

Data splitting RS => RS => Reed Solomon Encoding 10G Volume 7/8/2019 10G Volume Reed Solomon Encoding 4G parity Stripe1 Stripe2 BLOB1 BLOB4 BLOB4 BLOB6 BLOB8 BLOB10 BLOB5 RS BLOB2 BLOB2 BLOB5 BLOB11 => BLOB7 BLOB9 BLOB3 BLOB4 RS => BLOB2

Data placement RS RS Reed Solomon (10, 4) is used in practice (1.4X) 7/8/2019 10G Volume 4G parity Stripe1 Stripe2 RS RS Cell with 7 Racks Reed Solomon (10, 4) is used in practice (1.4X) Tolerates 4 racks ( 4 disk/host ) failures

Reads Cell 2-phase: Index read returns exact physical location of BLOB 7/8/2019 Reads Router Storage Nodes Index Read Index User Request Data Read Compute Cell 2-phase: Index read returns exact physical location of BLOB

Reads under cell-local failures 7/8/2019 Reads under cell-local failures Router Storage Nodes Index Read Index User Request Data Read Compute (Decoders) Decode Read Cell Cell-Local failures (disks/hosts/racks) handled locally

Reads under datacenter failures (2.8X) 7/8/2019 Reads under datacenter failures (2.8X) User Request Compute (Decoders) Proxying Cell in Datacenter1 Compute (Decoders) Mirror Cell in Datacenter2 2 * 1.4X = 2.8X

Cross datacenter XOR (1.5 * 1.4 = 2.1X) 67% 33% Index Cell in 7/8/2019 Cross datacenter XOR (1.5 * 1.4 = 2.1X) 67% Index Cell in Datacenter1 33% Index Cell in Datacenter2 Index Cell in Datacenter3 Cross –DC index copy

Reads with datacenter failures (2.1X) 7/8/2019 Reads with datacenter failures (2.1X) Router Index Data Read Router User Request Index XOR Index Read Data Read Index Index Router

Haystack f4 2.8 f4 2.1 RS(10,4) 3.6X 2.8X 2.1X 9 10 3 2 3X 2X 1X 7/8/2019 Haystack 3-way replication f4 2.8 RS(10,4) f4 2.1 RS(10,4) Replication 3.6X 2.8X 2.1X Irrecoverable Disk Failures 9 10 Irrecoverable Host Failures 3 Irrecoverable Rack failures Irrecoverable Datacenter failures 2 Load split 3X 2X 1X

Evaluation What and how much data is “warm”? 7/8/2019 Evaluation What and how much data is “warm”? Can f4 satisfy throughput and latency requirements? How much space does f4 save f4 failure resilience

Methodology CDN data: 1 day, 0.5% sampling 7/8/2019 Methodology CDN data: 1 day, 0.5% sampling BLOB store data: 2 week, 0.1% Random distribution of BLOBs assumed The worst case rates reported

Hot and warm divide HOT DATA WARM DATA < 3 months  Haystack 7/8/2019 Hot and warm divide HOT DATA WARM DATA < 3 months  Haystack > 3 months  f4 80 Reads/Sec

It is warm, not cold Haystack (50%) F4 (50%) HOT DATA WARM DATA 7/8/2019 It is warm, not cold Haystack (50%) F4 (50%) HOT DATA WARM DATA

f4 Performance: Most loaded disk in cluster 7/8/2019 f4 Performance: Most loaded disk in cluster Peak load on disk: 35 Reads/Sec Reads/Sec

f4 Performance: Latency 7/8/2019 f4 Performance: Latency P80 = 30ms P99 = 80ms

Summary Facebook’s BLOB storage is big and growing 7/8/2019 Summary Facebook’s BLOB storage is big and growing BLOBs cool down with age ~100X drop in read requests in 60 days Haystack’s 3.6X replication over provisioning for old, warm data. f4 encodes data to lower replication to 2.1X