current hadoop architecture

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.
CSCE430/830 Computer Architecture
© 2013, A. Datta & F. Oggier, NTU Singapore Storage codes: Managing Big Data with Small Overheads Presented by Anwitaman Datta & Frédérique E. Oggier Nanyang.
Henry C. H. Chen and Patrick P. C. Lee
1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.
HAIL (High-Availability and Integrity Layer) for Cloud Storage
BASIC Regenerating Codes for Distributed Storage Systems Kenneth Shum (Joint work with Minghua Chen, Hanxu Hou and Hui Li)
Self-repairing Homomorphic Codes for Distributed Storage Systems [1] Tao He Software Engineering Laboratory Department of Computer Science,
Coding and Algorithms for Memories Lecture 12 1.
Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.
Coding for Modern Distributed Storage Systems: Part 1. Locally Repairable Codes Parikshit Gopalan Windows Azure Storage, Microsoft.
Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
An Upper Bound on Locally Recoverable Codes Viveck R. Cadambe (MIT) Arya Mazumdar (University of Minnesota)
Beyond the MDS Bound in Distributed Cloud Storage
Coding for Atomic Shared Memory Emulation Viveck R. Cadambe (MIT) Joint with Prof. Nancy Lynch (MIT), Prof. Muriel Médard (MIT) and Dr. Peter Musial (EMC)
Parikshit Gopalan Windows Azure Storage, Microsoft.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur,
CSE 501 Research Overview Atri Rudra
Alex Dimakis based on collaborations with Dimitris Papailiopoulos Viveck Cadambe Kannan Ramchandran USC Tutorial on Distributed Storage Problems and Regenerating.
Cooperative regenerating codes for distributed storage systems Kenneth Shum (Joint work with Yuchong Hu) 22nd July 2011.
Codes with local decoding procedures Sergey Yekhanin Microsoft Research.
Failures in the System  Two major components in a Node Applications System.
NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds
Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.
Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
The Hadoop Distributed File System
Repairable Fountain Codes Megasthenis Asteris, Alexandros G. Dimakis IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 5, MAY /5/221.
22/07/ The MDS Scaling Problem for Cloud Storage Yu-chong Hu Institute of Network Coding.
1 Failure Correction Techniques for Large Disk Array Garth A. Gibson, Lisa Hellerstein et al. University of California at Berkeley.
© 2012 A. Datta & F. Oggier, NTU Singapore Redundantly Grouped Cross-object Coding for Repairable Storage Anwitaman Datta & Frédérique Oggier NTU Singapore.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu th Annual IEEE/IFIP International.
A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.
Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu, Yinlong Xu, Xiaozhao Wang, Cheng Zhan and Pei.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
Coding and Algorithms for Memories Lecture 14 1.
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
20/10/ Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu Institute of Network Coding Please.
A Fast Repair Code Based on Regular Graphs for Distributed Storage Systems Yan Wang, East China Jiao Tong University Xin Wang, Fudan University 1 12/11/2013.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Coding and Algorithms for Memories Lecture 13 1.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Secret Sharing in Distributed Storage Systems Illinois Institute of Technology Nexus of Information and Computation Theories Paris, Feb 2016 Salim El Rouayheb.
Alex Dimakis based on collaborations with Mahesh Sathiamoorthy Megas Asteris Dimitris Papailiopoulos Kannan Ramchandran Scott Chen Ramkumar Vadali Dhruba.
A Tale of Two Erasure Codes in HDFS
Double Regenerating Codes for Hierarchical Data Centers
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
Vladimir Stojanovic & Nicholas Weaver
Partial-Parallel-Repair (PPR):
Repair Pipelining for Erasure-Coded Storage
Presented by Haoran Wang
Section 7 Erasure Coding Overview
Dineshkumar Bhaskaran Aricent Technologies
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
On Sequential Locally Repairable Codes
Presentation transcript:

Coding for Distributed Storage Alex Dimakis (UT Austin) Joint work with Mahesh Sathiamoorthy (USC) Megasthenis Asteris, Dimitris Papailipoulos (UT Austin) Ramkumar Vadali, Scott Chen, Dhruba Borthakur (Facebook)

current hadoop architecture file 1 1 2 3 4 file 2 1 2 3 4 5 6 file 3 1 2 3 4 5 … NameNode DataNode 1 DataNode 2 DataNode 3 DataNode 4

current hadoop architecture file 1 1 2 3 4 1 2 3 4 file 2 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 file 3 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 … NameNode DataNode 1 DataNode 2 DataNode 3 DataNode 4

current hadoop architecture file 1 1 2 3 4 1 2 3 4 file 2 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 file 3 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 1 4 3 … 4 2 6 3 1 1 3 4 NameNode DataNode 1 DataNode 2 DataNode 3 DataNode 4

current hadoop architecture file 1 1 2 3 4 1 2 3 4 file 2 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 file 3 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 1 4 3 … 4 2 6 3 1 1 3 4 NameNode DataNode 1 DataNode 2 DataNode 3 DataNode 4

current hadoop architecture file 1 1 2 3 4 1 2 3 4 file 2 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 file 3 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 4 1 1 4 1 3 … 4 2 6 3 1 1 3 4 NameNode DataNode 1 DataNode 2 DataNode 3 DataNode 4

Real systems that use distributed storage codes Windows Azure, (Cheng et al. USENIX 2012) (LRC Code) Used in production in Microsoft CORE (Li et al. MSST 2013) (Regenerating EMSR Code) NCCloud (Hu et al. USENIX FAST 2012) (Regenerating Functional MSR) ClusterDFS (Pamies Juarez et al. ) (SelfRepairing Codes) StorageCore (Esmaili et al. ) (over Hadoop HDFS) HDFS Xorbas (Sathiamoorthy et al. VLDB 2013 ) (over Hadoop HDFS) (LRC code) Testing in Facebook clusters

Coding theory for Distributed Storage Storage efficiency is the main driver for cost in distributed storage systems. ‘Disks are cheap’ Today all big storage systems use some form of erasure coding Classical RS codes or modern codes with Local repair (LRC) Still theory is not fully understood

How to store using erasure codes file 1 1 2 3 4 1 2 3 4 file 1 1 2 3 4 P1 P2

how to store using erasure codes k=2 A File or data object A A B B B A+B This is another example a code with n=4, k=2. In the right we have a good erasure code (any 2 suffice). Notice that 2 is the minimum number of packets required to have any hope of recovering so in that sense good-erasure codes introduce redundancy in an optimal way. Here, we compare it to replication. Now assume we store each packet on a storage node. If we lose each node with probability ½ , we can calculate the probability that we cannot recover the data. It is clear the erasure codes outperform repetition. Reed-Solomon, LT codes and Raptor codes are all good erasure codes that have numerous applications. (3,2) MDS code, (single parity) used in RAID 5

how to store using erasure codes k=2 A File or data object A A B B B A+B This is another example a code with n=4, k=2. In the right we have a good erasure code (any 2 suffice). Notice that 2 is the minimum number of packets required to have any hope of recovering so in that sense good-erasure codes introduce redundancy in an optimal way. Here, we compare it to replication. Now assume we store each packet on a storage node. If we lose each node with probability ½ , we can calculate the probability that we cannot recover the data. It is clear the erasure codes outperform repetition. Reed-Solomon, LT codes and Raptor codes are all good erasure codes that have numerous applications. (3,2) MDS code, (single parity) used in RAID 5 A= 0 1 0 1 1 1 0 1 … B= 1 1 0 0 0 1 1 1 … A+B = 1 0 0 1 1 0 1 0 ……

how to store using erasure codes k=2 A A File or data object A A B B B B A+B A+B This is another example a code with n=4, k=2. In the right we have a good erasure code (any 2 suffice). Notice that 2 is the minimum number of packets required to have any hope of recovering so in that sense good-erasure codes introduce redundancy in an optimal way. Here, we compare it to replication. Now assume we store each packet on a storage node. If we lose each node with probability ½ , we can calculate the probability that we cannot recover the data. It is clear the erasure codes outperform repetition. Reed-Solomon, LT codes and Raptor codes are all good erasure codes that have numerous applications. (3,2) MDS code, (single parity) used in RAID 5 A+2B (4,2) MDS code. Tolerates any 2 failures Used in RAID 6

how to store using erasure codes k=2 A A File or data object A A B B B B A+B A+B This is another example a code with n=4, k=2. In the right we have a good erasure code (any 2 suffice). Notice that 2 is the minimum number of packets required to have any hope of recovering so in that sense good-erasure codes introduce redundancy in an optimal way. Here, we compare it to replication. Now assume we store each packet on a storage node. If we lose each node with probability ½ , we can calculate the probability that we cannot recover the data. It is clear the erasure codes outperform repetition. Reed-Solomon, LT codes and Raptor codes are all good erasure codes that have numerous applications. (3,2) MDS code, (single parity) used in RAID 5 A+2B A= 0 1 0 1 1 1 0 1 … B= 1 1 0 0 0 1 1 1 … (4,2) MDS code. Tolerates any 2 failures Used in RAID 6 A+2B = 0 0 0 1 1 0 1 0 ……

erasure codes are reliable (4,2) MDS erasure code (any 2 suffice to recover) Replication A A File or data object A A B vs B B A+B This is another example a code with n=4, k=2. In the right we have a good erasure code (any 2 suffice). Notice that 2 is the minimum number of packets required to have any hope of recovering so in that sense good-erasure codes introduce redundancy in an optimal way. Here, we compare it to replication. Now assume we store each packet on a storage node. If we lose each node with probability ½ , we can calculate the probability that we cannot recover the data. It is clear the erasure codes outperform repetition. Reed-Solomon, LT codes and Raptor codes are all good erasure codes that have numerous applications. B A+2B

storing with an (n,k) code An (n,k) erasure code provides a way to: Take k packets and generate n packets of the same size such that Any k out of n suffice to reconstruct the original k Optimal reliability for that given redundancy. Well-known and used frequently, e.g. Reed-Solomon codes, Array codes, LDPC and Turbo codes. Each packet is stored at a different node, distributed in a network.

current default hadoop architecture 640 MB file => 10 blocks 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 3x replication is HDFS current default. Very large storage overhead. Very costly for BIG data

facebook introduced Reed-Solomon (HDFS RAID) 640 MB file => 10 blocks 1 2 3 4 5 6 7 8 9 10 P1 P2 P3 P4 Older files are switched from 3-replication to (14,10) Reed Solomon.

Tolerates 2 missing blocks, Storage cost 3x 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 P1 P2 P3 P4 Tolerates 4 missing blocks, Storage cost 1.4x HDFS RAID. Uses Reed-Solomon Erasure Codes Diskreduce (B. Fan, W. Tantisiriroj, L. Xiao, G. Gibson)

erasure codes save space

erasure codes save space

Limitations of classical codes Currently only 8% of facebook’s data is Reed-Solomon encoded. Goal: Move to 50-60% encoded data (Save petabytes) Bottleneck: Repair problem

The repair problem Great, we can tolerate n-k=4 node failures. 1 8 9 2 10 3 P1 4 P2 5 P3 6 P4 7

The repair problem Great, we can tolerate 4 node failures. Most of the time we start with a single failure. 1 8 9 2 10 3 P1 4 P2 5 P3 6 P4 7

The repair problem Great, we can tolerate 4 node failures. Most of the time we start with a single failure. 1 8 9 2 10 3 P1 4 P2 5 P3 6 ??? P4 7 3’

The repair problem Great, we can tolerate 4 node failures. Most of the time we start with a single failure. Read from any 10 nodes, send all data to 3’ who can repair the lost block. 1 8 9 2 10 3 P1 4 P2 5 P3 6 P4 7 3’

The repair problem Great, we can tolerate 4 node failures. Most of the time we start with a single failure. Read from any 10 nodes, send all data to 3’ who can repair the lost block. High network traffic High disk read 10x more than the lost information 1 8 9 2 10 3 P1 4 P2 5 P3 6 P4 7 3’

The repair problem 1 2 3 4 5 6 7 8 9 10 P1 P2 P3 P4 3’ If we have 1 failure, how do we rebuild the redundancy in a new disk? Naïve repair: send k blocks. Filesize B, B/k per block. Do I need to reconstruct the Whole data object to repair one failure?

Three repair metrics of interest Number of bits communicated in the network during single node failures (Repair Bandwidth) The number of bits read from disks during single node repairs (Disk IO) 3. The number of nodes accessed to repair a single node failure (Locality)

Three repair metrics of interest Number of bits communicated in the network during single node failures (Repair Bandwidth) Capacity known for two points only. My 3-year old conjecture for intermediate points was just disproved. [ISIT13] 2. The number of bits read from disks during single node repairs (Disk IO) 3. The number of nodes accessed to repair a single node failure (Locality)

Three repair metrics of interest Number of bits communicated in the network during single node failures (Repair Bandwidth) Capacity known for two points only. My 3-year old conjecture for intermediate points was just disproved. [ISIT13] 2. The number of bits read from disks during single node repairs (Disk IO) Capacity unknown. Only known technique is bounding by Repair Bandwidth 3. The number of nodes accessed to repair a single node failure (Locality)

Three repair metrics of interest Number of bits communicated in the network during single node failures (Repair Bandwidth) Capacity known for two points only. My 3-year old conjecture for intermediate points was just disproved. [ISIT13] 2. The number of bits read from disks during single node repairs (Disk IO) Capacity unknown. Only known technique is bounding by Repair Bandwidth 3. The number of nodes accessed to repair a single node failure (Locality) Capacity known for some cases. Practical LRC codes known for some cases. [ISIT12, Usenix12, VLDB13] General constructions open

Three repair metrics of interest Number of bits communicated in the network during single node failures (Repair Bandwidth) Capacity known for two points only. My 3-year old conjecture for intermediate points was just disproved. [ISIT13] 2. The number of bits read from disks during single node repairs (Disk IO) Capacity unknown. Only known technique is bounding by Repair Bandwidth 3. The number of nodes accessed to repair a single node failure (Locality) Capacity known for some cases. Practical LRC codes known for some cases. [ISIT12, Usenix12, VLDB13] General constructions open

The repair problem Great, we can tolerate 4 node failures. Most of the time we start with a single failure. Read from any 10 nodes, send all data to 3’ who can repair the lost block. High network traffic High disk read 10x more than the lost information 1 8 9 2 10 3 P1 4 P2 5 P3 6 P4 7 3’

The repair problem 1 8 Ok, great, we can tolerate n-k disk failures without losing data. If we have 1 failure however, how do we rebuild the redundancy in a new disk? Naïve repair: send k blocks. Filesize B, B/k per block. 9 2 10 3 P1 4 P2 5 P3 6 Functional repair: 3’ can be different from 3. Maintains the any k out of n reliability property. Exact repair: 3’ is exactly equal to 3. P4 Do I need to reconstruct the Whole data object to repair one failure? 7 3’

(Regenerating Codes, IT Transactions 2010) The repair problem 1 8 Theorem: It is possible to functionally repair a code by communicating only As opposed to naïve repair cost of B bits. (Regenerating Codes, IT Transactions 2010) Ok, great, we can tolerate n-k disk failures without losing data. If we have 1 failure however, how do we rebuild the redundancy in a new disk? Naïve repair: send k blocks. Filesize B, B/k per block. 9 2 10 3 P1 4 P2 5 P3 6 Functional repair: 3’ can be different from 3. Maintains the any k out of n reliability property. Exact repair: 3’ is exactly equal to 3. P4 Do I need to reconstruct the Whole data object to repair one failure? 7 3’

(Regenerating Codes, IT Transactions 2010) The repair problem 1 8 Theorem: It is possible to functionally repair a code by communicating only As opposed to naïve repair cost of B bits. (Regenerating Codes, IT Transactions 2010) For n=14, k=10, B=640MB, Naïve repair: 640 MB repair bandwidth Optimal Repair bw (βd for MSR point): 208 MB Ok, great, we can tolerate n-k disk failures without losing data. If we have 1 failure however, how do we rebuild the redundancy in a new disk? Naïve repair: send k blocks. Filesize B, B/k per block. 9 2 10 3 P1 4 P2 5 P3 6 Functional repair: 3’ can be different from 3. Maintains the any k out of n reliability property. Exact repair: 3’ is exactly equal to 3. P4 Do I need to reconstruct the Whole data object to repair one failure? 7 3’

(Regenerating Codes, IT Transactions 2010) The repair problem 1 8 Theorem: It is possible to functionally repair a code by communicating only As opposed to naïve repair cost of B bits. (Regenerating Codes, IT Transactions 2010) For n=14, k=10, B=640MB, Naïve repair: 640 MB repair bandwidth Optimal Repair bw (βd for MSR point): 208 MB Can we match this with Exact repair? Theoretically yes. No practical codes known! Ok, great, we can tolerate n-k disk failures without losing data. If we have 1 failure however, how do we rebuild the redundancy in a new disk? Naïve repair: send k blocks. Filesize B, B/k per block. 9 2 10 3 P1 4 P2 5 P3 6 Functional repair: 3’ can be different from 3. Maintains the any k out of n reliability property. Exact repair: 3’ is exactly equal to 3. P4 Do I need to reconstruct the Whole data object to repair one failure? 7 3’

Storage-Bandwidth tradeoff For any (n,k) code we can find the minimum functional repair bandwidth There is a tradeoff between storage α and repair bandwidth β. This defines a tradeoff region for functional repair.

Repair Bandwidth Tradeoff Region Min Bandwidth point (MBR) Min Storage point (MSR)

Exact repair region? Min Bandwidth point (MBR) Min Storage point (MSR)

Status in 2011

? Status in 2012 Code constructions by: Rashmi,Shah,Kumar Suh,Ramchandran El Rouayheb, Oggier,Datta Silberstein, Viswanath et al. Cadambe,Maleki, Jafar Papailiopoulos, Wu, Dimakis Wang, Tamo, Bruck ?

Status in 2013 Provable gap from CutSet Region. (Chao Tian, ISIT 2013) Exact Repair Region remains open.

Taking a step back Finding exact regenerating codes is still an open problem in coding theory What can we do to make practical progress ? Change the problem.

Locally Repairable Codes

Three repair metrics of interest Number of bits communicated in the network during single node failures (Repair Bandwidth) Capacity known for two points only. My 3-year old conjecture for intermediate points was just disproved. [ISIT13] 2. The number of bits read from disks during single node repairs (Disk IO) Capacity unknown. Only known technique is bounding by Repair Bandwidth 3. The number of nodes accessed to repair a single node failure (Locality) Capacity known for some cases. Practical LRC codes known for some cases. [ISIT12, Usenix12, VLDB13] General constructions open

Minimum Distance The distance of a code d is the minimum number of erasures after which data is lost. Reed-Solomon (10,14) (n=14, k=10). d= 5 R. Singleton (1964) showed a bound on the best distance possible: Reed-Solomon codes achieve the Singleton bound (hence called MDS)

Locality of a code A code symbol has locality r if it is a function of r other codeword symbols. A systematic code has message locality r if all its systematic symbols have locality r A code has all-symbol locality r if all its symbols have locality r. In an MDS code, all symbols have locality at most r <= k Easy lemma: Any MDS code must have trivial locality r=k for every symbol.

Example: code with message locality 5 RS p1 p2 p3 p4 1 2 3 4 5 6 7 8 9 10 x1 + x2 + All k=10 message blocks can be recovered by reading r=5 other blocks. A single parity block failure requires still 10 reads. Best distance possible for a code with locality r?

Locality-distance tradeoff Codes with all-symbol locality r can have distance at most: Shown by Gopalan et al. for scalar linear codes (Allerton 2012) Papailiopoulos et al. information theoretically (ISIT 2012) r=k (trivial locality) gives Singleton Bound. Any non-trivial locality will hurt the fault tolerance of the storage system Pyramid codes (Huang et al) achieve this bound for message-locality Few explicit code constructions known for all-symbol locality.

All-symbol locality RS p1 p2 p3 p4 1 2 3 4 5 6 7 8 9 10 x1 + x2 + + x3 The coefficients need to make the local forks in general position compared to the global parities. Random works whp in exponentially large field. Checking requires exponential time. General Explicit LRCs for this case are an open problem.

Recap: locality and distance If each symbol has optimal size α=M/k, best distance possible: (bound by R. Singleton 1964) Achievable by Reed-Solomon codes (1960). No locality possible. If we need locality r, best distance possible: (Gopalan et al., Papailiopoulos et al.) If each symbol has slightly larger size α=(1+ε) M/k, best distance possible:

Recent work on LRCs Silberstein, Kumar et al., Pamies-Juarez et al.: Work on local repair even if multiple failures happen within each group. Tamo et al. Explicit LRC constructions (for some values of n,k,r) Mazumdar et al. Locality bounds for given field size

LRC code design RS p1 p2 p3 p4 1 2 3 4 5 6 7 8 9 10 x1 + x2 + x3 + Local XORs allow single block recovery by transferring only 5 blocks (320MB) instead of 10 blocks (640 MB in HDFS RAID). 17 total blocks stored Storage overhead increased to 1.7x from 1.4x

LRC code design RS p1 p2 p3 p4 1 2 3 4 5 6 7 8 9 10 x1 + x2 + + x3 Local XORs can be any local linear combinations (just invert in repair) Choose coefficients so that x1+x2=x3 (interference alignment) Do not store x3!

LRC code design RS 1 2 3 4 5 6 7 8 9 10 p1 p2 p3 p4 x1 + x2 + + x3 +

code we implemented in HDFS RS p1 p2 p3 p4 1 2 3 4 5 6 7 8 9 10 x1 + x2 + + x3 Single block failures can be repaired by accessing 5 blocks. (vs 10) Stores 16 blocks 1.6x Storage overhead vs 1.4x in HDFS RAID. Implemented this in Hadoop (system available on github/madiator)

Practical Storage Systems and Open Problems

Real systems that use distributed storage codes Windows Azure, (Cheng et al. USENIX 2012) (LRC Code) Used in production in Microsoft CORE (Li et al. MSST 2013) (Regenerating EMSR Code) NCCloud (Hu et al. USENIX FAST 2012) (Regenerating Functional MSR) ClusterDFS (Pamies Juarez et al. ) (SelfRepairing Codes) StorageCore (Esmaili et al. ) (over Hadoop HDFS) HDFS Xorbas (Sathiamoorthy et al. VLDB 2013 ) (over Hadoop HDFS) (LRC code) Testing in Facebook clusters

HDFS Xorbas (HDFS with LRC codes) Practical open-source Hadoop file system with built in locally repairable code. Tested in Facebook analytics cluster and Amazon cluster Uses a (16,10) LRC with all-symbol locality 5 Microsoft LRC has only message-symbol locality Saves storage, good degraded read, bad repair.

HDFS Xorbas

code we implemented in HDFS RS p1 p2 p3 p4 1 2 3 4 5 6 7 8 9 10 x1 + x2 + + x3 Single block failures can be repaired by accessing 5 blocks. (vs 10) Stores 16 blocks 1.6x Storage overhead vs 1.4x in HDFS RAID. Implemented this in Hadoop (system available on github/madiator)

Java implementation

Some experiments

Some experiments 100 machines on Amazon ec2 50 machines running HDFS RAID (facebook version, (14,10) Reed Solomon code ) 50 running our LRC code 50 files uploaded on system, 640MB per file Killing nodes and measuring network traffic, disk IO, CPU, etc during node repairs.

Repair Network traffic Facebook HDFS RAID (RS code) Xorbas HDFS (LRC code)

CPU Facebook HDFS RAID (RS code) Xorbas HDFS (LRC code)

Disk IO Facebook HDFS RAID (RS code) Xorbas HDFS (LRC code)

what we observe New storage code reduces bytes read by roughly 2.6x Network bandwidth reduced by approximately 2x We use 14% more storage. Similar CPU. In several cases 30-40% faster repairs. Provides four more zeros of data availability compared to replication Gains can be much more significant if larger codes are used (i.e. for archival storage systems). In some cases might be better to save storage, reduce repair bandwidth but lose in locality (PiggyBack codes: Rashmi et al. USENIX HotStorage 2013)

Conclusions and Open Problems

Which repair metric to optimize? Repair BW, All-Symbol Locality, Message-Locality, Disk I/O, Fault tolerance, A combination of all? Depends on type of storage cluster (Cloud, Analytics, Photo Storage, Archival, Hot vs Cold data)

Open problems Repair Bandwidth: Exact repair bandwidth region ? Practical E-MSR codes for high rates ? Repairing codes with a small finite field limit ? Better Repair for existing codes (EvenOdd, RDP, Reed-Solomon) ? Disk IO: Bounds for disk IO during repair? Locality: Explicit LRCs for all values of n,k,r ? Simple and practical constructions ? Further: Network topology awareness (same rack/data center) ? Solid state drives, hot data (photo clusters) Practical system designs for different applications 71

Coding for Storage wiki