Alex Dimakis based on collaborations with Mahesh Sathiamoorthy Megas Asteris Dimitris Papailiopoulos Kannan Ramchandran Scott Chen Ramkumar Vadali Dhruba.

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
current hadoop architecture
Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.
Henry C. H. Chen and Patrick P. C. Lee
1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.
BASIC Regenerating Codes for Distributed Storage Systems Kenneth Shum (Joint work with Minghua Chen, Hanxu Hou and Hui Li)
Coding and Algorithms for Memories Lecture 12 1.
Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.
Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
D.J.C MacKay IEE Proceedings Communications, Vol. 152, No. 6, December 2005.
Multicut Lower Bounds via Network Coding Anna Blasiak Cornell University.
Locally Decodable Codes
1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Beyond the MDS Bound in Distributed Cloud Storage
Coding for Atomic Shared Memory Emulation Viveck R. Cadambe (MIT) Joint with Prof. Nancy Lynch (MIT), Prof. Muriel Médard (MIT) and Dr. Peter Musial (EMC)
Typhoon: An Ultra-Available Archive and Backup System Utilizing Linear-Time Erasure Codes.
1 Network Coding: Theory and Practice Apirath Limmanee Jacobs University.
A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur,
1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.
1 University of Freiburg Computer Networks and Telematics Prof. Christian Schindelhauer Mobile Ad Hoc Networks Network Coding and Xors in the Air 7th Week.
A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation Yinlong Xu University of Science and Technology of.
Resilient Network Coding in the presence of Byzantine Adversaries Michelle Effros Michael Langberg Tracey Ho Sachin Katti Muriel Médard Dina Katabi Sidharth.
Network Coding and Reliable Communications Group Network Coding for Multi-Resolution Multicast March 17, 2010 MinJi Kim, Daniel Lucani, Xiaomeng (Shirley)
Network Coding Theory: Consolidation and Extensions Raymond Yeung Joint work with Bob Li, Ning Cai and Zhen Zhan.
Network Coding Project presentation Communication Theory 16:332:545 Amith Vikram Atin Kumar Jasvinder Singh Vinoo Ganesan.
1 NETWORK CODING Anthony Ephremides University of Maryland - A NEW PARADIGM FOR NETWORKING - February 29, 2008 University of Minnesota.
1 Simple Network Codes for Instantaneous Recovery from Edge Failures in Unicast Connections Salim Yaacoub El Rouayheb, Alex Sprintson Costas Georghiades.
Alex Dimakis based on collaborations with Dimitris Papailiopoulos Viveck Cadambe Kannan Ramchandran USC Tutorial on Distributed Storage Problems and Regenerating.
Network Coding and Reliable Communications Group Algebraic Network Coding Approach to Deterministic Wireless Relay Networks MinJi Kim, Muriel Médard.
Redundant Data Update in Server-less Video-on-Demand Systems Presented by Ho Tsz Kin.
Cooperative regenerating codes for distributed storage systems Kenneth Shum (Joint work with Yuchong Hu) 22nd July 2011.
Processing Along the Way: Forwarding vs. Coding Christina Fragouli Joint work with Emina Soljanin and Daniela Tuninetti.
Codes with local decoding procedures Sergey Yekhanin Microsoft Research.
Network Coding vs. Erasure Coding: Reliable Multicast in MANETs Atsushi Fujimura*, Soon Y. Oh, and Mario Gerla *NEC Corporation University of California,
NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds
Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.
Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.
Network Alignment: Treating Networks as Wireless Interference Channel Chun Meng Univ. of California, Irvine.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
1 Solid State Storage (SSS) System Error Recovery LHO 08 For NASA Langley Research Center.
Repairable Fountain Codes Megasthenis Asteris, Alexandros G. Dimakis IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 5, MAY /5/221.
22/07/ The MDS Scaling Problem for Cloud Storage Yu-chong Hu Institute of Network Coding.
© 2012 A. Datta & F. Oggier, NTU Singapore Redundantly Grouped Cross-object Coding for Repairable Storage Anwitaman Datta & Frédérique Oggier NTU Singapore.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
Distributed Storage Allocations for Optimal Delay Derek Leong 1, Alexandros G. Dimakis 2, Tracey Ho 1 1 California Institute of Technology 2 University.
NETWORK CODING. Routing is concerned with establishing end to end paths between sources and sinks of information. In existing networks each node in a.
1 Network Coding and its Applications in Communication Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.
1 Network Coding and its Applications in Communication Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.
Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu, Yinlong Xu, Xiaozhao Wang, Cheng Zhan and Pei.
Erasure Coding for Real-Time Streaming Derek Leong and Tracey Ho California Institute of Technology Pasadena, California, USA ISIT
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
On Coding for Real-Time Streaming under Packet Erasures Derek Leong *#, Asma Qureshi *, and Tracey Ho * * California Institute of Technology, Pasadena,
The High, the Low and the Ugly Muriel Médard. Collaborators Nadia Fawaz, Andrea Goldsmith, Minji Kim, Ivana Maric 2.
Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.
20/10/ Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu Institute of Network Coding Please.
A Fast Repair Code Based on Regular Graphs for Distributed Storage Systems Yan Wang, East China Jiao Tong University Xin Wang, Fudan University 1 12/11/2013.
Coding and Algorithms for Memories Lecture 13 1.
Secret Sharing in Distributed Storage Systems Illinois Institute of Technology Nexus of Information and Computation Theories Paris, Feb 2016 Salim El Rouayheb.
A Tale of Two Erasure Codes in HDFS
Double Regenerating Codes for Hierarchical Data Centers
Repair Pipelining for Erasure-Coded Storage
Section 7 Erasure Coding Overview
Network Coding and its Applications in Communication Networks
MinJi Kim, Muriel Médard, João Barros
Symmetric Allocations for Distributed Storage
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
Erasure Correcting Codes for Highly Available Storage
Presentation transcript:

Alex Dimakis based on collaborations with Mahesh Sathiamoorthy Megas Asteris Dimitris Papailiopoulos Kannan Ramchandran Scott Chen Ramkumar Vadali Dhruba Borthakur USC Network Coding for Cloud Storage facebook

2 Distributed storage systems Numerous disk failures per day. Failures are the norm rather than the exception Must introduce redundancy for reliability Replication or erasure coding?

33 how to store using erasure codes A B A B A+B B A+2B A A+B A B (3,2) MDS code, (single parity) used in RAID 5 (4,2) MDS code. Tolerates any 2 failures Used in RAID 6 k=2 n=3 n=4 File or data object

44 erasure codes are reliable A B A A B B A+B A+2B (4,2) MDS erasure code (any 2 suffice to recover) A B vs Replication File or data object

storing with an (n,k) code An (n,k) erasure code provides a way to: Take k packets and generate n packets of the same size such that Any k out of n suffice to reconstruct the original k Optimal reliability for that given redundancy. Well-known and used frequently, e.g. Reed-Solomon codes, Array codes, LDPC and Turbo codes. Each packet is stored at a different node, distributed in a network. 5

current hadoop architecture 640 MB file => 10 blocks 3x replication is HDFS current default. Very large storage overhead. Very costly for BIG data

P1P2P3P4 facebook introduced Reed- Solomon (HDFS RAID) 640 MB file => 10 blocks Older files are switched from 3-replication to (14,10) Reed Solomon.

Tolerates 2 missing blocks, Storage cost 3x P1 P2 P3 P4 Tolerates 4 missing blocks, Storage cost 1.4x HDFS RAID. Uses Reed-Solomon Erasure Codes Source file Parity file Diskreduce (B. Fan, W. Tantisiriroj, L. Xiao, G. Gibson)

RS codes save 5PB

Limitations of Reed Solomon Currently only 8% of facebook’s data warehouse is RS encoded. (still significant saving). Our Goal: move to 40-50% of coded data. Save Petabytes. 10

11 Coding+Storage Networks = New open problems Issues: Communication Update complexity Repair communication Repair bits Read No of nodes accessed A B ? Network traffic

overview 12 Storing information using codes. The repair problem Exact Repair. The state of the art. Interference Alignment Different repair metrics The road to practice Storage Allocation problems

P1 P2 P3 P4 The repair problem Great, we can tolerate n-k=4 node failures.

P1 P2 P3 P4 The repair problem Great, we can tolerate 4 node failures. Most of the time we start with a single failure.

P1 P2 P3 P4 The repair problem 3’ ??? Great, we can tolerate 4 node failures. Most of the time we start with a single failure.

P1 P2 P3 P4 The repair problem 3’ Great, we can tolerate 4 node failures. Most of the time we start with a single failure. Read from any 10 nodes, send all data to 3’ who can repair the lost block.

P1 P2 P3 P4 The repair problem 3’ Great, we can tolerate 4 node failures. Most of the time we start with a single failure. Read from any 10 nodes, send all data to 3’ who can repair the lost block. High network bandwidth, High disk IO at 10 nodes.

18 If we have 1 failure, how do we rebuild the redundancy in a new disk? Naïve repair: send k blocks. Filesize B, B/k per block P1 P2 P3 P4 3’ The repair problem Do I need to reconstruct the Whole data object to repair one failure?

is repair frequent? node failures * 15TB = 300TB if 8% RS coded, 588TB network traffic/day. (average total network: 2PB/day) ~30% of network traffic is repair in a normal day.

20 Ok, great, we can tolerate n-k disk failures without losing data. If we have 1 failure however, how do we rebuild the redundancy in a new disk? Naïve repair: send k blocks. Filesize B, B/k per block P1 P2 P3 P4 3’ The repair problem Do I need to reconstruct the Whole data object to repair one failure? Functional repair : 3’ can be different from 3. Maintains the any k out of n reliability property. Exact repair : 3’ is exactly equal to 3.

21 Ok, great, we can tolerate n-k disk failures without losing data. If we have 1 failure however, how do we rebuild the redundancy in a new disk? Naïve repair: send k blocks. Filesize B, B/k per block P1 P2 P3 P4 3’ The repair problem Do I need to reconstruct the Whole data object to repair one failure? Functional repair : 3’ can be different from 3. Maintains the any k out of n reliability property. Exact repair : 3’ is exactly equal to 3. Theorem: It is possible to functionally repair a code by communicating only As opposed to naïve repair cost of B bits. (Regenerating Codes)

Exact repair with 3GB a b c d a+c b+d b+c a+b+d a = (b+d) + (a+b+d) b = d + (b+d) a? b? 1GB

Systematic repair with 1.5GB a b c d a+c b+d b+c a+b+d a = (b+d) + (a+b+d) b = d + (b+d) a? b? 1GB Reconstructing all the data: 4GB Repairing a single node: 3GB 3 equations were aligned, solvable for a,b Reconstructing all the data: 4GB Repairing a single node: 3GB 3 equations were aligned, solvable for a,b

Repairing the last node a b c d a+c b+d b+c a+b+d b+c = (c+d) + (b+d) a+b+d = a + (b+d)

network coding: multicasting 25 data collector 2 S data collector 1 Each link carries one packet. You can use it once. Max number of packets I can send to dc2?

network coding: multicasting 26 data collector 2 S data collector 1 Each link carries one packet. You can use it once. Max number of packets I can send to dc2? use the red and the green path: 2 packets.

network coding: multicasting 27 min cut (s, dc2)? data collector 2 data collector 1 S

network coding: multicasting 28 Max flow= min-cut (Ford Fulkerson+ Elias Feinstein Shannon 1956) data collector 2 data collector 1 S

network coding: multicasting 29 Max flow from (S-dc(1))? Multicasting: maximum number of packets we can simultaneously send to many users. Cannot exceed min (mincut (S-dc(i)) data collector 2 data collector 1 S

network coding: multicasting 30 data collector 2 S data collector 1 Sending one packet to both is easy. Routing. Can we always achieve min (mincut (S-dc(i)) ?

the butterfly 31 data collector 2 data collector 1 S mincut(s,dc(1))=2 =mincut(s,dc(2)). Can we send (the same) two packets to both dc1,dc2 simultaneously?

the butterfly 32 data collector 2 data collector 1 S How does dc2 get the green packet?

the butterfly 33 data collector 2 data collector 1 S We need algebraic mixing of packets (network coding). A B A B B A A+B

4,2 MDS code is a multicasting NC 34 a b a b a+b a+2b S data collector data collector data collector data collector if all dc’s get both a,b, they can reconstruct the data object ∞ ∞ ∞ ∞

wait a minute 35 a b a b a+b a+2b S data collector can I just recover from a+b only? ∞

adding storage links 36 a b a b a+b a+2b S data collector ∞ b a+b a+2b a capacity= storage of node α

Let’s go back to this example. a b c d a+c b+d b+c a+b+d a? b? 1GB M=? α=? β=? d=?

Let’s go back to this example. a b c d a+c b+d b+c a+b+d a? b? 1GB M=4GB α=2GB β=? d=3

adding storage links 39 a b a b a+b a+2b S b a+b a+2b a capacity= storage of node α α =2 GB bb β β β data collector ∞ ∞

40 Proof idea: Information flow graph a e 2GB a bb cc dd α =2 GB data collector ∞ ∞ β β β 2+2 β ≥4 GB  β ≥1 GB Total repair comm. ≥3 GB S data collector

41 Proof sketch: reduction to multicasting a e a bb c dd data collector    S data collector data collector data collector functional repair = multicasting on the information flow graph. sufficient iff minimum of the min cuts is larger than file size M. (Ahlswede et al. Koetter & Medard, Ho et al.) data collector data collector c

quiz (5,3) MDS code M=1GB (storing a 1GB total file) k=3 n=5 (any 3 out of 5 must recover) 1 node lost. Newcomer can connect to d=4 nodes. α(ΜSR) (Minimum storage to have the any 3 out of 5 guarantee). What is the minimum repair bandwidth β? 42

quiz: Repairing a (5,3) MDS code 43 a b S b a capacity= storage of node α α =1/3 GB bb β β β data collector cc dd ee β

quiz 44 a b S b a capacity= storage of node α α =1/3 GB bb β β β data collector ∞ ∞ cc dd ee β ∞

quiz 45 a b S b a capacity= storage of node α α =1/3 GB bb β β β data collector ∞ ∞ cc dd ee β ∞ cut=1+β≥M

quiz 46 a b S b a capacity= storage of node α α =1/3 GB bb β β β data collector ∞ ∞ cc dd ee β ∞ cut=2/3+2β≥M 2/3+2β≥1 β≥1/6 GB

47 Ok, great, we can tolerate n-k disk failures without losing data. If we have 1 failure however, how do we rebuild the redundancy in a new disk? Naïve repair: send k blocks. Filesize B, B/k per block P1 P2 P3 P4 3’ The repair problem Do I need to reconstruct the Whole data object to repair one failure? Functional repair : 3’ can be different from 3. Maintains the any k out of n reliability property. Exact repair : 3’ is exactly equal to 3. Theorem: It is possible to functionally repair a code by communicating only As opposed to naïve repair cost of B bits. (Regenerating Codes)

Increasing storage reduces β 48 a b S b a capacity= storage of node α α =1/3+ε GB bb β β β data collector ∞ ∞ cc dd ee β ∞ cut=2/3+2ε+2β≥ M 2/3+2ε+2β≥1 β≥1/6-ε GB

49 The infinite graph for Repair x1x1 α α α α α β d α β d α β d α β d data collector k data collector x2x2 … xnxn

50 Theorem 3 : for any (n,k) code, where each node stores α bits, repairs from d existing nodes and downloads d β=γ bits, the feasible region is piecewise linear function described as follows: Storage-Communication tradeoff

51 Storage-Communication tradeoff Min-Storage Regenerating code Min-Bandwidth Regenerating code α (D, Godfrey, Wu, Wainwright, Ramchandran, IT Transactions (2010) ) γ=βd

52 The information flow graph is a generic tool to analyze distributed storage problems. Example: Cooperative repair [Shum & Hu] Cooperative repair

overview 53 Storing information using codes. The repair problem Exact Repair. The state of the art. Interference Alignment Different repair metrics Storage Allocation problems

54 Key problem: Exact repair a b c d e =a From Theorem 1, an (n,k) MDS code can be repaired by communicating What if we require perfect reconstruction? ? ? ?

55 Exact Repair-(4,2) example x1 x3 x2 x4 x1+x3 x2+x4 x1+2x3 2x2+3x4 x1? x2? x1+x2+x3+x x x2+x3+x x3+x4 (Wu and D., ISIT 2009) Exact repair of the first node Trivial by communicating 4 blocks Can be done with 3?

x1?x1? 56 Repair vs Exact Repair x1x1 α α α α α β d α β d α β d α β d data collector k data collector x2x2 … xnxn Functional Repair= Multicasting Exact repair= Multicasting with intermediate nodes having (overlapping) requests. Cut set region might not be achievable Linear codes might not suffice (Dougherty et al.) Functional Repair= Multicasting Exact repair= Multicasting with intermediate nodes having (overlapping) requests. Cut set region might not be achievable Linear codes might not suffice (Dougherty et al.)

57 Exact Storage-Communication tradeoff? α Exact repair feasible? γ=βd

58 For (n,k=2) E-MSR repair can match cutset bound. [WD ISIT’09] (n=5,k=3) E-MSR systematic code exists (Cullina,D,Ho, Allerton’09) For k/n <=1/2 E-MSR repair can match cutset bound [Rashmi, Shah, Kumar, Ramchandran (2010)] E-MBR for all n,k, for d=n-1 matches cut-set bound. [Suh, Ramchandran (2010) ] What is known about exact repair

59 What can be done for high rates? Recently the interference alignment S.E. (Cadambe, Jafar, Maleki) and independently (Suh, Ramchandran) was shown to approach cut-set bound for E-MSR, for all (k,n,d). (However requires enormous field size and sub-packetization.) Shows that linear codes suffice to approach cut-set region for exact repair, for the whole range of parameters. Tamo et al., Papailiopoulos et al. and Cadambe et al. presented the first constructions of high rate exact regenerating codes at ISIT What is known about exact repair

60 Min-Storage Regenerating code (no known practical codes for high rates) Min-Bandwidth Regenerating code (practical) α γ=βd E-MSR Point E-MBR Point Exact Storage-Communication tradeoff?

61 Min-Storage Regenerating code (no known practical codes for high rates) Min-Bandwidth Regenerating code (practical) α γ=βd E-MSR Point E-MBR Point Exact Storage-Communication tradeoff?

62 Min-Storage Regenerating code (no known practical codes for high rates) Min-Bandwidth Regenerating code (practical) α γ=βd E-MSR Point E-MBR Point Exact Storage-Communication tradeoff? The ouzo problem: Characterize exact repair tradeoff region

overview 63 Storing information using codes. The repair problem Exact Repair. The state of the art. Interference Alignment Different repair metrics Storage Allocation problems

The coefficients of some variables lie in a lower dimensional subspace and can be canceled out. 64 Imagine getting three linear equations in four variables. In general none of the variables is recoverable. (only a subspace). A 1 +2A 2 + B 1 +B 2 =y 1 2A 1 +A 2 + B 1 +B 2 =y 2 B 1 +B 2 =y 3 Interference alignment How to form codes that have multiple alignments at the same time?

65 Exact Repair-(4,2) example x1 x3 x2 x4 x1+x3 x2+x4 x1+2x3 2x2+3x4 x1? x2? x1+x2+x3+x x x2+x3+x x3+x4 (Wu and D., ISIT 2009) Exact repair of the first node Trivial by communicating 4 blocks Can be done with 3?

v2v2 v3v3 v4v4 = = = Exact Repair-interference alignment

Exact Repair-interference alignment = = = [Cadambe-Jafar 2008, Cadambe-Jafar-Maleki-2010]

We want this full rank Exact Repair-interference alignment = = = Choose same V’ and V Make all A diagonal iid Want this in the span of V’

69 Exact Repair-interference alignment We have to choose V, V’ so that all the rows in Are contained in the rowspan of The T i matrices assumed iid diagonal, no assumption other than that they commute

70 Exact Repair-interference alignment We have to choose V, V’ so that all the rows in Are contained in the rowspan of Ok. Lets start by choosing V to be one vector w

Exact Repair-interference alignment And fold it back in… by repeating this ‘folding’ V and V’ overlap more and more.

A combinatorial view Look at the exponents of T 1,T 2 as lattice points [Papailiopoulos, D, Cadambe Allerton 2011]

A combinatorial view Look at the exponents of T 1,T 2 as lattice points

A combinatorial view Look at the exponents of T 1,T 2 as lattice points Overla p

A combinatorial view Cadambe-Jafar set V to be the interior of a hypercube

Open problem which set of dots overlaps with its two shifts maximally? (easy in 2D)

Given an error-correcting code find the repair coefficients that reduce communication (over a field) Given some channel matrices find the beamforming matrices that maximize the DoF (Cadambe and Jafar, Suh and Tse) Given some channel matrices find the beamforming matrices that maximize the DoF (Cadambe and Jafar, Suh and Tse) connecting storage and wireless Both problems reduce to rank minimization subject to full rank constraints. Polynomial reduction from one to the other. (Papailiopoulos & D. Asilomar 2010) Both problems reduce to rank minimization subject to full rank constraints. Polynomial reduction from one to the other. (Papailiopoulos & D. Asilomar 2010)

78 Storage codes through alignment techniques The symbol extension alignment technique of [Cadambe and Jafar] leads to exact regenerating codes Exact repair is a non-multicast problem where cut-set region is achievable but needs alignment. (unfortunately not practical) ergodic alignment should have a storage code equivalent? does real alignment have a finite-field equivalent?

overview 79 Storing information using codes. The repair problem Exact Repair. The state of the art. Interference Alignment Different repair metrics The road to practice Storage Allocation problems

[Locally decodable codes recent work by [Gopalan,Yekhanin et al],[Oggier et al])] Different metrics of interest 80 Many companies are investigating the use of erasure codes (Google, Microsoft, NetApp, Wuala, Cleversafe) since large amounts of data require higher reliability. Especially for archival storage. Several metrics of interest: 1. Bits communicated for repair (network traffic generated) 2. Bits read for repairs (open) 3. Number of Nodes used during a repair.

Locality of a code Example: (6,4)-MDS Code M = 4Mb file, M/k = 1 Mb per node Well… any k nodes can reconstruct everything data1 parity 1 parity 2 data2 data3 data4 data1 Lemma: An (n,k) MDS code has locality no less than k. 81 MDS Codes = worst locality

If we allow more storage, can we have i) high-rate, ii) the erasure property, iii) local and simple repairs? What is the Cost of Locality? 82 data1 parity 1 parity 2 data2data3data4 data1

Theorem 1 : (Locality, Storage) Locally Repairable Codes Theorem 2 : This is the optimal tradeoff between repair locality and storage 83

Simple example 84

(4,2) example n=4 nodes, each node stores 2 data packets and one fork (f=2). Any k=2 nodes can recover (even without using the forks)

(4,2) example n=4 nodes, each node stores 2 data packets and one fork (f=2). Any k=2 nodes can recover the file (f1,f2) (even without using the forks)

(4,2) example- exact repair ? ? ?

forks are used for local repairs Outer MDS codes used to provide the (n,k) safety Must ensure that a fork and its parents are stored in different nodes (nontrivial combinatorial placement problems).

File is Separated in m blocks A code (possibly MDS code) produces T blocks. Each coded block is stored in r=1.5 nodes. m Each storage node Stores d coded blocks. n Adjacency matrix of an expander graph. Every k right nodes are adjacent to m left nodes. + + General construction

File is Separated in m blocks An MDS code produces T blocks. Each coded block is stored in r nodes. m Each storage node Stores d coded blocks. n Adjacency matrix of an expander graph. Every k right nodes are adjacent to m left nodes. Claim: I can still do easy lookup repair. d packets lost + + General construction

File is Separated in m blocks An MDS code produces T blocks. Each coded block is stored in r nodes. m Each storage node Stores d coded blocks. n Adjacency matrix of an expander graph. Every k right nodes are adjacent to m left nodes. Claim: I can still do easy lookup repair. 2d disk IO and communication [ Papailipoulos et al. to be submitted] d packets lost + + General construction

overview 92 Storing information using codes. The repair problem Exact Repair. The state of the art. Interference Alignment Different repair metrics The road to practice Storage Allocation problems

the road to practical use 93 [Hu, Yu, Li, Lee, Lui] CUHK Network Coding File System (NetCod 2011)]

the road to practical use 94 [Hu, Yu, Li, Lee, Lui] CUHK Network Coding File System (NetCod 2011)]

95 Hadoop Mapreduce Yahoo created an open-source version of GFS (called Hadoop Distributed File System HDFS) Plus an analytics infrastructure. Together the software is called Apache Hadoop Mapreduce Hundreds of companies are using Hadoop, tens of startups are developing tools for Hadoop. ( BigData ) It is open source, free, changing the world.

code design RS p1 p2 p3 p4 10 x1 + + x2 + + x3 + + Local XORs allow single block recovery by transferring only 5 blocks (320MB) instead of 10 blocks (640 MB in HDFS RAID). 17 total blocks stored Storage overhead increased to 1.7x from 1.4x

code design RS p1 p2 p3 p4 10 x1 + + x2 + + x3 + + Local XORs can be any local linear combinations (just invert in repair) Choose coefficients so that x1+x2=x3 (interference alignment) Do not store x3!

code design RS p1 p2 p3 p4 10 x1 + + x2 + + x p2

code we implemented in HDFS RS p1 p2 p3 p4 10 x1 + + x2 + + x3 + + Single block failures can be repaired by accessing 5 blocks. (vs 10) Stores 16 blocks 1.6x Storage overhead vs 1.4x in HDFS RAID. Implemented this in Hadoop (system available on github/madiator)

Java implementation 100

Some experiments 101

102 Some experiments 100 machines on Amazon ec2 50 machines running HDFS RAID (facebook version, (14,10) Reed Solomon code ) 50 running our version USC3XOR HDFS 3XOR Regenerating code 50 files uploaded on system, 640MB per file Killing nodes and measuring network traffic, disk IO, CPU, etc during node repairs.

103

104

105

what we observe 106 New storage code reduces bytes read by roughly 2.6x Network bandwidth reduced by approximately 2x We use 14% more storage. Similar CPU. In several cases 30-40% faster repairs. Study on larger scale-on going. Provides four more zeros of data availability compared to replication Gains can be much more significant if larger codes are used (i.e. for archival storage systems).

107 Conclusions and open problems There are several theoretical open problems in coding for distributed storage. Exact repair region ? (12 bottles of ouzo) Repairing codes with a small finite field limit ? Dealing with bit-errors (security) and privacy ? Network topology awareness (same rack/data center) ? Disk read bounds, Locally repairable codes ? Also there seems to be significant potential for use in real systems. (especially for large archival storage or across data centers) 107

108 Coding for Storage wiki

109 fin

A Storage allocation problem

Allocations for one object0.1

Problem Description Can be generalized to other node failure models Nonconvex problem. Harder than it looks.

Allocations for one objectA B C

Symmetric allocations can be suboptimal – † Given n = 5 storage nodes, budget T = 12/5, and p = 0.9, the nonsymmetric allocation performs better than the optimal symmetric allocation Finding the optimal symmetric allocation is also nontrivial Distributed storage allocations

Leong, D. Ho, Netcod 2009, ICC, Globecom 2010 Distributed storage allocations Results can be obtained for different access models. For iid model. Theorem : Maximal spreading x i = T/n, for all i in [1,n], has asymptotically zero gap from optimality if Tp>1 Conjecture : There is a phase transition from minimal spreading to maximal spreading being optimal, as n grows.

Storage allocations and combinatorics 116 The storage allocation problem was recently shown to be equivalent to an old conjecture by Erdos on uniform hypergraphs. A storage counterexample from Leong,D,Ho turns out to be a counterexample to the strong fractional Erdos Conjecture on uniform hypergraphs (Alon et al. 2012).

On-going implementations 117 Network Coding File System CUHK -New file system over FUSE -Uses Exact MBR codes by Rashmi et al. -Open source, available Our own implementation over Hadoop (HDFS RAID). -Implementing locally repairable codes -Java open source implementation- easy to add new codes and experiment

118 Open Problems in distributed storage Cut-Set region matches exact repair region ? Repairing codes with a small finite field limit ? Dealing with bit-errors (security) and privacy ? (Dikaliotis,D, Ho, ISIT’10) What is the role of (non-trivial) network topologies ? Cooperative repair (Shum et al.) Lookup repair region ? Disk IO region ? What are the limits of interference alignment techniques ? Repairing existing codes used in storage (e.g. EvenOdd, B- Code, Reed-Solomon etc) ? Real world implementation, benefits over HDFS for Mapreduce ? Archival storage, Storage in Flash SSDs, Cloud Storage? 118

overview 119 Storing information using codes. The repair problem Exact Repair. The state of the art. The role of Interference Alignment Future directions: security through coding

120 coding allows secret sharing a b c d Four coded blocks are stored in four different cloud storage providers Any two can be used to recover the data Any cloud storage provider knows nothing about the data. [Shamir, Blakley 1979] Distributed coding theory problems?

121 Security during Repair ? a b c e Incorrect linear equations d Repair bandwidth in the presence of byzantine adversaries?

122 Exact Repair-(4,2) example x1 x3 x2 x4 x1+x3 x2+x4 x1+2x3 2x2+3x4 x1? x2? x1+x2+x3+x x x2+x3+x x3+x4 (Wu and D., ISIT 2009)

The ring code n=5 k=3 Any 3 nodes must suffice to recover the data. set x 5 =x 1 +x 2 +x 3 +x 4 not an MDS code (has rate 1/2 * 4/5) lower than k/n= 3/5

The ring code 124 n=5 k=3 Any 3 nodes know m=4 packets. An MDS code produces T=5 blocks. Each coded block is stored in r=2 nodes.

The ring code 125 An MDS code produces T blocks. m=4 n=5

The ring code: lookup repair n=5 k=3 node 1 fails. just read from d=2 other nodes. Minimizing d is proportional to total disk IO.