Double Regenerating Codes for Hierarchical Data Centers

Slides:



Advertisements
Similar presentations
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
Advertisements

current hadoop architecture
Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.
SDN + Storage.
Henry C. H. Chen and Patrick P. C. Lee
1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.
BASIC Regenerating Codes for Distributed Storage Systems Kenneth Shum (Joint work with Minghua Chen, Hanxu Hou and Hui Li)
Self-repairing Homomorphic Codes for Distributed Storage Systems [1] Tao He Software Engineering Laboratory Department of Computer Science,
Coding and Algorithms for Memories Lecture 12 1.
Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
Multicut Lower Bounds via Network Coding Anna Blasiak Cornell University.
1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu The Chinese University of Hong Kong.
Beyond the MDS Bound in Distributed Cloud Storage
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
1 Network Coding: Theory and Practice Apirath Limmanee Jacobs University.
Distributed Algorithms for Secure Multipath Routing
1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.
A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation Yinlong Xu University of Science and Technology of.
Duality Lecture 10: Feb 9. Min-Max theorems In bipartite graph, Maximum matching = Minimum Vertex Cover In every graph, Maximum Flow = Minimum Cut Both.
Network Coding Project presentation Communication Theory 16:332:545 Amith Vikram Atin Kumar Jasvinder Singh Vinoo Ganesan.
Redundant Data Update in Server-less Video-on-Demand Systems Presented by Ho Tsz Kin.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Presented by: Raymond Leung Wai Tak Supervisor:
Cooperative regenerating codes for distributed storage systems Kenneth Shum (Joint work with Yuchong Hu) 22nd July 2011.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds
Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.
Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.
Repairable Fountain Codes Megasthenis Asteris, Alexandros G. Dimakis IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 5, MAY /5/221.
22/07/ The MDS Scaling Problem for Cloud Storage Yu-chong Hu Institute of Network Coding.
Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2
Network Coding and Information Security Raymond W. Yeung The Chinese University of Hong Kong Joint work with Ning Cai, Xidian University.
1 Network Coding and its Applications in Communication Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.
A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.
Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu, Yinlong Xu, Xiaozhao Wang, Cheng Zhan and Pei.
Erasure Coding for Real-Time Streaming Derek Leong and Tracey Ho California Institute of Technology Pasadena, California, USA ISIT
Analyzing the Vulnerability of Superpeer Networks Against Attack Niloy Ganguly Department of Computer Science & Engineering Indian Institute of Technology,
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
Maximum Flow Problem (Thanks to Jim Orlin & MIT OCW)
15.082J and 6.855J March 4, 2003 Introduction to Maximum Flows.
Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.
20/10/ Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu Institute of Network Coding Please.
A Fast Repair Code Based on Regular Graphs for Distributed Storage Systems Yan Wang, East China Jiao Tong University Xin Wang, Fudan University 1 12/11/2013.
Coding and Algorithms for Memories Lecture 13 1.
Dynamic Background Learning through Deep Auto-encoder Networks Pei Xu 1, Mao Ye 1, Xue Li 2, Qihe Liu 1, Yi Yang 2 and Jian Ding 3 1.University of Electronic.
Compression for Fixed-Width Memories Ori Rottenstriech, Amit Berman, Yuval Cassuto and Isaac Keslassy Technion, Israel.
Secret Sharing in Distributed Storage Systems Illinois Institute of Technology Nexus of Information and Computation Theories Paris, Feb 2016 Salim El Rouayheb.
1 Chapter 6 Reformulation-Linearization Technique and Applications.
1 Chapter 5 Branch-and-bound Framework and Its Applications.
Network Topology Single-level Diversity Coding System (DCS) An information source is encoded by a number of encoders. There are a number of decoders, each.
A Tale of Two Erasure Codes in HDFS
IERG6120 Lecture 22 Kenneth Shum Dec 2016.
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
A Simulation Analysis of Reliability in Erasure-coded Data Centers
Partial-Parallel-Repair (PPR):
Repair Pipelining for Erasure-Coded Storage
Presented by Haoran Wang
Injong Rhee ICMCS’98 Presented by Wenyu Ren
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.
Symmetric Allocations for Distributed Storage
Chap 9. General LP problems: Duality and Infeasibility
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
A Fusion-based Approach for Tolerating Faults in Finite State Machines
3.5 Minimum Cuts in Undirected Graphs
On Sequential Locally Repairable Codes
CMPE 252A : Computer Networks
Image Coding and Compression
Flow Feasibility Problems
Xiaolu Li†, Runhui Li†, Patrick P. C. Lee†, and Yuchong Hu‡
Presentation transcript:

Double Regenerating Codes for Hierarchical Data Centers Yuchong Hu1, Patrick P. C. Lee2, and Xiaoyang Zhang1 1Huazhong University of Science and Technology 2The Chinese University of Hong Kong yuchonghu@hust.edu.cn You know, in many data centers, cross-rack links are often congested? and most coding schemes are not specially designed to reduce the cross-rack traffic. So we will show how to design a new code to help reduce the cross-rack traffic significantly. This is our work. Double Regenerating Codes for Hierarchical Data Centers.

Outline Background Motivation Lower Bound Code Construction Evaluation

Hierarchy in Data Center So What is the Hierarchy in Data Center? Let’s see what a datacenter is composed of.

Hierarchy in Data Center First, we have thousands of nodes.

Hierarchy in Data Center Second, the nodes are grouped into racks, and two nodes in one rack are connected to the same switch Rack 1 Rack 2 Rack 3

Hierarchy in Data Center Network core Rack 1 Rack 2 Rack 3 Third, switches are connected to the network core.

Hierarchy in Data Center Network core Rack 1 Rack 2 Rack 3 Here, the hierarchy in Data Center refers to the link hierarchy. Link hierarchy

Hierarchy in Data Center Network core Rack 1 Rack 2 Rack 3 Which means, compared to the sufficient intra-rack bandwidth, Link hierarchy: Sufficient intra-rack link

Hierarchy in Data Center Network core Rack 1 Rack 2 Rack 3 In some bad cases, the cross-rack bandwidth per node is a small fraction of the intra-rack bandwidth. Link hierarchy: Sufficient intra-rack link Scarce cross-rack link the cross-rack capacity available per node in the worst case is only 1/5 to 1/20 of the intra-rack capacity [1,3,20].

cross-rack bandwidth Many studies aim to reduce it: Chowdhury et. al., Sigcomm 11 Broadcast transfer Chowdhury et. al., Sigcomm 13 Replica placement Li et.al., DSN 15 Transitions from Replication to Erasure Coding Jalaparti et.al., Sigcomm 15 Scheduling for Data-Parallel Jobs Many studies aim to reduce to cross-rack bandwidth. But they focus on different problems

cross-rack bandwidth Many studies aim to reduce it: Failure repairing Chowdhury et. al., Sigcomm 11 Broadcast transfer Chowdhury et. al., Sigcomm 13 Replica placement Li et.al., DSN 15 Transitions from Replication to Erasure Coding Jalaparti et.al., Sigcomm 15 Scheduling for Data-Parallel Jobs In this paper, we also aim to reduce the cross-rack bandwidth, but focus on the failure repairing problem. Failure repairing

Failures are common As we know, failures are common in data centers.

Redundancy Redundancy replication Erasure coding And one solution is to adopt redundancy. There are two kinds of common techniques for data redundancy: replication and erasure coding.

Redundancy Redundancy replication Erasure coding Erasure coding has much less storage overhead relative to replication.

Inefficient Repair for EC Erasure codes Repair bandwidth= M File of size M A Node 1 A B Node 2 B B Node 3 A+B A A+B Node 4 A+2B However, erasure coding may cause an inefficient repair. For example, to repair the lost fragment in Node 1, the repair bandwidth, which means the total amount of transferred traffic for repair, has to be two fragments.

Regenerating Codes [Dimakis et al., TIT’10] Erasure codes Repair bandwidth= M File of size M A Node 1 A B Node 2 B B Node 3 A+B A A+B Node 4 A+2B Regenerating codes [5] Repair bandwidth= 0.75M A File of size M Node 1 A B To reduce the repair bandwidth, Dimakis and others proposed RC Regenerating Codes remains challenging in terms of cross-rack repair bandwidth. Next we will give use an example to motivate our work. B C D Node 2 C D C Node 3 A+C B+D A A+C B Node 4 A+D A+B+C B+C+D

Outline Background Motivation Lower Bound Code Construction Evaluation

Observation: n = 6, k = 3, file size = M [Rashmi et al., Sigcomm 15] Observation: n = 6, k = 3, file size = M Cross-rack Repair Bandwidth = 5/9 M We consider the state of art of minimum storage regenerating codes. We can see that each node is located in a different rack, in order to maximize the tolerance to node failures and rack failures. observe that with MSR codes, that are minimum storage regenerating codes, Cross-rack Repair Bandwidth is five ninth of M. Note that, MSR codes instead of placing one fragment per rack, we can place multiple fragments per rack (while we still keep one fragment per node), and exploit inner-rack regeneration to reduce the cross-rack repair bandwidth. MSR codes [15]

Failures are common

Failures are common Then we can find that rack failures are much less frequent than node failures in practice

Observation: n = 6, k = 3, file size = M Cross-rack Repair Bandwidth = 5/9 M So what if we just tolerate only a single rack failure instead of tolerating multiple rack failures, MSR codes

Observation: n = 6, k = 3, file size = M Cross-rack Repair Bandwidth = 5/9 M It is kind of like this. we can place multiple nodes in each rack. MSR codes

Observation: n = 6, k = 3, file size = M Cross-rack Repair Bandwidth = 5/9 M Cross-rack Repair Bandwidth = 1/3 M In this example we can design our codes which reduces the cross-rack repair bandwidth to one third of M. MSR codes

Observation: n = 6, k = 3, file size = M Cross-rack Repair Bandwidth = 5/9 M Cross-rack Repair Bandwidth = 1/3 M 40% repair bandwidth reduced! MSR codes

Observation: n = 6, k = 3, file size = M Cross-rack Repair Bandwidth = 5/9 M Cross-rack Repair Bandwidth = 1/3 M The intuition is that we can use intra-rack bandwidth to reduce the cross-rack bandwidth. The main idea is to perform regeneration twice: first within one rack and then across multiple racks. We call this double regeneration. 40% repair bandwidth reduced! MSR codes Double regeneration

Challenges How to design new double regenerating codes to minimize cross-rack repair bandwidth : Can we obtain the lower bound of cross-rack repair bandwidth in theory based on double regeneration? Can we design regenerating codes to match the lower bound? It is natural to ask two questions:

Outline Background Motivation Lower Bound Code Construction Evaluation

System Model X1,1 X1,2 β X2,1 Parameters: n=6, k=3, r=3 Rack 1 Parameters: n=6, k=3, r=3 Constraint: (6,3) MDS codes Goal: To minimize β (rack-to-rack repair bandwidth) Rack 2 To solve the above questions, we first give the system model. Three parameters: n ,k are familiar to us, I think. R is the number of racks. X1,1 refers to the first node of rack 1. When X11 fails, X underline 1,1 refers to a new node to repair the lost data of X11. Note that our goal is to minimize the cross-rack bandwidth \beta. Rack 3

Information Flow Graph X1,1 X1,2 X2,1 X2,2 X3,1 X3,2 β In order to analyze the system model, We construct an information flow graph to describe a data center network for a single-node repair.

Information Flow Graph X1,1 X1,2 X2,1 X2,2 X3,1 X3,2 β It is kind of like this. You can see the paper to see the details.

Information Flow Graph To get the lower bound of beta, we need to find all the cuts. Like this.

Lower Bound of β Based on network coding theory in distributed storage, we have lemma 2 that specifies the lower bound of beta

Lower Bound of β Main idea of Proof: Each min-cut needs to be at least M, which responds to one inequality. Reduce all inequalities to the result.

Lower Bound of β Main idea of Proof: Each min-cut needs to be at least M, which responds to one inequality. Reduce all inequalities to the result. And this is the lower bound of beta Necessary lower bound of β < Some min-cuts M Some data collectors can not reconstruct the file <

Outline Background Motivation Lower Bound Code Construction Evaluation

Code Construction Double Regenerating Codes (DRC): We propose DRC such that its optimal single-node repair satisfies and maintains the MDS property after a single-node repair. Based on Lemma 2, we propose DRC such beta is equal to the lower bound, and we will prove the existence of DRC.

Code Construction Double Regenerating Codes (DRC): We propose DRC such that its optimal single-node repair satisfies and maintains the MDS property after a single-node repair. Repair-Optimal Note that if our code construction matches the necessary lower bound. Code construction matching the necessary lower bound is correct the bound is tight; the code construction is repair-optimal

Code Construction Initialization

Code Construction Initialization …… …… …… X1,1 q blocks node q blocks X1,n/r …… rack The original File: M divide encode q blocks Xr,1 q×k blocks …… q blocks Xr,n/r Block size = M/qk Here, q is equal to r minus the floor of kr over n, so the block size is equal to M/qk. Those qk blocks are encoded to n nodes, each node has q blocks.

Code Construction Initialization …… …… …… X1,1 q blocks node q blocks X1,n/r …… rack The original File: M divide encode q blocks Xr,1 q×k blocks …… q blocks Xr,n/r Block size = M/qk X1,1 P1,1 (n, k, r) = (6, 3, 3), q=2 R1 X1,2 P1,2 This an example. We will use this example to specify the single node failure repairing. File A1 X2,1 A2 P2,1 divide encode A3 R2 A4 X2,2 P2,2 A5 A6 X3,1 P3,1 R3 X3,2 P3,2

Code Construction Single-node Repair: X1,1 X1,1 P1,1 P’1,1 Suppose that X1,1 fails, and we reconstruct a new fragment P′1,1 in a new node X1,1.

Code Construction Single-node Repair: X1,1 X1,1 P1,1 P’1,1 Stage 1: Xh,1 (where 2 ≤ h ≤ r) compute a new block, p′h, from all stored blocks in rack Rh , where ch denotes a coefficient vector: 1st regenerating In stage 1, this is the first regenerating of data from one rack

Code Construction Single-node Repair: X1,1 X1,1 P1,1 P’1,1 Stage 1: Xh,1 (where 2 ≤ h ≤ r) compute a new block, p′h, from all stored blocks in rack Rh , where ch denotes a coefficient vector: 1st regenerating X1,1 P1,1 R1 X1,2 P1,2 X2,1 P2,1 P’2 and p’3 are the result of the first regeneration p’2 =[P2,1;P2,2]c2 R2 X2,2 P2,2 X3,1 P3,1 p’3 =[P3,1;P3,2]c3 R3 X3,2 P3,2

Code Construction Single-node Repair: X1,1 X1,1 P1,1 P’1,1 Stage 2: X1,1 computes a new fragment P′1,1, from all the surviving blocks in rack R1 as well as all the p′h (where 2 ≤ h ≤ r), D is a coefficient matrix: 2nd regenerating In stage 2, this is the second regenerating of data from multiple racks

Code Construction Single-node Repair: X1,1 X1,1 P1,1 P’1,1 Stage 2: X1,1 computes a new fragment P′1,1, from all the surviving blocks in rack R1 as well as all the p′h (where 2 ≤ h ≤ r), D is a coefficient matrix: 2nd regenerating X1,1 X1,1 P1,1 P’1,1 R1 P’1,1 =[P1,1; p’2 ; p’3] D X1,2 P1,2 X2,1 p’2 P2,1 P’1,1 are the result of the second regeneration p’2 =[P2,1;P2,2]c2 R2 X2,2 P2,2 X3,1 p’3 P3,1 p’3 =[P3,1;P3,2]c3 R3 X3,2 P3,2

Code Construction Single-node Repair: β = block size= M/qk = X1,1 X1,1 Stage 2: X1,1 computes a new fragment P′1,1, from all the surviving blocks in rack R1 as well as all the p′h (where 2 ≤ h ≤ r), D is a coefficient matrix: 2nd regenerating X1,1 X1,1 P1,1 P’1,1 R1 P’1,1 =[P1,1; p’2 ; p’3] D X1,2 P1,2 β = block size= M/qk = X2,1 p’2 P2,1 We can find that the rack2rack cross-rack bandwidth beta is equal to the lower bound. So if this code construction does exist, then the code is repair-optimal and bound is tight. p’2 =[P2,1;P2,2]c2 R2 X2,2 P2,2 X3,1 p’3 P3,1 p’3 =[P3,1;P3,2]c3 R3 X3,2 P3,2

Existence Theorem 1. There exists a linear coding construction for DRC defined in the finite field F, such that the MDS property is still maintained after a single-node repair with a probability arbitrarily driven to 1 by increasing the field size of F. We now prove the existence of this code construction by Theorem 1,

Existence Theorem 1. There exists a linear coding construction for DRC defined in the finite field F, such that the MDS property is still maintained after a single-node repair with a probability arbitrarily driven to 1 by increasing the field size of F. Main idea of Proof: We can always tune encoding coefficients ch and D of the DRC construction, such that the MDS property can be maintained after no matter how many single-node repairs. Lemma 3: how to tune encoding coefficients ch and D. We now prove the existence of DRC by showing that it maintains the MDS property after a single-node repair.

Outline Background Motivation Lower Bound Code Construction Evaluation

Quantitative Comparisons RS: Reed-Solomon (RS) codes use the conventional repair, whose cross-rack repair bandwidth equals the original data size M. MSR[5]: The repair bandwidth of MSR codes is DRC: The minimum cross-rack repair bandwidth is (r−1)min{β} = IEC: Assume that the existence of ideal erasure codes (IEC), whose cross-rack repair bandwidth equals the lost data size of the failed node, i.e., M/k.

Quantitative Comparisons Observations: The cross-rack repair bandwidth of DRC is always less than that of MSR codes, by up to 45.5% (e.g., n = 12, k = 8, r = 4). DRC has the same cross-rack repair bandwidth as IEC in some cases. Figure 3 depicts the cross-rack repair bandwidths of different erasure codes versus the number of nodes n in terms of the percentage of the original data size M. The reason is that the “floor” function may keep ⌊ kr n ⌋ the same even if r becomes smaller.

Future Work Deterministic and Systematic Code Construction Further Reduce the intra-rack bandwidth Practical Performance More hierarchy Geographically Distributed Data Centers Multiple Clouds

Thank You!

Code Construction Initialization The original File: M divide q × k blocks encode Xh,i 1 ≤ h ≤ r 1 ≤ I ≤ n/r q blocks …… Initialization q blocks …… q blocks Block size = M/qk ph,i,j : column vector specifying the coefficients for the jth block of the ith fragment of the hth rack

Code Construction Initialization The original File: M divide q × k blocks encode Xh,i 1 ≤ h ≤ r 1 ≤ I ≤ n/r q blocks …… Initialization q blocks …… q blocks Block size = M/qk ph,i,j : column vector specifying the coefficients for the jth block of the ith fragment of the hth rack Ph,i : matrix comprising the column vectors {ph,i,j}1 ≤ j ≤ q specifying the coefficients for the ith fragment of the hth rack

Hierarchy in Data Center Network core Rack 1 Rack 2 Rack 3 Cross-rack bandwidth is considered to be a scarce resource Link hierarchy: Sufficient intra-rack link Scarce cross-rack link