Cooperative regenerating codes for distributed storage systems Kenneth Shum (Joint work with Yuchong Hu) 22nd July 2011
Multiple node failures Large-scale storage system – Google data center, example from Kannan’s talk. – servers, fail rate = 4% per year – Repair in 2 days – Mean number of failed servers in 2 days = 175. The lazy-repair policy in TotalRecall – A repair process is triggered only after the number of failed nodes has reached a certain threshold. Jul, kshum
Jointly repair multiple failures Jul, 2011 Hu et al. (JSAC, Feb 2010) 3 Can we further reduce the repair-bandwidth? Data exchange kshum Storage nodes Newcomers
Distributed storage (erasure coding) Jul, A1A2A1A2 B1B2B1B2 A 1 +B 1 2 A 2 +B 2 A 1, A 2, B 1, B 2 2 A 1 +B 1 A 2 +B 2 Data Collector Wu, Dimakis ISIT09 kshum
Naive Repair Jul, A1A2A1A2 B1B2B1B2 A 1 +B 1 2 A 2 +B 2 A 1, A 2, B 1, B 2 2 A 1 +B 1 A 2 +B 2 4 packets required. A1A2A1A2 B 1, B 2 A 1 +B 1, 2 A 1 +B 2 kshum
Repair with ``code alignment’’ Jul, A1A2A1A2 B1B2B1B2 A 1 +B 1 2 A 2 +B 2 A 1, A 2, B 1, B 2 2 A 1 +B 1 A 2 +B 2 A1A2A1A2 3 packets required. B 1 + B 2 A 1 +2 A 2 +B 1 + B 2 2 A 1 + A 2 +B 1 + B 2 Solve: P 1 = A 1 +2 A 2 P 2 = 2 A 1 + A 2 kshum
Multiple failures, separate repair Jul, A1A2A1A2 B1B2B1B2 A 1 +B 1 2 A 2 +B 2 A 1, A 2, B 1, B 2 2 A 1 +B 1 A 2 +B 2 8 packets in total 4 packets per newcomer B1B2B1B2 2 packets 2 A 1 +B 1 A 2 +B 2 2 packets kshum
Multiple failures, cooperative repair (I) Jul, A1A2A1A2 B1B2B1B2 A 1 +B 1 2 A 2 +B 2 A 1, A 2, B 1, B 2 2 A 1 +B 1 A 2 +B 2 6 packets in total 3 packets per newcomer A 1, A 2 2A 2 +B 2 A 1 +B 1 B 1,B 2 B1B2B1B2 2 A 1 +B 1 A 2 +B 2 kshum
Multiple failures, cooperative repair (II) Jul, A1A2A1A2 B1B2B1B2 A 1 +B 1 2 A 2 +B 2 A 1, A 2, B 1, B 2 2 A 1 +B 1 A 2 +B 2 6 packets in total 3 packets per newcomer A 1 +B 1 A1A1 A 1 A 1 +B 1 A2A2 2A 2 +B 2 A 2 2A 2 +B 2 B2B2 B2B2 2A 1 +B 1 A 2 +B 2 B1B1 kshum
Outline of the talk Is it optimal in terms of repair-bandwidth? What is the tradeoff between storage and repair-bandwidth for cooperative repair? Can we achieve the Pareto-optimal operating points on the tradeoff curve by linear network coding? – Exact repair – Functional repair Jul, kshum
In 2 Information flow graph Jul, S In 1 Out 1 Data Collector Out 2 In 3 Out 3 In 4 Out 4 In 5 Out 5 Out 6 Out 7 11 11 11 In 6 In 7 11 11 11 Mid 6 Mid 7 22 22 kshum
Is this regenerating code optimal ? Jul, A1A2A1A2 B1B2B1B2 A 1 +B 1 2 A 2 +B 2 A 1, A 2, B 1, B 2 2 A 1 +B 1 A 2 +B 2 6 packets in total 3 packets per newcomer A 1 +B 1 A1A1 A 1 A 1 +B 1 A2A2 2A 2 +B 2 A 2 2A 2 +B 2 B2B2 B2B2 2A 1 +B 1 A 2 +B 2 A1A1 kshum
In 2 First cut Jul, B In 1 Out 1 Data Collector Out 2 In 3 Out 3 In 4 Out 4 Out 6 Out 7 Mid 6 Mid 7 22 22 11 11 11 11 B 4 1 In 6 In 7 kshum
Second cut Jul, Out 1 Data Collector Out 2 Out 3 Out 4 2 Out 1 2 Out 2 Mid 1 Mid 2 22 22 11 11 11 11 Out 3 Out 4 Mid 3 Mid 4 22 22 In 1 In 2 In 3 In 4 11 11 B 2+ 1 + 2 kshum
A linear programming problem Minimize 2 1 + 2 (repair bandwidth) Subject to 4 4 1 4 2+ 1 + 2 1, 2 0 Jul, 1 1 2 1 22 11 1 1 At least 3 packets kshum
In 2 Non-homogeneous download traffic Jul, B In 1 Out 1 Data Collector Out 2 In 3 Out 3 In 4 Out 4 Out 6 Out 7 Mid 6 Mid 7 22 22 aa dd cc bb B a + b + c + d In 6 In 7 kshum
Non-homogeneous traffic Jul, Out 1 Data Collector Out 2 Out 3 Out 4 2 Out 1 2 Out 2 Mid 1 Mid 2 22 22 11 11 11 11 Out 3 Out 4 Mid 3 Mid 4 ii jj In 1 In 2 In 3 In 4 hh ff ee ff gg B 2+ f + j kshum
Non-homogeneous traffic Jul, Out 1 Data Collector Out 2 Out 3 Out 4 2 Out 1 2 Out 2 Mid 1 Mid 2 22 22 11 11 11 11 Out 3 Out 4 Mid 3 Mid 4 ii jj In 1 In 2 In 3 In 4 hh ff ee ff gg B 2+ f + j B 2+ h + i kshum
Non-homogeneous traffic Jul, Out 1 Data Collector Out 2 Out 3 Out 4 2 Out 1 2 Out 2 Mid 1 Mid 2 22 22 11 11 11 11 Out 3 Out 4 Mid 3 Mid 4 ii jj In 1 In 2 In 3 In 4 hh ff ee ff gg B 2+ f + j B 2+ h + i B 2+ e + j kshum
Non-homogeneous traffic Jul, Out 1 Data Collector Out 2 Out 3 Out 4 2 Out 1 2 Out 2 Mid 1 Mid 2 22 22 11 11 11 11 Out 3 Out 4 Mid 3 Mid 4 ii jj In 1 In 2 In 3 In 4 hh ff ee ff gg B 2+ f + j B 2+ h + i B 2+ e + j B 2+ g + i kshum
The same LP problem Minimize Subject to Jul, At least 3 packets kshum
TRADEOFF BETWEEN STORAGE AND REPAIR-BANDWIDTH Jul, kshum
Storage vs Repair-bandwidth Jul, One-by-one repair Repairing 3 newcomers jointly File size = 420 d = 8 k = 4 d DC k kshum (S., ICC 2011, Kermarrec, Le Scouamec and Straub, Netcod 2011.)
Fair comparison? Jul, One-by-one repair repair degree = 8 Cooperative repair Surviving nodes Number of connections per each newcomer = 8 Number of connections per each newcomer = 8+2 kshum
MBCR and MSCR Jul, One-by-one repair Cooperative repair Minimum bandwidth cooperative repair (MBCR) Minimum storage cooperative repair (MSCR) kshum
How much can we improve? Jul, One-by-one repair Repairing 10 newcomers jointly File size = 2275 d = 30 k = 5 d DC k When d is large, joint repair does not have significant advantage over one-by-one repair. kshum
How much can we improve? Jul, One-by-one repair Repairing 10 newcomers jointly File size = 616 d = 8 k = 4 d DC k Repair-bandwidth reduction is more prominent when d is not so large. kshum
AN EXPLICIT CONSTRUCTION FOR MINIMUM-BANDWIDTH COOPERATIVE REPAIR Jul, kshum
An explicit construction for MBCR Jul, 2011kshum 29 Minimum repair- bandwidth Storage per node B = 8 information packets n = 4 nodes Each node stores 5 packets. Repair r = 2 failures simultaneously No. of connections for each DC = k=2 No. of helpers for each failed node =d=2 (S., Hu, ISIT 2011.) Require d = k, r = n–d
Min-Bandwidth point Jul, kshum One-by-one repair Repairing 2 new nodes cooperatively
Data Distribution 8 data packets: A, B, C, D, E, F, G, H A, B, C, D, F+G C, D, E, F, H+A E, F, G, H, B+C G, H, A, B, D+E XOR 5 packets: 4 systematic, 1 parity-check Jul, kshum
Data collection A, B, C, D, F+G C, D, E, F, H+A E, F, G, H, B+C G, H, A, B, D+E Data collector A,B,C,D,E,F,G,H A, B, C, D E, F, G, H Jul, kshum
Data collection A, B, C, D, F+G C, D, E, F, H+A E, F, G, H, B+C G, H, A, B, D+E Data collector A B C D E F G H Triangular, Full-rank F+G H+A A B C D E F A, B, C, F+G D, E, F, H+A Jul, kshum
Exact Repair A, B, C, D, F+G C, D, E, F, H+A E, F, G, H, B+C G, H, A, B, D+E BADC GH EF F+G B+C F+G How to repair? Total repair-bandwidth=10 Jul, kshum
Exact Repair A, B, C, D, F+G C, D, E, F, H+A E, F, G, H, B+C G, H, A, B, D+E CD GH D+EEH+A B+C F+GF E F E F How to repair? Total repair-bandwidth=10 Jul, kshum
Min-Bandwidth point Jul, kshum One-by-one repair Repairing 2 new nodes cooperatively
AN EXPLICIT CONSTRUCTION FOR MINIMUM-STORAGE COOPERATIVE REPAIR Jul, kshum
An explicit construction for MSCR Jul, 2011kshum 38 Minimum repair- bandwidth Storage per node B = 6 information packets n nodes Each node stores 2 packets. Repair r = 2 failures simultaneously No. of connections for each DC = k=3 No. of helpers for each failed node =d=3 (S. ICC 2011.) Require d = k
The min-storage point Jul, Non-cooperative k=3,d=3, r =2,B=6 Cooperative storage cost per node = 2 repair bandwidth per node = 4 3 DC 3 kshum
Data retrieval Jul, MDS code with dimension k=3 Source data encode codeword Storage nodes …… Data collector decode =2 kshum
Repair : phase 1 Jul, encode codeword Storage nodes lost decode newcomers kshum Source data
Repair: phase 2 Jul, encode codeword Storage nodes lost Re-encode exchange Repair bandwidth per node = 8/2 = 4 newcomers kshum
The construction is optimal Jul, Non-cooperative k=3,d=3, r =2,B=6 Cooperative storage cost per node = 2 repair bandwidth per node = 4 3 DC 3 kshum
EXISTENCE OF COOPERATIVE REGENERATING CODES UNDER FUNCTIONAL REPAIR Jul, kshum
Existence of optimal linear regenerating codes in general Sustainable storage system – Will it work after arbitrarily many repairs? Technical difficulty: The information flow graph is unbounded. Can we work over a fixed finite field, for unlimited number of regenerations? – Yes if we can construct an exact regenerating code. – The answer is also “yes” for cooperative functional repair in general. Jul, 2011kshum 45 (S., Hu, Netcod 2011.)
Trellis structure Jul, 2011kshum 46 m Message vector (row vector) … … … … Stage 0 Stage 1 Stage 2 mT 0 T 0 is the “transfer matrix” in stage 0 mT 0 T 1 T 1 is the “transfer matrix” in stage 1 T 2 is the “transfer matrix” in stage 2 mT 0 T 1 T 2
Flow in information flow graph Jul, 2011kshum 47 S Out 1 Out 2 Out 3 Out 4 In 1 In 2 Mid 1 Mid 2 Out 1 Out DC In 3 In 4 Mid 3 Mid 4 Out 3 Out Out 3 Out 4 The cut-set bound says that the cut capacity is at least 8. Can we construct a flow with value 8?
Cross-sectional flow pattern Jul, 2011kshum 48 S Out 1 Out 2 Out 3 Out 4 In 1 In 2 Mid 1 Mid 2 Out 1 Out DC In 1 In 2 Mid 1 Mid 2 Out 1 Out Out 3 Out 4
A recursive construction of flow Jul, 2011kshum 49 In 1 In 2 Mid 1 Mid 2 Out 1 Out 2 Out 3 Out 4 Out 3 Out 4 Stage s Stage s+1 g1g1 g2g2 g4g4 g3g3 h1h1 h2h2 h4h4 h3h3 1.Identify a set of cross- section flow pattern, say H. 2.For any cross-section flow pattern (h 1, h 2, h 3, h 4 ) in H stage s+1, we can find a flow in this segment of graph, such that (g 1, g 2, g 3, g 4 ) is also in H. 3.Each pattern corresponds to a submatrix of the transfer matrix. 4.By Schwartz-Zippel lemma, we can find the local encoding vectors so that all such determinants are non- zero, if the finite field is sufficiently large.
Summary Multiple node failures in medium-scale to large-scale storage system Formulation as a linear program Functional repair: Linear regenerating code over fixed finite field which matches the cut- set bound on repair-bandwidth exists. Exact repair: two families of explicit code constructions – Minimum-bandwidth point: d=k, r = n – d – Minimum-storage point: d=k, r arbitrary Jul, kshum
References Y. Wu and A. G. Dimakis, Reducing repair traffic for erasure coding-based storage via interference alignment, ISIT, Jul, Y. Hu, Y. Xu, X. Wang, C. Zhan and P. Li, Cooperative recovery of distributed storage systems from multiple losses with network coding, J. Sel. Area Comm., vol. 28, no. 2, pp , Feb, K. W. Shum, Cooperative Regenerating Codes for Distributed Storage Systems, ICC, Jun, A.-M. Kermarrec and N. Le Scouarnec and G. Straub, Repairing Multiple Failures with Coordinated and Adaptive Regenerating Codes, Netcod, Jul, K. W. Shum and Y. Hu, Existence of Minimum-Repair-Bandwidth Cooperative Regenerating Codes, Netcod, Jul, K. W. Shum and Y. Hu, Exact Minimum-Repair-Bandwidth Cooperative Regenerating Codes for Distributed Storage Systems, ISIT, Aug, Jul, 2011kshum 51