Space Bounds for Reliable Storage: Fundamental Limits of Coding Alexander Spiegelman Yuval Cassuto Gregory Chockler Idit Keidar 1
2
3
4
5
Storage Blow-Up Demands are growing exponentially Replication is costly Erasure coding can help [Goodson et al. DSN 2004] [Aguilera et al. DSN 2005] [Cachin and Tessaro DSN 2006] [Hendricks et al. SOSP 2007] [Dutta et al. DISC 2008] [Cadambe et al. NCA 2014] But … within limits 6
Model 7 n servers f can fail Asynchronous RMWs
Distributed Storage: Space Bounds 8 Replication Coding O(D c ) with c concurrent writes O(D f ) bits Lower bound (D min(f,c)) Best-of-both algorithm O(D min(f,c))
k-of-n Erasure Codes 9 encode
k-of-n Erasure Codes decode 10 encode
Register Emulation 11 decode
12 Write S1 S2 S3 S4
Write 13 Generate timestamp S1 S2 S3 S4
Write 14 Generate timestamp encode S1 S2 S3 S4
Write 15 Generate timestamp encode S1 S2 S3 S4
Write 16 Generate timestamp encode S1 S2 S3 S4
Write 17 S1 S2 S3 S4
Write 18 Wait for n-f replies S1 S2 S3 S4
Read 19 S1 S2 S3 S4
20 Read S1 S2 S3 S4
21 Read Wait for n-f replies S1 S2 S3 S4
Read decode 22 S1 S2 S3 S4
What about concurrent read and write? 23
24 Write S1 S2 S3 S4
25 Write S1 S2 S3 S4
26 Write Overwrite? S1 S2 S3 S4
27 Write Overwrite? Suppose yes, if TS is bigger S1 S2 S3 S4
28 Write S1 S2 S3 S4
29 Write Read S1 S2 S3 S4
30 Write Read S1 S2 S3 S4
31 Write Read No written value can be restored! S1 S2 S3 S4
What About Replication? 32
33 Write Read No problem!
34 Write Read No written value can be restored! S1 S2 S3 S4
Enough for safe register only 35 Write Read S1 S2 S3 S4
Let’s Try to Make It Regular Return a written value 36
37 S1 S2 S3 S4
38 S1 S2 S3 S4
39 Overwrite? S1 S2 S3 S4
40 Overwrite? No! S1 S2 S3 S4
41 S1 S2 S3 S4
42 S1 S2 S3 S4
43 What can be overwritten? S1 S2 S3 S4
44 Can green be overwritten? S1 S2 S3 S4
45 Can green be overwritten? No! It’s the only value that can be returned if S3 fails S1 S2 S3 S4
46 Yellow Can Yellow be overwritten? S1 S2 S3 S4
47 S1 S2 S3 S4
48 S1 S2 S3 S4
49 S1 S2 S3 S4
50 S1 S2 S3 S4
51 S1 S2 S3 S4
52 S1 S2 S3 S4
53 Read cannot be restored! S1 S2 S3 S4
54 Read cannot be restored! Regularity violation S1 S2 S3 S4
55 What can be overwritten? Nothing! S1 S2 S3 S4
56 S1 S2 S3 S4
57 S1 S2 S3 S4
58 S1 S2 S3 S4
Ok, so the standard algorithm can use loads of storage But… 59
Inherent 60
Inherent 61 For asynchronous algorithms that store coded data
Distributed Storage: Space Bounds 62 Replication Coding O(D c ) with c concurrent writes O(D f ) bits Lower bound (D min(f,c)) Best-of-both algorithm O(D min(f,c))
But First: More About The Model Black-box encoding Arbitrary encoding scheme Storage holds: – Coded blocks – Unbounded data-independent meta-data 63 encode independent of other values
Observation Every data bit in the storage can be associated with a unique written value – Given our storage model (black-box encoding) 64
Storage Complexity Storage measured in bits Data only (neglect meta-data) Clients write values from some domain V – D=log 2 |V| Count bits associated with value v V stored in servers and other clients 65
Need D=log 2 |V| Bits To Read A Value For most values in V Pigeonhole Argument 66
Theorem Every regular lock-free MWMR register emulation – Tolerating f failures – Allowing c concurrent writes – Storing values from domain of size 2 D Needs to store (D min(f,c)) bits – At some point 67
Adversary Ad We’ll define a particular adversary structure – Controls scheduling – Prevents progress – Blows up the storage Defined with a parameter 0 < l < D 68
69
70 C + (t) clients with pending writes that store at least D- l bits at time t
71 C + (t) C - (t) Other clients with pending writes
C + and C - C + have at least D- l bits in the storage C - have less than D- l bits in the storage – Cannot be read – l bits are missing 72
73 C + (t) C - (t)
74 C + (t) C - (t) D- l
75 C + (t) C - (t) D- l
76 C + (t) C - (t) D- l
77 C + (t) C - (t) D- l
78 C + (t) C - (t) D- l D
79 C + (t) C - (t) D D- l
80 C + (t) C - (t) l D- l D
F(t) Servers that store at least l bits at time t 81 C + (t) C - (t) l D- l D
82 Storage Size F(t) C + (t)C - (t) At least (D- l ) |C + (t)| bits l D- l D
83 Storage Size At least (D- l ) |C + (t)| bits F(t) C + (t)C - (t) l D- l D
Ad’s Scheduling Let clients trigger RMWs on servers Allow pending RMWs to take effect and return 84 l D- l
Ad’s Scheduling 85 C + (t) C - (t) F(t) freeze F: no RMWs take effect l D- l
Ad’s Scheduling 86 C + (t) C - (t) F(t) delay C + : no RMWs take effect l D- l
Ad’s Scheduling 87 C + (t) C - (t) F(t) l D- l
Ad’s Scheduling 88 Rule 1: choose the longest pending RMW on a server not in F(t) by a client in C - (t) F(t) C + (t) C - (t) l D- l
Ad’s Scheduling 89 C - (t+1) C + (t+1) F(t+1) l D- l
Ad’s Scheduling 90 When no such RMWs C - (t+1) C + (t+1) F(t+1) l D- l
Ad’s Scheduling 91 Rule 2: choose in round robin order a client that wants to trigger an RMW C - (t+1) C + (t+1) F(t+1) l D- l
Ad’s Scheduling 92 C - (t+2) C + (t+2) F(t+2) l D- l
Ad’s Scheduling 93 C - (t+3) C + (t+3) F(t+3) l D- l
Ad’s Scheduling 94 C - (t+4) C + (t+4) F(t+4) l D- l
Ad’s Scheduling 95 C - (t+5) C + (t+5) F(t+5) l D- l
Ad’s Scheduling 96 C - (t+6) C + (t+6) F(t+6) l D- l
Implications of Ad F only grows – Servers in F are “frozen” Clients can move out of C + – If their blocks are overwritten Values written by clients in C - cannot be read – Each such client stores less than D- l bits We will next show that no client can complete a write operation 97
Every set of n-f servers must store D bits of some pending write for a write to return Observation 98
Observation 99 Otherwise
Observation 100 Read No value can be restored!
Observation 101 Read No value can be restored! Regularity violation
Lemma |F(t)|=1=f 102 l
Lemma Proof 103 assume v can be read write(v) RMW to server S responds time if RMW writes l bits to S, S is added to F and v cannot be restored without servers in F write(v) is in C - l bits are missing otherwise, at least one bit is still missing Contradiction!
Corollary : Lemma + Observation 104 |F(t)|=1=f l
Corollary : Lemma + Observation Lock-freedom violation 105 |F(t)|=1=f l
Are We Done? No! The adversary is not fair We need to guarantee: every correct client gets infinitely many opportunities to trigger RMWs every RMW triggered by a correct client on a correct server eventually takes effect 106
Not Fair! RMWs on servers in F(t) never take effect from time t onward RMWs by clients that remain in C + from some point onward never take effect – Call these frozen clients 107
Theorem Proof 108
109 Storage Size Recalled At least (D- l ) |C + (t)| bits F(t) C + (t)C - (t) l D- l D
Theorem Proof (Continued) 110
Constructing a Fair Run (Sketch) Run r with Ad where some client is not frozen – No write completes in r Let r’ be an identical run to r, but in which we kill all the frozen clients and servers in F r’ is fair with at least one correct process – By lock-freedom, some write completes in r’ r and r’ are indistinguishable to all correct clients – Therefore, some write completes in r 111
Constructing a Fair Run (Sketch) Let r be an infinite run with Ad – No write completes in r – Therefore, some write completes in r 112 Contradiction
Constructing a Fair Run (Sketch) Let r be an infinite run with Ad – No write completes in r – Therefore, some write completes in r 113 Details in the paper
Theorem (Restated) Every regular lock-free MWMR register emulation – Tolerating f failures – Allowing C concurrent writes – Storing values from domain of size 2 D Needs to store (D min(f,c)) bits – At some point 114
Distributed Storage: Space Bounds 115 Replication Coding O(D c ) with c concurrent writes O(D f ) bits Lower bound (D min(f,c)) Best-of-both algorithm O(D min(f,c))
Wrap Up We illustrated erasure coded storage – Storage-efficient safe wait-free register emulation – Regular with unbounded storage We proved a fundamental limit of coding – (D min(f,c)) bits for regular lock-free register – Replication is the best solution in worst case Why not enjoy both worlds? 116
Adaptive Algorithm Replication storage cost: nD Coding with k,n = O(f) storage cost: (c+1)(D/k)n We combine both approaches Storage cost: min(2nD, (c+1)(D/k)n) = O(D min(f,c)) 117
Adaptive Algorithm Replication storage cost: nD Coding with k,n = O(f) storage cost: (c+1)(D/k)n We combine both approaches Storage cost: min(2nD, (c+1)(D/k)n) = O(D min(f,c)) 118 Details in the paper
Wrap Up We illustrated erasure coded storage – Storage-efficient safe wait-free register emulation – Regular with unbounded storage We proved a fundamental limit of coding – (D min(f,c)) bits for regular lock-free register – Replication is the best solution in worst case Enjoy both worlds 119