Space Bounds for Reliable Storage: Fundamental Limits of Coding Alexander Spiegelman Yuval Cassuto Gregory Chockler Idit Keidar 1.

Slides:



Advertisements
Similar presentations
Multi-Party Contract Signing Sam Hasinoff April 9, 2001.
Advertisements

N-Consensus is the Second Strongest Object for N+1 Processes Eli Gafni UCLA Petr Kuznetsov Max Planck Institute for Software Systems.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
Chapter 6 - Convergence in the Presence of Faults1-1 Chapter 6 Self-Stabilization Self-Stabilization Shlomi Dolev MIT Press, 2000 Shlomi Dolev, All Rights.
Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.
1 Chapter 2 Synchronization Algorithms and Concurrent Programming Gadi Taubenfeld © 2007 Synchronization Algorithms and Concurrent Programming Gadi Taubenfeld.
Multiprocessor Synchronization Algorithms ( ) Lecturer: Danny Hendler The Mutual Exclusion problem.
Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©
Fall 2006Costas Busch - RPI1 Non-regular languages (Pumping Lemma)
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
Coding for Atomic Shared Memory Emulation Viveck R. Cadambe (MIT) Joint with Prof. Nancy Lynch (MIT), Prof. Muriel Médard (MIT) and Dr. Peter Musial (EMC)
Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
1 © R. Guerraoui - Shared Memory - R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
CPSC 668Set 16: Distributed Shared Memory1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
Group-Solvability The Ultimate Wait-Freedom Eli Gafni UCLA DISC /4/04.
1 Principles of Reliable Distributed Systems Lecture 12: Disk Paxos and Quorum Systems Spring 2009 Idit Keidar.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
Aran Bergman Eddie Bortnikov, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
1 Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
Prof. Busch - LSU1 Non-regular languages (Pumping Lemma)
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Selected topics in distributed computing Shmuel Zaks
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
Lecture 8-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 16, 2010 Lecture 8 The Consensus.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 11: Asynchronous Consensus 1.
A vector space containing infinitely many vectors can be efficiently described by listing a set of vectors that SPAN the space. eg: describe the solutions.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
1 Chapter 9 Synchronization Algorithms and Concurrent Programming Gadi Taubenfeld © 2014 Synchronization Algorithms and Concurrent Programming Synchronization.
DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch Set 11: Asynchronous Consensus 1.
1 © R. Guerraoui Regular register algorithms R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch.
1 Consensus Hierarchy Part 1. 2 Consensus in Shared Memory Consider processors in shared memory: which try to solve the consensus problem.
DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch CSCE
Mutual Exclusion Using Atomic Registers Lecturer: Netanel Dahan Instructor: Prof. Yehuda Afek B.Sc. Seminar on Distributed Computation Tel-Aviv University.
Efficient Fork-Linearizable Access to Untrusted Shared Memory Presented by: Alex Shraer (Technion) IBM Zurich Research Laboratory Christian Cachin IBM.
CS294, Yelick Consensus revisited, p1 CS Consensus Revisited
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Asynchronous Exclusive Selection Bogdan Chlebus, U. Colorado Darek Kowalski, U. Liverpool.
Chapter 21 Asynchronous Network Computing with Process Failures By Sindhu Karthikeyan.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 16: Distributed Shared Memory 1.
Distributed Storage Systems: Data Replication using Quorums.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
“Towards Self Stabilizing Wait Free Shared Memory Objects” By:  Hopeman  Tsigas  Paptriantafilou Presented By: Sumit Sukhramani Kent State University.
“Distributed Algorithms” by Nancy A. Lynch SHARED MEMORY vs NETWORKS Presented By: Sumit Sukhramani Kent State University.
Computing with Byzantine Shared Memory Topics in Reliable Distributed Systems Fall , Idit Keidar.
Hwajung Lee. Mutual Exclusion CS p0 p1 p2 p3 Some applications are:  Resource sharing  Avoiding concurrent update on shared data  Controlling the.
6.852: Distributed Algorithms Spring, 2008 Class 14.
Chapter 10 Mutual Exclusion Presented by Yisong Jiang.
1 © R. Guerraoui Set-Agreement (Generalizing Consensus) R. Guerraoui.
 2004 SDU Uniquely Decodable Code 1.Related Notions 2.Determining UDC 3.Kraft Inequality.
The consensus problem in distributed systems
Non-regular languages
Alternating Bit Protocol
Parallel and Distributed Algorithms
Non-regular languages
PERSPECTIVES ON THE CAP THEOREM
From Viewstamped Replication to BFT
Mutual Exclusion CS p0 CS p1 p2 CS CS p3.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
ITEC452 Distributed Computing Lecture 7 Mutual Exclusion
Constructing Reliable Registers From Unreliable Byzantine Components
R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch
R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch
Presentation transcript:

Space Bounds for Reliable Storage: Fundamental Limits of Coding Alexander Spiegelman Yuval Cassuto Gregory Chockler Idit Keidar 1

2

3

4

5

Storage Blow-Up Demands are growing exponentially Replication is costly Erasure coding can help [Goodson et al. DSN 2004] [Aguilera et al. DSN 2005] [Cachin and Tessaro DSN 2006] [Hendricks et al. SOSP 2007] [Dutta et al. DISC 2008] [Cadambe et al. NCA 2014] But … within limits 6

Model 7 n servers f can fail Asynchronous RMWs

Distributed Storage: Space Bounds 8 Replication Coding O(D  c ) with c concurrent writes O(D  f ) bits Lower bound  (D  min(f,c)) Best-of-both algorithm O(D  min(f,c))

k-of-n Erasure Codes 9 encode

k-of-n Erasure Codes decode 10 encode

Register Emulation 11 decode

12 Write S1 S2 S3 S4

Write 13 Generate timestamp S1 S2 S3 S4

Write 14 Generate timestamp encode S1 S2 S3 S4

Write 15 Generate timestamp encode S1 S2 S3 S4

Write 16 Generate timestamp encode S1 S2 S3 S4

Write 17 S1 S2 S3 S4

Write 18 Wait for n-f replies S1 S2 S3 S4

Read 19 S1 S2 S3 S4

20 Read S1 S2 S3 S4

21 Read Wait for n-f replies S1 S2 S3 S4

Read decode 22 S1 S2 S3 S4

What about concurrent read and write? 23

24 Write S1 S2 S3 S4

25 Write S1 S2 S3 S4

26 Write Overwrite? S1 S2 S3 S4

27 Write Overwrite? Suppose yes, if TS is bigger S1 S2 S3 S4

28 Write S1 S2 S3 S4

29 Write Read S1 S2 S3 S4

30 Write Read S1 S2 S3 S4

31 Write Read No written value can be restored! S1 S2 S3 S4

What About Replication? 32

33 Write Read No problem!

34 Write Read No written value can be restored! S1 S2 S3 S4

Enough for safe register only 35 Write Read S1 S2 S3 S4

Let’s Try to Make It Regular Return a written value 36

37 S1 S2 S3 S4

38 S1 S2 S3 S4

39 Overwrite? S1 S2 S3 S4

40 Overwrite? No! S1 S2 S3 S4

41 S1 S2 S3 S4

42 S1 S2 S3 S4

43 What can be overwritten? S1 S2 S3 S4

44 Can green be overwritten? S1 S2 S3 S4

45 Can green be overwritten? No! It’s the only value that can be returned if S3 fails S1 S2 S3 S4

46 Yellow Can Yellow be overwritten? S1 S2 S3 S4

47 S1 S2 S3 S4

48 S1 S2 S3 S4

49 S1 S2 S3 S4

50 S1 S2 S3 S4

51 S1 S2 S3 S4

52 S1 S2 S3 S4

53 Read cannot be restored! S1 S2 S3 S4

54 Read cannot be restored! Regularity violation S1 S2 S3 S4

55 What can be overwritten? Nothing! S1 S2 S3 S4

56 S1 S2 S3 S4

57 S1 S2 S3 S4

58 S1 S2 S3 S4

Ok, so the standard algorithm can use loads of storage But… 59

Inherent 60

Inherent 61 For asynchronous algorithms that store coded data

Distributed Storage: Space Bounds 62 Replication Coding O(D  c ) with c concurrent writes O(D  f ) bits Lower bound  (D  min(f,c)) Best-of-both algorithm O(D  min(f,c))

But First: More About The Model Black-box encoding Arbitrary encoding scheme Storage holds: – Coded blocks – Unbounded data-independent meta-data 63 encode independent of other values

Observation Every data bit in the storage can be associated with a unique written value – Given our storage model (black-box encoding) 64

Storage Complexity Storage measured in bits Data only (neglect meta-data) Clients write values from some domain V – D=log 2 |V| Count bits associated with value v  V stored in servers and other clients 65

Need D=log 2 |V| Bits To Read A Value For most values in V Pigeonhole Argument 66

Theorem Every regular lock-free MWMR register emulation – Tolerating f failures – Allowing c concurrent writes – Storing values from domain of size 2 D Needs to store  (D  min(f,c)) bits – At some point 67

Adversary Ad We’ll define a particular adversary structure – Controls scheduling – Prevents progress – Blows up the storage Defined with a parameter 0 < l < D 68

69

70 C + (t) clients with pending writes that store at least D- l bits at time t

71 C + (t) C - (t) Other clients with pending writes

C + and C - C + have at least D- l bits in the storage C - have less than D- l bits in the storage – Cannot be read – l bits are missing 72

73 C + (t) C - (t)

74 C + (t) C - (t) D- l

75 C + (t) C - (t) D- l

76 C + (t) C - (t) D- l

77 C + (t) C - (t) D- l

78 C + (t) C - (t) D- l D

79 C + (t) C - (t) D D- l

80 C + (t) C - (t) l D- l D

F(t) Servers that store at least l bits at time t 81 C + (t) C - (t) l D- l D

82 Storage Size F(t) C + (t)C - (t) At least (D- l )  |C + (t)| bits l D- l D

83 Storage Size At least (D- l )  |C + (t)| bits F(t) C + (t)C - (t) l D- l D

Ad’s Scheduling Let clients trigger RMWs on servers Allow pending RMWs to take effect and return 84 l D- l

Ad’s Scheduling 85 C + (t) C - (t) F(t) freeze F: no RMWs take effect l D- l

Ad’s Scheduling 86 C + (t) C - (t) F(t) delay C + : no RMWs take effect l D- l

Ad’s Scheduling 87 C + (t) C - (t) F(t) l D- l

Ad’s Scheduling 88 Rule 1: choose the longest pending RMW on a server not in F(t) by a client in C - (t) F(t) C + (t) C - (t) l D- l

Ad’s Scheduling 89 C - (t+1) C + (t+1) F(t+1) l D- l

Ad’s Scheduling 90 When no such RMWs C - (t+1) C + (t+1) F(t+1) l D- l

Ad’s Scheduling 91 Rule 2: choose in round robin order a client that wants to trigger an RMW C - (t+1) C + (t+1) F(t+1) l D- l

Ad’s Scheduling 92 C - (t+2) C + (t+2) F(t+2) l D- l

Ad’s Scheduling 93 C - (t+3) C + (t+3) F(t+3) l D- l

Ad’s Scheduling 94 C - (t+4) C + (t+4) F(t+4) l D- l

Ad’s Scheduling 95 C - (t+5) C + (t+5) F(t+5) l D- l

Ad’s Scheduling 96 C - (t+6) C + (t+6) F(t+6) l D- l

Implications of Ad F only grows – Servers in F are “frozen” Clients can move out of C + – If their blocks are overwritten Values written by clients in C - cannot be read – Each such client stores less than D- l bits We will next show that no client can complete a write operation 97

Every set of n-f servers must store D bits of some pending write for a write to return Observation 98

Observation 99 Otherwise

Observation 100 Read No value can be restored!

Observation 101 Read No value can be restored! Regularity violation

Lemma |F(t)|=1=f 102 l

Lemma Proof 103 assume v can be read write(v) RMW to server S responds time if RMW writes l bits to S, S is added to F and v cannot be restored without servers in F write(v) is in C - l bits are missing otherwise, at least one bit is still missing Contradiction!

Corollary : Lemma + Observation 104 |F(t)|=1=f l

Corollary : Lemma + Observation Lock-freedom violation 105 |F(t)|=1=f l

Are We Done? No! The adversary is not fair We need to guarantee:  every correct client gets infinitely many opportunities to trigger RMWs  every RMW triggered by a correct client on a correct server eventually takes effect 106

Not Fair! RMWs on servers in F(t) never take effect from time t onward RMWs by clients that remain in C + from some point onward never take effect – Call these frozen clients 107

Theorem Proof 108

109 Storage Size Recalled At least (D- l )  |C + (t)| bits F(t) C + (t)C - (t) l D- l D

Theorem Proof (Continued) 110

Constructing a Fair Run (Sketch) Run r with Ad where some client is not frozen – No write completes in r Let r’ be an identical run to r, but in which we kill all the frozen clients and servers in F r’ is fair with at least one correct process – By lock-freedom, some write completes in r’ r and r’ are indistinguishable to all correct clients – Therefore, some write completes in r 111

Constructing a Fair Run (Sketch) Let r be an infinite run with Ad – No write completes in r – Therefore, some write completes in r 112 Contradiction

Constructing a Fair Run (Sketch) Let r be an infinite run with Ad – No write completes in r – Therefore, some write completes in r 113 Details in the paper

Theorem (Restated) Every regular lock-free MWMR register emulation – Tolerating f failures – Allowing C concurrent writes – Storing values from domain of size 2 D Needs to store  (D  min(f,c)) bits – At some point 114

Distributed Storage: Space Bounds 115 Replication Coding O(D  c ) with c concurrent writes O(D  f ) bits Lower bound  (D  min(f,c)) Best-of-both algorithm O(D  min(f,c))

Wrap Up We illustrated erasure coded storage – Storage-efficient safe wait-free register emulation – Regular with unbounded storage We proved a fundamental limit of coding –  (D  min(f,c)) bits for regular lock-free register – Replication is the best solution in worst case Why not enjoy both worlds? 116

Adaptive Algorithm Replication storage cost: nD Coding with k,n = O(f) storage cost: (c+1)(D/k)n We combine both approaches Storage cost: min(2nD, (c+1)(D/k)n) = O(D  min(f,c)) 117

Adaptive Algorithm Replication storage cost: nD Coding with k,n = O(f) storage cost: (c+1)(D/k)n We combine both approaches Storage cost: min(2nD, (c+1)(D/k)n) = O(D  min(f,c)) 118 Details in the paper

Wrap Up We illustrated erasure coded storage – Storage-efficient safe wait-free register emulation – Regular with unbounded storage We proved a fundamental limit of coding –  (D  min(f,c)) bits for regular lock-free register – Replication is the best solution in worst case Enjoy both worlds 119