Download presentation
Presentation is loading. Please wait.
1
Codes for Distributed Computing
ISIT 2017 Tutorial Viveck R. Cadambe (Pennsylvania State University) Pulkit Grover (Carnegie Mellon University)
2
Motivation Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020
3
Motivation Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020 Moore’s law is saturating, Improvements in energy/speed are not “easy”
4
Massive parallelization, Distributed Computing
Motivation Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020 Moore’s law is saturating, Improvements in energy/speed are not “easy” Massive parallelization, Distributed Computing Apache Hadoop, Apache Spark, Graphlab, Map-reduce,
5
Challenges Computing components can straggle, i.e., they can be slow
“Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08 “The tail at scale” Dean-Barroso 13 Components can be erroneous or fail “10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault tolerance techniques for high performance computing – Herault/Roberts “... since data of value is never just on one disk, the BER for a single disk could actually be orders of magnitude higher than the current target of ” [Brewer et al., Google white paper, ‘16] Data transfer cost/time an important component of computing performance Redundancy in data storage, computation can significantly improve performance!
6
Challenges Computing components can straggle, i.e., they can be slow
“Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08 “The tail at scale” Dean-Barroso 13 Components can be erroneous or fail “10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault tolerance techniques for high performance computing – Herault/Roberts “... since data of value is never just on one disk, the BER for a single disk could actually be orders of magnitude higher than the current target of ” [Brewer et al., Google white paper, ‘16] Data transfer cost/time an important component of computing performance Redundancy in data storage, computation can significantly improve performance!
7
Challenges Computing components can straggle, i.e., they can be slow “Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08 “The tail at scale” Dean-Barroso 13 Components can be erroneous or fail “10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault tolerance techniques for high performance computing – Herault/Roberts “... since data of value is never just on one disk, the BER for a single disk could actually be orders of magnitude higher than the current target of 10-15….” [Brewer et al., Google white paper, ‘16] Data transfer cost/time an important component of computing performance Redundancy in data storage, computation can significantly improve performance!
8
Theme of tutorial Role of coding and information theory in contemporary distributed computing systems Models inspired by practical distributed computing systems Fundamental limits on trade-off between redundancy and performance. New coding abstractions and constructions Complements rich literature in codes for network function computing An admittedly incomplete list: [Korner-Marton 79, Doshi-Shah-Medard-Jaggi 07, Orlitsky-Roche 95, Giridhar-Kumar 05, Ramamoorthy-Langberg 13, Ma-Ishwar 11, Nazer-Gastpar 11] Communication and information complexity literature
9
Theme of tutorial Role of coding and information theory in contemporary distributed computing systems Models inspired by practical distributed computing systems Fundamental limits on trade-off between redundancy and performance. New coding abstractions and constructions Complements rich literature in codes for network function computing [Korner-Marton 79, Orlitsky-Roche 95, Giridhar-Kumar 05, Doshi-Shah-Medard-Jaggi 07, Dimakis-Kar-Moura-Rabbat-Scaglione 10, Duchi-Agarwal-Wainwright 11 Ma-Ishwar 11, Nazer-Gastpar 11, Appuswamy-Fraceschetti-Karamchandani-Zeger 13, Ramamoorthy-Langberg 13] Communication and information complexity literature …..
10
Outline Information theory and codes for shared memory emulation
Codes for distributed linear data processing
11
Outline Information theory and codes for shared memory emulation
Codes for distributed linear data processing
13
Table of Contents Main theme and goal
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithms Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to distributed algorithms Directions of Future Research
14
Table of Contents Main theme and goal
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithms Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to distributed algorithms Directions of Future Research
15
Distributed Algorithms
Algorithms for distributed networks and multi-processors. ~50 years of research Applications: Cloud Computing, networking, multiprocessor programming, Internet of Things1 Information Theory, data is big, changes very fast Theme of this (part) of the tutorial: A marriage of ideas between information theory and distributed algorithms 1Typical publication venues: PODC, DISC, OPODIS, SODA, FOCS, Usenix Fast, Usenix ATC,
16
Distributed Algorithms: Central assumptions and requirements
Modeling assumptions Unreliability Asynchrony Decentralized nature Requirements Fault tolerance Consistency: Service provided by system “looks as if centralized” despite unreliability/asynchrony/decentralized nature of system Consequences Simple-looking tasks difficult to achieve or sometimes impossible. Careful algorithm design, non-trivial correctness proofs Example,
17
Distributed Algorithms: Central assumptions and requirements
Modeling assumptions Unreliability Asynchrony Decentralized nature Requirements Fault tolerance Consistency: Service provided by system “looks as if centralized” despite unreliability/asynchrony/decentralized nature of system Consequences Simple-looking tasks difficult to achieve or sometimes impossible. Careful algorithm design, non-trivial correctness proofs Example,
18
Distributed Algorithms: Central assumptions and requirements
Modeling assumptions Unreliability Asynchrony Decentralized nature Requirements Fault tolerance Consistency: Service provided by system “looks as if centralized” despite unreliability/asynchrony/decentralized nature of system Consequences Simple-looking tasks can be tricky or sometimes even impossible. Careful algorithm design, non-trivial correctness proofs Example,
19
Shared memory emulation
Distributed system Classical problem in distributed computing Goal: Implement a read-write memory over a distributed system Supports two operations: Read ( variablename ) % also called “get” operation Write ( variablename, value ) % also called a “put” operation For simplicity, we focus on a single variable and omit variablename. Cloud is distributed One variable, write it or read it.
20
Shared memory emulation
Distributed system Read-write memory Classical problem in distributed computing Goal: Implement a read-write memory over a distributed system Supports two operations: Read ( variablename ) % also called “get” operation Write ( variablename, value ) % also called a “put” operation For simplicity, we focus on a single variable and omit variablename. Cloud is distributed One variable, write it or read it.
21
Shared memory emulation
Distributed system Read-write memory Classical problem in distributed computing Goal: Implement a read-write memory over a distributed system Supports two operations: Read ( variablename ) % also called “get” operation Write ( variablename, value ) % also called a “put” operation For simplicity, we focus on a single variable and omit variablename. Cloud is distributed One variable, write it or read it.
22
Shared memory emulation
Distributed system Read-write memory Classical problem in distributed computing Goal: Implement a read-write memory over a distributed system Supports two operations: Read ( variablename ) % also called “get” operation Write ( variablename, value ) % also called a “put” operation For simplicity, we focus on a single variable and omit variablename. Cloud is distributed One variable, write it or read it.
23
Shared memory emulation: Application to cloud computing
Distributed system Theoretical underpinnings of commercial and open source key-value stores: Amazon Dynamo DB, Couch DB, Apache Cassandra DB, Voldermort DB. Applications: transactions, reservation systems, multi-player gaming, social networks, news feeds, distributed computing tasks etc. Design requires sophisticated distributed computing theory combined with great engineering. Active research field in systems and theory communities. Key-value store is like a database, widely used
24
Shared memory emulation: Application to cloud computing
Distributed system Theoretical underpinnings of commercial and open source key-value stores: Amazon Dynamo DB, Couch DB, Apache Cassandra DB, Voldermort DB. Applications: transactions, reservation systems, multi-player gaming, social networks, news feeds, distributed computing tasks etc. Design requires sophisticated distributed computing theory combined with great engineering. Active research field in systems and theory communities. Key-value store is like a database, widely used
25
Engineering Challenges in key-value store implementations
Distributed system Failure tolerance, Fast reads, Fast writes Asynchrony: Weak (no) timing assumptions. Distributed Nature: Nodes unaware of the state How do you distinguish a failed node from a very slow node? How do you ensure that all copies of data have received a write/update? Requirements: Present a consistent view of data Allow concurrent access to clients (no locks) Solutions exist to above challenges. However…. Use some other words on guarantees
26
Engineering Challenges in key-value store implementations
Distributed system Failure tolerance, Fast reads, Fast writes Asynchrony: Weak (no) timing assumptions. Distributed Nature: Nodes unaware of the state How do you distinguish a failed node from a very slow node? How do you ensure that all copies of data have received a write/update? Requirements: Present a consistent view of data Allow concurrent access to clients (no locks) Solutions exist to above challenges. However…. Use some other words on guarantees
27
Engineering Challenges in key-value store implementations
Distributed system Failure tolerance, Fast reads, Fast writes Asynchrony: Weak (no) timing assumptions. Distributed Nature: Nodes unaware of the state How do you distinguish a failed node from a very slow node? How do you ensure that all copies of data have received a write/update? Requirements: Present a consistent view of data Allow concurrent access to clients (no locks) Solutions exist to above challenges. However…. Use some other words on guarantees
28
Engineering Challenges in key-value store implementations
Distributed system Failure tolerance, Fast reads, Fast writes Asynchrony: Weak (no) timing assumptions. Distributed Nature: Nodes unaware of the state How do you distinguish a failed node from a very slow node? How do you ensure that all copies of data have received a write/update? Requirements: Present a consistent view of data Allow concurrent access to clients (no locks) Solutions exist to above challenges. However…. Use some other words on guarantees
29
Goal of this part of the tutorial
Distributed system Analytical understanding of performance (memory overhead, latency) limited. Replication used for fault tolerance and availability in practice today* Minimizing memory overhead important for several reasons Data volume is increasing exponentially Storing more data in high speed memory reduces latency. This tutorial discusses the following questions: Considerations for the use of erasure codes in such systems Information-theoretic framework Important to reduce memory overhead
30
Goal of this part of the tutorial
Distributed system Analytical understanding of performance (memory overhead, latency) limited. Replication used for fault tolerance and availability in practice today* Minimizing memory overhead important for several reasons Data volume is increasing exponentially Storing more data in high speed memory reduces latency. This tutorial discusses the following questions: Considerations for the use of erasure codes in such systems Information-theoretic framework Important to reduce memory overhead
31
Goal of this part of the tutorial
Distributed system Analytical understanding of performance (memory overhead, latency) limited. Replication used for fault tolerance and availability in practice today* Minimizing memory overhead important for several reasons Data volume is increasing exponentially Storing more data in high speed memory reduces latency. This tutorial discusses the following questions: Considerations for the use of erasure codes in such systems Information-theoretic framework Important to reduce memory overhead
32
Table of Contents Main theme and goal
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithm Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to shared memory emulation Directions of Future Research
33
Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges
34
Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges
35
Read-Write Memory Write(v) time Read()
36
Read-Write Memory Write(v) time Read()
37
Read-Write Memory Write(v) time Read() Reality check: Operations over distributed asynchronous systems cannot be modeled as instantaneous. Solution: Concept of atomicity!
38
Thought experiment: Motivation for atomicity
{ x=2.1 tic; x=3.2 %write to shared variable x toc;} { Read(x) } Two concurrent processes Time of write operation: 120 ms
39
Thought experiment: Motivation for atomicity
{ x=2.1 tic; x=3.2 %write to shared variable x toc;} { Read(x) } Two concurrent processes Time of write operation: 120 ms Question: When is the new value (3.2) of shared variable x available to a possibly concurrent read operation? 10 ms? 20 ms? 120 ms? 121 ms?
40
Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) Read() time
41
Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) Read() time
42
Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) Read() time
43
Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read()
44
Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read()
45
Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read()
46
Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read() Write(v) Read()
47
Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read() Write(v) Read()
48
Examples of non-atomic executions
Write(v) Read() time Write(v) time Read()
49
Importance of Consistency
Modular algorithm design Design an application (e.g., bank accounts, reservation systems) over an “instantaneous” memory Then use an atomic distributed memory in its place Program executions are indistinguishable Weaker consistency models also useful Social networks, news feeds use weaker consistency measures for performance. In this talk, we will focus on atomic consistency.
50
Importance of Consistency
Modular algorithm design Design an application (e.g., bank accounts, reservation systems) over an “instantaneous” memory Then use an atomic distributed memory in its place Program executions are indistinguishable Weaker consistency models also useful Social networks, news feeds use weaker consistency measures for performance. In this talk, we will focus on atomic consistency.
51
Importance of Consistency
Modular algorithm design Design an application (e.g., bank accounts, reservation systems) over an “instantaneous” memory Then use an atomic distributed memory in its place Program executions are indistinguishable Weaker consistency models also useful Social networks, news feeds use weaker consistency measures for performance. In this talk, we will focus on atomic consistency.
52
Importance of Consistency
Modular algorithm design Design an application (e.g., bank accounts, reservation systems) over an “instantaneous” memory Then use an atomic distributed memory in its place Program executions are indistinguishable Weaker consistency models also useful Social networks, news feeds use weaker consistency definitions Trade-off errors for performance In this talk, we will focus on atomic consistency.
53
Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges
54
Distributed System Model
Read Clients Write Clients Servers Client server architecture, nodes can fail, i.e., stop responding (no. of server failures is limited) Point-to-point reliable links (arbitrary delay). Nodes unaware of the current state of any other node.
55
The shared memory emulation problem
Read Clients Write Clients Servers Design write, read and server protocols Atomicity Liveness: Concurrent operations, no waiting.
56
The shared memory emulation problem
Read() { Design read protocol } write(v) { Design write protocol } Read Clients Write Clients { Design server protocol } Servers Design write, read and server protocols to ensure Atomicity Liveness: Concurrent operations, no waiting.
57
The shared memory emulation problem
Read() { Design read protocol } write(v) { Design write protocol } Read Clients Write Clients { Design server protocol } Servers Design write, read and server protocols to ensure Atomicity Liveness: Concurrent operations, (no blocking).
58
Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges
59
The ABD algorithm (sketch)
Read Clients Write Clients Servers Idea: Write and read from a majority of server nodes. Any pair of write and read operations intersect at at least one server Algorithm works if a minority of server nodes do not fail.
60
The ABD algorithm (sketch)
Read Clients Write Clients Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol: Send read query; wait for responses from majority; and return with latest value. Server protocol: Store latest value from server; send ack Respond to read request with value
61
The ABD algorithm (sketch)
ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol:: Send read query; wait for sufficient responses and return with latest value. Server protocol: Store latest value; send ack Respond to read request with value
62
The ABD algorithm (sketch)
Query Query Query Query Query Query Read Clients Write Clients Query Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol: Send read query; wait for sufficient responses and return with latest value. Server protocol: Store latest value from server; send ack Respond to read request with value Point: every server uses replication.
63
The ABD algorithm (sketch)
Read Clients Write Clients Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol: Send read query; wait for majority to respond; return with latest value. Server protocol: Store latest value from server; send ack Respond to read request with value I am not really presenting the ABD algorithm in its full generality. For instance, the read have an additional write step which I am omitting. Further, the paper actually proves that this algorithm is atomic, which is not completely trivial. Point: every server uses replication. So you send big packets over the network, you store entire values.
64
The ABD algorithm (sketch)
Read Clients Write Clients Servers Write protocol: Acquire latest tag via query; Send tagged value to every server; return after sufficeint acks. Read protocol: Send read query; wait for acks from majority; send latest value to servers; return latest value after receiving acks from quorum. Server protocol: Respond to query with tag. Store latest value at server; send ack Respond to read request with value Pause and motivate write back
65
The ABD algorithm (sketch)
ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Write protocol: Acquire latest tag via query; Send tagged value to every server; return after sufficeint acks. Read protocol: Send read query; wait for acks from majority; send latest value to servers; return latest value after receiving acks from majority. Server protocol: Respond to query with tag. Store latest value from server; send ack Respond to read request with value Point: every server uses replication.
66
The ABD algorithm is atomic – proof idea
After an operation P terminates, (i) if every future operation acquires a tag at least as large as the tag of P, (ii) if every future write operation acquires a tag strictly larger than the tag of P, (iii) and a read with tag t returns the value of the corresponding write with tag t. then algorithm is atomic. [Paraphrasing of a lemma from Lynch 96]s P Write Read Acquires a tag at least as large As the tag that P propagated Why should read operations write back the value?
67
The ABD algorithm - summary
An atomic read-write memory can be implemented over a distributed asynchronous system All operations terminate so long as the number of servers that fail is a minority Design principles of several modern key-value stores mirror shared memory emulation algorithms. See description of Amazon’s Dynamo key-value store [Decandia et. al. 2008] Replication is used for fault tolerance Point: every server uses replication.
68
Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges
69
Value recoverable from any 4 codeword symbols
Erasure Coding Smaller packets, smaller overheads Example: 2 parity Reed-solomon code Parity symbols Value recoverable from any 4 codeword symbols Size of codeword symbol is ¼ size of value
70
Value recoverable from any 4 codeword symbols
Erasure Coding Smaller packets, smaller overheads Example: 2 parity Reed-solomon code Parity symbols Value recoverable from any 4 codeword symbols Size of codeword symbol is ¼ size of value New constraint, need 4 symbols with same time-stamp
71
Set up for hypothetical erasure coding based algorithm
Read Clients Write Clients Servers Write/read from any five nodes, any two sets intersect at 4 nodes Operations complete works if at most one node has failed. More generally, in a system with N nodes, for dimension k code, write/read from any [(N+k)/2] nodes,
72
Hypothetical erasure coding based algorithm - challenges
Read Clients Write Clients Servers
73
Hypothetical erasure coding based algorithm - challenges
Query Query Write Clients Query Read Clients Discard old versions to save storage Servers Query Servers store multiple versions First Challenge: reveal symbols to readers only when enough symbols are propagated Second Challenge: discard old versions safely
74
First Challenge: reveal symbols to readers only when enough symbols are propagated
Second Challenge: discard old versions safely
75
Crude, one sentence summary
First Challenge: reveal symbols to readers only when enough symbols are propagated Second Challenge: discard old versions safely Crude, one sentence summary Challenges can be solved through careful algorithm design, storage cost savings if the extent of concurrency is small. Sample algorithm in appendix
76
Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 16], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Yu Li Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency What is the right information-theoretic abstraction of this system? Does the storage cost necessarily grow with concurrency? Can clever coding theoretic ideas improve storage cost?
77
Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 15], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency What is the right information-theoretic abstraction of this system? Does the storage cost necessarily grow with concurrency? Can clever coding theoretic ideas improve storage cost?
78
Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 14], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency Can clever coding theoretic ideas improve storage cost? Does the storage cost necessarily grow with concurrency? What is the right information-theoretic abstraction of this system?
79
Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 14], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency Can clever coding theoretic ideas improve storage cost? Does the storage cost necessarily grow with concurrency? What is the right information-theoretic abstraction of this system?
80
Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 14], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency Can clever coding theoretic ideas improve storage cost? Does the storage cost necessarily grow with concurrency? What is the right information-theoretic abstraction of this system?
81
Break
82
Table of Contents Main theme of this tutorial
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithm Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to shared memory emulation Directions of Future Research
83
Information-Theoretic abstraction of consistent distributed storage
Shared Memory Emulation Toy model Multi-version (MVC) codes [Wang Cadambe, Accepted to Trans. IT, 2017]
84
Information-Theoretic abstraction of consistent distributed storage
Shared Memory Emulation Toy model Multi-version (MVC) codes [Wang Cadambe, Accepted to Trans. IT, 2017]
85
Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible
86
Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible
87
Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at client: New version in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Goal: decoder invoked at time t, gets the latest common version among c servers T – degree of aynchrony, every time stamp will be called a version
88
Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at client: New version in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Goal: decoder invoked at time t, gets the latest common version among c servers
89
Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at client: New version in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Goal: decoder invoked at time t, gets the latest common version among c servers
90
Toy Model – Decoding requirement
N Servers f failures possible A version is complete at time t if it has arrived at N-f servers. Decoding requirement for decoder at time t: from every set of N-f servers, the latest complete version or a later version must be decodable. Mirrors erasure coding based shared memory emulation protocols. We will instead study an equivalent decoding requirement. Decoder connects to must be able to recover the latest common version, or a later version at time t, from every set of c=N-2f servers Goal: decoder invoked at time t, gets the latest common version among c servers
91
Toy Model – Decoding requirement
N Servers f failures possible A version is complete at time t if it has arrived at N-f servers. Decoding requirement for decoder at time t: from every set of N-f servers, the latest complete version or a later version must be decodable. Mirrors erasure coding based shared memory emulation protocols. We will instead study an equivalent decoding requirement. Decoder connects to must be able to recover the latest common version, or a later version at time t, from every set of c=N-2f servers Goal: decoder invoked at time t, gets the latest common version among c servers
92
Toy Model – Decoding requirement
N Servers f failures possible A version is complete at time t if it has arrived at N-f servers. Decoding requirement for decoder at time t: from every set of N-f servers, the latest complete version or a later version must be decodable. Mirrors erasure coding based shared memory emulation protocols. We will instead study an equivalent decoding requirement. Decoder connects to must be able to recover the latest common version, or a later version at time t, from every set of c=N-2f servers Goal: decoder invoked at time t, gets the latest common version among c servers
93
Toy Model – Decoding requirement
N Servers f failures possible A version is complete at time t if it has arrived at N-f servers. Decoding requirement for decoder at time t: from every set of N-f servers, the latest complete version or a later version must be decodable. Mirrors erasure coding based shared memory emulation protocols. We will instead study an equivalent decoding requirement. Decoder connects to must be able to recover the latest common version, or a later version at time t, from every set of c servers. The two decoding requirements have the same worst-case storage costs if c=N-2f Goal: decoder invoked at time t, gets the latest common version among c servers
94
Asynchrony Toy Model – Decoding requirement
c servers Snapshot at time Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others Decoder connects to any servers, decodes one of
95
Asynchrony Toy Model – Decoding requirement
c servers Snapshot at time Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others Decoder connects to any servers, decodes one of Goal: construct a storage method that minimizes storage cost
96
Information-Theoretic abstraction of consistent distributed storage
Shared Memory Emulation Toy model Multi-version (MVC) codes
97
The multi-version coding (MVC) problem
Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others
98
The multi-version coding (MVC) problem
Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others
99
The multi-version coding (MVC) problem
Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others
100
The MVC problem n servers T versions c connectivity
Goal: decode the latest common version or later version among every set of c servers Minimize the storage cost Worst case, across all “states” across all servers
101
Solution 1: Replication
Storage size = size-of-one-version N=4, T=2, c=2
102
Solution 1: Replication
Storage size = size-of-one-version N=4, T=2, c=2
103
Solution 2: (N,c) Maximum Distance Separable code
Question: Can we store a codeword symbol corresponding to the latest version?
104
Solution 2: (N,c) Maximum Distance Separable code
Storage size = T/c*size-of-one-version N=4, T=2, c=2 1/2 1/2 1/2 1/2 1/2 1/2 Separate coding across versions. Each server stores all the versions received.
105
Summary of results Naïve MDS codes Replication Storage cost
Number of versions T
106
Summary of results Naïve MDS codes Replication Singleton bound
Storage cost Singleton bound Number of versions T
107
Summary of results Naïve MDS codes Replication Our achievable scheme
Storage cost Our achievable scheme Singleton bound Number of versions T
108
Summary of results Naïve MDS codes Replication Our achievable scheme
Storage cost Our achievable scheme Our converse Singleton bound Number of versions T
109
Summary of results Naïve MDS codes Replication Storage cost Our achievable scheme Our converse Singleton bound Number of versions T Storage cost inevitably increases as degree of asynchrony grows!
110
Normalized by size-of-value
Summary of results Storage Cost Normalized by size-of-value Replication 1 Naïve MDS codes Constructions* Lower bound Typo!
111
Normalized by size-of-value
Summary of results Storage Cost Normalized by size-of-value Replication 1 Naïve MDS codes Constructions* Lower bound Typo! *Achievability can be improved (see [Wang, Cadambe, Accepted to Trans. IT, 17])
112
Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns
113
Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns
114
Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns
115
Achievability
116
Achievability
117
Achievability
118
Achievability Therefore, the version corresponding to at least one partition is decodable
119
Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns
120
Start with c servers
121
State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4
122
Start with c servers
123
State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4
124
Start with c servers Explain more Propagate version 2 to a minimal set of servers such that it is decodable
125
State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4
126
Start with c servers Explain more Propagate version 2 to a minimal set of servers such that it is decodable
127
State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4
128
State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4
129
Versions 1 and 2 decodable from c+1 symbols
Start with c servers Explain more Propagate version 2 to a minimal set of servers such that it is decodable Versions 1 and 2 decodable from c+1 symbols
130
Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns More challenging combinatorial puzzle for T > 2.
131
Summary of results Naïve MDS codes Replication Storage cost Our achievable scheme Our converse Singleton bound Number of versions T Storage cost inevitably increases as degree of asynchrony grows!
132
Information-Theoretic abstraction of consistent distributed storage
Shared Memory Emulation Toy model Multi-version (MVC) codes
133
Recall the shared memory emulation model
Read Clients (Decoders) Write Clients N Servers f failures possible
134
Recall the shared memory emulation model
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at clients: arbitrary Channel from clients to servers: arbitrary delay, reliable Clients, servers modeled as I/O automata, protocols can be designed.
135
Recall the shared memory emulation model
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at clients: arbitrary Channel from clients to servers: arbitrary (unbounded) delay, reliable Clients, servers modeled as I/O automata, protocols can be designed.
136
Recall the shared memory emulation model
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at clients: arbitrary Channel from clients to servers: arbitrary (unbounded) delay, reliable Clients, servers modeled as I/O automata, protocols can be designed.
137
Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Storage cost Generalization of the Singleton bound. [C-Lynch-Wang, ACM PODC 2016]
138
Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the Singleton bound. Non-trivial due to interactive nature Non-trivial [C-Lynch-Wang, ACM PODC 2016]
139
Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the Singleton bound. Non-trivial due to interactive nature Non-trivial [C-Lynch-Wang, ACM PODC 2016]
140
Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the MVC converse for T=2. Open question: Generalization of MVC bounds for T > 2 to the full-fledged distributed systems theoretic model. [C-Lynch-Wang, ACM PODC 2016]
141
Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the MVC converse for T=2. Generalization of MVC converse for T > 2 to works for non-interactive protocols: T = number of concurrent write operations. [C-Lynch-Wang, ACM PODC 2016]
142
Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the MVC converse for T=2. Generalization of MVC converse for T > 2 to works for non-interactive protocols: T = number of concurrent write operations. [C-Lynch-Wang, ACM PODC 2016]
143
Storage Cost bounds for Shared Memory Emulation
ABD algorithm Erasure coding based algorithms Storage Cost Second lower bound* First lower bound Baseline lower bound Number of concurrent writes
144
Table of Contents Main theme of this tutorial
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithm Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to shared memory emulation Directions of Future Research
145
Directions of Future Research
Classical Codes for Distributed Storage Multi-version codes System is totally asynchronous, every version arrival pattern is possible Nodes have do not even have stale information of system state Nodes do not even have partial information of system state. System is synchronous Nodes have instantaneous, system state information Nodes have global system state information.
146
Directions of Future Research
Classical Codes for Distributed Storage Multi-version codes Pessimistic/conservative model System is totally asynchronous, every version arrival pattern is possible Nodes have do not even have stale information of system state Nodes do not even have partial information of system state. Optimistic model System is synchronous Nodes have instantaneous, system state information Nodes have global system state information.
147
Directions of Future Research
Classical Codes for Distributed Storage Multi-version codes Pessimistic/conservative model System is totally asynchronous, every version arrival pattern is possible Nodes have do not even have stale information of system state Nodes do not even have partial information of system state. Optimistic model System is synchronous Nodes have instantaneous, system state information Nodes have global system state information. Practice
148
Directions of future research
Beyond the worst-case model Correlated versions
149
Codes with Correlated Versions
[Ali-C, ITW 2016] Markov Chain . Uniform
150
Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ?
151
Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? “Closeness” in message “Closeness” in codeword Delta coding
152
Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Storage cost as opposed to
153
Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ?
154
Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Apply Slepian-Wolf ideas
155
Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Apply Slepian-Wolf ideas Storage cost is as opposed to
156
Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Open: Information-theoretically optimal schemes for all regimes of Open: Practical code constructions
157
Directions of future research
Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost, good storage cost in typical states, larger worst-case storage cost. Beyond the toy model Explore relations between timing assumptions and storage cost Open question: Can interactive protocols help?, or does the storage cost necessarily grow with
158
Directions of future research
Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost: good storage cost in typical states, possibly with larger worst-case storage cost. Beyond the toy model Explore relations between timing assumptions and storage cost Open question: Can interactive protocols help?, or does the storage cost necessarily grow with
159
Beyond the toy model Toy model serves as a bridge between shared memory emulation and our network information theoretic formulation Open: Can an interactive write protocol improve storage cost? Future: realistic model, expose connections between: channel delay uncertainty staleness of read. storage cost
160
Beyond the toy model Toy model serves as a bridge between shared memory emulation and our network information theoretic formulation Future work: realistic model, expose connections between channel delay uncertainty staleness of read. storage cost Open: Can an interactive write protocol improve storage cost?
161
Beyond the toy model Toy model serves as a bridge between shared memory emulation and our network information theoretic formulation Future work: realistic model, expose connections between channel delay uncertainty staleness of read. storage cost Open: Can an interactive write protocol improve storage cost?
162
Directions of future research
Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost: good storage cost in typical states, possibly with larger worst-case storage cost. Beyond the toy model Open question: Can interactive protocols help?, o Relation between system response, decoding requirement and storage cost r does the storage cost necessarily grow with
163
Beyond read-write memory
Several systems: more complicated data structures over distributed asynchronous systems Transactions: Multiple read-write objects, more complicated consistency requirements. Graph based data structures Question: How do you “erasure code” more complicated data structures and state machines? Initial clues provided in [Balasubramanian-Garg 14]
164
Beyond read-write memory
Several systems: more complicated data structures over distributed asynchronous systems Transactions: Multiple read-write objects, more complicated consistency requirements. Graph based data structures Question: How do you “erasure code” more complicated data structures and state machines? Initial clues provided in [Balasubramanian-Garg 14]
165
Directions of future research
Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost: good storage cost in typical states, possibly with larger worst-case storage cost. Beyond the toy model Open question: Can interactive protocols help?, o Relation between system response, decoding requirement and storage cost r does the Beyond read-write data structures
166
Thanks
167
References [Lynch 96] N. A. Lynch, Distributed Algorithms USA: Morgan Kaufmann Publishers Inc., 1996. [Lamport 86] Lamport, L.: On interprocess communication. Part I: Basic formalism. Distributed Computing 2(1), 77–85 (1986) [Vogels 09] W. Vogels, “Eventually consistent,” Queue, vol. 6, no. 6, pp. 14–19, 2008. [Attiya-Bar-Noy-Dolev 95] H. Attiya, A. Bar-Noy, and D. Dolev, “Sharing memory robustly in message-passing systems,” J. ACM, vol. 42, no. 1, pp. 124–142, Jan [Decandia et. al. 07] G. DeCandia, D. Hastorun, M. Jampani, G. Kakula- pati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s highly available key-value store,” in SOSP, vol. 7, 2007, pp. 205– 220. [6] E. Hewitt, Cassandra: the definitive guide. ” O’Reilly Media, Inc.”, 2010. [Hendricks et. al. 07] J. Hendricks, G. R. Ganger, and M. K. Reiter, “Low-overhead byzantine fault-tolerant storage,” ACM SIGOPS Operating Systems Review, vol. 41, no. 6, pp. 73–86, 2007. [Dutta et. al. 08] P. Dutta, R. Guerraoui, and R. R. Levy, “Optimistic erasure-coded distributed storage,” in Distributed Com- puting. Springer, 2008, pp. 182–196. [Cadambe et. al. 14] V. R. Cadambe, N. Lynch, M. Medard, and P. Musial, “A coded shared atomic memory algorithm for message passing architectures,” in 2014 IEEE 13th International Symposium on Network Computing and Applications (NCA). IEEE, 2014, pp. 253–260. [Dobre et. al. 13] D. Dobre, G. Karame, W. Li, M. Majuntke, N. Suri, and M. Vukoli ́c, “PoWerStore: proofs of writing for efficient and robust storage,” in Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 2013, pp. 285–298. [Konwar et. al. PODC 2017] K.M.Konwar, N.Prakash, N.A.Lynch, and M.M ́edard, “A layered architecture for erasure-coded consistent distributed storage,” CoRR, vol. abs/ , 2017, ac- cepted to 2017 ACM Principles of Distributed Computing, available at
168
References [Zhang et. al. FAST 16] H. Zhang, M. Dong, and H. Chen, “Efficient and avail- able in-memory kv-store with hybrid erasure coding and replication.” in FAST, 2016, pp. 167–180. [Chen et. al. 2017] Y. L. Chen, S. Mu, J. Li, C. Huang, J. Li, A. Ogus, and D. Phillips, “Giza: Erasure coding objects across global data centers,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). Santa Clara, CA: USENIX Association, 2017. [Herlihy Wing 90] M. P. Herlihy and J. M. Wing, “Linearizability: a cor- rectness condition for concurrent objects,” ACM Trans. Program. Lang. Syst., vol. 12, pp. 463–492, July 1990. [Cadambe-Lynch-Wang ACM PODC 16] V. R. Cadambe, Z. Wang, and N. Lynch, “Information- theoretic lower bounds on the storage cost of shared memory emulation,” in Proceedings of the ninth annual ACM symposium on Principles of distributed computing, ser. PODC ’16. ACM, 2016, pp. 305–314. [Wang Cadambe Trans. IT 17]. Z. Wang and V. Cadambe, “Multi-version coding – An Information-Theoretic Perspective of Consistent Distributed Storage,” to appear in IEEE Transactions on Information Theory. [Brahma Fragouli 2012] S. Brahma and C. Fragouli, “Pliable index coding,” in 2012 IEEE International Symposium on Information Theory Proceedings (ISIT). Ieee, 2012, pp. 2251–2255. [Ali-Cadambe ITW 16] R. E. Ali and V. R. Cadambe, “Consistent distributed storage of correlated data updates via multi-version cod- ing,” in Information Theory Workshop (ITW), 2016 IEEE. IEEE, 2016, pp. 176–180. [Balasubramanian Garg 16] B. Balasubramanian and V. K. Garg, “Fault tolerance in distributed systems using fused state machines,” Dis- tributed Computing, vol. 27, no. 4, pp. 287–311, 2014.
171
Appendix: Binary consensus - A simple looking task that is impossible to achieve in a decentralized asynchronous system
172
Fischer-Lynch-Patterson (FLP) impossibility result (informal)
go back A famous impossibility result Two processors P1, P2. Each processor begins with an initial value in {0,1}. They can communicate messages over a reliable link, but with arbitrary, (unbounded) delay Goal: Design protocol such that (a) both processors agree on the same value, which is an initial value of some processor (b) Each non-failed process must eventually decide. Write W1’s value reached all 6 servers before W2 started. The write with v(2) sent the value only to server 1 by time t2 before R2 started Read R2 got responses from servers 1,2,3,4, therefore it returned v(2) Server 1 failed after R2 completed, but before R1 started. Read R3 then started and read v(1)…….(cannot read v(2)!). Finally, after R3 completes, v(2) reaches the remaining non-failed servers.
173
Appendix: Why does a read operation need to write back in the ABD algorithm?
174
What happens if a read does not write back?
go back The following execution is possible if a read does not write back W2 W1 time Write(v) Read() R1 R2 R3 An example of a wrong execution: Suppose we have 6 servers. Write W1’s value reached all 6 servers before W2 started. The write with v(2) sent the value only to server 1 by time t2 before R2 started Read R2 got responses from servers 1,2,3,4, therefore it returned v(2) Server 1 failed after R2 completed, but before R1 started. Read R3 then started and read v(1)…….(cannot read v(2)!). Finally, after R3 completes, v(2) reaches the remaining non-failed servers.
175
Appendix: An Erasure Coding Based Algorithm
Algorithm from [Cadambe-Lynch-Medard-Musial, IEEE NCA 2014], extended version in Distributed Computing (Springer) 2016.
176
Coded Atomic Storage (CAS)
Solves first challenge of revealing correct elements to readers Good communication cost, but infinite storage cost Failed attempt at garbage collection Attempts to solve challenge of discarding old versions Good storage cost, but poor liveness conditions if too many clients fail CASGC GC = garbage collection Solves both challenges Uses server gossip to propagate metadata Good storage and communication cost Good handling of client failures -
177
Coded Atomic Storage (CAS)
Solves first challenge of revealing correct elements to readers Good communication cost, but infinite storage cost Failed attempt at garbage collection Attempts to solve challenge of discarding old versions Good storage cost, but poor liveness conditions if too many clients fail CASGC GC = garbage collection Solves both challenges Uses server gossip to propagate metadata Good storage and communication cost Good handling of client failures -
178
Coded Atomic Storage (CAS)
solves challenge of only revealing completed writes to readers N servers, f failures. Use MDS code of dimension k, where k is no bigger than (N-k)/2. Every set of at least (N+k)/2 server nodes is referred to as a “quorum set”. Note that any two quorum sets intersect in at least k nodes. Additional “fin” label at servers, indicates that a sufficient number of versions are propagated Additional write phase Tells the servers that elements have been propagated to a quorum Servers store all the history Infinite storage cost (solved in CASGC)
179
Read Clients Write Clients Servers Server 1 Has been propagated to a quorum
180
CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded element; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.
181
CAS – Protocol overview
Write: Acquire latest tag; send (incremented) tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded element; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.
182
CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.
183
Read Clients Write Clients Servers Server 1
184
Read Clients Write Clients Servers Server 1
185
ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Server 1
186
fin fin fin fin fin Read Clients Write Clients Servers Server 1
187
ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Server 1
188
CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest tag labeled fin. Finalize the requested tag; respond to read request with codeword symbol.
189
CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest tag labeled fin. Label the requested tag as fin; respond to read request with coded element if available.
190
CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest tag labeled fin. Label the requested tag as fin; respond to read request with coded element if available.
191
Coded Atomic Storage (CAS)
solves challenge of only revealing completed writes to readers Additional “fin” label at servers, indicates that a sufficient number of versions are propagated Additional write phase Tells the servers that elements have been propagated to a quorum Servers store all the history Infinite storage cost (solved in CASGC)
192
CAS – Protocol overview
Theorem: 1) CAS satisfies atomicity. 2) Liveness: All operations return (if the number of server failures is below a pre-fixed threshold)
193
Coded Atomic Storage (CAS)
Solves first challenge of revealing correct elements to readers Good communication cost, but infinite storage cost Failed attempt at garbage collection Attempts to solve challenge of discarding old versions Good storage cost, but poor liveness conditions if too many clients fail CASGC GC = garbage collection Solves both challenges Uses server gossip to propagate metadata Good storage and communication cost Good handling of client failures -
194
Keep at most d+1 elements
Read Clients Write Clients Servers Possible solution: Store at most d+1 coded elements and delete older ones Keep at most d+1 elements Server 1
195
Modification of CAS The good: The bad:
Possible solution: Store at most d+1 coded elements and delete older ones The good: Finite storage cost. All operations terminate if no. of writes that overlap with a read smaller than d Atomicity (Simulation relation with CAS) Write The bad: Failed write clients result in weak liveness condition, that is, d failed writes can render all future reads incomplete. Read Does not end, concurrent with all future writes! Write
196
Coded Atomic Storage (CAS)
Solves first challenge of revealing correct elements to readers Good communication cost, but infinite storage cost Failed attempt at garbage collection Attempts to solve challenge of discarding old versions Good storage cost, but poor liveness conditions if too many clients fail CASGC GC = garbage collection Solves both challenges Uses server gossip to propagate metadata Good storage and communication cost Good handling of client failures -
197
The CASGC algorithm: The main novelties
Client protocol same as CAS. We only summarize difference in server protocol here Keep d+1 coded elements with fin label and all intervening elements, delete older ones
198
The CASGC algorithm: The main novelties
Client protocol same as CAS. We only summarize difference in server protocol here Keep d+1 coded elements with fin label and all intervening elements, delete older ones Use server gossip to propagate fin labels and “complete” failed operations write End-point: point at which operation is “completed” through gossip, Or the point of failure if the operation cannot be completed through gossip
199
The CASGC algorithm: The main novelties
Client protocol same as CAS. We only summarize difference in server protocol here Keep d+1 coded elements with fin label and all intervening elements, delete older ones Use server gossip to propagate fin labels and “complete” failed operations write End-point: point at which operation is “completed” through gossip, Or the point of failure if the operation cannot be completed through gossip Definition of end-point suffices for defining concurrency, and a satisfactory liveness theorem
200
Main Theorems All operations complete if the number of writes concurrent with a read is smaller than d (In the paper), a bound on the storage cost.
201
Main Theorems Main Insights
All operations complete if the number of writes concurrent with a read is smaller than d (In the paper), a bound on the storage cost. Main Insights Significant saving in network traffic overheads possible with clever design Sever gossip powerful tool for good liveness in storage systems Storage overheads depend on many factors, including extent of client concurrency activity
202
Summary Go back Liveness Comm. Cost Storage Cost CASGC
Conditional, Tuneable Small Tuneable, Quantifiable Viveck R. Cadambe, Nancy Lynch, Muriel Médard, and Peter Musial. A Coded Shared Atomic Memory Algorithm for Message Passing Architectures. Distributed Computing (Springer), 2017
203
Appendix: Converse for multi-version codes, T = 3, description of worst-case state.
204
Converse: T=3 Ver 1 Ver 1 Ver 1 Ver 2 Ver 2 Ver 2 Ver 3 Ver 3 Ver 3
205
Converse: T=3 Ver 1 Ver 1 Ver 1 Ver 2 Ver 2 Ver 2 Ver 3 Ver 3 Ver 3
206
Converse: T=3
207
Converse: T=3 Ver 1 Ver 1 Ver 1 Ver 1 Ver 1 Ver 2 Ver 2 Ver 2 Ver 3
208
Converse: T=3
209
Converse: T=3 Ver 1 Ver 1 Ver 1 Ver 1 Ver 1 Ver 2 Ver 2 Ver 2 Ver 3
210
Converse: T=3 go back Ver 1 Ver 1 Ver 1 Ver 1 Ver 1 All three versions decodable from these c+2 servers, implying storage cost bound! Ver 2 Ver 2 Ver 2 Ver 3 Ver 3 Ver 3 Ver 3 Ver 3 Ver 3 Ver 3 Ver 1 Ver 2 Ver 3 Ver 1 Ver 3
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.