Presentation is loading. Please wait.

Presentation is loading. Please wait.

Codes for Distributed Computing

Similar presentations


Presentation on theme: "Codes for Distributed Computing"— Presentation transcript:

1 Codes for Distributed Computing
ISIT 2017 Tutorial Viveck R. Cadambe (Pennsylvania State University) Pulkit Grover (Carnegie Mellon University)

2 Motivation Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020

3 Motivation Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020 Moore’s law is saturating, Improvements in energy/speed are not “easy”

4 Massive parallelization, Distributed Computing
Motivation Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020 Moore’s law is saturating, Improvements in energy/speed are not “easy” Massive parallelization, Distributed Computing Apache Hadoop, Apache Spark, Graphlab, Map-reduce,

5 Challenges Computing components can straggle, i.e., they can be slow
“Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08 “The tail at scale” Dean-Barroso 13 Components can be erroneous or fail “10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault tolerance techniques for high performance computing – Herault/Roberts “... since data of value is never just on one disk, the BER for a single disk could actually be orders of magnitude higher than the current target of ” [Brewer et al., Google white paper, ‘16] Data transfer cost/time an important component of computing performance Redundancy in data storage, computation can significantly improve performance!

6 Challenges Computing components can straggle, i.e., they can be slow
“Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08 “The tail at scale” Dean-Barroso 13 Components can be erroneous or fail “10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault tolerance techniques for high performance computing – Herault/Roberts “... since data of value is never just on one disk, the BER for a single disk could actually be orders of magnitude higher than the current target of ” [Brewer et al., Google white paper, ‘16] Data transfer cost/time an important component of computing performance Redundancy in data storage, computation can significantly improve performance!

7 Challenges Computing components can straggle, i.e., they can be slow “Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08 “The tail at scale” Dean-Barroso 13 Components can be erroneous or fail “10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault tolerance techniques for high performance computing – Herault/Roberts “... since data of value is never just on one disk, the BER for a single disk could actually be orders of magnitude higher than the current target of 10-15….” [Brewer et al., Google white paper, ‘16] Data transfer cost/time an important component of computing performance Redundancy in data storage, computation can significantly improve performance!

8 Theme of tutorial Role of coding and information theory in contemporary distributed computing systems Models inspired by practical distributed computing systems Fundamental limits on trade-off between redundancy and performance. New coding abstractions and constructions Complements rich literature in codes for network function computing An admittedly incomplete list: [Korner-Marton 79, Doshi-Shah-Medard-Jaggi 07, Orlitsky-Roche 95, Giridhar-Kumar 05, Ramamoorthy-Langberg 13, Ma-Ishwar 11, Nazer-Gastpar 11] Communication and information complexity literature

9 Theme of tutorial Role of coding and information theory in contemporary distributed computing systems Models inspired by practical distributed computing systems Fundamental limits on trade-off between redundancy and performance. New coding abstractions and constructions Complements rich literature in codes for network function computing [Korner-Marton 79, Orlitsky-Roche 95, Giridhar-Kumar 05, Doshi-Shah-Medard-Jaggi 07, Dimakis-Kar-Moura-Rabbat-Scaglione 10, Duchi-Agarwal-Wainwright 11 Ma-Ishwar 11, Nazer-Gastpar 11, Appuswamy-Fraceschetti-Karamchandani-Zeger 13, Ramamoorthy-Langberg 13] Communication and information complexity literature …..

10 Outline Information theory and codes for shared memory emulation
Codes for distributed linear data processing

11 Outline Information theory and codes for shared memory emulation
Codes for distributed linear data processing

12

13 Table of Contents Main theme and goal
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithms Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to distributed algorithms Directions of Future Research

14 Table of Contents Main theme and goal
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithms Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to distributed algorithms Directions of Future Research

15 Distributed Algorithms
Algorithms for distributed networks and multi-processors. ~50 years of research Applications: Cloud Computing, networking, multiprocessor programming, Internet of Things1 Information Theory, data is big, changes very fast Theme of this (part) of the tutorial: A marriage of ideas between information theory and distributed algorithms 1Typical publication venues: PODC, DISC, OPODIS, SODA, FOCS, Usenix Fast, Usenix ATC,

16 Distributed Algorithms: Central assumptions and requirements
Modeling assumptions Unreliability Asynchrony Decentralized nature Requirements Fault tolerance Consistency: Service provided by system “looks as if centralized” despite unreliability/asynchrony/decentralized nature of system Consequences Simple-looking tasks difficult to achieve or sometimes impossible. Careful algorithm design, non-trivial correctness proofs Example,

17 Distributed Algorithms: Central assumptions and requirements
Modeling assumptions Unreliability Asynchrony Decentralized nature Requirements Fault tolerance Consistency: Service provided by system “looks as if centralized” despite unreliability/asynchrony/decentralized nature of system Consequences Simple-looking tasks difficult to achieve or sometimes impossible. Careful algorithm design, non-trivial correctness proofs Example,

18 Distributed Algorithms: Central assumptions and requirements
Modeling assumptions Unreliability Asynchrony Decentralized nature Requirements Fault tolerance Consistency: Service provided by system “looks as if centralized” despite unreliability/asynchrony/decentralized nature of system Consequences Simple-looking tasks can be tricky or sometimes even impossible. Careful algorithm design, non-trivial correctness proofs Example,

19 Shared memory emulation
Distributed system Classical problem in distributed computing Goal: Implement a read-write memory over a distributed system Supports two operations: Read ( variablename ) % also called “get” operation Write ( variablename, value ) % also called a “put” operation For simplicity, we focus on a single variable and omit variablename. Cloud is distributed One variable, write it or read it.

20 Shared memory emulation
Distributed system Read-write memory Classical problem in distributed computing Goal: Implement a read-write memory over a distributed system Supports two operations: Read ( variablename ) % also called “get” operation Write ( variablename, value ) % also called a “put” operation For simplicity, we focus on a single variable and omit variablename. Cloud is distributed One variable, write it or read it.

21 Shared memory emulation
Distributed system Read-write memory Classical problem in distributed computing Goal: Implement a read-write memory over a distributed system Supports two operations: Read ( variablename ) % also called “get” operation Write ( variablename, value ) % also called a “put” operation For simplicity, we focus on a single variable and omit variablename. Cloud is distributed One variable, write it or read it.

22 Shared memory emulation
Distributed system Read-write memory Classical problem in distributed computing Goal: Implement a read-write memory over a distributed system Supports two operations: Read ( variablename ) % also called “get” operation Write ( variablename, value ) % also called a “put” operation For simplicity, we focus on a single variable and omit variablename. Cloud is distributed One variable, write it or read it.

23 Shared memory emulation: Application to cloud computing
Distributed system Theoretical underpinnings of commercial and open source key-value stores: Amazon Dynamo DB, Couch DB, Apache Cassandra DB, Voldermort DB. Applications: transactions, reservation systems, multi-player gaming, social networks, news feeds, distributed computing tasks etc. Design requires sophisticated distributed computing theory combined with great engineering. Active research field in systems and theory communities. Key-value store is like a database, widely used

24 Shared memory emulation: Application to cloud computing
Distributed system Theoretical underpinnings of commercial and open source key-value stores: Amazon Dynamo DB, Couch DB, Apache Cassandra DB, Voldermort DB. Applications: transactions, reservation systems, multi-player gaming, social networks, news feeds, distributed computing tasks etc. Design requires sophisticated distributed computing theory combined with great engineering. Active research field in systems and theory communities. Key-value store is like a database, widely used

25 Engineering Challenges in key-value store implementations
Distributed system Failure tolerance, Fast reads, Fast writes Asynchrony: Weak (no) timing assumptions. Distributed Nature: Nodes unaware of the state How do you distinguish a failed node from a very slow node? How do you ensure that all copies of data have received a write/update? Requirements: Present a consistent view of data Allow concurrent access to clients (no locks) Solutions exist to above challenges. However…. Use some other words on guarantees

26 Engineering Challenges in key-value store implementations
Distributed system Failure tolerance, Fast reads, Fast writes Asynchrony: Weak (no) timing assumptions. Distributed Nature: Nodes unaware of the state How do you distinguish a failed node from a very slow node? How do you ensure that all copies of data have received a write/update? Requirements: Present a consistent view of data Allow concurrent access to clients (no locks) Solutions exist to above challenges. However…. Use some other words on guarantees

27 Engineering Challenges in key-value store implementations
Distributed system Failure tolerance, Fast reads, Fast writes Asynchrony: Weak (no) timing assumptions. Distributed Nature: Nodes unaware of the state How do you distinguish a failed node from a very slow node? How do you ensure that all copies of data have received a write/update? Requirements: Present a consistent view of data Allow concurrent access to clients (no locks) Solutions exist to above challenges. However…. Use some other words on guarantees

28 Engineering Challenges in key-value store implementations
Distributed system Failure tolerance, Fast reads, Fast writes Asynchrony: Weak (no) timing assumptions. Distributed Nature: Nodes unaware of the state How do you distinguish a failed node from a very slow node? How do you ensure that all copies of data have received a write/update? Requirements: Present a consistent view of data Allow concurrent access to clients (no locks) Solutions exist to above challenges. However…. Use some other words on guarantees

29 Goal of this part of the tutorial
Distributed system Analytical understanding of performance (memory overhead, latency) limited. Replication used for fault tolerance and availability in practice today* Minimizing memory overhead important for several reasons Data volume is increasing exponentially Storing more data in high speed memory reduces latency. This tutorial discusses the following questions: Considerations for the use of erasure codes in such systems Information-theoretic framework Important to reduce memory overhead

30 Goal of this part of the tutorial
Distributed system Analytical understanding of performance (memory overhead, latency) limited. Replication used for fault tolerance and availability in practice today* Minimizing memory overhead important for several reasons Data volume is increasing exponentially Storing more data in high speed memory reduces latency. This tutorial discusses the following questions: Considerations for the use of erasure codes in such systems Information-theoretic framework Important to reduce memory overhead

31 Goal of this part of the tutorial
Distributed system Analytical understanding of performance (memory overhead, latency) limited. Replication used for fault tolerance and availability in practice today* Minimizing memory overhead important for several reasons Data volume is increasing exponentially Storing more data in high speed memory reduces latency. This tutorial discusses the following questions: Considerations for the use of erasure codes in such systems Information-theoretic framework Important to reduce memory overhead

32 Table of Contents Main theme and goal
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithm Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to shared memory emulation Directions of Future Research

33 Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges

34 Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges

35 Read-Write Memory Write(v) time Read()

36 Read-Write Memory Write(v) time Read()

37 Read-Write Memory Write(v) time Read() Reality check: Operations over distributed asynchronous systems cannot be modeled as instantaneous. Solution: Concept of atomicity!

38 Thought experiment: Motivation for atomicity
{ x=2.1 tic; x=3.2 %write to shared variable x toc;} { Read(x) } Two concurrent processes Time of write operation: 120 ms

39 Thought experiment: Motivation for atomicity
{ x=2.1 tic; x=3.2 %write to shared variable x toc;} { Read(x) } Two concurrent processes Time of write operation: 120 ms Question: When is the new value (3.2) of shared variable x available to a possibly concurrent read operation? 10 ms? 20 ms? 120 ms? 121 ms?

40 Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) Read() time

41 Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) Read() time

42 Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) Read() time

43 Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read()

44 Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read()

45 Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read()

46 Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read() Write(v) Read()

47 Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read() Write(v) Read()

48 Examples of non-atomic executions
Write(v) Read() time Write(v) time Read()

49 Importance of Consistency
Modular algorithm design Design an application (e.g., bank accounts, reservation systems) over an “instantaneous” memory Then use an atomic distributed memory in its place Program executions are indistinguishable Weaker consistency models also useful Social networks, news feeds use weaker consistency measures for performance. In this talk, we will focus on atomic consistency.

50 Importance of Consistency
Modular algorithm design Design an application (e.g., bank accounts, reservation systems) over an “instantaneous” memory Then use an atomic distributed memory in its place Program executions are indistinguishable Weaker consistency models also useful Social networks, news feeds use weaker consistency measures for performance. In this talk, we will focus on atomic consistency.

51 Importance of Consistency
Modular algorithm design Design an application (e.g., bank accounts, reservation systems) over an “instantaneous” memory Then use an atomic distributed memory in its place Program executions are indistinguishable Weaker consistency models also useful Social networks, news feeds use weaker consistency measures for performance. In this talk, we will focus on atomic consistency.

52 Importance of Consistency
Modular algorithm design Design an application (e.g., bank accounts, reservation systems) over an “instantaneous” memory Then use an atomic distributed memory in its place Program executions are indistinguishable Weaker consistency models also useful Social networks, news feeds use weaker consistency definitions Trade-off errors for performance In this talk, we will focus on atomic consistency.

53 Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges

54 Distributed System Model
Read Clients Write Clients Servers Client server architecture, nodes can fail, i.e., stop responding (no. of server failures is limited) Point-to-point reliable links (arbitrary delay). Nodes unaware of the current state of any other node.

55 The shared memory emulation problem
Read Clients Write Clients Servers Design write, read and server protocols Atomicity Liveness: Concurrent operations, no waiting.

56 The shared memory emulation problem
Read() { Design read protocol } write(v) { Design write protocol } Read Clients Write Clients { Design server protocol } Servers Design write, read and server protocols to ensure Atomicity Liveness: Concurrent operations, no waiting.

57 The shared memory emulation problem
Read() { Design read protocol } write(v) { Design write protocol } Read Clients Write Clients { Design server protocol } Servers Design write, read and server protocols to ensure Atomicity Liveness: Concurrent operations, (no blocking).

58 Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges

59 The ABD algorithm (sketch)
Read Clients Write Clients Servers Idea: Write and read from a majority of server nodes. Any pair of write and read operations intersect at at least one server Algorithm works if a minority of server nodes do not fail.

60 The ABD algorithm (sketch)
Read Clients Write Clients Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol: Send read query; wait for responses from majority; and return with latest value. Server protocol: Store latest value from server; send ack Respond to read request with value

61 The ABD algorithm (sketch)
ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol:: Send read query; wait for sufficient responses and return with latest value. Server protocol: Store latest value; send ack Respond to read request with value

62 The ABD algorithm (sketch)
Query Query Query Query Query Query Read Clients Write Clients Query Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol: Send read query; wait for sufficient responses and return with latest value. Server protocol: Store latest value from server; send ack Respond to read request with value Point: every server uses replication.

63 The ABD algorithm (sketch)
Read Clients Write Clients Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol: Send read query; wait for majority to respond; return with latest value. Server protocol: Store latest value from server; send ack Respond to read request with value I am not really presenting the ABD algorithm in its full generality. For instance, the read have an additional write step which I am omitting. Further, the paper actually proves that this algorithm is atomic, which is not completely trivial. Point: every server uses replication. So you send big packets over the network, you store entire values.

64 The ABD algorithm (sketch)
Read Clients Write Clients Servers Write protocol: Acquire latest tag via query; Send tagged value to every server; return after sufficeint acks. Read protocol: Send read query; wait for acks from majority; send latest value to servers; return latest value after receiving acks from quorum. Server protocol: Respond to query with tag. Store latest value at server; send ack Respond to read request with value Pause and motivate write back

65 The ABD algorithm (sketch)
ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Write protocol: Acquire latest tag via query; Send tagged value to every server; return after sufficeint acks. Read protocol: Send read query; wait for acks from majority; send latest value to servers; return latest value after receiving acks from majority. Server protocol: Respond to query with tag. Store latest value from server; send ack Respond to read request with value Point: every server uses replication.

66 The ABD algorithm is atomic – proof idea
After an operation P terminates, (i) if every future operation acquires a tag at least as large as the tag of P, (ii) if every future write operation acquires a tag strictly larger than the tag of P, (iii) and a read with tag t returns the value of the corresponding write with tag t. then algorithm is atomic. [Paraphrasing of a lemma from Lynch 96]s P Write Read Acquires a tag at least as large As the tag that P propagated Why should read operations write back the value?

67 The ABD algorithm - summary
An atomic read-write memory can be implemented over a distributed asynchronous system All operations terminate so long as the number of servers that fail is a minority Design principles of several modern key-value stores mirror shared memory emulation algorithms. See description of Amazon’s Dynamo key-value store [Decandia et. al. 2008] Replication is used for fault tolerance Point: every server uses replication.

68 Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges

69 Value recoverable from any 4 codeword symbols
Erasure Coding Smaller packets, smaller overheads Example: 2 parity Reed-solomon code Parity symbols Value recoverable from any 4 codeword symbols Size of codeword symbol is ¼ size of value

70 Value recoverable from any 4 codeword symbols
Erasure Coding Smaller packets, smaller overheads Example: 2 parity Reed-solomon code Parity symbols Value recoverable from any 4 codeword symbols Size of codeword symbol is ¼ size of value New constraint, need 4 symbols with same time-stamp

71 Set up for hypothetical erasure coding based algorithm
Read Clients Write Clients Servers Write/read from any five nodes, any two sets intersect at 4 nodes Operations complete works if at most one node has failed. More generally, in a system with N nodes, for dimension k code, write/read from any [(N+k)/2] nodes,

72 Hypothetical erasure coding based algorithm - challenges
Read Clients Write Clients Servers

73 Hypothetical erasure coding based algorithm - challenges
Query Query Write Clients Query Read Clients Discard old versions to save storage Servers Query Servers store multiple versions First Challenge: reveal symbols to readers only when enough symbols are propagated Second Challenge: discard old versions safely

74 First Challenge: reveal symbols to readers only when enough symbols are propagated
Second Challenge: discard old versions safely

75 Crude, one sentence summary
First Challenge: reveal symbols to readers only when enough symbols are propagated Second Challenge: discard old versions safely Crude, one sentence summary Challenges can be solved through careful algorithm design, storage cost savings if the extent of concurrency is small. Sample algorithm in appendix

76 Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 16], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Yu Li Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency What is the right information-theoretic abstraction of this system? Does the storage cost necessarily grow with concurrency? Can clever coding theoretic ideas improve storage cost?

77 Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 15], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency What is the right information-theoretic abstraction of this system? Does the storage cost necessarily grow with concurrency? Can clever coding theoretic ideas improve storage cost?

78 Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 14], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency Can clever coding theoretic ideas improve storage cost? Does the storage cost necessarily grow with concurrency? What is the right information-theoretic abstraction of this system?

79 Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 14], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency Can clever coding theoretic ideas improve storage cost? Does the storage cost necessarily grow with concurrency? What is the right information-theoretic abstraction of this system?

80 Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 14], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency Can clever coding theoretic ideas improve storage cost? Does the storage cost necessarily grow with concurrency? What is the right information-theoretic abstraction of this system?

81 Break

82 Table of Contents Main theme of this tutorial
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithm Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to shared memory emulation Directions of Future Research

83 Information-Theoretic abstraction of consistent distributed storage
Shared Memory Emulation Toy model Multi-version (MVC) codes [Wang Cadambe, Accepted to Trans. IT, 2017]

84 Information-Theoretic abstraction of consistent distributed storage
Shared Memory Emulation Toy model Multi-version (MVC) codes [Wang Cadambe, Accepted to Trans. IT, 2017]

85 Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible

86 Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible

87 Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at client: New version in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Goal: decoder invoked at time t, gets the latest common version among c servers T – degree of aynchrony, every time stamp will be called a version

88 Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at client: New version in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Goal: decoder invoked at time t, gets the latest common version among c servers

89 Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at client: New version in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Goal: decoder invoked at time t, gets the latest common version among c servers

90 Toy Model – Decoding requirement
N Servers f failures possible A version is complete at time t if it has arrived at N-f servers. Decoding requirement for decoder at time t: from every set of N-f servers, the latest complete version or a later version must be decodable. Mirrors erasure coding based shared memory emulation protocols. We will instead study an equivalent decoding requirement. Decoder connects to must be able to recover the latest common version, or a later version at time t, from every set of c=N-2f servers Goal: decoder invoked at time t, gets the latest common version among c servers

91 Toy Model – Decoding requirement
N Servers f failures possible A version is complete at time t if it has arrived at N-f servers. Decoding requirement for decoder at time t: from every set of N-f servers, the latest complete version or a later version must be decodable. Mirrors erasure coding based shared memory emulation protocols. We will instead study an equivalent decoding requirement. Decoder connects to must be able to recover the latest common version, or a later version at time t, from every set of c=N-2f servers Goal: decoder invoked at time t, gets the latest common version among c servers

92 Toy Model – Decoding requirement
N Servers f failures possible A version is complete at time t if it has arrived at N-f servers. Decoding requirement for decoder at time t: from every set of N-f servers, the latest complete version or a later version must be decodable. Mirrors erasure coding based shared memory emulation protocols. We will instead study an equivalent decoding requirement. Decoder connects to must be able to recover the latest common version, or a later version at time t, from every set of c=N-2f servers Goal: decoder invoked at time t, gets the latest common version among c servers

93 Toy Model – Decoding requirement
N Servers f failures possible A version is complete at time t if it has arrived at N-f servers. Decoding requirement for decoder at time t: from every set of N-f servers, the latest complete version or a later version must be decodable. Mirrors erasure coding based shared memory emulation protocols. We will instead study an equivalent decoding requirement. Decoder connects to must be able to recover the latest common version, or a later version at time t, from every set of c servers. The two decoding requirements have the same worst-case storage costs if c=N-2f Goal: decoder invoked at time t, gets the latest common version among c servers

94 Asynchrony Toy Model – Decoding requirement
c servers Snapshot at time Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others Decoder connects to any servers, decodes one of

95 Asynchrony Toy Model – Decoding requirement
c servers Snapshot at time Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others Decoder connects to any servers, decodes one of Goal: construct a storage method that minimizes storage cost

96 Information-Theoretic abstraction of consistent distributed storage
Shared Memory Emulation Toy model Multi-version (MVC) codes

97 The multi-version coding (MVC) problem
Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others

98 The multi-version coding (MVC) problem
Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others

99 The multi-version coding (MVC) problem
Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others

100 The MVC problem n servers T versions c connectivity
Goal: decode the latest common version or later version among every set of c servers Minimize the storage cost Worst case, across all “states” across all servers

101 Solution 1: Replication
Storage size = size-of-one-version N=4, T=2, c=2

102 Solution 1: Replication
Storage size = size-of-one-version N=4, T=2, c=2

103 Solution 2: (N,c) Maximum Distance Separable code
Question: Can we store a codeword symbol corresponding to the latest version?

104 Solution 2: (N,c) Maximum Distance Separable code
Storage size = T/c*size-of-one-version N=4, T=2, c=2 1/2 1/2 1/2 1/2 1/2 1/2 Separate coding across versions. Each server stores all the versions received.

105 Summary of results Naïve MDS codes Replication Storage cost
Number of versions T

106 Summary of results Naïve MDS codes Replication Singleton bound
Storage cost Singleton bound Number of versions T

107 Summary of results Naïve MDS codes Replication Our achievable scheme
Storage cost Our achievable scheme Singleton bound Number of versions T

108 Summary of results Naïve MDS codes Replication Our achievable scheme
Storage cost Our achievable scheme Our converse Singleton bound Number of versions T

109 Summary of results Naïve MDS codes Replication Storage cost Our achievable scheme Our converse Singleton bound Number of versions T Storage cost inevitably increases as degree of asynchrony grows!

110 Normalized by size-of-value
Summary of results Storage Cost Normalized by size-of-value Replication 1 Naïve MDS codes Constructions* Lower bound Typo!

111 Normalized by size-of-value
Summary of results Storage Cost Normalized by size-of-value Replication 1 Naïve MDS codes Constructions* Lower bound Typo! *Achievability can be improved (see [Wang, Cadambe, Accepted to Trans. IT, 17])

112 Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns

113 Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns

114 Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns

115 Achievability

116 Achievability

117 Achievability

118 Achievability Therefore, the version corresponding to at least one partition is decodable

119 Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns

120 Start with c servers

121 State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4

122 Start with c servers

123 State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4

124 Start with c servers Explain more Propagate version 2 to a minimal set of servers such that it is decodable

125 State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4

126 Start with c servers Explain more Propagate version 2 to a minimal set of servers such that it is decodable

127 State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4

128 State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4

129 Versions 1 and 2 decodable from c+1 symbols
Start with c servers Explain more Propagate version 2 to a minimal set of servers such that it is decodable Versions 1 and 2 decodable from c+1 symbols

130 Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns More challenging combinatorial puzzle for T > 2.

131 Summary of results Naïve MDS codes Replication Storage cost Our achievable scheme Our converse Singleton bound Number of versions T Storage cost inevitably increases as degree of asynchrony grows!

132 Information-Theoretic abstraction of consistent distributed storage
Shared Memory Emulation Toy model Multi-version (MVC) codes

133 Recall the shared memory emulation model
Read Clients (Decoders) Write Clients N Servers f failures possible

134 Recall the shared memory emulation model
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at clients: arbitrary Channel from clients to servers: arbitrary delay, reliable Clients, servers modeled as I/O automata, protocols can be designed.

135 Recall the shared memory emulation model
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at clients: arbitrary Channel from clients to servers: arbitrary (unbounded) delay, reliable Clients, servers modeled as I/O automata, protocols can be designed.

136 Recall the shared memory emulation model
Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at clients: arbitrary Channel from clients to servers: arbitrary (unbounded) delay, reliable Clients, servers modeled as I/O automata, protocols can be designed.

137 Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Storage cost Generalization of the Singleton bound. [C-Lynch-Wang, ACM PODC 2016]

138 Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the Singleton bound. Non-trivial due to interactive nature Non-trivial [C-Lynch-Wang, ACM PODC 2016]

139 Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the Singleton bound. Non-trivial due to interactive nature Non-trivial [C-Lynch-Wang, ACM PODC 2016]

140 Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the MVC converse for T=2. Open question: Generalization of MVC bounds for T > 2 to the full-fledged distributed systems theoretic model. [C-Lynch-Wang, ACM PODC 2016]

141 Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the MVC converse for T=2. Generalization of MVC converse for T > 2 to works for non-interactive protocols: T = number of concurrent write operations. [C-Lynch-Wang, ACM PODC 2016]

142 Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the MVC converse for T=2. Generalization of MVC converse for T > 2 to works for non-interactive protocols: T = number of concurrent write operations. [C-Lynch-Wang, ACM PODC 2016]

143 Storage Cost bounds for Shared Memory Emulation
ABD algorithm Erasure coding based algorithms Storage Cost Second lower bound* First lower bound Baseline lower bound Number of concurrent writes

144 Table of Contents Main theme of this tutorial
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithm Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to shared memory emulation Directions of Future Research

145 Directions of Future Research
Classical Codes for Distributed Storage Multi-version codes System is totally asynchronous, every version arrival pattern is possible Nodes have do not even have stale information of system state Nodes do not even have partial information of system state. System is synchronous Nodes have instantaneous, system state information Nodes have global system state information.

146 Directions of Future Research
Classical Codes for Distributed Storage Multi-version codes Pessimistic/conservative model System is totally asynchronous, every version arrival pattern is possible Nodes have do not even have stale information of system state Nodes do not even have partial information of system state. Optimistic model System is synchronous Nodes have instantaneous, system state information Nodes have global system state information.

147 Directions of Future Research
Classical Codes for Distributed Storage Multi-version codes Pessimistic/conservative model System is totally asynchronous, every version arrival pattern is possible Nodes have do not even have stale information of system state Nodes do not even have partial information of system state. Optimistic model System is synchronous Nodes have instantaneous, system state information Nodes have global system state information. Practice

148 Directions of future research
Beyond the worst-case model Correlated versions

149 Codes with Correlated Versions
[Ali-C, ITW 2016] Markov Chain . Uniform

150 Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ?

151 Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? “Closeness” in message “Closeness” in codeword Delta coding

152 Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Storage cost as opposed to

153 Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ?

154 Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Apply Slepian-Wolf ideas

155 Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Apply Slepian-Wolf ideas Storage cost is as opposed to

156 Codes with Correlated Versions
[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Open: Information-theoretically optimal schemes for all regimes of Open: Practical code constructions

157 Directions of future research
Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost, good storage cost in typical states, larger worst-case storage cost. Beyond the toy model Explore relations between timing assumptions and storage cost Open question: Can interactive protocols help?, or does the storage cost necessarily grow with

158 Directions of future research
Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost: good storage cost in typical states, possibly with larger worst-case storage cost. Beyond the toy model Explore relations between timing assumptions and storage cost Open question: Can interactive protocols help?, or does the storage cost necessarily grow with

159 Beyond the toy model Toy model serves as a bridge between shared memory emulation and our network information theoretic formulation Open: Can an interactive write protocol improve storage cost? Future: realistic model, expose connections between: channel delay uncertainty staleness of read. storage cost

160 Beyond the toy model Toy model serves as a bridge between shared memory emulation and our network information theoretic formulation Future work: realistic model, expose connections between channel delay uncertainty staleness of read. storage cost Open: Can an interactive write protocol improve storage cost?

161 Beyond the toy model Toy model serves as a bridge between shared memory emulation and our network information theoretic formulation Future work: realistic model, expose connections between channel delay uncertainty staleness of read. storage cost Open: Can an interactive write protocol improve storage cost?

162 Directions of future research
Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost: good storage cost in typical states, possibly with larger worst-case storage cost. Beyond the toy model Open question: Can interactive protocols help?, o Relation between system response, decoding requirement and storage cost r does the storage cost necessarily grow with

163 Beyond read-write memory
Several systems: more complicated data structures over distributed asynchronous systems Transactions: Multiple read-write objects, more complicated consistency requirements. Graph based data structures Question: How do you “erasure code” more complicated data structures and state machines? Initial clues provided in [Balasubramanian-Garg 14]

164 Beyond read-write memory
Several systems: more complicated data structures over distributed asynchronous systems Transactions: Multiple read-write objects, more complicated consistency requirements. Graph based data structures Question: How do you “erasure code” more complicated data structures and state machines? Initial clues provided in [Balasubramanian-Garg 14]

165 Directions of future research
Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost: good storage cost in typical states, possibly with larger worst-case storage cost. Beyond the toy model Open question: Can interactive protocols help?, o Relation between system response, decoding requirement and storage cost r does the Beyond read-write data structures

166 Thanks

167 References [Lynch 96] N. A. Lynch, Distributed Algorithms USA: Morgan Kaufmann Publishers Inc., 1996. [Lamport 86] Lamport, L.: On interprocess communication. Part I: Basic formalism. Distributed Computing 2(1), 77–85 (1986) [Vogels 09] W. Vogels, “Eventually consistent,” Queue, vol. 6, no. 6, pp. 14–19, 2008. [Attiya-Bar-Noy-Dolev 95] H. Attiya, A. Bar-Noy, and D. Dolev, “Sharing memory robustly in message-passing systems,” J. ACM, vol. 42, no. 1, pp. 124–142, Jan [Decandia et. al. 07] G. DeCandia, D. Hastorun, M. Jampani, G. Kakula- pati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s highly available key-value store,” in SOSP, vol. 7, 2007, pp. 205– 220. [6] E. Hewitt, Cassandra: the definitive guide. ” O’Reilly Media, Inc.”, 2010. [Hendricks et. al. 07] J. Hendricks, G. R. Ganger, and M. K. Reiter, “Low-overhead byzantine fault-tolerant storage,” ACM SIGOPS Operating Systems Review, vol. 41, no. 6, pp. 73–86, 2007. [Dutta et. al. 08] P. Dutta, R. Guerraoui, and R. R. Levy, “Optimistic erasure-coded distributed storage,” in Distributed Com- puting. Springer, 2008, pp. 182–196. [Cadambe et. al. 14] V. R. Cadambe, N. Lynch, M. Medard, and P. Musial, “A coded shared atomic memory algorithm for message passing architectures,” in 2014 IEEE 13th International Symposium on Network Computing and Applications (NCA). IEEE, 2014, pp. 253–260. [Dobre et. al. 13] D. Dobre, G. Karame, W. Li, M. Majuntke, N. Suri, and M. Vukoli ́c, “PoWerStore: proofs of writing for efficient and robust storage,” in Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 2013, pp. 285–298. [Konwar et. al. PODC 2017] K.M.Konwar, N.Prakash, N.A.Lynch, and M.M ́edard, “A layered architecture for erasure-coded consistent distributed storage,” CoRR, vol. abs/ , 2017, ac- cepted to 2017 ACM Principles of Distributed Computing, available at

168 References [Zhang et. al. FAST 16]  H. Zhang, M. Dong, and H. Chen, “Efficient and avail- able in-memory kv-store with hybrid erasure coding and replication.” in FAST, 2016, pp. 167–180. [Chen et. al. 2017]  Y. L. Chen, S. Mu, J. Li, C. Huang, J. Li, A. Ogus, and D. Phillips, “Giza: Erasure coding objects across global data centers,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). Santa Clara, CA: USENIX Association, 2017. [Herlihy Wing 90]  M. P. Herlihy and J. M. Wing, “Linearizability: a cor- rectness condition for concurrent objects,” ACM Trans. Program. Lang. Syst., vol. 12, pp. 463–492, July 1990. [Cadambe-Lynch-Wang ACM PODC 16]  V. R. Cadambe, Z. Wang, and N. Lynch, “Information- theoretic lower bounds on the storage cost of shared memory emulation,” in Proceedings of the ninth annual ACM symposium on Principles of distributed computing, ser. PODC ’16. ACM, 2016, pp. 305–314. [Wang Cadambe Trans. IT 17]. Z. Wang and V. Cadambe, “Multi-version coding – An Information-Theoretic Perspective of Consistent Distributed Storage,” to appear in IEEE Transactions on Information Theory. [Brahma Fragouli 2012] S. Brahma and C. Fragouli, “Pliable index coding,” in 2012 IEEE International Symposium on Information Theory Proceedings (ISIT). Ieee, 2012, pp. 2251–2255. [Ali-Cadambe ITW 16] R. E. Ali and V. R. Cadambe, “Consistent distributed storage of correlated data updates via multi-version cod- ing,” in Information Theory Workshop (ITW), 2016 IEEE. IEEE, 2016, pp. 176–180. [Balasubramanian Garg 16] B. Balasubramanian and V. K. Garg, “Fault tolerance in distributed systems using fused state machines,” Dis- tributed Computing, vol. 27, no. 4, pp. 287–311, 2014.

169

170

171 Appendix: Binary consensus - A simple looking task that is impossible to achieve in a decentralized asynchronous system

172 Fischer-Lynch-Patterson (FLP) impossibility result (informal)
go back A famous impossibility result Two processors P1, P2. Each processor begins with an initial value in {0,1}. They can communicate messages over a reliable link, but with arbitrary, (unbounded) delay Goal: Design protocol such that (a) both processors agree on the same value, which is an initial value of some processor (b) Each non-failed process must eventually decide. Write W1’s value reached all 6 servers before W2 started. The write with v(2) sent the value only to server 1 by time t2 before R2 started Read R2 got responses from servers 1,2,3,4, therefore it returned v(2) Server 1 failed after R2 completed, but before R1 started. Read R3 then started and read v(1)…….(cannot read v(2)!). Finally, after R3 completes, v(2) reaches the remaining non-failed servers.

173 Appendix: Why does a read operation need to write back in the ABD algorithm?

174 What happens if a read does not write back?
go back The following execution is possible if a read does not write back W2 W1 time Write(v) Read() R1 R2 R3 An example of a wrong execution: Suppose we have 6 servers. Write W1’s value reached all 6 servers before W2 started. The write with v(2) sent the value only to server 1 by time t2 before R2 started Read R2 got responses from servers 1,2,3,4, therefore it returned v(2) Server 1 failed after R2 completed, but before R1 started. Read R3 then started and read v(1)…….(cannot read v(2)!). Finally, after R3 completes, v(2) reaches the remaining non-failed servers.

175 Appendix: An Erasure Coding Based Algorithm
Algorithm from [Cadambe-Lynch-Medard-Musial, IEEE NCA 2014], extended version in Distributed Computing (Springer) 2016.

176 Coded Atomic Storage (CAS)
Solves first challenge of revealing correct elements to readers Good communication cost, but infinite storage cost Failed attempt at garbage collection Attempts to solve challenge of discarding old versions Good storage cost, but poor liveness conditions if too many clients fail CASGC GC = garbage collection Solves both challenges Uses server gossip to propagate metadata Good storage and communication cost Good handling of client failures -

177 Coded Atomic Storage (CAS)
Solves first challenge of revealing correct elements to readers Good communication cost, but infinite storage cost Failed attempt at garbage collection Attempts to solve challenge of discarding old versions Good storage cost, but poor liveness conditions if too many clients fail CASGC GC = garbage collection Solves both challenges Uses server gossip to propagate metadata Good storage and communication cost Good handling of client failures -

178 Coded Atomic Storage (CAS)
solves challenge of only revealing completed writes to readers N servers, f failures. Use MDS code of dimension k, where k is no bigger than (N-k)/2. Every set of at least (N+k)/2 server nodes is referred to as a “quorum set”. Note that any two quorum sets intersect in at least k nodes. Additional “fin” label at servers, indicates that a sufficient number of versions are propagated Additional write phase Tells the servers that elements have been propagated to a quorum Servers store all the history Infinite storage cost (solved in CASGC)

179 Read Clients Write Clients Servers Server 1 Has been propagated to a quorum

180 CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded element; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.

181 CAS – Protocol overview
Write: Acquire latest tag; send (incremented) tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded element; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.

182 CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.

183 Read Clients Write Clients Servers Server 1

184 Read Clients Write Clients Servers Server 1

185 ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Server 1

186 fin fin fin fin fin Read Clients Write Clients Servers Server 1

187 ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Server 1

188 CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest tag labeled fin. Finalize the requested tag; respond to read request with codeword symbol.

189 CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest tag labeled fin. Label the requested tag as fin; respond to read request with coded element if available.

190 CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest tag labeled fin. Label the requested tag as fin; respond to read request with coded element if available.

191 Coded Atomic Storage (CAS)
solves challenge of only revealing completed writes to readers Additional “fin” label at servers, indicates that a sufficient number of versions are propagated Additional write phase Tells the servers that elements have been propagated to a quorum Servers store all the history Infinite storage cost (solved in CASGC)

192 CAS – Protocol overview
Theorem: 1) CAS satisfies atomicity. 2) Liveness: All operations return (if the number of server failures is below a pre-fixed threshold)

193 Coded Atomic Storage (CAS)
Solves first challenge of revealing correct elements to readers Good communication cost, but infinite storage cost Failed attempt at garbage collection Attempts to solve challenge of discarding old versions Good storage cost, but poor liveness conditions if too many clients fail CASGC GC = garbage collection Solves both challenges Uses server gossip to propagate metadata Good storage and communication cost Good handling of client failures -

194 Keep at most d+1 elements
Read Clients Write Clients Servers Possible solution: Store at most d+1 coded elements and delete older ones Keep at most d+1 elements Server 1

195 Modification of CAS The good: The bad:
Possible solution: Store at most d+1 coded elements and delete older ones The good: Finite storage cost. All operations terminate if no. of writes that overlap with a read smaller than d Atomicity (Simulation relation with CAS) Write The bad: Failed write clients result in weak liveness condition, that is, d failed writes can render all future reads incomplete. Read Does not end, concurrent with all future writes! Write

196 Coded Atomic Storage (CAS)
Solves first challenge of revealing correct elements to readers Good communication cost, but infinite storage cost Failed attempt at garbage collection Attempts to solve challenge of discarding old versions Good storage cost, but poor liveness conditions if too many clients fail CASGC GC = garbage collection Solves both challenges Uses server gossip to propagate metadata Good storage and communication cost Good handling of client failures -

197 The CASGC algorithm: The main novelties
Client protocol same as CAS. We only summarize difference in server protocol here Keep d+1 coded elements with fin label and all intervening elements, delete older ones

198 The CASGC algorithm: The main novelties
Client protocol same as CAS. We only summarize difference in server protocol here Keep d+1 coded elements with fin label and all intervening elements, delete older ones Use server gossip to propagate fin labels and “complete” failed operations write End-point: point at which operation is “completed” through gossip, Or the point of failure if the operation cannot be completed through gossip

199 The CASGC algorithm: The main novelties
Client protocol same as CAS. We only summarize difference in server protocol here Keep d+1 coded elements with fin label and all intervening elements, delete older ones Use server gossip to propagate fin labels and “complete” failed operations write End-point: point at which operation is “completed” through gossip, Or the point of failure if the operation cannot be completed through gossip Definition of end-point suffices for defining concurrency, and a satisfactory liveness theorem

200 Main Theorems All operations complete if the number of writes concurrent with a read is smaller than d (In the paper), a bound on the storage cost.

201 Main Theorems Main Insights
All operations complete if the number of writes concurrent with a read is smaller than d (In the paper), a bound on the storage cost. Main Insights Significant saving in network traffic overheads possible with clever design Sever gossip powerful tool for good liveness in storage systems Storage overheads depend on many factors, including extent of client concurrency activity

202 Summary Go back Liveness Comm. Cost Storage Cost CASGC
Conditional, Tuneable Small Tuneable, Quantifiable Viveck R. Cadambe, Nancy Lynch, Muriel Médard, and Peter Musial. A Coded Shared Atomic Memory Algorithm for Message Passing Architectures. Distributed Computing (Springer), 2017

203 Appendix: Converse for multi-version codes, T = 3, description of worst-case state.

204 Converse: T=3 Ver 1 Ver 1 Ver 1 Ver 2 Ver 2 Ver 2 Ver 3 Ver 3 Ver 3

205 Converse: T=3 Ver 1 Ver 1 Ver 1 Ver 2 Ver 2 Ver 2 Ver 3 Ver 3 Ver 3

206 Converse: T=3

207 Converse: T=3 Ver 1 Ver 1 Ver 1 Ver 1 Ver 1 Ver 2 Ver 2 Ver 2 Ver 3

208 Converse: T=3

209 Converse: T=3 Ver 1 Ver 1 Ver 1 Ver 1 Ver 1 Ver 2 Ver 2 Ver 2 Ver 3

210 Converse: T=3 go back Ver 1 Ver 1 Ver 1 Ver 1 Ver 1 All three versions decodable from these c+2 servers, implying storage cost bound! Ver 2 Ver 2 Ver 2 Ver 3 Ver 3 Ver 3 Ver 3 Ver 3 Ver 3 Ver 3 Ver 1 Ver 2 Ver 3 Ver 1 Ver 3


Download ppt "Codes for Distributed Computing"

Similar presentations


Ads by Google