Codes for Distributed Computing

Codes for Distributed Computing
ISIT 2017 Tutorial Viveck R. Cadambe (Pennsylvania State University) Pulkit Grover (Carnegie Mellon University)

Motivation Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020

Motivation Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020 Moore’s law is saturating, Improvements in energy/speed are not “easy”

Massive parallelization, Distributed Computing
Motivation Worldwide Data Storage, 0.8 ZB in 2009, 4.4 ZB in 2013, 44 ZB in 2020 Moore’s law is saturating, Improvements in energy/speed are not “easy” Massive parallelization, Distributed Computing Apache Hadoop, Apache Spark, Graphlab, Map-reduce,

Challenges Computing components can straggle, i.e., they can be slow
“Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08 “The tail at scale” Dean-Barroso 13 Components can be erroneous or fail “10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault tolerance techniques for high performance computing – Herault/Roberts “... since data of value is never just on one disk, the BER for a single disk could actually be orders of magnitude higher than the current target of ” [Brewer et al., Google white paper, ‘16] Data transfer cost/time an important component of computing performance Redundancy in data storage, computation can significantly improve performance!

Challenges Computing components can straggle, i.e., they can be slow “Map-reduce: simplified data processing on large clusters”, Dean-Ghemawat 08 “The tail at scale” Dean-Barroso 13 Components can be erroneous or fail “10 challenges towards exascale computing”: DOE ASCAC report 2014; Fault tolerance techniques for high performance computing – Herault/Roberts “... since data of value is never just on one disk, the BER for a single disk could actually be orders of magnitude higher than the current target of 10-15….” [Brewer et al., Google white paper, ‘16] Data transfer cost/time an important component of computing performance Redundancy in data storage, computation can significantly improve performance!

Theme of tutorial Role of coding and information theory in contemporary distributed computing systems Models inspired by practical distributed computing systems Fundamental limits on trade-off between redundancy and performance. New coding abstractions and constructions Complements rich literature in codes for network function computing An admittedly incomplete list: [Korner-Marton 79, Doshi-Shah-Medard-Jaggi 07, Orlitsky-Roche 95, Giridhar-Kumar 05, Ramamoorthy-Langberg 13, Ma-Ishwar 11, Nazer-Gastpar 11] Communication and information complexity literature

Theme of tutorial Role of coding and information theory in contemporary distributed computing systems Models inspired by practical distributed computing systems Fundamental limits on trade-off between redundancy and performance. New coding abstractions and constructions Complements rich literature in codes for network function computing [Korner-Marton 79, Orlitsky-Roche 95, Giridhar-Kumar 05, Doshi-Shah-Medard-Jaggi 07, Dimakis-Kar-Moura-Rabbat-Scaglione 10, Duchi-Agarwal-Wainwright 11 Ma-Ishwar 11, Nazer-Gastpar 11, Appuswamy-Fraceschetti-Karamchandani-Zeger 13, Ramamoorthy-Langberg 13] Communication and information complexity literature …..

Outline Information theory and codes for shared memory emulation
Codes for distributed linear data processing

Table of Contents Main theme and goal
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithms Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to distributed algorithms Directions of Future Research

Distributed Algorithms
Algorithms for distributed networks and multi-processors. ~50 years of research Applications: Cloud Computing, networking, multiprocessor programming, Internet of Things1 Information Theory, data is big, changes very fast Theme of this (part) of the tutorial: A marriage of ideas between information theory and distributed algorithms 1Typical publication venues: PODC, DISC, OPODIS, SODA, FOCS, Usenix Fast, Usenix ATC,

Distributed Algorithms: Central assumptions and requirements
Modeling assumptions Unreliability Asynchrony Decentralized nature Requirements Fault tolerance Consistency: Service provided by system “looks as if centralized” despite unreliability/asynchrony/decentralized nature of system Consequences Simple-looking tasks difficult to achieve or sometimes impossible. Careful algorithm design, non-trivial correctness proofs Example,

Distributed Algorithms: Central assumptions and requirements
Modeling assumptions Unreliability Asynchrony Decentralized nature Requirements Fault tolerance Consistency: Service provided by system “looks as if centralized” despite unreliability/asynchrony/decentralized nature of system Consequences Simple-looking tasks can be tricky or sometimes even impossible. Careful algorithm design, non-trivial correctness proofs Example,

Shared memory emulation
Distributed system Classical problem in distributed computing Goal: Implement a read-write memory over a distributed system Supports two operations: Read ( variablename ) % also called “get” operation Write ( variablename, value ) % also called a “put” operation For simplicity, we focus on a single variable and omit variablename. Cloud is distributed One variable, write it or read it.

Shared memory emulation
Distributed system Read-write memory Classical problem in distributed computing Goal: Implement a read-write memory over a distributed system Supports two operations: Read ( variablename ) % also called “get” operation Write ( variablename, value ) % also called a “put” operation For simplicity, we focus on a single variable and omit variablename. Cloud is distributed One variable, write it or read it.

Shared memory emulation: Application to cloud computing
Distributed system Theoretical underpinnings of commercial and open source key-value stores: Amazon Dynamo DB, Couch DB, Apache Cassandra DB, Voldermort DB. Applications: transactions, reservation systems, multi-player gaming, social networks, news feeds, distributed computing tasks etc. Design requires sophisticated distributed computing theory combined with great engineering. Active research field in systems and theory communities. Key-value store is like a database, widely used

Engineering Challenges in key-value store implementations
Distributed system Failure tolerance, Fast reads, Fast writes Asynchrony: Weak (no) timing assumptions. Distributed Nature: Nodes unaware of the state How do you distinguish a failed node from a very slow node? How do you ensure that all copies of data have received a write/update? Requirements: Present a consistent view of data Allow concurrent access to clients (no locks) Solutions exist to above challenges. However…. Use some other words on guarantees

Goal of this part of the tutorial
Distributed system Analytical understanding of performance (memory overhead, latency) limited. Replication used for fault tolerance and availability in practice today* Minimizing memory overhead important for several reasons Data volume is increasing exponentially Storing more data in high speed memory reduces latency. This tutorial discusses the following questions: Considerations for the use of erasure codes in such systems Information-theoretic framework Important to reduce memory overhead

Table of Contents Main theme and goal
Distributed algorithms Shared memory emulation problem. Applications to key-value stores Shared Memory Emulation in Distributed Computing The concept of consistency Overview of replication-based shared memory emulation algorithm Challenges of erasure coding based shared memory emulation Information-Theoretic Framework Toy Model for distributed algorithms Multi-version Coding Closing the loop: connection to shared memory emulation Directions of Future Research

Shared Memory Emulation in Distributed Computing
The concept of consistency Atomic consistency (or simply, atomicity) Other notions of consistency Atomic Shared Memory Emulation Problem formulation (Informal) Replication-based algorithm of [Attiya Bar-Noy Dolev 95] Erasure coding based algorithms – main challenges

Read-Write Memory Write(v) time Read()

Read-Write Memory Write(v) time Read() Reality check: Operations over distributed asynchronous systems cannot be modeled as instantaneous. Solution: Concept of atomicity!

Thought experiment: Motivation for atomicity
{ x=2.1 tic; x=3.2 %write to shared variable x toc;} { Read(x) } Two concurrent processes Time of write operation: 120 ms

Thought experiment: Motivation for atomicity
{ x=2.1 tic; x=3.2 %write to shared variable x toc;} { Read(x) } Two concurrent processes Time of write operation: 120 ms Question: When is the new value (3.2) of shared variable x available to a possibly concurrent read operation? 10 ms? 20 ms? 120 ms? 121 ms?

Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) Read() time

[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read()

[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write(v) time Read() Write(v) Read()

Examples of non-atomic executions
Write(v) Read() time Write(v) time Read()

Importance of Consistency
Modular algorithm design Design an application (e.g., bank accounts, reservation systems) over an “instantaneous” memory Then use an atomic distributed memory in its place Program executions are indistinguishable Weaker consistency models also useful Social networks, news feeds use weaker consistency measures for performance. In this talk, we will focus on atomic consistency.

Importance of Consistency
Modular algorithm design Design an application (e.g., bank accounts, reservation systems) over an “instantaneous” memory Then use an atomic distributed memory in its place Program executions are indistinguishable Weaker consistency models also useful Social networks, news feeds use weaker consistency definitions Trade-off errors for performance In this talk, we will focus on atomic consistency.

Distributed System Model
Read Clients Write Clients Servers Client server architecture, nodes can fail, i.e., stop responding (no. of server failures is limited) Point-to-point reliable links (arbitrary delay). Nodes unaware of the current state of any other node.

The shared memory emulation problem
Read Clients Write Clients Servers Design write, read and server protocols Atomicity Liveness: Concurrent operations, no waiting.

Read() { Design read protocol } write(v) { Design write protocol } Read Clients Write Clients { Design server protocol } Servers Design write, read and server protocols to ensure Atomicity Liveness: Concurrent operations, no waiting.

Read() { Design read protocol } write(v) { Design write protocol } Read Clients Write Clients { Design server protocol } Servers Design write, read and server protocols to ensure Atomicity Liveness: Concurrent operations, (no blocking).

The ABD algorithm (sketch)
Read Clients Write Clients Servers Idea: Write and read from a majority of server nodes. Any pair of write and read operations intersect at at least one server Algorithm works if a minority of server nodes do not fail.

Read Clients Write Clients Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol: Send read query; wait for responses from majority; and return with latest value. Server protocol: Store latest value from server; send ack Respond to read request with value

ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol:: Send read query; wait for sufficient responses and return with latest value. Server protocol: Store latest value; send ack Respond to read request with value

Query Query Query Query Query Query Read Clients Write Clients Query Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol: Send read query; wait for sufficient responses and return with latest value. Server protocol: Store latest value from server; send ack Respond to read request with value Point: every server uses replication.

Read Clients Write Clients Servers Write protocol: Send time-stamped value to every server; return after receiving acks from majority. Read protocol: Send read query; wait for majority to respond; return with latest value. Server protocol: Store latest value from server; send ack Respond to read request with value I am not really presenting the ABD algorithm in its full generality. For instance, the read have an additional write step which I am omitting. Further, the paper actually proves that this algorithm is atomic, which is not completely trivial. Point: every server uses replication. So you send big packets over the network, you store entire values.

Read Clients Write Clients Servers Write protocol: Acquire latest tag via query; Send tagged value to every server; return after sufficeint acks. Read protocol: Send read query; wait for acks from majority; send latest value to servers; return latest value after receiving acks from quorum. Server protocol: Respond to query with tag. Store latest value at server; send ack Respond to read request with value Pause and motivate write back

ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Write protocol: Acquire latest tag via query; Send tagged value to every server; return after sufficeint acks. Read protocol: Send read query; wait for acks from majority; send latest value to servers; return latest value after receiving acks from majority. Server protocol: Respond to query with tag. Store latest value from server; send ack Respond to read request with value Point: every server uses replication.

The ABD algorithm is atomic – proof idea
After an operation P terminates, (i) if every future operation acquires a tag at least as large as the tag of P, (ii) if every future write operation acquires a tag strictly larger than the tag of P, (iii) and a read with tag t returns the value of the corresponding write with tag t. then algorithm is atomic. [Paraphrasing of a lemma from Lynch 96]s P Write Read Acquires a tag at least as large As the tag that P propagated Why should read operations write back the value?

The ABD algorithm - summary
An atomic read-write memory can be implemented over a distributed asynchronous system All operations terminate so long as the number of servers that fail is a minority Design principles of several modern key-value stores mirror shared memory emulation algorithms. See description of Amazon’s Dynamo key-value store [Decandia et. al. 2008] Replication is used for fault tolerance Point: every server uses replication.

Value recoverable from any 4 codeword symbols
Erasure Coding Smaller packets, smaller overheads Example: 2 parity Reed-solomon code Parity symbols Value recoverable from any 4 codeword symbols Size of codeword symbol is ¼ size of value

Value recoverable from any 4 codeword symbols
Erasure Coding Smaller packets, smaller overheads Example: 2 parity Reed-solomon code Parity symbols Value recoverable from any 4 codeword symbols Size of codeword symbol is ¼ size of value New constraint, need 4 symbols with same time-stamp

Set up for hypothetical erasure coding based algorithm
Read Clients Write Clients Servers Write/read from any five nodes, any two sets intersect at 4 nodes Operations complete works if at most one node has failed. More generally, in a system with N nodes, for dimension k code, write/read from any [(N+k)/2] nodes,

Hypothetical erasure coding based algorithm - challenges
Read Clients Write Clients Servers

Hypothetical erasure coding based algorithm - challenges
Query Query Write Clients Query Read Clients Discard old versions to save storage Servers Query Servers store multiple versions First Challenge: reveal symbols to readers only when enough symbols are propagated Second Challenge: discard old versions safely

First Challenge: reveal symbols to readers only when enough symbols are propagated
Second Challenge: discard old versions safely

Crude, one sentence summary
First Challenge: reveal symbols to readers only when enough symbols are propagated Second Challenge: discard old versions safely Crude, one sentence summary Challenges can be solved through careful algorithm design, storage cost savings if the extent of concurrency is small. Sample algorithm in appendix

Different approaches to solving challenges
[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 16], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Yu Li Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency What is the right information-theoretic abstraction of this system? Does the storage cost necessarily grow with concurrency? Can clever coding theoretic ideas improve storage cost?

[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 15], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency What is the right information-theoretic abstraction of this system? Does the storage cost necessarily grow with concurrency? Can clever coding theoretic ideas improve storage cost?

[Ganger et. al. 04] The HGR algorithm, [Dutta et. al 08] The ORCAS and ORCAS-B algorithm, [Dobre et. al. 14] The M-PoWerStore algorithm [Androulaki et. al. 14] AWE algorithm [Cadambe et. al. 14], The CASGC algorithm [Konwar et. al. 16] SODA algorithm Storage cost grows as the number of concurrent write operations. Noteworthy recent developments: Coding-based consistent store implementations [Zhang et. al. FAST 16], [Chen et. al. Usenix ATC 17] Erasure coding based algorithms for consistency issues in edge computing [Konwar et. al. PODC 2017], Close the loop, clever coding, motivate concurrency Can clever coding theoretic ideas improve storage cost? Does the storage cost necessarily grow with concurrency? What is the right information-theoretic abstraction of this system?

Table of Contents Main theme of this tutorial

Information-Theoretic abstraction of consistent distributed storage
Shared Memory Emulation Toy model Multi-version (MVC) codes [Wang Cadambe, Accepted to Trans. IT, 2017]

Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients N Servers f failures possible

Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at client: New version in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Goal: decoder invoked at time t, gets the latest common version among c servers T – degree of aynchrony, every time stamp will be called a version

Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at client: New version in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Goal: decoder invoked at time t, gets the latest common version among c servers

Toy Model – Decoding requirement
N Servers f failures possible A version is complete at time t if it has arrived at N-f servers. Decoding requirement for decoder at time t: from every set of N-f servers, the latest complete version or a later version must be decodable. Mirrors erasure coding based shared memory emulation protocols. We will instead study an equivalent decoding requirement. Decoder connects to must be able to recover the latest common version, or a later version at time t, from every set of c=N-2f servers Goal: decoder invoked at time t, gets the latest common version among c servers

Toy Model – Decoding requirement
N Servers f failures possible A version is complete at time t if it has arrived at N-f servers. Decoding requirement for decoder at time t: from every set of N-f servers, the latest complete version or a later version must be decodable. Mirrors erasure coding based shared memory emulation protocols. We will instead study an equivalent decoding requirement. Decoder connects to must be able to recover the latest common version, or a later version at time t, from every set of c servers. The two decoding requirements have the same worst-case storage costs if c=N-2f Goal: decoder invoked at time t, gets the latest common version among c servers

Asynchrony Toy Model – Decoding requirement
c servers Snapshot at time Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others Decoder connects to any servers, decodes one of

Asynchrony Toy Model – Decoding requirement
c servers Snapshot at time Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others Decoder connects to any servers, decodes one of Goal: construct a storage method that minimizes storage cost

Shared Memory Emulation Toy model Multi-version (MVC) codes

The multi-version coding (MVC) problem
Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others

The MVC problem n servers T versions c connectivity
Goal: decode the latest common version or later version among every set of c servers Minimize the storage cost Worst case, across all “states” across all servers

Solution 1: Replication
Storage size = size-of-one-version N=4, T=2, c=2

Solution 2: (N,c) Maximum Distance Separable code
Question: Can we store a codeword symbol corresponding to the latest version?

Solution 2: (N,c) Maximum Distance Separable code
Storage size = T/c*size-of-one-version N=4, T=2, c=2 1/2 1/2 1/2 1/2 1/2 1/2 Separate coding across versions. Each server stores all the versions received.

Summary of results Naïve MDS codes Replication Storage cost
Number of versions T

Summary of results Naïve MDS codes Replication Singleton bound
Storage cost Singleton bound Number of versions T

Summary of results Naïve MDS codes Replication Our achievable scheme
Storage cost Our achievable scheme Singleton bound Number of versions T

Summary of results Naïve MDS codes Replication Our achievable scheme
Storage cost Our achievable scheme Our converse Singleton bound Number of versions T

Summary of results Naïve MDS codes Replication Storage cost Our achievable scheme Our converse Singleton bound Number of versions T Storage cost inevitably increases as degree of asynchrony grows!

Normalized by size-of-value
Summary of results Storage Cost Normalized by size-of-value Replication 1 Naïve MDS codes Constructions* Lower bound Typo!

Normalized by size-of-value
Summary of results Storage Cost Normalized by size-of-value Replication 1 Naïve MDS codes Constructions* Lower bound Typo! *Achievability can be improved (see [Wang, Cadambe, Accepted to Trans. IT, 17])

Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns

Achievability

Achievability Therefore, the version corresponding to at least one partition is decodable

Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns

Start with c servers

State vector s1= (1,1,……1), Version 1 is decodable
State vector s3 = (2,2, …, 2, 1, 1, …, 1): Minimal state vector s.t. version 2 is decodable State vector s4 = (2,2, …, 1, 1, 1, …, 1): Maximal state vector s.t. version 1 is decodable Versions 1 and 2 decodable from c+1 symbols c symbols in s3 and one changed symbol in s4

Start with c servers

Start with c servers Explain more Propagate version 2 to a minimal set of servers such that it is decodable

Versions 1 and 2 decodable from c+1 symbols
Start with c servers Explain more Propagate version 2 to a minimal set of servers such that it is decodable Versions 1 and 2 decodable from c+1 symbols

Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony T Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns More challenging combinatorial puzzle for T > 2.

Summary of results Naïve MDS codes Replication Storage cost Our achievable scheme Our converse Singleton bound Number of versions T Storage cost inevitably increases as degree of asynchrony grows!

Shared Memory Emulation Toy model Multi-version (MVC) codes

Recall the shared memory emulation model
Read Clients (Decoders) Write Clients N Servers f failures possible

Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at clients: arbitrary Channel from clients to servers: arbitrary delay, reliable Clients, servers modeled as I/O automata, protocols can be designed.

Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at clients: arbitrary Channel from clients to servers: arbitrary (unbounded) delay, reliable Clients, servers modeled as I/O automata, protocols can be designed.

Shared Memory Emulation model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Storage cost Generalization of the Singleton bound. [C-Lynch-Wang, ACM PODC 2016]

Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the Singleton bound. Non-trivial due to interactive nature Non-trivial [C-Lynch-Wang, ACM PODC 2016]

Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the MVC converse for T=2. Open question: Generalization of MVC bounds for T > 2 to the full-fledged distributed systems theoretic model. [C-Lynch-Wang, ACM PODC 2016]

Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Per-server storage cost Generalization of the MVC converse for T=2. Generalization of MVC converse for T > 2 to works for non-interactive protocols: T = number of concurrent write operations. [C-Lynch-Wang, ACM PODC 2016]

Storage Cost bounds for Shared Memory Emulation
ABD algorithm Erasure coding based algorithms Storage Cost Second lower bound* First lower bound Baseline lower bound Number of concurrent writes

Table of Contents Main theme of this tutorial

Directions of Future Research
Classical Codes for Distributed Storage Multi-version codes System is totally asynchronous, every version arrival pattern is possible Nodes have do not even have stale information of system state Nodes do not even have partial information of system state. System is synchronous Nodes have instantaneous, system state information Nodes have global system state information.

Classical Codes for Distributed Storage Multi-version codes Pessimistic/conservative model System is totally asynchronous, every version arrival pattern is possible Nodes have do not even have stale information of system state Nodes do not even have partial information of system state. Optimistic model System is synchronous Nodes have instantaneous, system state information Nodes have global system state information.

Classical Codes for Distributed Storage Multi-version codes Pessimistic/conservative model System is totally asynchronous, every version arrival pattern is possible Nodes have do not even have stale information of system state Nodes do not even have partial information of system state. Optimistic model System is synchronous Nodes have instantaneous, system state information Nodes have global system state information. Practice

Directions of future research
Beyond the worst-case model Correlated versions

Codes with Correlated Versions
[Ali-C, ITW 2016] Markov Chain . Uniform

[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ?

[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? “Closeness” in message “Closeness” in codeword Delta coding

[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Storage cost as opposed to

[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ?

[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Apply Slepian-Wolf ideas

[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Apply Slepian-Wolf ideas Storage cost is as opposed to

[Ali-C, ITW 2016] N=4, T=2, c=2 1/2 1/2 1/2 1/2 ? ? Open: Information-theoretically optimal schemes for all regimes of Open: Practical code constructions

Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost, good storage cost in typical states, larger worst-case storage cost. Beyond the toy model Explore relations between timing assumptions and storage cost Open question: Can interactive protocols help?, or does the storage cost necessarily grow with

Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost: good storage cost in typical states, possibly with larger worst-case storage cost. Beyond the toy model Explore relations between timing assumptions and storage cost Open question: Can interactive protocols help?, or does the storage cost necessarily grow with

Beyond the toy model Toy model serves as a bridge between shared memory emulation and our network information theoretic formulation Open: Can an interactive write protocol improve storage cost? Future: realistic model, expose connections between: channel delay uncertainty staleness of read. storage cost

Beyond the toy model Toy model serves as a bridge between shared memory emulation and our network information theoretic formulation Future work: realistic model, expose connections between channel delay uncertainty staleness of read. storage cost Open: Can an interactive write protocol improve storage cost?

Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost: good storage cost in typical states, possibly with larger worst-case storage cost. Beyond the toy model Open question: Can interactive protocols help?, o Relation between system response, decoding requirement and storage cost r does the storage cost necessarily grow with

Beyond read-write memory
Several systems: more complicated data structures over distributed asynchronous systems Transactions: Multiple read-write objects, more complicated consistency requirements. Graph based data structures Question: How do you “erasure code” more complicated data structures and state machines? Initial clues provided in [Balasubramanian-Garg 14]

Beyond the worst-case model Correlated versions (Limited) server co-operation, exchange possibly partial state information Average-case storage cost: good storage cost in typical states, possibly with larger worst-case storage cost. Beyond the toy model Open question: Can interactive protocols help?, o Relation between system response, decoding requirement and storage cost r does the Beyond read-write data structures

Thanks

References [Lynch 96] N. A. Lynch, Distributed Algorithms USA: Morgan Kaufmann Publishers Inc., 1996. [Lamport 86] Lamport, L.: On interprocess communication. Part I: Basic formalism. Distributed Computing 2(1), 77–85 (1986) [Vogels 09] W. Vogels, “Eventually consistent,” Queue, vol. 6, no. 6, pp. 14–19, 2008. [Attiya-Bar-Noy-Dolev 95] H. Attiya, A. Bar-Noy, and D. Dolev, “Sharing memory robustly in message-passing systems,” J. ACM, vol. 42, no. 1, pp. 124–142, Jan [Decandia et. al. 07] G. DeCandia, D. Hastorun, M. Jampani, G. Kakula- pati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s highly available key-value store,” in SOSP, vol. 7, 2007, pp. 205– 220. [6] E. Hewitt, Cassandra: the definitive guide. ” O’Reilly Media, Inc.”, 2010. [Hendricks et. al. 07] J. Hendricks, G. R. Ganger, and M. K. Reiter, “Low-overhead byzantine fault-tolerant storage,” ACM SIGOPS Operating Systems Review, vol. 41, no. 6, pp. 73–86, 2007. [Dutta et. al. 08] P. Dutta, R. Guerraoui, and R. R. Levy, “Optimistic erasure-coded distributed storage,” in Distributed Com- puting. Springer, 2008, pp. 182–196. [Cadambe et. al. 14] V. R. Cadambe, N. Lynch, M. Medard, and P. Musial, “A coded shared atomic memory algorithm for message passing architectures,” in 2014 IEEE 13th International Symposium on Network Computing and Applications (NCA). IEEE, 2014, pp. 253–260. [Dobre et. al. 13] D. Dobre, G. Karame, W. Li, M. Majuntke, N. Suri, and M. Vukoli ́c, “PoWerStore: proofs of writing for efficient and robust storage,” in Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 2013, pp. 285–298. [Konwar et. al. PODC 2017] K.M.Konwar, N.Prakash, N.A.Lynch, and M.M ́edard, “A layered architecture for erasure-coded consistent distributed storage,” CoRR, vol. abs/ , 2017, accepted to 2017 ACM Principles of Distributed Computing, available at

References [Zhang et. al. FAST 16] H. Zhang, M. Dong, and H. Chen, “Efficient and available in-memory kv-store with hybrid erasure coding and replication.” in FAST, 2016, pp. 167–180. [Chen et. al. 2017] Y. L. Chen, S. Mu, J. Li, C. Huang, J. Li, A. Ogus, and D. Phillips, “Giza: Erasure coding objects across global data centers,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). Santa Clara, CA: USENIX Association, 2017. [Herlihy Wing 90] M. P. Herlihy and J. M. Wing, “Linearizability: a correctness condition for concurrent objects,” ACM Trans. Program. Lang. Syst., vol. 12, pp. 463–492, July 1990. [Cadambe-Lynch-Wang ACM PODC 16] V. R. Cadambe, Z. Wang, and N. Lynch, “Information- theoretic lower bounds on the storage cost of shared memory emulation,” in Proceedings of the ninth annual ACM symposium on Principles of distributed computing, ser. PODC ’16. ACM, 2016, pp. 305–314. [Wang Cadambe Trans. IT 17]. Z. Wang and V. Cadambe, “Multi-version coding – An Information-Theoretic Perspective of Consistent Distributed Storage,” to appear in IEEE Transactions on Information Theory. [Brahma Fragouli 2012] S. Brahma and C. Fragouli, “Pliable index coding,” in 2012 IEEE International Symposium on Information Theory Proceedings (ISIT). Ieee, 2012, pp. 2251–2255. [Ali-Cadambe ITW 16] R. E. Ali and V. R. Cadambe, “Consistent distributed storage of correlated data updates via multi-version coding,” in Information Theory Workshop (ITW), 2016 IEEE. IEEE, 2016, pp. 176–180. [Balasubramanian Garg 16] B. Balasubramanian and V. K. Garg, “Fault tolerance in distributed systems using fused state machines,” Dis- tributed Computing, vol. 27, no. 4, pp. 287–311, 2014.

Appendix: Binary consensus - A simple looking task that is impossible to achieve in a decentralized asynchronous system

Fischer-Lynch-Patterson (FLP) impossibility result (informal)
go back A famous impossibility result Two processors P1, P2. Each processor begins with an initial value in {0,1}. They can communicate messages over a reliable link, but with arbitrary, (unbounded) delay Goal: Design protocol such that (a) both processors agree on the same value, which is an initial value of some processor (b) Each non-failed process must eventually decide. Write W1’s value reached all 6 servers before W2 started. The write with v(2) sent the value only to server 1 by time t2 before R2 started Read R2 got responses from servers 1,2,3,4, therefore it returned v(2) Server 1 failed after R2 completed, but before R1 started. Read R3 then started and read v(1)…….(cannot read v(2)!). Finally, after R3 completes, v(2) reaches the remaining non-failed servers.

Appendix: Why does a read operation need to write back in the ABD algorithm?

What happens if a read does not write back?
go back The following execution is possible if a read does not write back W2 W1 time Write(v) Read() R1 R2 R3 An example of a wrong execution: Suppose we have 6 servers. Write W1’s value reached all 6 servers before W2 started. The write with v(2) sent the value only to server 1 by time t2 before R2 started Read R2 got responses from servers 1,2,3,4, therefore it returned v(2) Server 1 failed after R2 completed, but before R1 started. Read R3 then started and read v(1)…….(cannot read v(2)!). Finally, after R3 completes, v(2) reaches the remaining non-failed servers.

Appendix: An Erasure Coding Based Algorithm
Algorithm from [Cadambe-Lynch-Medard-Musial, IEEE NCA 2014], extended version in Distributed Computing (Springer) 2016.

Coded Atomic Storage (CAS)
Solves first challenge of revealing correct elements to readers Good communication cost, but infinite storage cost Failed attempt at garbage collection Attempts to solve challenge of discarding old versions Good storage cost, but poor liveness conditions if too many clients fail CASGC GC = garbage collection Solves both challenges Uses server gossip to propagate metadata Good storage and communication cost Good handling of client failures -

solves challenge of only revealing completed writes to readers N servers, f failures. Use MDS code of dimension k, where k is no bigger than (N-k)/2. Every set of at least (N+k)/2 server nodes is referred to as a “quorum set”. Note that any two quorum sets intersect in at least k nodes. Additional “fin” label at servers, indicates that a sufficient number of versions are propagated Additional write phase Tells the servers that elements have been propagated to a quorum Servers store all the history Infinite storage cost (solved in CASGC)

Read Clients Write Clients Servers Server 1 Has been propagated to a quorum

CAS – Protocol overview
Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded element; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.

Write: Acquire latest tag; send (incremented) tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded element; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.

Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.

Read Clients Write Clients Servers Server 1

ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Server 1

fin fin fin fin fin Read Clients Write Clients Servers Server 1

ACK ACK ACK Read Clients Write Clients ACK ACK ACK Servers Server 1

Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest tag labeled fin. Finalize the requested tag; respond to read request with codeword symbol.

Write: Acquire latest tag; send tag and coded element to every server; Send finalize message after getting acks from quorum; Return after receiving acks from quorum. Read: Send read query; wait for tags from a quorum; Send request with latest tag to servers; Decode value after receiving coded elements from quorum. Servers: Store the coded symbol; send ack. Set fin flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest tag labeled fin. Label the requested tag as fin; respond to read request with coded element if available.

solves challenge of only revealing completed writes to readers Additional “fin” label at servers, indicates that a sufficient number of versions are propagated Additional write phase Tells the servers that elements have been propagated to a quorum Servers store all the history Infinite storage cost (solved in CASGC)

Theorem: 1) CAS satisfies atomicity. 2) Liveness: All operations return (if the number of server failures is below a pre-fixed threshold)

Keep at most d+1 elements
Read Clients Write Clients Servers Possible solution: Store at most d+1 coded elements and delete older ones Keep at most d+1 elements Server 1

Modification of CAS The good: The bad:
Possible solution: Store at most d+1 coded elements and delete older ones The good: Finite storage cost. All operations terminate if no. of writes that overlap with a read smaller than d Atomicity (Simulation relation with CAS) Write The bad: Failed write clients result in weak liveness condition, that is, d failed writes can render all future reads incomplete. Read Does not end, concurrent with all future writes! Write

The CASGC algorithm: The main novelties
Client protocol same as CAS. We only summarize difference in server protocol here Keep d+1 coded elements with fin label and all intervening elements, delete older ones

Client protocol same as CAS. We only summarize difference in server protocol here Keep d+1 coded elements with fin label and all intervening elements, delete older ones Use server gossip to propagate fin labels and “complete” failed operations write End-point: point at which operation is “completed” through gossip, Or the point of failure if the operation cannot be completed through gossip

Client protocol same as CAS. We only summarize difference in server protocol here Keep d+1 coded elements with fin label and all intervening elements, delete older ones Use server gossip to propagate fin labels and “complete” failed operations write End-point: point at which operation is “completed” through gossip, Or the point of failure if the operation cannot be completed through gossip Definition of end-point suffices for defining concurrency, and a satisfactory liveness theorem

Main Theorems All operations complete if the number of writes concurrent with a read is smaller than d (In the paper), a bound on the storage cost.

Main Theorems Main Insights
All operations complete if the number of writes concurrent with a read is smaller than d (In the paper), a bound on the storage cost. Main Insights Significant saving in network traffic overheads possible with clever design Sever gossip powerful tool for good liveness in storage systems Storage overheads depend on many factors, including extent of client concurrency activity

Summary Go back Liveness Comm. Cost Storage Cost CASGC
Conditional, Tuneable Small Tuneable, Quantifiable Viveck R. Cadambe, Nancy Lynch, Muriel Médard, and Peter Musial. A Coded Shared Atomic Memory Algorithm for Message Passing Architectures. Distributed Computing (Springer), 2017

Appendix: Converse for multi-version codes, T = 3, description of worst-case state.

Converse: T=3 Ver 1 Ver 1 Ver 1 Ver 2 Ver 2 Ver 2 Ver 3 Ver 3 Ver 3

Converse: T=3 Ver 1 Ver 1 Ver 1 Ver 2 Ver 2 Ver 2 Ver 3 Ver 3 Ver 3

Converse: T=3

Converse: T=3 go back Ver 1 Ver 1 Ver 1 Ver 1 Ver 1 All three versions decodable from these c+2 servers, implying storage cost bound! Ver 2 Ver 2 Ver 2 Ver 3 Ver 3 Ver 3 Ver 3 Ver 3 Ver 3 Ver 3 Ver 1 Ver 2 Ver 3 Ver 1 Ver 3

Codes for Distributed Computing

Similar presentations

Presentation on theme: "Codes for Distributed Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Codes for Distributed Computing

Similar presentations

Presentation on theme: "Codes for Distributed Computing"— Presentation transcript:

Similar presentations

About project

Feedback