Viveck R. Cadambe Pennsylvania State University

Asynchrony-Storage Trade-offs in Consistent Distributed Storage Systems
Viveck R. Cadambe Pennsylvania State University Joint with Prof. Zhiying Wang (UC Irvine) and Prof. Nancy Lynch (MIT) Acknowledgements: Muriel Medard (MIT), Peter Musial (EMC Corporation)

Replicated State Machine Problem
[Lamport 78] The canonical problem in distributed systems theory and practice. Goal: given a state machine, implement it on failure-prone distributed servers Clients and servers, communicate using point-to-point links, replication of state machine used commonly for fault tolerance. Concurrent access to system The basis of of numerous modern cloud computing services, Including [Amazon Dynamo DB, Google Collosus and Spanner, Graphlab, Apache Cassandra and Zookeeper, Voldermort, etc.] Distributed System Users/Clients

[Lamport 78] The canonical problem in distributed systems theory and practice. Goal: given a state machine, implement it on failure-prone distributed servers Clients and servers, communicate using point-to-point, reliable, asynchrounous links, replication of state machine used commonly for fault tolerance. Concurrent access to system: main challenge, present a consistent global state. The basis of of numerous modern cloud computing services, [Amazon Dynamo DB, Google Collosus and Spanner, Graphlab, Apache Cassandra etc.] [Applications: Reservation systems, transactions, multi-player gaming, multi-processor programming] Let me begin with the canonial problem in distributed systems theory and practice. The problem is the following: Given state machine, I want to implement it over a distributed set of failure prone servers. So you have users or client which want to access a distributed set of servers, send inputs and obtain outputs of the state machine. These clients jointly use the state machine to perform some computation. Typically, the state machine is replicated to ensure some kind of fault tolerance, i.e., if a few machines fail, you want to get the output. Sometimes replication is also done to improve performance, if you repeat the task, at least one of them will complete faster. Clients and servers talk to each other through links, which are commonly thought of as reliable, but delay-prone. An important property that the problem must allow is concurrent access, that is, no single client should lock the system to send its input. The specific challenge in implementing a replicated state machine is to maintain some kind of consistency. For e.g., if the state machine is used to store some say airline reservation system, then it must expose a the same number of empty seats even if many people are accessing the system. Similarly, suppose the state machine is storing the presence of a gun in some multiplayer game. If I pick up the gun, and then you try to pick up the gun, it must expose a consistent global understanding of the position or the possesor of the gun. Several intriguing and complicated algorithms that people have studied, and this results in several commercial and open source cloud based products. Distributed System Users/Clients

[Lamport 78] The canonical problem in distributed systems theory and practice. Goal: given a state machine, implement it on failure-prone distributed servers Clients and servers, communicate using point-to-point, reliable, asynchrounous links, replication of state machine used commonly for fault tolerance. Concurrent access (no locking): main challenge, present a consistent global state. The basis of of numerous modern cloud computing services, [Amazon Dynamo DB, Google Collosus and Spanner, Graphlab, Apache Cassandra etc.] [Applications: Reservation systems, transactions, multi-player gaming, multi-processor programming] Distributed System Users/Clients

Shared Memory Emulation Problem
Special case of replicated state machine, read-write register. Distributed System Users/Clients

Special case of replicated state machine, read-write register. Write time Read Distributed System Users/Clients

Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write time Read

Atomic consistency for shared memory emulation
[Lamport 86] aka linearizability. [Herlihy, Wing 90] Write time Read Not atomic time time

Special case of replicated state machine, read-write register. Write time Read Distributed System Users/Clients

Special case of replicated state machine, read-write register. Write time Read Weaker requirement (regularity): A read that begins after completion of a write must return the value of the write, or the value of a write that completes after the write begins Distributed System Users/Clients

Shared Memory Emulation - Model
Read Clients (Decoders) Write Clients Servers Asynchrony – packets don’t arrive at all the servers simultaneously Distributed nature - nodes do not know which packets have been received by other nodes, or if they have failed. Concurrency: Parallel Read and write operations. Consistency (atomicity or regularity) In distributed systems literature, shared memory emulation is a mathematical model. This is a model in discrete mathematics, and each component is actually described by specific its i/o automaton. I won’t describe the model in great detail, but describe its salient features. Servers, and these are clients say playing a game. And my goal is to design these server and client protocols so that they ensure consistency. There are three salient aspects that accompany this model, the we don’t often see in information or coding theory.

Read Clients (Decoders) Write Clients Servers Asynchrony, Distributed Nature, Concurrent access, Consistency Active area of study in distributed systems theory, inspires modern key-value store design. [For e.g. Attiya-Bar-Noy-Dolev 1995, Decandia et. al. 2008]

Read Clients (Decoders) Write Clients Servers Asynchrony, Distributed Nature, Concurrent access, Consistency Active area of study in distributed systems theory, inspires modern key-value store design. [For e.g. Attiya-Bar-Noy-Dolev 1995, Decandia et. al. 2008] Analytical understanding of storage costs is very limited, replication is used in every commercial solution to provide fault tolerance

Challenge of erasure coding in shared memory emulation
2 parities

2 parities Incoherent, due to asynchrony. Cannot delete older version!

2 parities Incoherent, due to asynchrony. Cannot delete older version! Storage cost gain of erasure coding offset by need to store history.

Shared Memory Emulation Toy model
Multi-version (MVC) codes Information-Theoretic Understanding of Storage Costs [C-Wang ISIT 2014, C-Wang arxiv 2015, C-Wang-Lynch, ACM PODC 2016, Ali-C, ITW 2016]

Toy Model for packet arrivals, links
Read Clients (Decoders) Write Clients Servers

Read Clients (Decoders) Write Clients N Servers f failures possible

Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at client: One packet in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Goal: decoder invoked at time t, gets the latest common version among c servers

Read Clients (Decoders) Write Clients N Servers f failures possible Arrival at client: One packet in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Any version that arrives at cW < N-f+1 servers is complete. Decoder connects to cR < N-f+1 servers, decode the latest complete version or a later version

Read Clients (Decoders) Write Clients N Servers f failures possible, Arrival at client: One packet in every time slot. Sent immediately to the servers. Channel from the write client to the server: Delay is an integer in [0,T-1]. Channel from server to read client: instantaneous (no delay). Goal: decoder invoked at time t, receives c < N-f+1 symbols, and decodes the latest common version, or a later version

Asynchrony Toy Model – Snapshot at time Consistency:
Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others

Asynchrony Toy Model – Snapshot at time
Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others Decoder connects to any servers, decodes one of

Asynchrony Toy Model – Snapshot at time
Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others Decoder connects to any servers, decodes one of Goal: construct a storage method that minimizes storage cost

Multi-version (MVC) codes [C-Wang ISIT 2014, C-Wang arxiv 2015, C-Wang-Lynch, ACM PODC 2016, Ali-C, ITW 2016]

The multi-version coding (MVC) problem
Consistency: Decode the latest possible version (globaly consistent view) Asynchrony Every server receives different subsets of versions Every server has no information about others

The MVC problem n servers T versions c connectivity
Goal: decode the latest common version among the c servers Minimize the storage cost Worst case, across all “states” across all servers

Solution 1: Replication
Storage size = size-of-one-version N=4, T=2, c=2

Solution 2: (N,c) MDS code
Storage size = T/c*size-of-one-version N=4, T=2, c=2 1/2 1/2 1/2 1/2 1/2 1/2 Separate coding across versions. Each server stores all the versions received.

Summary of results Naïve MDS codes Replication Storage cost
Number of versions T

Summary of results Naïve MDS codes Replication Singleton bound
Storage cost Singleton bound Number of versions T

Summary of results Naïve MDS codes Replication Our achievable scheme
Storage cost Our achievable scheme Singleton bound Number of versions T

Summary of results Naïve MDS codes Replication Our achievable scheme
Storage cost Our achievable scheme Our converse Singleton bound Number of versions T

Summary of results Naïve MDS codes Replication Storage cost Our achievable scheme Our converse Singleton bound Number of versions T Storage cost inevitably increases as degree of asynchrony grows!

Normalized by size-of-value
Summary of results Storage Cost Normalized by size-of-value Replication 1 Naïve MDS codes Constructions Lower bound You can do up to twice as better as the the best of replication and coding.

Main insights and techniques
Redundancy required to ensure consistency in an asynchronous environment Amount of redundancy grows with the degree of asynchrony Connected to pliable index coding [Brahma-Fragouli 12] Exercises in network information theory, can be converted to exercises in combinatorics Achievability: Separate linear code for each version, Carefully choose the “budget” for each version based on the set of received versions. Genie based converse, discover “worst-case” arrival patterns

Achievability

Start with c servers

Start with c servers Propagate version 2 to a minimal set of servers such that it is decodable

Start with c servers Ver 2 Virtual server Propagate version 2 to a minimal set of servers such that it is decodable Versions 1 and 2 decodable from c+1 symbols

Summary of results Naïve MDS codes Replication Storage cost Our achievable scheme Our converse Singleton bound Number of versions T Storage cost inevitably increases as degree of asynchrony grows!

Multi-version (MVC) codes [C-Wang ISIT 2014, C-Wang arxiv 2015, C-Wang-Lynch, ACM PODC 2016, Ali-C, ITW 2016]

Shared Memory Emulation Model – Key features
Read Clients (Decoders) Write Clients Servers Arrival at clients: arbitrary Channel from clients to servers: arbitrary delay, reliable Clients, servers modeled as I/O automata, protocols can be designed.

max. no. of new versions that overlap with a read
Beyond the toy model Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Storage cost max. no. of new versions that overlap with a read Read interval New versions

max. no. of new versions that overlap with a read
Beyond the toy model Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Storage cost max. no. of new versions that overlap with a read MVC codes with Read interval New versions

Shared Memory Emulation Model
Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Storage cost Generalization of the Singleton bound.

Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Storage cost Generalization of the Singleton bound. [C-Lynch-Wang, ACM PODC 2016]

Arbitrary arrival times, arbitrary delays between encoders, servers and decoders Clients, servers modeled as I/O automata, protocols can be designed. Storage cost Generalization of the MVC bound for T=2. Generalization of MVC bounds for T > 2 to works for non-interactive protocols. [C-Lynch-Wang, ACM PODC 2016]

Multi-version (MVC) codes

Open Questions 1) The impact of interaction/feedback.
Example: Two round client protocols (Causal) knowledge of global state 2) Beyond “worst-case” measures. Ongoing: Correlated versions (Ali-C, ITW 2016) Open: beyond worst-case states, consider “typical” arrival patterns at the servers, what is the best storage cost? 3) Beyond read-write memory - implementing arbitrary state machines.

Viveck R. Cadambe Pennsylvania State University

Similar presentations

Presentation on theme: "Viveck R. Cadambe Pennsylvania State University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Viveck R. Cadambe Pennsylvania State University

Similar presentations

Presentation on theme: "Viveck R. Cadambe Pennsylvania State University"— Presentation transcript:

Similar presentations

About project

Feedback