Coding for Atomic Shared Memory Emulation Viveck R. Cadambe (MIT) Joint with Prof. Nancy Lynch (MIT), Prof. Muriel Médard (MIT) and Dr. Peter Musial (EMC)

Slides:

Advertisements

Similar presentations

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

Advertisements

current hadoop architecture

Linearizability Linearizability is a correctness criterion for concurrent object (Herlihy & Wing ACM TOPLAS 1990). It provides the illusion that each operation.

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.

An Upper Bound on Locally Recoverable Codes Viveck R. Cadambe (MIT) Arya Mazumdar (University of Minnesota)

OPODIS 05 Reconfigurable Distributed Storage for Dynamic Networks Gregory Chockler, Seth Gilbert, Vincent Gramoli, Peter M Musial, Alexander A Shvartsman.

1 Cooperative Communications in Networks: Random coding for wireless multicast Brooke Shrader and Anthony Ephremides University of Maryland October, 2008.

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

CS 582 / CMPE 481 Distributed Systems

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.

Network Coding and Reliable Communications Group Network Coding for Multi-Resolution Multicast March 17, 2010 MinJi Kim, Daniel Lucani, Xiaomeng (Shirley)

Networking Theory (Part 1). Introduction Overview of the basic concepts of networking Also discusses essential topics of networking theory.

Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Timed Quorum Systems … for large-scale and dynamic environments Vincent Gramoli, Michel Raynal.

Codes with local decoding procedures Sergey Yekhanin Microsoft Research.

State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.

Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.

Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,

Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.

6.4 Data And File Replication Presenter : Jing He Instructor: Dr. Yanqing Zhang.

22/07/ The MDS Scaling Problem for Cloud Storage Yu-chong Hu Institute of Network Coding.

Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.

HQ Replication: Efficient Quorum Agreement for Reliable Distributed Systems James Cowling 1, Daniel Myers 1, Barbara Liskov 1 Rodrigo Rodrigues 2, Liuba.

Data and Computer Communications Chapter 10 – Circuit Switching and Packet Switching (Wide Area Networks)

CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.

CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.

Practical Byzantine Fault Tolerance

Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.

From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.

Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May.

1 © R. Guerraoui Regular register algorithms R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch.

IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.

Linearizability Linearizability is a correctness criterion for concurrent object (Herlihy & Wing ACM TOPLAS 1990). It provides the illusion that each operation.

Re-Configurable Byzantine Quorum System Lei Kong S. Arun Mustaque Ahamad Doug Blough.

Physical clock synchronization Question 1. Why is physical clock synchronization important? Question 2. With the price of atomic clocks or GPS coming down,

a/b/g Networks Routing Herbert Rubens Slides taken from UIUC Wireless Networking Group.

Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.

Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.

Linearizability Linearizability is a correctness criterion for concurrent object (Herlihy & Wing ACM TOPLAS 1990). It provides the illusion that each operation.

Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University.

Distributed Storage Systems: Data Replication using Quorums.

Optimization Problems in Wireless Coding Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.

Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.

“Distributed Algorithms” by Nancy A. Lynch SHARED MEMORY vs NETWORKS Presented By: Sumit Sukhramani Kent State University.

Space Bounds for Reliable Storage: Fundamental Limits of Coding Alexander Spiegelman Yuval Cassuto Gregory Chockler Idit Keidar 1.

Mobility Victoria Krafft CS /25/05. General Idea People and their machines move around Machines want to share data Networks and machines fail Network.

Viveck R. Cadambe Pennsylvania State University

Primary-Backup Replication

The consensus problem in distributed systems

6.4 Data and File Replication

Codes for Distributed Computing

VLAN Trunking Protocol

Agreement Protocols CS60002: Distributed Systems

Outline Announcements Fault Tolerance.

Linearizability Linearizability is a correctness criterion for concurrent object (Herlihy & Wing ACM TOPLAS 1990). It provides the illusion that each operation.

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

EEC 688/788 Secure and Dependable Computing

From Viewstamped Replication to BFT

Ajay Vyasapeetam Brijesh Shetty Karol Gryczynski

EEC 688/788 Secure and Dependable Computing

Physical clock synchronization

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Implementing Consistency -- Paxos

Fault-Tolerant SemiFast Implementations of Atomic Read/Write Registers

Presentation transcript:

Coding for Atomic Shared Memory Emulation Viveck R. Cadambe (MIT) Joint with Prof. Nancy Lynch (MIT), Prof. Muriel Médard (MIT) and Dr. Peter Musial (EMC)

Erasure Coding for Distributed Storage

Locality, Repair Bandwidth, Caching and Content Distribution –[Gopalan et. al 2011, Dimakis-Godfrey-Wu-Wainwright- 10, Wu-Dimakis 09, Niesen-Ali 12] Erasure Coding for Distributed Storage

Locality, Repair Bandwidth, Caching and Content Distribution –[Gopalan et. al 2011, Dimakis-Godfrey-Wu-Wainwright- 10, Wu-Dimakis 09, Niesen-Ali 12] Queuing theory –[Ferner-Medard-Soljanin 12, Joshi-Liu-Soljanin 12, Shah-Lee- Ramchandran 12] Erasure Coding for Distributed Storage

Locality, Repair Bandwidth, Caching and Content Distribution –[Gopalan et. al 2011, Dimakis-Godfrey-Wu-Wainwright- 10, Wu-Dimakis 09, Niesen-Ali 12] Queuing theory –[Ferner-Medard-Soljanin 12, Joshi-Liu-Soljanin 12, Shah-Lee- Ramchandran 12] Erasure Coding for Distributed Storage This talk: Theory of distributed computing Considerations for storing data that changes

Consistency: Value changing, get the “latest” version Failure tolerance, Low storage costs, Fast reads and writes 6

7 Shared Memory Emulation - History Atomic (consistent) shared memory [Lamport 1986] Cornerstone of distributed computing and multi-processor programming

8 Shared Memory Emulation - History Atomic (consistent) shared memory Emulation over distributed storage systems [Lamport 1986] Cornerstone of distributed computing and multi-processor programming “ABD” algorithm [Attiya-Bar-Noy- Dolev95], 2011 Dijsktra Prize, Amazon dynamo key-value store [Decandia et. al. 2008] Replication-based

9 Shared Memory Emulation - History Atomic (consistent) shared memory Emulation over distributed storage systems Costs of emulation [Lamport 1986] Cornerstone of distributed computing and multi-processor programming “ABD” algorithm [Attiya-Bar-Noy- Dolev95], 2011 Dijsktra Prize, Amazon dynamo key-value store [Decandia et. al. 2008] Replication-based Low cost coding based algorithm Communication and storage costs (This talk) [C-Lynch-Medard-Musial 2014], preprint available

10 Shared Memory Emulation - History Atomic (consistent) shared memory Emulation over distributed storage systems Costs of emulation [Lamport 1986] Cornerstone of distributed computing and multi-processor programming “ABD” algorithm [Attiya-Bar-Noy- Dolev95], 2011 Dijsktra Prize, Amazon dynamo key-value store [Decandia et. al. 2008] Replication-based Low cost coding based algorithm Communication and storage costs [C-Lynch-Medard-Musial 2014], preprint available (This talk)

11

Write Read time 12

Write Read time 13

Atomicity [Lamport 86] aka linearizability. [Herlihy, Wing 90] Write Read time 14

Write Read Atomicity [Lamport 86] aka linearizability. [Herlihy, Wing 90] 15 time

Write Read Atomicity [Lamport 86] aka linearizability. [Herlihy, Wing 90] 16 time

Write Read Atomicity [Lamport 86] aka linearizability. [Herlihy, Wing 90] 17 time Atomic

Not atomic Write Read Atomicity [Lamport 86] aka linearizability. [Herlihy, Wing 90] time 18 time

19 Shared Memory Emulation - History Atomic (consistent) shared memory Emulation over distributed storage systems Costs of emulation [Lamport 1986] Cornerstone of distributed computing and multi-processor programming “ABD” algorithm [Attiya-Bar-Noy- Dolev95], 2011 Dijsktra Prize, Amazon dynamo key-value store [Decandia et. al. 2008] Replication-based Low cost coding based algorithm Communication and storage costs [C-Lynch-Medard-Musial 2014], preprint available (This talk)

Client server architecture, nodes can fail (no. of server failures is limited) Point-to-point reliable links (arbitrary delay). Nodes do not know if other nodes fail An operation should not have to wait for others to complete Distributed Storage Model 20 Servers Write Clients Read Clients

Client server architecture, nodes can fail (no. of server failures is limited) Point-to-point reliable links (arbitrary delay) Nodes do not know if other nodes fail An operation should not have to wait for others to complete Distributed Storage Model 21 Servers Write Clients Read Clients

Client server architecture, nodes can fail (no. of server failures is limited) Point-to-point reliable links (arbitrary delay). Nodes do not know if other nodes fail An operation should not have to wait for others to complete Distributed Storage Model 22 Servers Write Clients Read Clients

23 Write Clients Read Clients Servers Requirements and cost measure Design write, read and server protocols such that Atomicity Concurrent operations, no waiting. Communication overheads: Number of bits sent over links Storage overheads: (Worst-case) server storage costs

The ABD algorithm (sketch) 24 Servers Write Clients Read Clients Quorum set: Every majority of server snodes. Any two sets intersect at at least one nodes Algorithm works if at least one quorum set is available.

The ABD algorithm (sketch) 25 Write: Send time-stamped value to every server; return after receiving sufficeint acks. Read: Send read query; wait for sufficient responses and return with latest value. Servers: Store latest value from server; send ack Respond to read request with value Servers Write Clients Read Clients

The ABD algorithm (sketch) 26 Write: Send time-stamped value to every server; return after receiving acks from quorum. Read:: Send read query; wait for sufficient responses and return with latest value. Servers: Store latest value; send ack Respond to read request with value Servers ACK Write Clients Read Clients

The ABD algorithm (sketch) 27 Query Write Clients Read Clients Write: Send time-stamped value to every server; return after receiving sufficeint acks. Read: Send read query; wait for sufficient responses and return with latest value. Servers: Store latest value from server; send ack Respond to read request with value Servers

The ABD algorithm (sketch) 28 Servers Write: Send time-stamped value to every server; return after receiving sufficeint acks. Read: Send read query; wait for quorum of responses; return with latest value. Servers: Store latest value from server; send ack Respond to read request with value Write Clients Read Clients

The ABD algorithm (sketch) 29 Servers Write: Send time-stamped value to every server; return after receiving sufficeint acks. Read: Send read query; wait for quorum responses; send latest value to quourm; latest value. Servers: Store latest value from server; send ack Respond to read request with value Write Clients Read Clients

The ABD algorithm (sketch) 30 Servers Write: Send time-stamped value to every server; return after receiving sufficeint acks. Read: Send read query; wait for acks from quorum responses; send latest value to servers; return latest value after receiving acks from quorum. Servers: Store latest value from server; send ack Respond to read request with value Write Clients Read Clients ACK

The ABD algorithm (summary) The ABD algorithm ensures atomic operations. Operations terminate is ensured as long as a majority of nodes do not fail. Implication: A networked distributed storage system can be used as shared memory. Replication to ensure failure tolerance.

ABD Storage Communication (read) Communication (write) Performance Analysis f represents number of failures a lower communication cost algorithm in [Fan-Lynch 03]

33 Shared Memory Emulation - History Atomic (consistent) shared memory Emulation over distributed storage systems Costs of emulation [Lamport 1986] Cornerstone of distributed computing and multi-processor programming “ABD” algorithm [Attiya-Bar-Noy- Dolev95], 2011 Dijsktra Prize, Amazon dynamo key-value store [Decandia et. al. 2008] Replication-based Low cost coding based algorithm Communication and storage costs (This talk) [C-Lynch-Medard-Musial 2014], preprint available

Shared Memory Emulation – Erasure Coding [Hendricks-Ganger-Reiter 07, Dutta-Guerraoui-Levy 08, Dobre-et.al 13, Androulaki et. al 14] New algorithm, a formal analysis of costs Outperforms previous algorithms in certain aspects Previous algorithms incur infinite worst-case storage costs Previous algorithms incur large communication costs

35 Erasure Coded Shared Memory

Example: (6,4) MDS code 36 Value recoverable from any 4 coded packets Size of coded packet is ¼ size of value Smaller packets, smaller overheads

Value recoverable from any 4 coded packets Size of coded packet is ¼ size of value New constraint, need 4 packets with same time- stamp 37 Erasure Coded Shared Memory Smaller packets, smaller overheads Example: (6,4) MDS code

38 Quorum set: Every subset of 5 server snodes. Any two sets intersect at 4 nodes Algorithm works if at least one quorum set is available. Coded Shared Memory – Quorum set up Servers Write Clients Read Clients

Coded Shared Memory – Why is it challenging? Servers 39 Write Clients Read Clients

Coded Shared Memory – Why is it challenging? Servers Query Challenges: reveal elements to readers only when enough elements are propagated discard old versions safely Solutions: Write in multiple phases Store all the write-versions concurrent with a read 40 Servers store multiple versions Write Clients Read Clients

Coded Shared Memory – Protocol overview Write: Send time-stamped value to every server; send finalize message after getting acks from quorum; return after receiving acks from quorum. Read: Send read query; wait for time-stamps from a quorum; Send request with latest time-stamp to servers; decode and return value after receiving acks from quorum. Servers: Store the coded symbol; keep latest δ codeword symbols and delete older ones; send ack. Set finalize flag for tag on receiving finalize message. Respond to read query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.

Coded Shared Memory – Protocol overview Write: Send time-stamped value to every server; send finalize message after getting acks from quorum; return after receiving acks from quorum. Read: Send read query; wait for time-stamps from a quorum; Send request with latest time-stamp to servers; decode and return value after receiving acks from quorum. Servers: Store the coded symbol; keep latest δ codeword symbols and delete older ones; send ack. Set finalize flag for time-stamp on receiving finalize message. Send ack. Respond to read query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.

Coded Shared Memory – Protocol overview Write: Send time-stamped value to every server; send finalize message after getting acks from quorum; return after receiving acks from quorum. Read: Send read query; wait for time-stamps from a quorum; Send request with latest time-stamp to servers; decode and return value after receiving acks from quorum. Servers: Store the coded symbol; keep latest δ codeword symbols and delete older ones; send ack. Set finalize flag for tag on receiving finalize message. Respond to read query with latest finalized tag. Finalize the requested tag; respond to read request with codeword symbol.

Coded Shared Memory – Protocol overview Write: Send time-stamped value to every server; send finalize message after getting acks from quorum; return after receiving acks from quorum. Read: Send read query; wait for time-stamps from a quorum; Send request with latest time-stamp to servers; decode and return value after receiving acks/symbols from quorum. Servers: Store the coded symbol; keep latest δ codeword symbols and delete older ones; send ack. Set finalize flag for tag on receiving finalize message. Respond to read query with latest finalized tag. Finalize the requested time-stamp; respond to read request with codeword symbol if it exists, else send ack.

Coded Shared Memory – Protocol overview Write: Send time-stamped value to every server; send finalize message after getting acks from quorum; return after receiving acks from quorum. Read: Send read query; wait for time-stamps from a quorum; Send request with latest time-stamp to servers; decode and return value after receiving acks/symbols from quorum. Servers: Store the coded symbol; keep latest δ codeword symbols and delete older ones; send ack. Set finalize flag for time-stamp on receiving finalize message. Respond to read query with latest finalized tag. Finalize the requested time-stamp; respond to read request with codeword symbol if it exists, else send ack.

Coded Shared Memory – Protocol overview Use (N,k) MDS code, where N is the number of servers Ensures atomic operations Operations terminate is ensured as long as o Number of failed nodes smaller than (N-k)/2 o Number of writes concurrent with a read smaller than δ

Performance comparisons ABD Our Algorithm Storage Communication (read) Communication (write) N represents number of nodes, f represents number of failures δ represents maximum number of writes concurrent with a read

Proof Steps After every operation terminates, - there is a quorum of servers with the codeword symbol - there is a quorum of servers with the finalize label - because every pair of servers intersects in k servers, readers can decode the value 48

Proof Steps After every operation terminates, - there is a quorum of servers with the codeword symbol - there is a quorum of servers with the finalize label - because every pair of servers intersects in k servers, readers can decode the value When a codeword symbol is deleted at a server –Every operation that wants that time-stamp has terminated –(Or the concurrency bound is violated) 49

Main Insights Significant savings on network traffic overheads  Reflects the classical gain of erasure coding over replication (New Insight) Storage overheads depend on client activity Storage overhead proportional to the no. of writes concurrent with a read Better than classical techniques for moderate client activity 50

51 Future Work – Many open questions Refinements of our algorithm  (Ongoing) More robustness to client node failures Information theoretic bounds on costs  New coding schemes Finer network models  Erasure channels, different topologies, wireless channels Finer source models  Correlations across versions Dynamic networks

52 Future Work – Many open questions Refinements of our algorithm  (Ongoing) More robustness to client node failures Information theoretic bounds on costs  New coding schemes Finer network models  Erasure channels, different topologies, wireless channels Finer source models  Correlations across versions Dynamic networks

Storage costs ABD Our algorithm Number of writes concurrent with a read 53 Storage Overhead What is the fundamental cost curve?

54 Future Work – Many open questions Refinements of our algorithm  (Ongoing) More robustness to client node failures Information theoretic bounds on costs  New coding schemes Finer network models, finer source models  Erasure channels, different topologies, wireless channels  Correlations across versions Dynamic networks

55 Future Work – Many open questions Refinements of our algorithm  (Ongoing) More robustness to client node failures Information theoretic bounds on costs  New coding schemes Finer network models, finer source models  Erasure channels, different topologies, wireless channels  Correlations across versions Dynamic networks -Interesting replication based algorithm in [Gilbert-Lynch-Shvartsman 03] -Study of costs in terms of network dynamics