Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May.

Slides:



Advertisements
Similar presentations
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Advertisements

Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Distributed Systems Overview Ali Ghodsi
Life after CAP Ali Ghodsi CAP conjecture [reminder] Can only have two of: – Consistency – Availability – Partition-tolerance Examples.
Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
Distributed Shared Memory
DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Coding for Atomic Shared Memory Emulation Viveck R. Cadambe (MIT) Joint with Prof. Nancy Lynch (MIT), Prof. Muriel Médard (MIT) and Dr. Peter Musial (EMC)
Replica Control for Peer-to- Peer Storage Systems.
Efficient Solutions to the Replicated Log and Dictionary Problems
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
State Machine Replication Project Presentation Ido Zachevsky Marat Radan Supervisor: Ittay Eyal Winter Semester 2010.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
1 Principles of Reliable Distributed Systems Lecture 10: Atomic Shared Memory Objects and Shared Memory Emulations Spring 2007 Prof. Idit Keidar.
OPODIS 05 Reconfigurable Distributed Storage for Dynamic Networks Gregory Chockler, Seth Gilbert, Vincent Gramoli, Peter M Musial, Alexander A Shvartsman.
CS 582 / CMPE 481 Distributed Systems
Transis Efficient Message Ordering in Dynamic Networks PODC 1996 talk slides Idit Keidar and Danny Dolev The Hebrew University Transis Project.
1 Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Timed Quorum Systems … for large-scale and dynamic environments Vincent Gramoli, Michel Raynal.
SQUARE Scalable Quorum-based Atomic Memory with Local Reconfiguration Vincent Gramoli, Emmanuelle Anceaume, Antonino Virgillito.
Transis Dynamic Voting for Consistent Primary Components PODC 1997 talk slides Esti Yeger Lotem, Idit Keidar and Danny Dolev The Hebrew University
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
6.4 Data and File Replication Gang Shen. Why replicate  Performance  Reliability  Resource sharing  Network resource saving.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
1 The Google File System Reporter: You-Wei Zhang.
6.4 Data And File Replication Presenter : Jing He Instructor: Dr. Yanqing Zhang.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.
Distributed Database Systems Overview
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Chapter 6.5 Distributed File Systems Summary Junfei Wen Fall 2013.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
By Shruti poundarik.  Data Objects and Files are replicated to increase system performance and availability.  Increased system performance achieved.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
The CoBFIT Toolkit PODC-2007, Portland, Oregon, USA August 14, 2007 HariGovind Ramasamy IBM Zurich Research Laboratory Mouna Seri and William H. Sanders.
Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
Distributed Storage Systems: Data Replication using Quorums.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
“Distributed Algorithms” by Nancy A. Lynch SHARED MEMORY vs NETWORKS Presented By: Sumit Sukhramani Kent State University.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Distributed Systems – Paxos
MongoDB Distributed Write and Read
Chryssis Georgiou, University of Cyprus Peter Musial, VeroModo, Inc.
6.4 Data and File Replication
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Providing Secure Storage on the Internet
Fault-Tolerant SemiFast Implementations of Atomic Read/Write Registers
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks Nancy Lynch, MIT Alex Shvartsman, U. Conn.
Fault-Tolerant SemiFast Implementations of Atomic Read/Write Registers
Presentation transcript:

Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May 19-20, 2012

Introduction Shared memory supports concurrent access – Read & write interface Memory models: single writer, multiple reader (SWMR) and multiple writer, multiple reader (MWMR) – Consistency is important Strong consistency provides useful semantics Abstraction for message-passing networks – Shared memory can be emulated – Difficult to do, but solutions exist – For example applications for the Internet, such as Dropbox

Our Research Project THE RAMBO PROJECT Framework for emulating shared memory – Introduced by Lynch and Shvartsman, extended by Gilbert – Implements the MWMR model with strong consistency – Designed for dynamic distributed message-passing settings OUR GOAL RAMBO is elegant but not always efficient Extend RAMBO with intelligent data management

Consistency & Atomicity There are many consistency models We are interested in atomicity Violation (Safety) Violation (Safety) Violation (Regularity) Violation (Regularity) Atomicity time 0 read(3) read(0) read(8) write(8) time 0 read(8)read(0) read(8) write(8) time 0 read(8) write(8)

Emulating Shared Memory Data: 5 Status: WORKING User 1: Reader Data: 5 User 2: Writer Data: 5 User 3: Reader Data: 5

Weakness of the Centralized Approach Data: Status: FAILED User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: error

Replication in Distributed Setting Data: Status: FAILED User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: Status: WORKING Data: 5 Status: WORKING

The ABD Algorithm Hagit Attiya, Amotz Bar-Noy, Danny Dolev A SWMR algorithm Operation level wait-freedom – Termination unaffected by concurrency Designed for a message-passing setting – Allows limited failures – Communication is reliable – Messages can be delayed

Quorum Systems and ABD ABD is a quorum based algorithm – Quorum system is a collection of intersecting sets For example a voting majority quorum system Data is replicated in a quorum systems – Quorum system members are networked servers Guarantee of atomicity – Quorum intersection and read/write protocols Reads must write! (… sometimes as we will see later) – A reader must write the latest data – Writer cannot be trusted to complete

Phased Read/Write Protocols Data: Status: WORKING User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: Status: WORKING Data: Status: WORKING Q2Q2 Q1Q1 User 2 writes its data, a 5, to quorum Q1. 55

Phased Read/Write Protocols Data: Status: WORKING User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: Status: WORKING Data: 5 Status: WORKING Q2Q2 Q1Q1 User 1 queries quorum Q2, sees the latest data is a 5, and writes that back to the computer that does not have the latest data. 5

Data Versions & Timestamps Data: 5,t=1 Status: WORKING User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: 5,t=1 7,t=2 5,t=1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Timestamps allow us to distinguish among different versions of the data.

Data Versions & Timestamps Data: 7,t=2 Status: WORKING User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: 7,t=2 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1

Quorum Viability Data: 7,t=2 Status: WORKING User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: error Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Data: Status: FAILED Data: Status: FAILED A weakness of the ABD algorithm is that it is dependent on a quorum of servers always being viable. When no quorum is available, then operations are blocked.

The RAMBO Framework (Reconfigurable Atomic Memory for Basic Objects) Seth Gilbert Nancy Lynch Alexander Shvartsman

Quorum Reconfiguration Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: Status: WORKING Q2Q2 Q1Q1 Data: Status: WORKING Data: Status: WORKING RAMBO uses quorum reconfiguration to ensure service longevity. A new quorum system (a new set of servers) is installed to replace the old ones, allowing progress in spite of failures.

Replica Transfer Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: Status: WORKING Q2Q2 Q1Q1 Data: Status: WORKING Data: Status: WORKING 7,t=2 After a new set of servers is installed, these servers do not have any information. The replica information (copies of data) must be transferred to the new configuration.

Garbage Collection Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: Status: WORKING After information is transferred to the new servers, the old servers are phased out of use. This process is called `garbage collection’. The mechanism for garbage collection has two phases and is analogous to read/write operations (introduced in the next slies).

Read/Write Operations Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: Status: WORKING User 1: Reader Data: 7,t=2 What if reads and writes occur during reconfiguration? Concurrent operations contact all existing configurations to ensure the latest information is accessed. Multi-Configuration Access

Read/Write Operations Old configurations need to be removed from use. Ongoing read/write operations use their existing configuration knowledge. New operations ignore the old configuration. Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: Status: WORKING User 1: Reader Data: 7,t=2 Garbage Collection

Q1: Can a reader (respectively writer) avoid contacting configurations that it learned have been marked as garbage collected? Q2: When can a reader avoid its second phase, and can a reader propagate selectively? Q3: Can we propagate to the most recent configuration only? Research Questions

Concurrent Garbage Collection (Q1) Data: 5,t=1 Status: WORKING Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: Status: WORKING Q2Q2 Q1Q1 Data: Status: WORKING Data: 0,t=0 Status: WORKING 7,t= Return 7 7,t=2 4 User 1: Reader Data: 5 7,t=2 We believe that the garbage collected configuration can in fact be ignored because the reader learns of the configuration’s information regardless. 7,t=2 0,t=0

Improved Configuration Management (Q1) Authors of RAMBO conjecture that operations must contact all configurations that are discovered during the query (respectively propagate) phase. Communicating with configurations learned to be garbage collected mid-operation is unnecessary – Intermediate discovery of garbage collected configurations from another server – That server knows at least as recent tag as any known in the old configurations IMPACT: improves operation liveness

Improved Bookkeeping (Q2) Data: 7,t=2 Status: WORKING User 1: Reader Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 7 t=2 7 t=2 After querying the reader learns that a majority of nodes has the up-to-date information, thus making propagation needless. 7,t=2

Semi-Fast Read Operations (Q2) Read operations always propagate – Regardless of the actual replica dissemination – Redundant messages and slow operation The proposed solution – During the query phase, reader records the latest timestamps of server with which it communicated – The reader contacts servers that are not up-to-date – Sometimes this allows omitting the propagation phase entirely (`semi-fast’ read operations) IMPACT: improves operation latency and reduces communication costs

Overly Extensive Propagation (Q3) Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: Status: WORKING User 1: Writer Data: 7,t=2 Currently, RAMBO both queries and propagates to all active configurations. In fact, just the query phase covering all active configurations is sufficient for atomicity.

Propagate to the Latest Configuration (Q3) We believe it is not necessary to propagate to any configuration but the last active configuration. Properties of configuration information All configurations are totally ordered. Configuration have a forward link. Discovery is faster than reconfiguration Operations query all active configurations IMPACT: reduces communication cost

Summary Algorithmic optimizations Opportunistic benefits – A clear advantage when Servers gossip, and Configurations have members in common Changes are minimally intrusive – Modest increase in bookkeeping and the size of messages

Future Work Formal reasoning – Use the Input/Output Automata framework to demonstrate that the new changes preserve consistency guarantees of RAMBO Simulation – Use the TEMPO toolkit to simulate RAMBO executions and build confidence in our proofs Empirical experiments – Augment the existing implementations of RAMBO and collect behavior data on Planet-Lab

Special Thanks to: The MIT PRIMES Program Supervisor Prof. Nancy Lynch Mentor Dr. Peter Musial