Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May 19-20, 2012
Introduction Shared memory supports concurrent access – Read & write interface Memory models: single writer, multiple reader (SWMR) and multiple writer, multiple reader (MWMR) – Consistency is important Strong consistency provides useful semantics Abstraction for message-passing networks – Shared memory can be emulated – Difficult to do, but solutions exist – For example applications for the Internet, such as Dropbox
Our Research Project THE RAMBO PROJECT Framework for emulating shared memory – Introduced by Lynch and Shvartsman, extended by Gilbert – Implements the MWMR model with strong consistency – Designed for dynamic distributed message-passing settings OUR GOAL RAMBO is elegant but not always efficient Extend RAMBO with intelligent data management
Consistency & Atomicity There are many consistency models We are interested in atomicity Violation (Safety) Violation (Safety) Violation (Regularity) Violation (Regularity) Atomicity time 0 read(3) read(0) read(8) write(8) time 0 read(8)read(0) read(8) write(8) time 0 read(8) write(8)
Emulating Shared Memory Data: 5 Status: WORKING User 1: Reader Data: 5 User 2: Writer Data: 5 User 3: Reader Data: 5
Weakness of the Centralized Approach Data: Status: FAILED User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: error
Replication in Distributed Setting Data: Status: FAILED User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: Status: WORKING Data: 5 Status: WORKING
The ABD Algorithm Hagit Attiya, Amotz Bar-Noy, Danny Dolev A SWMR algorithm Operation level wait-freedom – Termination unaffected by concurrency Designed for a message-passing setting – Allows limited failures – Communication is reliable – Messages can be delayed
Quorum Systems and ABD ABD is a quorum based algorithm – Quorum system is a collection of intersecting sets For example a voting majority quorum system Data is replicated in a quorum systems – Quorum system members are networked servers Guarantee of atomicity – Quorum intersection and read/write protocols Reads must write! (… sometimes as we will see later) – A reader must write the latest data – Writer cannot be trusted to complete
Phased Read/Write Protocols Data: Status: WORKING User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: Status: WORKING Data: Status: WORKING Q2Q2 Q1Q1 User 2 writes its data, a 5, to quorum Q1. 55
Phased Read/Write Protocols Data: Status: WORKING User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: Status: WORKING Data: 5 Status: WORKING Q2Q2 Q1Q1 User 1 queries quorum Q2, sees the latest data is a 5, and writes that back to the computer that does not have the latest data. 5
Data Versions & Timestamps Data: 5,t=1 Status: WORKING User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: 5,t=1 7,t=2 5,t=1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Timestamps allow us to distinguish among different versions of the data.
Data Versions & Timestamps Data: 7,t=2 Status: WORKING User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: 7,t=2 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1
Quorum Viability Data: 7,t=2 Status: WORKING User 1: Reader Data: User 2: Writer Data: User 3: Reader Data: error Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Data: Status: FAILED Data: Status: FAILED A weakness of the ABD algorithm is that it is dependent on a quorum of servers always being viable. When no quorum is available, then operations are blocked.
The RAMBO Framework (Reconfigurable Atomic Memory for Basic Objects) Seth Gilbert Nancy Lynch Alexander Shvartsman
Quorum Reconfiguration Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: Status: WORKING Q2Q2 Q1Q1 Data: Status: WORKING Data: Status: WORKING RAMBO uses quorum reconfiguration to ensure service longevity. A new quorum system (a new set of servers) is installed to replace the old ones, allowing progress in spite of failures.
Replica Transfer Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: Status: WORKING Q2Q2 Q1Q1 Data: Status: WORKING Data: Status: WORKING 7,t=2 After a new set of servers is installed, these servers do not have any information. The replica information (copies of data) must be transferred to the new configuration.
Garbage Collection Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: Status: WORKING After information is transferred to the new servers, the old servers are phased out of use. This process is called `garbage collection’. The mechanism for garbage collection has two phases and is analogous to read/write operations (introduced in the next slies).
Read/Write Operations Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: Status: WORKING User 1: Reader Data: 7,t=2 What if reads and writes occur during reconfiguration? Concurrent operations contact all existing configurations to ensure the latest information is accessed. Multi-Configuration Access
Read/Write Operations Old configurations need to be removed from use. Ongoing read/write operations use their existing configuration knowledge. New operations ignore the old configuration. Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: Status: WORKING User 1: Reader Data: 7,t=2 Garbage Collection
Q1: Can a reader (respectively writer) avoid contacting configurations that it learned have been marked as garbage collected? Q2: When can a reader avoid its second phase, and can a reader propagate selectively? Q3: Can we propagate to the most recent configuration only? Research Questions
Concurrent Garbage Collection (Q1) Data: 5,t=1 Status: WORKING Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: Status: WORKING Q2Q2 Q1Q1 Data: Status: WORKING Data: 0,t=0 Status: WORKING 7,t= Return 7 7,t=2 4 User 1: Reader Data: 5 7,t=2 We believe that the garbage collected configuration can in fact be ignored because the reader learns of the configuration’s information regardless. 7,t=2 0,t=0
Improved Configuration Management (Q1) Authors of RAMBO conjecture that operations must contact all configurations that are discovered during the query (respectively propagate) phase. Communicating with configurations learned to be garbage collected mid-operation is unnecessary – Intermediate discovery of garbage collected configurations from another server – That server knows at least as recent tag as any known in the old configurations IMPACT: improves operation liveness
Improved Bookkeeping (Q2) Data: 7,t=2 Status: WORKING User 1: Reader Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 7 t=2 7 t=2 After querying the reader learns that a majority of nodes has the up-to-date information, thus making propagation needless. 7,t=2
Semi-Fast Read Operations (Q2) Read operations always propagate – Regardless of the actual replica dissemination – Redundant messages and slow operation The proposed solution – During the query phase, reader records the latest timestamps of server with which it communicated – The reader contacts servers that are not up-to-date – Sometimes this allows omitting the propagation phase entirely (`semi-fast’ read operations) IMPACT: improves operation latency and reduces communication costs
Overly Extensive Propagation (Q3) Data: Status: FAILED Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Data: 7,t=2 Status: WORKING Q2Q2 Q1Q1 Data: 7,t=2 Status: WORKING Data: Status: WORKING User 1: Writer Data: 7,t=2 Currently, RAMBO both queries and propagates to all active configurations. In fact, just the query phase covering all active configurations is sufficient for atomicity.
Propagate to the Latest Configuration (Q3) We believe it is not necessary to propagate to any configuration but the last active configuration. Properties of configuration information All configurations are totally ordered. Configuration have a forward link. Discovery is faster than reconfiguration Operations query all active configurations IMPACT: reduces communication cost
Summary Algorithmic optimizations Opportunistic benefits – A clear advantage when Servers gossip, and Configurations have members in common Changes are minimally intrusive – Modest increase in bookkeeping and the size of messages
Future Work Formal reasoning – Use the Input/Output Automata framework to demonstrate that the new changes preserve consistency guarantees of RAMBO Simulation – Use the TEMPO toolkit to simulate RAMBO executions and build confidence in our proofs Empirical experiments – Augment the existing implementations of RAMBO and collect behavior data on Planet-Lab
Special Thanks to: The MIT PRIMES Program Supervisor Prof. Nancy Lynch Mentor Dr. Peter Musial