Edward Bortnikov 048961 – Topics in Reliable Distributed Computing Slides partially borrowed from Nancy Lynch (DISC ’02) Seth Gilbert (DSN ’03) and Idit.

Slides:



Advertisements
Similar presentations
CS 542: Topics in Distributed Systems Diganta Goswami.
Advertisements

Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
Distributed Systems Overview Ali Ghodsi
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
OPODIS 05 Reconfigurable Distributed Storage for Dynamic Networks Gregory Chockler, Seth Gilbert, Vincent Gramoli, Peter M Musial, Alexander A Shvartsman.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
CS 582 / CMPE 481 Distributed Systems
CPSC 668Set 16: Distributed Shared Memory1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
CPSC 668Set 12: Causality1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.
1 Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
Transis Dynamic Voting for Consistent Primary Components PODC 1997 talk slides Esti Yeger Lotem, Idit Keidar and Danny Dolev The Hebrew University
1 A Framework for Highly Available Services Based on Group Communication Alan Fekete Idit Keidar University of Sidney MIT.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Consensus and Its Impossibility in Asynchronous Systems.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
November NC state university Group Communication Specifications Gregory V Chockler, Idit Keidar, Roman Vitenberg Presented by – Jyothish S Varma.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
SysRép / 2.5A. SchiperEté The consensus problem.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 16: Distributed Shared Memory 1.
1 Communication and Data Management in Dynamic Distributed Systems Nancy Lynch MIT June 20, 2002 …
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
Fault Tolerance (2). Topics r Reliable Group Communication.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
The consensus problem in distributed systems
Distributed Systems – Paxos
EECS 498 Introduction to Distributed Systems Fall 2017
Distributed Systems, Consensus and Replicated State Machines
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Active replication for fault tolerance
PERSPECTIVES ON THE CAP THEOREM
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Consensus, FLP, and Paxos
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks Nancy Lynch, MIT Alex Shvartsman, U. Conn.
Fault-Tolerant SemiFast Implementations of Atomic Read/Write Registers
Presentation transcript:

Edward Bortnikov – Topics in Reliable Distributed Computing Slides partially borrowed from Nancy Lynch (DISC ’02) Seth Gilbert (DSN ’03) and Idit Keidar (multiple talks) R AMBO Reconfigurable Atomic Memory for Dynamic Networks

Outline Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

Distributed Shared Memory Read Write(7) Write(0)

Atomic Consistency AKA linearizability Definition: Each operation appears to occur at some point between its invocation and response. Sufficient condition: For each object x, all the read and write operations for x can be partially ordered by , so that:  is consistent with the order of invocations and responses: there are no operations such that  1 completes before  2 starts, yet  2   1. All write operations are ordered with respect to each other and with respect to all the reads. Every read returns the value of the last write preceding it in .

Read 7 Write(7) Atomic Consistency Read Write(7) Write(0)

Quorums Write(7) Read

Dynamic Atomic Memory

Outline Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

Prior Work on Quorums Gifford (79) and Thomas (79) Upfal and Wigderson (85) majority sets of readers and writers Vitanyi and Awerbuch (86) matrices of single-writer/single-reader registers Attiya, Bar-Noy and Dolev (90/95) majorities of processors to implement single-writer/multi- reader objects in message passing systems Static

olev))A(ttiya) B(ar-Noy) D Single-writer multiple-readers Assuming non-faulty processors (nodes) Majority is a primitive quorum Communicate Send a request to n processors Await ack from processors Tags are used for distributed ordering of operations WRITE operations increment the tag READ operations use the tag Both propagate the tag Properties R returns either the last completed or a concurrent W ≤ tag ordering between R

Write increment tag send tag/value Read Phase 1: find tag/value Phase 2: send tag Reads and Writes Value Tag

Outline Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

Dynamic Approaches (1) Consensus to agree on each operation [Lamport] Consensus for each R/W  bad performance! Virtual synchrony [Birman 85] group communication R/W simulated through atomic broadcast Consensus only for special case (view change) Issue with determining the primary partition (quorum) [Yeger-Lotem, Keidar, Dolev ’97] – dynamic voting But still - performance issues One join or failure may trigger view formation  delays R/W In the presence of failures, R/W ops can converge indefinitely

Group Communication Abstraction Send ( Grp, Msg ) Deliver ( Msg ) Join / Leave ( Grp ) View ( Grp, Members, Id) Group Communication

Group Communication Systems (1) Group Membership Processes organized into groups Particular memberships stamped as views Views provide a form of Concurrent Common Knowledge about system In partitionable system, views can be concurrent p1p1 time p2p2 p3p3 V1 {p 1, p 2, p 3 } V2 {p 1, p 2 } V5 {p 1, p 2, p 3 } V3 {p 3 }

Virtual Synchrony [Birman, Joseph 87] Integration of Multicast and Membership Synchronization of Messages and Views Includes many different properties One key property:  Powerful abstraction for state-machine replication Processes that go together through same views, deliver same sets of messages.

Reliable Multicast Messages sent to group Total/Causal/FIFO ordering Virtual Synchrony The same set of multicast messages delivered to group members between view changes Guaranteed Self Delivery A process will eventually deliver a self-message or crash (Usually) Sending View Delivery The message is delivered in the same view that it is sent Group Communication Systems (2) p1p1 time p2p2 p3p3 V1 {p 1, p 2, p 3 } V2 {p 1, p 2 }

Movie Group Chocolat Movie Group Gladiator Movie Group Spy Kids Example: a GC-based VOD server start update Movies? Service Group control Session Group

Virtual Synchrony - Membership Issue – accurate estimation on group membership Natural implementation – consensus But - distributed consensus is impossible under failures in an asynchronous system [FLP ’85]! How to distinguish between a failed and slow processor? Solution – failure detectors to deliver views May use mechanisms other than asynchronous message arrivals to suspect the failed processes Failure detector ◊S Initially, the output is arbitrary, but eventually … every process that crashes is suspected (completeness) some process does not crash is not suspected (accuracy) ◊S is the weakest FD to solve the consensus Rotating Coordinator algorithm

Virtual Synchrony - Multicast Assumption: point-to-point reliable FIFO All-or-none message delivery Only for the view (alive processes) Dead men tell no tales (E.W. Hournung 1899) STABLE messages and delivery between views What if the sender crashes in the middle of multicast? ISIS algorithm – FLUSH markers Messages can be delayed indefinitely during view formation! Total message ordering TOTEM (token-ring) algorithm Symmetric (Lamport timestamps) algorithm

Implementing Virtual Synchrony  Process 4 notices that process 7 has crashed, sends a view change  Process 6 sends out all its unstable messages, followed by a flush message  Process 6 installs the new view when it has received a flush message from everyone else

Virtual Synchrony – More Issues Failure Detectors performance Slow convergence under long delays (e.g. WAN) Implementation of Same View Delivery Dropping messages? Delivering them in the wrong view? Network partitioning Multiple partitions, split and merge or groups Transitional views (Amir et al) The set of processes seeing the same messages (not necessarily from the current view members)

Dynamic Voting on Top of GC R/W service as a replicated state machine (total order) Data replicas managed by the primary partition (quorum) Problematic in dynamic unreliable network Adaptive quorums – majority of the previous quorum {a,b,c,d,e}  {a,b,c}  {a,b} Dynamic linear voting Pid to break ties between equal-sized partitions Is this enough?

Failures in the Course of the Protocol  {a, b, c} attempt to form a quorum  a and b succeed  c detaches, unaware of the attempt  {a, b} form a quorum  majority of {a, b, c}  Concurrently {c, d, e} form a quorum  majority of {a, b, c, d, e}  Inconsistency!

Handling Ambiguous Configurations Idea: make c aware if a and b succeed in forming {a, b, c} {a, b, c} is ambiguous for c: may or may not have been formed Processes record ambiguous attempts c records both: {a, b, c, d, e} and {a, b, c} Requires a majority of both  will refuse to form {c, d, e}

Dynamic Voting - Ambiguity Resolution Upon Membership Changes  Exchange information  [Sub-quorum of last primary and of all ambiguous attempts]  ATTEMPT: Record the attempt as ambiguous  [All attempted]  FORM: become primary + delete all ambiguous attempts Caveat: Garbage Collection Potentially exponential # of ambiguous attempts Constrain to store a linear #

Dynamic Approaches (1) Consensus to agree on each operation [Lamport] Consensus for each R/W (not guaranteed to terminate) Bad performance! Virtual synchrony [Birman 85] group communication R/W simulated through atomic broadcast Consensus only for special case (view change) Issue with determining the primary partition (quorum) [Yeger-Lotem, Keidar, Dolev ’97] – dynamic voting But still - performance issues One join or failure may trigger view formation  delays R/W In the presence of failures, R/W ops can converge indefinitely

Dynamic Approaches (2) Quorum-based reads/writes over GC [De Prisco, et al. 99] New view must satisfy space requirements Intersection between the old and new quorums RAMBO has time requirements Some quorums of the old and new system are involved in reconfiguration Single reconfigurer [Lynch, Shvartsman 97], [Englert, Shvartsman 00]: Terminology change: view  configuration Allows multiple concurrent configurations SPOF!

Outline Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

RAMBO – key ideas Separate the handling of R/W operations from view (configuration) changes R/W ops must complete fast Configuration changes can propagate in the background Two levels of accommodating changes Small and transient changes – through multiple quorums Large and permanent changes – through reconfiguration Managing configurations Multiple configurations may co-exist Old configurations can be garbage-collected The nodes agree on the order of configurations (Paxos)

RAMBO Architecture Net Recon read/write upgrade read read-ack write write-ack

RAMBO API Domains I = set of Nodes (Locations) V = set of Values C = set of Configurations Members ( C ) Read-quorums ( C ) Write-quorums ( C ) Input // asynchronous - per node/object Join Read Write (v) Recon (c, c’) Fail Output // asynchronous - per node/object Join-ack Read-ack (v) Write ack Recon-ack (b) // True/False Report (c) // new configuration

Recon Service Specification Recon Chooses configurations Tells members of the previous and new configuration. Informs Reader-Writer components (new-config). Behavior (assuming well-formedness): Agreement: Two configs never assigned to same k. Validity: Any announced new-config was previously requested by someone. No duplication: No configuration is assigned to more than one k.

Write Phase 1: choose tag Phase 2: send tag/value Read Phase 1: find tag/value Phase 2: send tag/value Reads and Writes Value Tag

Multiple Configurations (1) Every node can Install a new configuration Garbage-collect an old configuration Learn about both through gossiping The Recon service guarantees the global order Configuration map The node’s snapshot of the picture of the world Special configurations:  (undefined) and ± (GC’ed)

Multiple Configurations (2) Some algebra: Update:   c, c  ± // Configuration lifecycle Extend:   c // New configurations Truncate: (c 1, c 2, , c 4 )  (c 1, c 2 ) // Removing holes Configuration map w/o holes € TRUNCATED ±± ccc  c ...  GC’d Defined Mixed Undefined c

CMAP Evolution c0c0  c0c0 c1c1  c0c0 c1c1 c2c2  ckck  ± c1c1 c2c2  ckck  ±± c2c2  ckck ... ±±± c3c3  ckck  ±±±±± ccc  c 

R/W Automaton Implementation The node keeps gossiping with the “world” all the time Tags are used for distributed ordering of operations WRITE operations increment the tag READ operations use the tag Every READ returns the value of WRITE with the same tag Agreeing on tags Every op consists of the query and propagation phases Query – acquire the tag from “enough” members R-quorum of every active configuration Propagation – push the value/tag to “enough” members W-quorum of every active configuration Fixed point: predicate that the respective op has completed

R/W with Multiple Configurations Key to asynchronous execution of R/W operations No abortion of R/W when a new configuration is reported Extra work to access additional processes needed for new quorums. Reaching a quorum for every C in CMAP To synchronize with every process that might hold C Some read-quorum at the QUERY stage Query-fixed-point precondition Some write-quorum at the PROP stage Prop-fixed-point precondition

RAMBO - R/W Transitions

RAMBO - GC Transitions

R/W Automata State world value, tag cmap pnum1, counts phases of locally-initiated operations pnum2[], records latest known phase numbers for all locations Recall causal ordering and vector clocks! op-record, keeps track of the status of a current locally initiated read/write operation Includes op.cmap, consisting of consecutive configs. gc-record, keeps track of the status of a current locally-initiated GC operation

R/W Automaton: Recv() code CMAP may evolve during the R/W Accept only “recent” messages Local message numbering (PNUM) to ensure causal order “I have heard from you since you started the op!” Pitfall: a hole in the new CMAP  I am using stale data!  Restart the phase with the truncated CMAP  world := world  W  if t > tag then (value,tag) := (v,t)  cmap := update(cmap,cm)  pnum2(j) := max(pnum2(j), ns)  gc-record: If message is “recent”, record the sender.  op-record: If message is “recent”:  Record the sender  Extend op.cmap with newly discovered configurations

 c6c6 Largest tag: 100 New tag: 101 Putting it all together … ±± c3c3 c4c4 c5c5  write(x, 7)

Garbage Collection A process can initiate a configuration’s garbage collection Provided that the previous configurations are ± One at a time (may be improved !!!) Multiple processes can start GC of the same configuration Concurrently with R/W A GC can stop if an idempotent GC has completed The same two-phase protocol Query: reach a read and write quorums of CMAP[k] Inform W-quorum of old configuration about the new configuration. Collect object values from R-quorum of the old configuration. Prop: reach the write quorum of CMAP[k+1] Propagate the latest value to a W-quorum of the new configuration.

Proof Sketch ≤ ordering of tags between sequential GC operations ∩ between the R-quorum of CMAP[k] and W-quorum of CMAP[k+1] Ordering between sequential GC and R/W ≤ ordering of tags between the GC and READ operations < ordering of tags between the GC and WRITE operations Ordering between sequential R and W ≤ ordering between */R < ordering between */W  Either there is a common configuration C Tag conveyed through the quorum ∩ property  … or the tag info is conveyed through the GC of some configuration in between

Recon Implementation Consensus implemented using Paxos Synod algorithm. Members of old configuration propose a new configuration Proposals reconciled using consensus. recon(c,c’): Request for reconfiguration from c to c’. [If c is the k-1 st configuration] Send init(Cons(k,c’)) message to c.members Recv(init): Participate in consensus. decide(c’): Tell R/W the new configuration Send new-config message to members of c’. Net Consensus Recon Recon-ack

Conditional Performance Analysis Safety is guaranteed … But no absolute performance guarantees! Under “good” network conditions Bounded message delay d Sufficient spacing between configurations (e) Configuration and quorum viability (e) Bounds (under quiescence conditions) Join – 2d Reconfiguration – 13d Read/Write – 4d (two phases) GC – 4d (two phase) Deteriorate under weaker stability conditions! Network stabilizes Rambo stabilizes

Outline Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

RAMBO-2 Goal: overcome the bottleneck of one GC at a time Upgrade instead of GC: collect multiple configurations < k Any configuration can be upgraded, even if < indices are not Problem – no nice RAMBO property: RAMBO: every configuration is upgraded before removal Need to overcome the race condition between two upgrades … which leads to data loss! Solution: Don’t remove a configuration until the upgrade is complete … even if somebody is removing it in parallel with you! Proof Intuition: Order between R/W op tags through the transitive closure of multiple Upgrade op tags (instead of a single GC)

c3c3 c4c4 ±± Configuration Upgrade in RAMBO-2  ±± c5c5  upgrade(5) largest tag: 101

Performance Frequency of Reconfiguration Latency Rambo Rambo II  Think of the size of CMAP you need to drag along!

GeoQuorums Problem: Atomic R/W Shared Memory Objects for a Mobile Setting Constraints: Mobile hosts are constantly moving, turning off, etc. and thus are highly unreliable to serve as “backbone” of the algorithm. Idea: Separate that world into regions that are usually populated Clusters of nodes simulate focal points A region or node fails when there are no mobile hosts in that region that are active.

Rosebud Problem: Atomic R/W Shared Memory Objects in a Byzantine environment Environment: Multiple configurations (RAMBO) + up to f Byzantine replicas Protocols : The same as RAMBO + cryptographic augmentation Sets of 3f+1 replicas, quorums of 2f+1

Backup Slides

ABD - code

Virtual Synchrony Implementation ISIS Algorithm - markers When P receives a view change from G i to G i+1 Forward all unstable messages from G i to all other processes in G i+1. Mark them stable Multicast flush message for G i+1 When P receives flush message for G i+1 from all processes Install new view change of G i+1 SAFE messages Network-level vs application-level delivery guarantees

Symmetric Atomic Broadcast Timestamp = counter + pid Send: increment counter Receive: Record the neighbor’s counter Adopt the counter on the message if greater than mine Deliver: Accept the message stamped with a counter ≤ than every node’s counter Use pid to break ties (p0,0) (p1,0)(p1,1)(p0,2)

Causally Ordered Broadcast 0,1,00,2,00,3,0 1,3,0  Every node maintains a vector timestamp vt  Increase my timestamp upon send  Self-messages delivered immediately  Delivering a message from neighbor stamped with v  v[k] ≤ vt[k], k ≠ i