Presentation is loading. Please wait.

Presentation is loading. Please wait.

Edward Bortnikov 048961 – Topics in Reliable Distributed Computing Slides partially borrowed from Nancy Lynch (DISC ’02) Seth Gilbert (DSN ’03) and Idit.

Similar presentations


Presentation on theme: "Edward Bortnikov 048961 – Topics in Reliable Distributed Computing Slides partially borrowed from Nancy Lynch (DISC ’02) Seth Gilbert (DSN ’03) and Idit."— Presentation transcript:

1 Edward Bortnikov 048961 – Topics in Reliable Distributed Computing Slides partially borrowed from Nancy Lynch (DISC ’02) Seth Gilbert (DSN ’03) and Idit Keidar (multiple talks) R AMBO Reconfigurable Atomic Memory for Dynamic Networks

2 Outline Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

3 Distributed Shared Memory Read Write(7) Write(0)

4 Atomic Consistency AKA linearizability Definition: Each operation appears to occur at some point between its invocation and response. Sufficient condition: For each object x, all the read and write operations for x can be partially ordered by , so that:  is consistent with the order of invocations and responses: there are no operations such that  1 completes before  2 starts, yet  2   1. All write operations are ordered with respect to each other and with respect to all the reads. Every read returns the value of the last write preceding it in .

5 Read 7 Write(7) Atomic Consistency Read Write(7) Write(0)

6 Quorums Write(7) Read

7 Dynamic Atomic Memory

8 Outline Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

9 Prior Work on Quorums Gifford (79) and Thomas (79) Upfal and Wigderson (85) majority sets of readers and writers Vitanyi and Awerbuch (86) matrices of single-writer/single-reader registers Attiya, Bar-Noy and Dolev (90/95) majorities of processors to implement single-writer/multi- reader objects in message passing systems Static

10 olev))A(ttiya) B(ar-Noy) D Single-writer multiple-readers Assuming non-faulty processors (nodes) Majority is a primitive quorum Communicate Send a request to n processors Await ack from processors Tags are used for distributed ordering of operations WRITE operations increment the tag READ operations use the tag Both propagate the tag Properties R returns either the last completed or a concurrent W ≤ tag ordering between R

11 Write increment tag send tag/value Read Phase 1: find tag/value Phase 2: send tag Reads and Writes Value 32 5 24 72 Tag 100 101 102 103

12 Outline Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

13 Dynamic Approaches (1) Consensus to agree on each operation [Lamport] Consensus for each R/W  bad performance! Virtual synchrony [Birman 85] group communication R/W simulated through atomic broadcast Consensus only for special case (view change) Issue with determining the primary partition (quorum) [Yeger-Lotem, Keidar, Dolev ’97] – dynamic voting But still - performance issues One join or failure may trigger view formation  delays R/W In the presence of failures, R/W ops can converge indefinitely

14 Group Communication Abstraction Send ( Grp, Msg ) Deliver ( Msg ) Join / Leave ( Grp ) View ( Grp, Members, Id) Group Communication

15 Group Communication Systems (1) Group Membership Processes organized into groups Particular memberships stamped as views Views provide a form of Concurrent Common Knowledge about system In partitionable system, views can be concurrent p1p1 time p2p2 p3p3 V1 {p 1, p 2, p 3 } V2 {p 1, p 2 } V5 {p 1, p 2, p 3 } V3 {p 3 }

16 Virtual Synchrony [Birman, Joseph 87] Integration of Multicast and Membership Synchronization of Messages and Views Includes many different properties One key property:  Powerful abstraction for state-machine replication Processes that go together through same views, deliver same sets of messages.

17 Reliable Multicast Messages sent to group Total/Causal/FIFO ordering Virtual Synchrony The same set of multicast messages delivered to group members between view changes Guaranteed Self Delivery A process will eventually deliver a self-message or crash (Usually) Sending View Delivery The message is delivered in the same view that it is sent Group Communication Systems (2) p1p1 time p2p2 p3p3 V1 {p 1, p 2, p 3 } V2 {p 1, p 2 }

18 Movie Group Chocolat Movie Group Gladiator Movie Group Spy Kids Example: a GC-based VOD server start update Movies? Service Group control Session Group

19 Virtual Synchrony - Membership Issue – accurate estimation on group membership Natural implementation – consensus But - distributed consensus is impossible under failures in an asynchronous system [FLP ’85]! How to distinguish between a failed and slow processor? Solution – failure detectors to deliver views May use mechanisms other than asynchronous message arrivals to suspect the failed processes Failure detector ◊S Initially, the output is arbitrary, but eventually … every process that crashes is suspected (completeness) some process does not crash is not suspected (accuracy) ◊S is the weakest FD to solve the consensus Rotating Coordinator algorithm

20 Virtual Synchrony - Multicast Assumption: point-to-point reliable FIFO All-or-none message delivery Only for the view (alive processes) Dead men tell no tales (E.W. Hournung 1899) STABLE messages and delivery between views What if the sender crashes in the middle of multicast? ISIS algorithm – FLUSH markers Messages can be delayed indefinitely during view formation! Total message ordering TOTEM (token-ring) algorithm Symmetric (Lamport timestamps) algorithm

21 Implementing Virtual Synchrony  Process 4 notices that process 7 has crashed, sends a view change  Process 6 sends out all its unstable messages, followed by a flush message  Process 6 installs the new view when it has received a flush message from everyone else

22 Virtual Synchrony – More Issues Failure Detectors performance Slow convergence under long delays (e.g. WAN) Implementation of Same View Delivery Dropping messages? Delivering them in the wrong view? Network partitioning Multiple partitions, split and merge or groups Transitional views (Amir et al) The set of processes seeing the same messages (not necessarily from the current view members)

23 Dynamic Voting on Top of GC R/W service as a replicated state machine (total order) Data replicas managed by the primary partition (quorum) Problematic in dynamic unreliable network Adaptive quorums – majority of the previous quorum {a,b,c,d,e}  {a,b,c}  {a,b} Dynamic linear voting Pid to break ties between equal-sized partitions Is this enough?

24 Failures in the Course of the Protocol  {a, b, c} attempt to form a quorum  a and b succeed  c detaches, unaware of the attempt  {a, b} form a quorum  majority of {a, b, c}  Concurrently {c, d, e} form a quorum  majority of {a, b, c, d, e}  Inconsistency!

25 Handling Ambiguous Configurations Idea: make c aware if a and b succeed in forming {a, b, c} {a, b, c} is ambiguous for c: may or may not have been formed Processes record ambiguous attempts c records both: {a, b, c, d, e} and {a, b, c} Requires a majority of both  will refuse to form {c, d, e}

26 Dynamic Voting - Ambiguity Resolution Upon Membership Changes  Exchange information  [Sub-quorum of last primary and of all ambiguous attempts]  ATTEMPT: Record the attempt as ambiguous  [All attempted]  FORM: become primary + delete all ambiguous attempts Caveat: Garbage Collection Potentially exponential # of ambiguous attempts Constrain to store a linear #

27 Dynamic Approaches (1) Consensus to agree on each operation [Lamport] Consensus for each R/W (not guaranteed to terminate) Bad performance! Virtual synchrony [Birman 85] group communication R/W simulated through atomic broadcast Consensus only for special case (view change) Issue with determining the primary partition (quorum) [Yeger-Lotem, Keidar, Dolev ’97] – dynamic voting But still - performance issues One join or failure may trigger view formation  delays R/W In the presence of failures, R/W ops can converge indefinitely

28 Dynamic Approaches (2) Quorum-based reads/writes over GC [De Prisco, et al. 99] New view must satisfy space requirements Intersection between the old and new quorums RAMBO has time requirements Some quorums of the old and new system are involved in reconfiguration Single reconfigurer [Lynch, Shvartsman 97], [Englert, Shvartsman 00]: Terminology change: view  configuration Allows multiple concurrent configurations SPOF!

29 Outline Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

30 RAMBO – key ideas Separate the handling of R/W operations from view (configuration) changes R/W ops must complete fast Configuration changes can propagate in the background Two levels of accommodating changes Small and transient changes – through multiple quorums Large and permanent changes – through reconfiguration Managing configurations Multiple configurations may co-exist Old configurations can be garbage-collected The nodes agree on the order of configurations (Paxos)

31 RAMBO Architecture Net Recon read/write upgrade read read-ack write write-ack

32 RAMBO API Domains I = set of Nodes (Locations) V = set of Values C = set of Configurations Members ( C ) Read-quorums ( C ) Write-quorums ( C ) Input // asynchronous - per node/object Join Read Write (v) Recon (c, c’) Fail Output // asynchronous - per node/object Join-ack Read-ack (v) Write ack Recon-ack (b) // True/False Report (c) // new configuration

33 Recon Service Specification Recon Chooses configurations Tells members of the previous and new configuration. Informs Reader-Writer components (new-config). Behavior (assuming well-formedness): Agreement: Two configs never assigned to same k. Validity: Any announced new-config was previously requested by someone. No duplication: No configuration is assigned to more than one k.

34 Write Phase 1: choose tag Phase 2: send tag/value Read Phase 1: find tag/value Phase 2: send tag/value Reads and Writes Value 32 5 24 72 Tag 100 101 102 103

35 Multiple Configurations (1) Every node can Install a new configuration Garbage-collect an old configuration Learn about both through gossiping The Recon service guarantees the global order Configuration map The node’s snapshot of the picture of the world Special configurations:  (undefined) and ± (GC’ed)

36 Multiple Configurations (2) Some algebra: Update:   c, c  ± // Configuration lifecycle Extend:   c // New configurations Truncate: (c 1, c 2, , c 4 )  (c 1, c 2 ) // Removing holes Configuration map w/o holes € TRUNCATED ±± ccc  c ...  GC’d Defined Mixed Undefined c

37 CMAP Evolution c0c0  c0c0 c1c1  c0c0 c1c1 c2c2  ckck  ± c1c1 c2c2  ckck  ±± c2c2  ckck ... ±±± c3c3  ckck  ±±±±± ccc  c 

38 R/W Automaton Implementation The node keeps gossiping with the “world” all the time Tags are used for distributed ordering of operations WRITE operations increment the tag READ operations use the tag Every READ returns the value of WRITE with the same tag Agreeing on tags Every op consists of the query and propagation phases Query – acquire the tag from “enough” members R-quorum of every active configuration Propagation – push the value/tag to “enough” members W-quorum of every active configuration Fixed point: predicate that the respective op has completed

39 R/W with Multiple Configurations Key to asynchronous execution of R/W operations No abortion of R/W when a new configuration is reported Extra work to access additional processes needed for new quorums. Reaching a quorum for every C in CMAP To synchronize with every process that might hold C Some read-quorum at the QUERY stage Query-fixed-point precondition Some write-quorum at the PROP stage Prop-fixed-point precondition

40 RAMBO - R/W Transitions

41 RAMBO - GC Transitions

42 R/W Automata State world value, tag cmap pnum1, counts phases of locally-initiated operations pnum2[], records latest known phase numbers for all locations Recall causal ordering and vector clocks! op-record, keeps track of the status of a current locally initiated read/write operation Includes op.cmap, consisting of consecutive configs. gc-record, keeps track of the status of a current locally-initiated GC operation

43 R/W Automaton: Recv() code CMAP may evolve during the R/W Accept only “recent” messages Local message numbering (PNUM) to ensure causal order “I have heard from you since you started the op!” Pitfall: a hole in the new CMAP  I am using stale data!  Restart the phase with the truncated CMAP  world := world  W  if t > tag then (value,tag) := (v,t)  cmap := update(cmap,cm)  pnum2(j) := max(pnum2(j), ns)  gc-record: If message is “recent”, record the sender.  op-record: If message is “recent”:  Record the sender  Extend op.cmap with newly discovered configurations

44  c6c6 Largest tag: 100 New tag: 101 Putting it all together … ±± c3c3 c4c4 c5c5  write(x, 7)

45 Garbage Collection A process can initiate a configuration’s garbage collection Provided that the previous configurations are ± One at a time (may be improved !!!) Multiple processes can start GC of the same configuration Concurrently with R/W A GC can stop if an idempotent GC has completed The same two-phase protocol Query: reach a read and write quorums of CMAP[k] Inform W-quorum of old configuration about the new configuration. Collect object values from R-quorum of the old configuration. Prop: reach the write quorum of CMAP[k+1] Propagate the latest value to a W-quorum of the new configuration.

46 Proof Sketch ≤ ordering of tags between sequential GC operations ∩ between the R-quorum of CMAP[k] and W-quorum of CMAP[k+1] Ordering between sequential GC and R/W ≤ ordering of tags between the GC and READ operations < ordering of tags between the GC and WRITE operations Ordering between sequential R and W ≤ ordering between */R < ordering between */W  Either there is a common configuration C Tag conveyed through the quorum ∩ property  … or the tag info is conveyed through the GC of some configuration in between

47 Recon Implementation Consensus implemented using Paxos Synod algorithm. Members of old configuration propose a new configuration Proposals reconciled using consensus. recon(c,c’): Request for reconfiguration from c to c’. [If c is the k-1 st configuration] Send init(Cons(k,c’)) message to c.members Recv(init): Participate in consensus. decide(c’): Tell R/W the new configuration Send new-config message to members of c’. Net Consensus Recon Recon-ack

48 Conditional Performance Analysis Safety is guaranteed … But no absolute performance guarantees! Under “good” network conditions Bounded message delay d Sufficient spacing between configurations (e) Configuration and quorum viability (e) Bounds (under quiescence conditions) Join – 2d Reconfiguration – 13d Read/Write – 4d (two phases) GC – 4d (two phase) Deteriorate under weaker stability conditions! Network stabilizes Rambo stabilizes

49 Outline Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

50 RAMBO-2 Goal: overcome the bottleneck of one GC at a time Upgrade instead of GC: collect multiple configurations < k Any configuration can be upgraded, even if < indices are not Problem – no nice RAMBO property: RAMBO: every configuration is upgraded before removal Need to overcome the race condition between two upgrades … which leads to data loss! Solution: Don’t remove a configuration until the upgrade is complete … even if somebody is removing it in parallel with you! Proof Intuition: Order between R/W op tags through the transitive closure of multiple Upgrade op tags (instead of a single GC)

51 c3c3 c4c4 ±± Configuration Upgrade in RAMBO-2  ±± c5c5  upgrade(5) largest tag: 101

52 Performance 0 5 10 15 20 25 30 0510152025 Frequency of Reconfiguration Latency Rambo Rambo II  Think of the size of CMAP you need to drag along!

53 GeoQuorums Problem: Atomic R/W Shared Memory Objects for a Mobile Setting Constraints: Mobile hosts are constantly moving, turning off, etc. and thus are highly unreliable to serve as “backbone” of the algorithm. Idea: Separate that world into regions that are usually populated Clusters of nodes simulate focal points A region or node fails when there are no mobile hosts in that region that are active.

54 Rosebud Problem: Atomic R/W Shared Memory Objects in a Byzantine environment Environment: Multiple configurations (RAMBO) + up to f Byzantine replicas Protocols : The same as RAMBO + cryptographic augmentation Sets of 3f+1 replicas, quorums of 2f+1

55 Backup Slides

56 ABD - code

57 Virtual Synchrony Implementation ISIS Algorithm - markers When P receives a view change from G i to G i+1 Forward all unstable messages from G i to all other processes in G i+1. Mark them stable Multicast flush message for G i+1 When P receives flush message for G i+1 from all processes Install new view change of G i+1 SAFE messages Network-level vs application-level delivery guarantees

58 Symmetric Atomic Broadcast Timestamp = counter + pid Send: increment counter Receive: Record the neighbor’s counter Adopt the counter on the message if greater than mine Deliver: Accept the message stamped with a counter ≤ than every node’s counter Use pid to break ties (p0,0) (p1,0)(p1,1)(p0,2)

59 Causally Ordered Broadcast 0,1,00,2,00,3,0 1,3,0  Every node maintains a vector timestamp vt  Increase my timestamp upon send  Self-messages delivered immediately  Delivering a message from neighbor stamped with v  v[k] ≤ vt[k], k ≠ i


Download ppt "Edward Bortnikov 048961 – Topics in Reliable Distributed Computing Slides partially borrowed from Nancy Lynch (DISC ’02) Seth Gilbert (DSN ’03) and Idit."

Similar presentations


Ads by Google