Download presentation
Presentation is loading. Please wait.
Published byบุญศรี รักไทย Modified over 5 years ago
1
RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks Nancy Lynch, MIT Alex Shvartsman, U. Conn. DISC 2002 October 29, 2002
2
Goal An algorithm to implement atomic read/write shared memory in a dynamic network setting. Participants may join, leave, fail during computation. Mobile networks, peer-to-peer networks. High availability, low latency. Atomicity for all patterns of asynchrony and change. Good performance under reasonable limits on asynchrony and change. Applications: Battle data for teams of soldiers in military operation. Game data for players in multiplayer game.
3
Approach: Dynamic Quorums
Objects are replicated at several network locations. To accommodate small, transient changes: Uses quorum configurations: members, read-quorums, write-quorums. Maintains atomicity during stable situations. Allows concurrency. To handle larger, more permanent changes: Reconfigure Maintains atomicity across configuration changes. Any configuration can be installed at any time. Reconfigure concurrently with reads/writes; no heavyweight view change. No heavyweight view change as for group communication solutions.
4
RAMBO RAMBO: Reconfigurable Atomic Memory for Basic Objects (dynamic atomic read/write shared memory). Global service specification: Algorithm: Reads and writes objects. Chooses new configurations, notifies members. Identifies, garbage-collects obsolete configurations. All concurrently. RAMBO
5
RAMBO algorithm structure
Main algorithm + reconfiguration service Loosely coupled Recon service: Provides the main algorithm with a consistent sequence of configurations. Main algorithm: Handles reading, writing. Receives, disseminates new configuration information; no formal installation. Garbage-collects old configurations. Reads/writes may use several configurations. Net RRAMBO Recon Recon
6
Main algorithm: Reads/writes
Uses two-phase strategy [Attiya, Bar-Noy, Dolev 96]: Phase 1: Collect object values from read-quorums of active configurations. Phase 2: Propagate latest value to write-quorums of active configurations. Operations may execute concurrently. Quorum intersection properties guarantee atomicity. Our communication mechanism: Background gossiping Terminate by fixed-point condition, involving a quorum from each active configuration.
7
Removing old configurations
Main algorithm removes old configurations by garbage-collecting them in the background. Two-phase garbage-collection procedure: First phase: Inform write-quorum of old configuration about the new configuration. Collect object values from read-quorum of the old configuration. Second phase: Propagate the latest value to a write-quorum of the new configuration. Garbage-collection concurrent with reads/writes. Implemented using gossiping and fixed points.
8
Implementation of Recon
Uses distributed consensus to determine successive configurations 1,2,3,… Members of old configuration propose new configuration. Proposals reconciled using consensus Consensus is a heavyweight mechanism, but: Used only for reconfigurations, infrequent. Does not delay Read/Write operations. Consensus Recon Net
9
Implementation of consensus
decide(v) init(v) Consensus Use a version of the Paxos algorithm [Lamport 89, 98, 02]. Agreement, validity guaranteed absolutely. Termination guaranteed if/when underlying system stabilizes. FLP implies, of course, that termination can’t be guaranteed absolutely---depends on good behavior on the part of the underlying network.
10
Models and analysis I/O automaton models.
Prove atomicity for arbitrary patterns of asynchrony and change. Analyze performance conditionally, based on failure and timing assumptions. Reads and writes take time at most 8d, under reasonable “steady-state” assumptions. The pretty picture, with circles and arrows, is a pictorial representation of a completely rigorous presentation. The circles represent interacting state machines (automata). Such state machines are used to represent all system components, including the high-level system specification, the low-level Network, the algorithm components running at all the network nodes, models for applications running on top of services, etc. Making this all formal enables correctness proofs and performance analysis. Supporting theory is basically a foundation based on interacting state machine models. Basic asynchronous models plus other features: timing, hybrid continuous/discrete, probabilities,… Composition, abstraction. Applying the theory has mostly meant using the modeling/analysis methods on real systems. Some of the prototype bblocks we have developed have been used as starting-points for modeling/analyzing similar bblocks that appear in real systems.
11
Other approaches Use consensus to agree on total ordering of operations: [Lamport 89…] Not resilient to transient failures. Termination of r/w depends on termination of consensus. Totally-ordered broadcast over group communication: [Amir, Dolev, Melliar-Smith, Moser 94], [Keidar, Dolev 96] View formation takes a long time, delays reads/writes. One change may trigger view formation. Dynamic quorums over GC: [De Prisco, et al, 99] New view must satisfy intersection requirements. Single reconfigurer: [Lynch, Shvartsman 97], [Englert, Shvartsman 00]
12
Outline of talk 1. Introduction
2. Reconfigurable Atomic Memory (RAMBO) specification 3. Reconfiguration service (Recon) specification 4. Implementation of RAMBO using Recon 5. Proof of atomicity 6. Implementation of Recon 7. Conditional performance results 8. Conclusions
13
2. RAMBO Service Specification
I, infinite set of participants’ locations X, set of objects C, configuration identifiers External actions for each i and x: Inputs: joinx,i, readx,i, write(v)x,i, recon(c,c’)x,i Outputs: join-ackx,i, read-ack(v)x,i, …, report(c)x,i Ignore joins in this talk. Behavior: Assuming basic well-formedness conditions, RAMBO guarantees atomicity. Liveness replaced by latency bounds. RAMBO The J is a set of locations presumed to already be in the system. Used to help in the join. The recon(c,c’) request means that the current config is assumed to be c and c’ is being proposed. The proposing participant has to be a member of c. Well-formedness assumptions just say things like: the requests alternate with responses; no read, write, or recon is invoked until/unless the participant has joined (with a join-ack); only request recon(c,c’) if you’ve received report(c); unique requested config identifiers.
14
Atomicity AKA linearizability
Definition: Each operation appears to occur at some point between its invocation and response. Sufficient condition: For each object x, all the read and write operations for x can be partially ordered by , so that: is consistent with the order of invocations and responses: there are no operations such that 1 completes before 2 starts, yet 2 1 . All write operations are ordered with respect to each other and with respect to all the reads. Every read returns the value of the last write preceding it in .
15
Implementing RAMBO Composition of separate service for each x.
RAMBO (for x) uses separate Recon service (for x): Net Recon recon read, write RAMBO new-config
16
3. Recon Service Specification
External actions for each i: Inputs: recon(c,c’)i Outputs: recon-acki, report(c)i, new-config(c,k)i And some joining actions (ignore) Behavior: Assuming well-formedness, Recon produces consistent configuration identifiers at participating locations: Agreement: Two configs never assigned to same k. Validity: Any announced new-config was previously requested by someone. No duplication: No configuration is assigned to more than one k. A version of atomic broadcast.
17
4. Implementing RAMBO using Recon
Chooses configurations Tells members of the previous and new configuration. Informs Reader-Writer components (new-config). Reader-Writer Conducts read and write operations Two-phased quorum-based algorithm. Uses all current configurations. Garbage-collects obsolete configurations. Join external connections are join and join-ack. Read/write external connections are r/w and their acks. Recon external connections are recon and its ack. Also a report output. New-config arrows go from Recon to R/W. Joiner-recon arrows are for the joining of the recon service, and likewise for join-rw. All components use the network.
18
Static Reader-Writer protocol
Quorum configuration for I: read-quorums, write-quorums, two collections of subsets of I For any R in read-quorums, W in write-quorums, R W . Replicate the object x at all locations in I. At each i in I, keep: value tag, consisting of (sequence number, location) Read, Write use two phases: Phase 1: Read (value, tag) from a read-quorum Phase 2: Write (value,tag) to a write-quorum The starting point for the algorithm is a standard static atomic memory algorithm of Attiya, Bar-Noy, Dolev. This is a 2-phase algorithm for reading and writing, using quorums. The read also has a write phase---after the value to be returned by the read is determined, it continues by propagating this value to a write quorum. This is to make sure that no read that comes in later can miss getting this value in favor of an earlier value, which would violate atomicity property. Read could return unconfirmed value after phase 1---that gives good information, but the user has to understand that it isn’t yet guaranteed to persist, that is, others might not see it. This can be done highly concurrently, e.g., writes that come in somewhere out-of-order simply don’t get done. No heavyweight mechanisms like transactions are used. Intersection of quorums guarantees every read gets the latest value.
19
Static Reader-Writer protocol
Write at location i: Phase 1: Read (value, tag) from a read-quorum. Determine largest seq-number among the tags that are read. Choose new-tag := (larger sequence-number, i). Phase 2: Propagate (new-value, new-tag) to a write-quorum. Read at location i: Determine largest (value,tag) among those read. Propagate this (value,tag) to a write-quorum. Return value. Highly concurrent. Quorum intersection implies atomicity Readers need to propagate in order to guarantee atomicity---otherwise sequential reads may return values out-of-order.
20
Extend to dynamic setting
Any member of current configuration can propose a new configuration. Recon produces consistent configurations. Reader-Writer processes run two-phase static quorum-based algorithm, using all current configurations. Uses gossip and fixed-point tests. When Recon provides new configuration, Reader-Writer doesn’t abort reads/writes in progress, but does extra work to access additional processes needed for new quorums.
21
Configurations and Config Maps
Configuration c members(c) --“owners” of the data in configuration c read-quorums(c) write-quorums(c) Configuration map cm Sequence of configurations cm(k) Can be defined, undefined (), garbage-collected (±) To describe the information maintained by reader/writer processes about configurations, we need some new data types. Configuration maps will keep track of information known somewhere, about all configurations. They will always follow the pattern given here: a finite number (possibly 0) known to be gc’d already, then some known in a solid block (at least one), then a mixture of Known and unknown, also finite, then an infinite tail of unknowns. c c c c ... c ... GC’d Defined Mixed Undefined
22
Configuration maps ± ± ± ± ± ± ± ± ± ± ± c0 c0 c1 c0 c1 c2 ck c1 c2 ck
. . . c0 . . . c0 c1 . . . c0 c1 c2 ck . . . c1 c2 ck . . . c2 ck In the step where c2 is gc’d: This can happen in one step: a process can learn about the new c3 and about the gc of c2, both in one message-receive step. . . . c3 ck . . . . . . c c c c
23
Reader-Writer state world value, tag cmap
pnum1, counts phases of locally-initiated operations pnum2, records latest known phase numbers for all locations op-record, keeps track of the status of a current locally initiated read/write operation Includes op.cmap, consisting of consecutive configs. gc-record, keeps track of the status of a current locally-initiated garbage-collection operation Op.cmap gives the set of configurations being used for the operation. (Actually, it’s for each phase of the operation.)
24
Reader-Writer protocol
One kind of message, gossiped nondeterministically. Message <W, v, t, cm, ns, nr > from i to j, where: W is i ’s world v,t are i’s value and tag cm is i’s cmap ns is i’s phase number, pnum1 nr is the latest phase number i knows for j, pnum2(j) (ns,nr) used to identify “fresh” messages. Key actions are taken when “enough” information has been gathered (fixed point).
25
When <W,v,t,cm,ns,nr> arrives from j:
world := world W if t > tag then (value,tag) := (v,t) cmap := update(cmap,cm) Updates cmap with newer information in cm. pnum2(j) := max(pnum2(j), ns) gc-record: If message is “fresh”, record the sender. op-record: If message is “fresh”: Record the sender. Extend op.cmap with newly-discovered configurations. The update operation for cmap
26
Processing reads and writes
Reads and Writes perform Query and Propagation phases using known configurations, stored in op.cmap. Query phase: Obtains fresh value, tag, cmap information from read-quorums. Propagation phase: Propagates up-to-date (value,tag) to write-quorums; obtains fresh cmap information from write-quorums. Both phases: Extend op.cmap with newly-discovered configurations; new configurations are also used in the phase. Each phase ends with a fixed point, after hearing from quorums of all the configurations currently in op.cmap. Technicality: This can only add new configs, never remove any. Must use truncated version. And if this isn’t possible, the phase can restart.
27
Garbage collection . . . ± ck ck+1
A process can try to GC config k when its cmap looks like: Phase 1: Informs a write-quorum of ck about ck+1. Collects latest (value, tag) from a read-quorum of ck. Phase 2: Propagates (value, tag) to a write-quorum of ck+1. Set cmap(k) to ±. GC operates concurrently with reads and writes. ck ck+1 . . .
28
5. Proof of Atomicity Atomicity holds for:
arbitrary patterns of asynchrony, arbitrary crash-failures and message loss, arbitrary joins. Proof: Construct partial order of read and write operations satisfying: is consistent with the order of invocations and responses. All write operations are ordered with respect to each other and with respect to all the reads. Every read returns the value of the last write preceding it in . Let be the lexicographic order on the operations’ tags, and order write with tag t before all reads with tag t. Tag of a write is the one chosen at the midpoint. Tag of a read is the one it finds at the midpoint. In both cases, it’s the tag that the operation propagates.
29
Showing consistency Lemma 1: Tags of GC operations are nondecreasing with respect to the configuration index. Proof: GC is done sequentially. Lemma 2: If the first GC of config k completes before a read/write operation begins, then the tag of the GC is less than or equal to the tag of (< if is a write). Lemma 3: If 1 and 2 are two read/write operations and 1 completes before 2 begins, then the tag of 1 is less than or equal to the tag of 2 (< if 2 is a write).
30
Proof of Lemma 3 Assume 1 and 2 are two read/write operations and 1 completes before 2 begins. Each phase uses consecutive configurations. Case 1: prop-cmap(1) and query-cmap(2) share a configuration c. Quorum intersection for c yields the tag inequality. Case 2: All configs in prop-cmap(1) are less than all those in query-cmap(2). The tag inequality follows from a chain of tag inequalities, following a chain of GC operations for the intervening configurations. Uses Lemmas 1 and 2. Case 3: All configs in prop-cmap(1) are greater than all those in query-cmap(2). Impossible. The fact that each phase uses consecutive configs is a technical detail I neglected to mention in describing the algorithm. When we include new information, we want to assemble a version that has no “gaps”. If we can’t, we restrart the phase. This isn’t such a bad idea, because it means that we were executing a phase with very old configs, and now we hear about much newer ones. So much newer that things have been gc’d in between. Not a bad idea to start over in this case. Case 2: An interesting part of the argument is the propagation of the tag from the first r/w to the first gc in the chain. There is some process in common, between the write-quorum used in the r/w second phase and the read-quorum used in the first phase of the gc. We have to consider cases based on which operation that process participates in first. If it participates in the rw phase before the gc phase, the tag will propagate. If it participates in the gc first, it will learn about the next config and will tell this to the process conducting the second phase of the r/w, which will then mean that the new config will get used in the r/w. That’s a contradiction.
31
6. Implementing Recon Recon algorithm uses (static) consensus services to determine configurations 1, 2, 3,… Cons(k,c): Used to determine config k, if config k-1 is c. Consensus is used only for reconfigurations, does not delay read and write operations. recon-ack recon Recon Consensus Net The Recon service is, in turn, implemented using a set of consensus services. A different consensus service is used to agree on each successive configuration.
32
Implementing Recon Simple---no atomicity issues.
Members of old configuration may propose a new configuration; proposals reconciled using consensus. recon(c,c’): Request for reconfiguration from c to c’. If c is the k-1st configuration (and is current), then send init message to members; invoke Cons(k,c) with initial value c’ Receive an init message: Participate in consensus. decide(c’): Tell Reader-Writer the new configuration; send config message to members of c’. Receipt of config message: Tell Reader-Writer the new configuration. Consensus implemented using Paxos Synod algorithm.
33
7. Latency Analysis Consider a subset of timed executions:
Gossip occurs: Periodically, and At certain key times: At beginning of operation phase. Just after receiving a message from someone with a new phase number. Just after certain join and reconfiguration events. Perform local steps immediately. Reliable message delivery, bounded delay. Normal timing for consensus services.
34
Additional assumptions
e-Configuration-viability for time parameter e A read-quorum and a write-quorum of configuration k remain alive, until at least time e after configuration k+1 is “installed” (decided upon by all non-failed members of configuration k). e-Reconfiguration-spacing recon(c,*)i occurs at least e time after report(c)i e-Join-connectivity If i and j join by time t then they learn about each other by time t+e
35
Latency results Reconfiguration:
13d, if recon(c,c’)i occurs and no members of c subsequently fail. Garbage-collection of ck by process i: 4d, if process i, a read-quorum and a write-quorum of ck, and a write-quorum of ck+1, do not fail. Read or write operation by process i in a “stable” system: 4d, if no reconfigurations occur, and process i’s cmap is “up-to-date”. Learning about configurations: If i and j are “old enough” and don’t fail, then information from i is conveyed to j within time 2d. These bounds do not depend on periodic gossip.
36
Latency results Garbage-collection, in executions with 6d-reconfiguration-spacing and 5d-configuration-viability: If report(c) occurs at i and i does not fail then any non-failed process that is old enough learns about c and garbage-collects any older configuration within time 6d. Read and write operations, in executions with 12d-reconfiguration-spacing and 11d-configuration-viability: 8d, for an operation managed by a process that is old enough and does not fail. These bounds do depend on periodic gossip. The gc bound can be used to show that, in this normal case, no process ever has more than two configurations in its local cmap.
37
8. Conclusions RAMBO algorithm
Composed of R/W algorithm, Recon service, Consensus Atomicity in all executions. Good latency bounds: For reading, writing, garbage-collection. Under assumptions about timing, joins, failures, and rate of reconfiguration.
38
Algorithmic innovations
Dynamic configurations: Members can be changed dynamically. Any current member may request reconfiguration. Arbitrary configurations can be installed; no intersection requirements. Loosely-coupled reconfiguration: Concurrent reading, writing, reconfiguration. Reads/writes can use several configurations; can complete during reconfiguration. Efficient “steady-state”: Assuming bounded delays, infrequent reconfiguration, and periodic gossip, read and write operations complete in time 8d.
39
Comparison with other approaches
Using consensus to agree on a total ordering of operations: We use consensus only for the configurations. Consensus termination impacts only reconfiguration latency, not read and write latency. Group communication: Our reads/writes work during “new view” establishment. Dynamic quorum configurations over GC: We allow arbitrary new configurations - no intersection requirements. Single reconfigurer approaches: We allow multiple reconfigurers. We uncouple introduction of new configurations and garbage-collection of old configurations.
40
Current and future work
LAN implementation [Musial, Shvartsman] More analysis: “Normal behavior” starting from some point Tradeoff between configuration-viability and gc rate. Algorithmic improvements and additions: Concurrent garbage-collection [Gilbert] Reducing communication. Better join protocol, explicit “leave” protocol. Early return of read values. Backup strategies for when configuration-viability fails. Choosing good configurations. Extensions to other data types?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.