Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Compositional Design and Analysis of Timing-Based Distributed Algorithms Nancy Lynch Theory of Distributed Systems MIT Third MURI Workshop Arlington-Ballston,

Similar presentations


Presentation on theme: "1 Compositional Design and Analysis of Timing-Based Distributed Algorithms Nancy Lynch Theory of Distributed Systems MIT Third MURI Workshop Arlington-Ballston,"— Presentation transcript:

1 1 Compositional Design and Analysis of Timing-Based Distributed Algorithms Nancy Lynch Theory of Distributed Systems MIT Third MURI Workshop Arlington-Ballston, Virginia December 10, 2002

2 2 MIT Participants Leader –Nancy Lynch Postdoctoral associates –Idit Keidar, Dilsun Kirli Graduate students –Roger Khazan, Carl Livadas, Ziv Bar-Joseph, Rui Fan, Seth Gilbert, Sayan Mitra Collaborators –Alex Shvartsman and students, Frits Vaandrager, Roberto Segala

3 3 Project Overview Design and analyze distributed algorithms that implement global services with strong guarantees, e.g.: –Reliable communication –Strongly coherent data services Dynamic environment, where processes join, leave, and fail. Algorithms composed of sub-algorithms. Analyze performance conditionally, under various assumptions about timing and failures. Develop underlying mathematical modeling framework, based on interacting state machines (IOA, TIOA), capable of: –Describing precisely all the algorithms we study. –Supporting compositional and conditional analysis.

4 4 Algorithms Studied Scalable group communication [Khazan, Keidar] Early-delivery dynamic atomic broadcast [Bar-Joseph, Keidar, Lynch] Reconfigurable atomic memory [Lynch, Shvartsman] Scalable reliable multicast [Livadas, Keidar, Lynch] In progress: –Reconfigurable atomic memory –Peer-to-peer: Fault-tolerant location services, data services –Mobile: Topology control, clock synchronization, tracking

5 5 This Talk I.Completed work: Scalable group communication Early-delivery dynamic atomic broadcast II.Reconfigurable atomic memory III.Reliable multicast IV.Modeling framework V.Plans for the next two years

6 6 I. Completed Work Scalable Group Communication [Keidar, Khazan 00, 02], [Khazan 02], [Keidar, Khazan, Lynch, Shvartsman 02] [Taraschanskiy 00] GCS

7 7 Group Communication Services Cope with changing participants using abstract groups of client processes with changing membership sets. Processes communicate with group members indirectly, by sending messages to the group as a whole. GC services support management of groups: –Maintain membership information. Form new views in response to changes. –Manage communication. Communication respects views. Provide guarantees about ordering and reliability of message delivery. Virtual synchrony Applications: Managing replicated data; distributed multiplayer games; collaborative work

8 8 Scalable GC Algorithm Specification, including virtual synchrony. New algorithm: –Uses a scalable membership service, implemented on a small set of membership servers. –Multicast implemented on all the nodes. –View change uses only one round for state exchange, in parallel with membership service’s agreement on views. –Participants can join during view formation. GCS Net Memb GCS

9 9 Analysis Safety proofs, using incremental proof methods. Liveness proofs. Performance analysis: –Time from when network stabilizes until GCS announces final view. –Message latency. –Conditional analysis, based on input, failure, and timing assumptions. –Compositional analysis, based on performance of membership service and Net. Modeled and analyzed data-management application running on top of the new GCS. Distributed implementation [Taraschanskiy 00]. SS’ AA’

10 10 Completed Work: Early-Delivery Dynamic Atomic Broadcast [Bar-Joseph, Keidar, Lynch 02] DAB

11 11 Dynamic Atomic Broadcast Atomic broadcast, where processes may join, leave, or fail. Safety: Sending, receiving orders are consistent with a single global message ordering (no gaps). Liveness: Eventual completion of joins, leaves. Eventual delivery, including the process’ own messages. Fast delivery, even with joins, leaves. Application: Distributed multiplayer interactive games. join leave mcast(m) join-ack leave-ack rcv(m) … DAB

12 12 Implementing DAB Processes: –Timing-dependent, have approximately-synchronized clocks. Net: –Pairwise FIFO delivery –Low latency –But does not guarantee a single total order, nor that all processes see the same messages from a failing process. join net-join DAB Net

13 13 Dynamic Atomic Bcast Algorithm Processes coordinate message delivery: –Divide time into slots using local clock, assign messages to slots. –Deliver messages in order of (slot, sender id). –Determine members of each slot, deliver only messages from members. Processes must agree on slot membership: –Joining process selects join-slot, informs others. –Similarly for leaving process. –Failed process results in consensus on failure slot. Requires a new kind of consensus service: Consensus with Uncertain Participants (CUP). –Participants not known a priori. –Each participant submits its perceived “World”. –Processes may abstain.

14 14 DAB i1 DAB i2 CUP(j) DAB Net fail The DAB Algorithm Using CUP

15 15 Consensus with Uncertain Participants CUP Problem: –Guarantees agreement, validity, termination. –Assumes submitted worlds are “close”: Process that initiates is in other processes’ worlds Process in anyone’s world initiates, abstains, leaves, or fails. CUP Algorithm –A new early-stopping consensus algorithm. –Similar to [Dolev, Reischuk, Strong 90], but: Tolerates uncertainty about participants. Tolerates processes leaving. –Terminates in two rounds when failures stop, even if leaves continue. –Latency linear in number of actual failures

16 16 Analysis Compositional analysis: Properties of CUP used to prove properties of DAB: –Safety: CUP agreement and validity imply DAB atomic broadcast consistency guarantees. –Liveness: CUP safety and liveness properties (e.g., termination) imply DAB liveness properties (e.g., eventual delivery). –Latency: CUP decision bounds imply DAB message delay bounds. Message latency: –No failures: Constant, even when participants join and leave. –With failures: Linear in the number of failures. –Improves upon algorithms using group communication.

17 17 II. Reconfigurable Atomic Memory for Dynamic Distributed Environments [Lynch, Shvartsman 02] R AMBO

18 18 Reconfigurable Atomic Memory Implement atomic read/write shared memory in a dynamic network setting. –Participants may join, leave, fail. –Mobile networks, peer-to-peer networks. High availability, low latency. Atomicity for all patterns of asynchrony and change. Good performance under reasonable limits on asynchrony and change. Applications: –Battle data for teams of soldiers in military operation. –Game data for players in multiplayer game.

19 19 Approach: Dynamic Quorums Objects are replicated at several network locations. To accommodate small, transient changes: –Uses quorum configurations: members, read-quorums, write-quorums. –Maintains atomicity during stable situations. –Allows concurrency. To handle larger, more permanent changes: –Reconfigure –Maintains atomicity across configuration changes. –Any configuration can be installed at any time. –Reconfigure concurrently with reads/writes; no heavyweight view change.

20 20 R AMBO Reconfigurable Atomic Memory for Basic Objects (dynamic atomic read/write shared memory). Global service specification: Algorithm: –Reads and writes objects. –Chooses new configurations, notifies members. –Identifies, garbage-collects obsolete configurations. –All concurrently. R AMBO

21 21 Main algorithm + reconfiguration service Loosely coupled Recon service: –Provides a consistent sequence of configurations. Main algorithm: –Handles reading, writing. –Receives, disseminates new configuration information; no formal installation. –Garbage-collects old configurations. –Reads/writes may use several configurations. Recon Net Recon R R AMBO R AMBO Algorithm Structure

22 22 Main algorithm: Reading and Writing Run a version of the standard static two-phase quorum- based read/write algorithm [Vitanyi, Awerbuch], [Attiya, Bar-Noy, Dolev]. Use all current configurations. read, write Net Recon new-config

23 23 Static Read/write Protocol Quorum configuration: –read-quorums, write-quorums RWRW  . –For any R in read-quorums, W in write-quorums, R  W  . Replicate object at all locations. At each location, keep: –value –tag = (sequence number, location) Read, write use two phases: –Phase 1: Read (value, tag) from a read-quorum –Phase 2: Write (value,tag) to a write-quorum Highly concurrent. Quorum intersection implies atomicity

24 24 Static Read/write Protocol Details Write at location i: –Phase 1: Read (value, tag) from a read-quorum. Determine largest seq-number among the tags that are read. Choose new-tag := (larger sequence-number, i). –Phase 2: Propagate (new-value, new-tag) to a write-quorum. Read at location i: –Phase 1: Read (value, tag) from a read-quorum. Determine largest (value,tag) among those read. –Phase 2: Propagate this (value,tag) to a write-quorum. Return value.

25 25 Dynamic Read/write Protocol Perform two-phase static protocol, using all current configurations. –Phase 1: Collect object values from read-quorums of current configurations. –Phase 2: Propagate latest value to write-quorums of current configurations. When new configuration is provided by Recon: –Start using it too. –Do not abort reads/writes in progress, but do extra work to access additional processes needed for new quorums. Our communication mechanism: –Background gossiping –Terminate by fixed-point condition, involving a quorum from each active configuration.

26 26 Removing Old Configurations Garbage-collect them in the background. Two-phase garbage-collection procedure: –Phase 1: Inform write-quorum of old configuration about the new configuration. Collect object values from read-quorum of the old configuration. –Phase 2: Propagate the latest value to a write-quorum of the new configuration. Garbage-collection concurrent with reads/writes. Implemented using gossiping and fixed points.

27 27 Implementation of Recon Uses distributed consensus to determine successive configurations 1, 2, 3,… Members of old configuration propose new configuration. Proposals reconciled using consensus. Consensus is a heavyweight mechanism, but: –Only used for reconfigurations, infrequent. –Does not delay read/write operations. Consensus Recon Net

28 28 Consensus Implementation Use a variant of timing-based Paxos algorithm [Lamport] Agreement, validity guaranteed absolutely (independent of timing). Termination guaranteed when underlying system stabilizes. Leader chosen using failure detectors; conducts two-phase algorithm with retries. decide(v) init(v) Consensus

29 29 Analysis We prove atomicity for arbitrary patterns of asynchrony and change, using partial order methods. Analyze performance conditionally, based on failure and timing assumptions. E.g., under reasonable “steady-state” assumptions: –Removing old configurations takes time at most 6d. –Reads and writes take time at most 8d. LAN implementation [Musial 02].

30 30 Other Approaches Use consensus to agree on total order of operations: [Lamport 89] –Not resilient to transient failures. –Termination of reads/writes depends on termination of consensus. Totally-ordered broadcast over group communication: [Amir, Dolev, Melliar-Smith, Moser 94], [Keidar, Dolev 96] –View formation takes a long time, delays reads/writes. –One change may trigger view formation.

31 31 III. Reliable Multicast Protocols [Livadas, Keidar, Lynch 01], [Livadas, Keidar 02], [Livadas, Lynch 02]

32 32 Physical System Model Infinite # of symmetric hosts i.e., same resources, processes Network of interconnected routers Failures: fail-stop host crashes and packet drops r1 r2 r6 r4 r5 r3 h1 h2 h3 h4 h5 h6

33 33 Reliable Multicast Service (RMS) Overview: –Single reliable multicast group & single client process/host –RM(  ) encompasses behavior of all other processes on hosts and functionality of underlying network –Parameter  bounds the reliable delivery delay Membership: –A host becomes a member of the group upon the acknowledgment of its join request –A host ceases to be a member upon issuing a leave request RM-Client 1 RM-Client 2 RM(  ) rm-join 1 rm-join-ack 1 RM-Client 1 rm-send 1 (p) RM-Client 1 rm-recv 2 (p) RM-Client 1 rm-leave 1 RM-Client 1

34 34 Multicast Reliability: Properties Let h,s be hosts and p,p’ be packets from s such that p<p’ Eventual Delivery: If p’ remains active forever after its transmission, h delivers p, and h remains a member thereafter, then h delivers p’. Time-Bounded Delivery: Let T denote the time interval ranging from the transmission time t of p’ to the point in time  time units past t. If p’ remains active throughout T, h delivers p prior to the expiration of T, and h remains a member thereafter within T, then h delivers p’ within T.

35 35 Reliable Multicast Implementation (RMI) Scalable Reliable Multicast (SRM) [Floyd et. al., 97] –Retransmission-based protocol using NACKs –Uses best-effort IP multicast as communication primitive Augment SRM so as to precisely specify: –when a host becomes a member of the group –which packets each member should attempt to recover

36 36 SRM’s Recovery Scheme Each host schedules a request for each missing packet Any capable host schedules a reply to each such request Duplicate requests/replies limited using deterministic and probabilistic suppression schemes h h’ s rqst repl

37 37 IP-mcast RM-Client 1 RM-mem 1 RM-rep 1 RM-Client 2 RM-mem 2 RM-rep 2 RM-IPbuff 1 RM-rec 1 RM-IPbuff 2 RM-rec 2 RMI Timed I/O Automaton Model

38 38 Analysis of RMI Correctness Analysis: RMI implements RMS; i.e., RMI delivers appropriate packets to appropriate members of the reliable multicast group as dictated by RMS. Conditional Timeliness Analysis: Presuming no leaves, no crashes, bounded transmission latencies and latency estimates, bounded loss detection delays, and a fixed number k of packet drops per packet transmission/recovery, packets are guaranteed delivery within particular delivery delay upper bound  (k).

39 39 Byproduct of RMI Timeliness Analysis Constraints on SRM scheduling parameters –C 3 < C 1 : back-off abstinence does not affect next round requests –D 1 + D 2 + 2 < 2 C 1 : replies received prior to transmission of next round requests –D 1 + D 2 + D 3 < 2 C 1 : requests not discarded due to prior round reply abstinence Violating these guidelines may lead to superfluous traffic and unwarranted recovery round failure

40 40 Caching-Enhanced SRM (CESRM) Enhance SRM with caching scheme –determines and caches optimal requestor/replier pair for each loss –expedites recovery of losses based on requestor/replier pair cache h h’ s exp-rqst exp-repl

41 41 CESRM Timed I/O Automaton Model IP-mcast RM-Client 1 RM-mem 1 RM-rep 1 RM-Client 2 RM-mem 2 RM-rep 2 RM-IPbuff 1 RM-rec 1 RM-IPbuff 2 RM-rec 2 RM-IPbuff 1 RM-rec 1 RM-IPbuff 2 RM-rec 2 IP-ucast

42 42 CESRM: Conditional Timeliness Analysis Definition: A cache hit corresponds to a recovery scenario in which: –hosts that share the loss also share optimal requestor-replier pair, –the optimal requestor shares the loss, and –the optimal replier does not share the loss. Claim: For any execution  where no recovery packets are dropped, cache hits lead to packet recovery within at most: DET-BOUND+ d reorder-delay +2d + as opposed to: DET-BOUND+(C 1 +C 2 )d + +d + +(D 1 +D 2 )d + +d + For C 1 =C 2 =D 1 =D 2 =1, worst-case recovery delay following detection reduced from ~3 RTT to ~1 RTT

43 43 Estimating the Frequency of Cache Hits Analyzed 14 multicast transmission traces [Yajnik et al. 95/96] On average, ~1/3 of losses recoverable by expedited recoveries More precise identification of loss locations may lead to the recovery of ~1/2 of losses by expedited recoveries Abstract loss location representationActual loss location representation

44 44 IV. Modeling Framework To support all this analysis, we need a well-designed mathematical foundation, capable of: –Describing all the algorithms we want to consider. –Supporting compositional and conditional analysis. We use a framework based on interacting state machines. –Basic asynchronous model (I/O automata) –Augmented models: Timed, hybrid (continuous/discrete), probabilistic.

45 45 I/O Automata [Lynch, Tuttle 87] Nondeterministic, infinite-state automata –States, start states –Actions: Input, output, internal –Transitions (s,a,s’) –Executions, traces –A implements B if traces(A)  traces(B) Describing system modularity: –Parallel composition –Levels of abstraction Reasoning methods: –Invariant assertions –Simulation relations –Compositional methods Used to describe asynchronous distributed algorithms.

46 46 Timed I/O Automata (TIOA) [Merritt, Modugno,Tuttle], [Lynch, Vaandrager] Add time-passage actions Used to describe: –Timeout-based algorithms. –Local clocks, clock synchronization. –Timing/performance characteristics.

47 47 Hybrid I/O Automata (HIOA) [Lynch, Segala, Vaandrager 01, 02] Automata with continuous and discrete transitions –States: Input, output, internal variables; start states –Actions: Input, output, internal –Discrete transitions (s,a,s’) –Trajectories , mapping time intervals to states –Execution  0 a 1  1 a 2  2 … –Trace: Project on external variables, external actions. –A implements B if traces(A)  traces(B). Composition, levels of abstraction. Invariants, simulation relations, compositional reasoning Used to describe: –Controlled systems –Automated transportation systems –Embedded systems

48 48 Timed I/O Automata (TIOA), Revisited [Lynch, Segala, Vaandrager, Kirli] Have reformulated TIOA as a special case of HIOA: –No external variables: states consist of internal variables only. Use trajectories to describe time-passage, instead of time- passage actions. Monograph on modeling timed systems: –Theory –Analysis methods –Examples –Relationships with other timed models [Alur, Dill], [Merritt, Modugno, Tuttle], [Maler, Manna, Pnueli]

49 49 Probabilistic Automata (PIOA, PTIOA) [Segala 95] [Segala, Vaandrager, Lynch 02] Add probabilistic transitions (s,a,  ) Work in progress [Segala, Vaandrager, Lynch], [de Alfaro, Henzinger]: –External behavior notion. –Composition theorems. –Implementation relationships Used to describe: –Probabilistic and nondeterministic behavior. –Randomized distributed algorithms –Systems with probabilistic assumptions

50 50 V. Plans for the Next Two Years

51 51 Plans: Distributed Algorithms Reconfigurable atomic memory –LAN implementation [Musial, Shvartsman] –More analysis: “Normal behavior” starting from some point –Algorithmic improvements: Concurrent garbage-collection [Gilbert] Reduced communication Better join protocol Faster reads –Extensions: “Leave” protocol Backup strategies for when configurations fail Support for choosing configurations

52 52 Plans: Distributed Algorithms Reliable multicast protocols [Livadas]: –Extend SRM analysis to handle nodes leaving and failing. –Finish CESRM analysis. –Analyze LMS protocol [Papadopoulos, Varghese 98]. Mobile systems: –Topology control [Hajiaghayi, Mirrokni] –Time synchronization –Tracking –Resource allocation –Data management Peer-to-peer systems [Lynch, Stoica]: –Location services that are provably fault-tolerant under reasonable steady-state assumptions. –Data management over location services

53 53 Plans: Semantic Framework Timed models: –Composition theorems for timing properties. –Structured TIOAs to support conditional performance analysis. –Relate TIOA to other models, e.g., reactive modules [Alur, Henzinger]. Probabilistic models: –Composition theorems [de Alfaro, Henzinger] Integrate timed and probabilistic models into one semantic framework.


Download ppt "1 Compositional Design and Analysis of Timing-Based Distributed Algorithms Nancy Lynch Theory of Distributed Systems MIT Third MURI Workshop Arlington-Ballston,"

Similar presentations


Ads by Google