Distributed Systems 2006 Group Communication II * *With material adapted from Ken Birman.

Slides:



Advertisements
Similar presentations
1 Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create (
Advertisements

Global States.
Reliable Communication in the Presence of Failures Kenneth Birman, Thomas Joseph Cornell University, 1987 Julia Campbell 19 November 2003.
CS 542: Topics in Distributed Systems Diganta Goswami.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 6 Instructor: Haifeng YU.
1 CS 194: Elections, Exclusion and Transactions Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Distributed Systems 2006 Group Communication I * *With material adapted from Ken Birman.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
Distributed Systems Spring 2009
LEADER ELECTION CS Election Algorithms Many distributed algorithms need one process to act as coordinator – Doesn’t matter which process does the.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
CS 582 / CMPE 481 Distributed Systems
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.
Synchronization Clock Synchronization Logical Clocks Global State Election Algorithms Mutual Exclusion.
Distributed Systems 2006 Retrofitting Reliability* *With material adapted from Ken Birman.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
Distributed Systems 2006 Virtual Synchrony* *With material adapted from Ken Birman.
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
Clock Synchronization and algorithm
Ken Birman Cornell University. CS5410 Fall
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
Reliable Distributed Systems More on Group Communication.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
TOTEM: A FAULT-TOLERANT MULTICAST GROUP COMMUNICATION SYSTEM L. E. Moser, P. M. Melliar Smith, D. A. Agarwal, B. K. Budhia C. A. Lingley-Papadopoulos University.
Chapter 6-2 the TCP/IP Layers. The four layers of the TCP/IP model are listed in Table 6-2. The layers are The four layers of the TCP/IP model are listed.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Lamport’s Logical Clocks & Totally Ordered Multicasting.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Krzys Ostrowski: TA.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Krzys Ostrowski: TA.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 11-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 28, 2010 Lecture 11 Leader Election.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CIS825 Lecture 2. Model Processors Communication medium.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Logical Clocks. Topics r Logical clocks r Totally-Ordered Multicasting.
Reliable Communication in the Presence of Failures Kenneth P. Birman and Thomas A. Joseph Presented by Gloria Chang.
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 6: Synchronyzation 3/5/20161 Distributed Systems - COMP 655.
Fault Tolerance (2). Topics r Reliable Group Communication.
Data Link Layer. Data link layer The communication between two machines that can directly communicate with each other. Basic property – If bit A is sent.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
CSE 486/586 CSE 486/586 Distributed Systems Leader Election Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CSE 486/586 Distributed Systems Reliable Multicast --- 1
Replication & Fault Tolerance CONARD JAMES B. FARAON
CSE 486/586 Distributed Systems Leader Election
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Leader Election
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CS514: Intermediate Course in Operating Systems
CSE 486/586 Distributed Systems Reliable Multicast --- 2
CSE 486/586 Distributed Systems Reliable Multicast --- 1
CSE 486/586 Distributed Systems Leader Election
Presentation transcript:

Distributed Systems 2006 Group Communication II * *With material adapted from Ken Birman

Distributed Systems Plan Tracking group membership: We’ll base it on 2PC and 3PC Fault-tolerant multicast: We’ll use membership Ordered multicast: We’ll base it on fault-tolerant multicast Tools for solving practical replication and availability problems: we’ll base them on ordered multicast Robust Web Services: We’ll build them with these tools 2PC and 3PC: Our first “tools” (lowest layer)

Distributed Systems Ordering: The missing element Our fault-tolerant protocol was –FIFO ordered Messages from a single sender are delivered in the order they were sent, even in the event of failure –View synchronous Everyone receives a given message in the same group view This is the protocol we called fbcast –We will look at this and others now

Distributed Systems Ordering properties: FIFO pqrspqrs ae

Distributed Systems Ordering properties: FIFO pqrspqrs a bcd e delivery of c to p is delayed until after b is delivered

Distributed Systems Implementing FIFO order Basic reliable multicast algorithm has this property –Without failures all we need is to run it on FIFO channels Like TCP, except “wired” to our GMS Or number multicasts –With failures need to be careful about the order in which things are done But only need to take care of this per sender Multithreaded applications –Must carefully use locking or order can be lost as soon as delivery occurs

Distributed Systems We identified other options cbcast –If cbcast(a)  cbcast(b) then deliver(a)  deliver(b) at common destinations abcast –Even if a and b are concurrent, deliver in some agreed order at common destinations gbcast –Deliver this message like a new group view: agreed order wrt. multicasts of all other flavors

Distributed Systems JGroups Let’s look at JGroups again –We saw GMS in action last time –Now let’s look at multicast with different orderings JGroup’s flexible protocol stack has a number of ordering possibilities –none –org.jgroups.protocols.CAUSAL –org.jgroups.protocols.TOTAL

Distributed Systems JGroups: Alphabet Example Stable process group Send parts of alphabet in sequence –Initiator multicasts {”A”, } –Receivers print ”A”, stores ”A” –Receiver with address multicasts {”B”, } –And so on... Which ordering do we want?

Distributed Systems JGroups: Alphabet Example

Distributed Systems JGroups: Alphabet Example

Distributed Systems JGroups: Alphabet Example

Distributed Systems How can we implement the orderings? First look at cbcast –Recall that this property was “like” fbcast –The issue concerns the meaning of a “single sender” With fbcast, a single sender is a single process With cbcast, we think about a single causal thread of events that can span many processes –For example: p asks q to send a, then asks r to send b –So a  b but a happens at q and b happens at r!

Distributed Systems Causally ordered updates Events occur on a “causal thread” but multicasts have different senders p r s t Perhaps p invoked a remote operation implemented by some other object here… The process corresponding to that object is “t” and, while doing the operation, it sent a multicast Now we’re back in process p. The remote operation has returned and p resumes computing T finishes whatever the operation involved and sends a response to the invoker. Now t waits for other requests T gets another request. This one came from p “indirectly” via s… but the idea is exactly the same. P is really running a single causal thread that weaves through the system, visiting various objects (and hence the processes that own them)

Distributed Systems How to implement it? Within a single group, the easiest option is to include a vector timestamp in the header of the message –Only increment the VT when sending –Send these “labeled” messages with fbcast Delay a received message if a causally prior message hasn’t been seen yet

Distributed Systems Causally ordered updates Example: messages from p and s arrive out of order at t p r s t VT(a) = [0,0,0,1] VT(b)=[1,0,0,1] VT(c) = [1,0,1,1] c is early: VT(c) = [1,0,1,1] but VT(t)=[0,0,0,1]: clearly we are missing one message from s When b arrives, we can deliver both it and message c, in order

Distributed Systems Causally ordered updates This works even with multiple causal threads. Concurrent messages might be delivered to different receivers in different orders –Example: green 4 and red 1 are concurrent p r s t

Distributed Systems Causally ordered updates Sorting based on vector timestamp In this run, everything can be delivered immediately on arrival p r s t [0,0,0,1] [1,0,0,1] [1,0,1,1] [1,1,1,1] [1,0,1,2][1,1,1,3] [2,1,1,3]

Distributed Systems Causally ordered updates Suppose p’s message [1,0,0,1] is “delayed” When t receives message [1,0,1,1], t can “see” that one message from p is late and can delay deliver of s’s message until p’s prior message arrives! p r s t [0,0,0,1] [1,0,0,1] [1,0,1,1] [1,1,1,1] [1,0,1,2][1,1,1,3] [2,1,1,3]

Distributed Systems Other uses for cbcast? The protocol is very helpful in systems that use locking for synchronization –Gaining a lock gives some process mutual exclusion –Then it can send updates to the locked variable or replicated data cbcast will maintain the update order –Since updates are causally related

Distributed Systems Cost of cbcast? This protocol is very cheap! It requires one phase to get the data from the sender to the receiver Receiver can deliver instantly –Same cost as an IP multicast or a set of UDP sends Imposes a small header and a small garbage collection overhead –Nobody is likely to notice! And we can often omit or compress the header

Distributed Systems Better and better Suppose some process sends a bunch of small updates using fbcast or cbcast –Pack them into a single bigger message –Benefit: message costs are dominated by the system call and almost unrelated to size, at least until we get big enough to require fragmentation! Can send hundreds of thousands of asynchronous updates per second in this mode!

Distributed Systems Causally ordered updates A bursty application p r s t Can pack into one large message and amortize overheads

Distributed Systems Snapshots with cbcast Send two rounds of cbcast –Round 1: “Start a snapshot” Receivers make a checkpoint And they start recording incoming messages Then say “OK” –Round 2: “Done” They send back their checkpoints and logs Thought question –Why does this give a consistent snapshot? –I.e., deliver(m) in snapshot => send(m) in snapshot? Assume Pk sends m to Pl outside snapshot –=> deliver(“Done”) at Pk is before send(m) at Pk –=> send(“Done”) is before send(m) at Pk (transitivity) –=> deliver(“Done”) at Pl is before deliver(m) at Pl (causal order) –Thus deliver(m) at PI is also outside snapshot

Distributed Systems What about abcast? abcast puts messages into a single agreed upon order even if two multicasts are sent concurrently –Contrast fbcast and cbcast that can deliver messages in different orders at different receivers –Notice that this disordered delivery wouldn’t matter in many cases Does this imply FIFO and/or causal delivery?

Distributed Systems Many options… Literature has at least a dozen abcast protocols, and some are causal too Easiest just uses a token –To send an abcast, either pass it to the token holder, or request the token –Token holder can increment a counter and put it in header of message Only need the counter if token can move… Delay a message until it can be delivered in order

Distributed Systems What about gbcast? This is a very costly protocol –Must be ordered wrt. all other event types, including fbcast, cbcast, abcast, view changes, other gbcasts –Used to change a security key or even modify the protocol stack at runtime Like changing the engines on a jet while it is flying! Not a common event Implement with a fusion of flush protocol and abcast –Requires at least two phases

Distributed Systems Life of a multicast The sender sends it… The protocol moves it to the right machines, deals with failures, puts it in order, finally delivers it –All of this is hidden from the real user Now the application “gets” the multicast and could send replies point-to-point

Distributed Systems Programming with multicasts Should we ask for replies? Synchronous versus asynchronous –A “synchronous” operation is RPC-like We need one or more replies from the processes that we invoke “When is the next operation of Patient X?” –An “asynchronous” operation is a multicast with no replies or feedback to the caller “Schedule a new operation for Patient X!”

Distributed Systems Should we ask for replies? Synchronous cases (one or more replies) won’t batch messages –Exception: sender could be multithreaded –But this is sort of rare since it is easier to work without concurrent threads unless you really have to Waiting for all replies is worst since slowest receiver limits the whole system –So speed is greatly reduced…

Distributed Systems Life of a multicast Asynchronous: sender doesn’t wait for replies Synchronous: sender does wait for replies Sender doesn’t pause Sender is waiting

Distributed Systems Asynchronous multicast: Pros and cons Asynchronous multicast allows higher speeds –The system can batch up multiple messages into one big message, as we saw earlier –And the sender won’t be limited by the speed of the network and the receivers This makes asynchronous multicast very popular in real systems But the sender can get “way ahead” and this can cause confusion if it then fails –Multicasts still in the channels can be lost

Distributed Systems Asynchronous confusion… From the outside a viewer might assume these were all delivered If a crash occurs, messages are delivered to all or none of the destinations My order is gone! OK, my order has been placed

Distributed Systems Remedies for confusion Insight is that these red multicasts were unstable –If we flush the channels and wait until they have been delivered (become stable), the issue is eliminated –Users find this easy to understand because file systems work the same way File I/O is asynchronous through the buffer pool… must use fsync to force writes to disk E.g., org.jgroups.protocols.FLUSH –Coordinator broadcasts flush message containing array of highest sequence number from members –Members answer with highest seq no + possibly missed messages –Coordinator re-broadcasts messages that may not have been seen by all

Distributed Systems Asynchronous confusion… Application invokes flush, but only when it is about to talk to the outside world Flush protocol runs here, pushes data through the channels

Distributed Systems Limits to asynchrony At any rate, most systems limit the number of asynchronous multicasts that are running simultaneously –Issue is that otherwise, sender can get arbitrarily far ahead of receivers –A few messages is one thing… millions is another –So most systems allow a few asynchronous messages at a time, but then force new multicasts to wait for some old ones to finish Very similar to TCP window idea –Congestion control –Limits the amount of data a sender can send before acknowledgments from receiver

Distributed Systems Picking between synchronous and asynchronous multicast With synchronous multicast we can “ask” the receivers to do something –Please search the telephone book –With k members at the time of reception, the group member i searches the i’th part of the book (dividing it into k parts) –Each reply has 1/k th of the answer! But we need to wait for the answers –This is a shame if we didn’t actually need answers

Distributed Systems A range of synchrony levels A platform usually offers multiple options –Wait for k replies, for some specified k  0 Waiting for no replies: asynchronous –Wait for “all” to reply When we say “all”: –This means “one reply from each member in the view at the time of delivery” –If someone gets the message but then fails, obviously, we should stop waiting for a reply….

Distributed Systems JGroups: More building blocks PullPushAdapter –Saw this last time –Converts pull interface of Channels to push interface –Can register MembershipListeners and MessageListeners with it MessageDispatcher –Send or cast a request to members public RspList MessageDispatcher.castMessage ( Vector dests, Message msg, int mode, int timeout ) –Respond public Object RequestHandler.handle( Message msg ) –Synchronous dispatch GroupRequest.GET_FIRST GroupRequest.GET_ALL GroupRequest.GET_MAJORITY... –Asynchronous dispatch GroupRequest.GET_NONE

Distributed Systems JGroups: More building blocks RpcDispatcher –Builds on MessageDispatcher –Invoke methods on members, possibly waiting for results –RspList RpcDispatcher.callRemoteMethods( java.util.Vector dests, java.lang.String method_name, java.lang.Object[] args, java.lang.Class[] types, int mode, long timeout ) –Installed at client and server –Uses reflection to invoke appropriate method on server-side

Distributed Systems JGroups: More building blocks DistributedHashtable –Extends java.util.Hashtable, uses RpcDispatcher –Overrides put, get, clear,... to replicate hashtable state among process group members –”Build a distributed naming and registration service in a couple of lines” And –DistributedTree, DistributedQueue,... –TwoPhaseVotingAdapter, VotingAdapter,...

Distributed Systems Summary We’ve got a range of ordered multicast primitives –Two (fbcast, cbcast) have low cost –Two (abcast, gbcast) are more ordered but more costly And we can use them asynchronously or synchronously Tracking group membership: We’ll base it on 2PC and 3PC Fault-tolerant multicast: We’ll use membership Ordered multicast: We’ll base it on fault-tolerant multicast Tools for solving practical replication and availability problems: we’ll base them on ordered multicast Robust Web Services: We’ll build them with these tools 2PC and 3PC: Our first “tools” (lowest layer)