Ken Birman Cornell University. CS5410 Fall 2008..

Slides:



Advertisements
Similar presentations
1 Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create (
Advertisements

COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.
Reliable Communication in the Presence of Failures Kenneth Birman, Thomas Joseph Cornell University, 1987 Julia Campbell 19 November 2003.
CS 542: Topics in Distributed Systems Diganta Goswami.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
1 CS 194: Elections, Exclusion and Transactions Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
Consistent Cuts Ken Birman. Idea  We would like to take a snapshot of the state of a distributed computation  We’ll do this by asking participants to.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Distributed Systems 2006 Group Communication I * *With material adapted from Ken Birman.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
Virtual Synchrony Ki Suh Lee Some slides are borrowed from Ken, Jared (cs ) and Justin (cs )
Distributed Systems Spring 2009
LEADER ELECTION CS Election Algorithms Many distributed algorithms need one process to act as coordinator – Doesn’t matter which process does the.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
CS 582 / CMPE 481 Distributed Systems
Ordering and Consistent Cuts Presented By Biswanath Panda.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Distributed Systems 2006 Group Communication II * *With material adapted from Ken Birman.
EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Distributed Systems 2006 Retrofitting Reliability* *With material adapted from Ken Birman.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
Ken Birman Cornell University. CS5410 Fall
Distributed Systems 2006 Virtual Synchrony* *With material adapted from Ken Birman.
Clock Synchronization and algorithm
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Lab 2 Group Communication Farnaz Moradi Based on slides by Andreas Larsson 2012.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Farnaz Moradi Based on slides by Andreas Larsson 2013.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Krzys Ostrowski: TA.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Krzys Ostrowski: TA.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Event Ordering Greg Bilodeau CS 5204 November 3, 2009.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
COORDINATION, SYNCHRONIZATION AND LOCKING WITH ISIS2 Ken Birman 1 Cornell University.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
Reliable Communication in the Presence of Failures Kenneth P. Birman and Thomas A. Joseph Presented by Gloria Chang.
Distributed systems. distributed systems and protocols distributed systems: use components located at networked computers use message-passing to coordinate.
Fault Tolerance (2). Topics r Reliable Group Communication.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.
CS5412: VIRTUAL SYNCHRONY Ken Birman 1 CS5412 Spring 2016 (Cloud Computing: Birman) Lecture XIV.
Detour: Distributed Systems Techniques
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Coordination and Agreement
Replication & Fault Tolerance CONARD JAMES B. FARAON
CS5412: Virtual Synchrony Lecture XIV Ken Birman
Active replication for fault tolerance
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CS514: Intermediate Course in Operating Systems
Presentation transcript:

Ken Birman Cornell University. CS5410 Fall 2008.

Monday: Designed an Oracle We used a state machine protocol to maintain consensus on events Structured the resulting system as a tree in which each node is a group of replicas Results in a very general management service One role of which is to manage membership when an application needs replicated data Today continue to flesh out this idea of a group communication abstraction in support of replication

Turning the GMS into the Oracle p q r Proposed V 1 = {p,r} V 0 = {p,q,r} OKOK Commit V 1 V 1 = {p,r} Here, three replicas cooperate to implement the GMS as a fault-tolerant state machine. Each client platform binds to some representative, then rebinds to a different replica if that one later crashes….

Tree of state machines Each “owns” a subset of the logs (1) Send events to the Oracle. (2) Appended to appropriate log. (3) Reported

Use scenario Application A wants to connect with B via “consistent TCP” A and B register with the Oracle – each has an event channel of its own, like /status/biscuit.cs.cornell.edu/pid=12421 Each subscribes to the channel of the other (if connection breaks, just reconnect to some other Oracle member and ask it to resume where the old one left off) They break the TCP connections if (and only if) the Oracle tells them to do so.

Use scenario For locking Lock is “named” by a path /x/y/z… Send “lock request” and “unlock” messages Everyone sees the same sequence of lock, unlock messages… so everyone knows who has the lock Garbage collection? Truncate prefix after lock is granted

Use scenario For tracking group membership Group is “named” by a path /x/y/z… Send “join request” and “leave” messages Report failures as “forced leave” Everyone sees the same sequence of join, leave messages… so everyone knows the group view Garbage collection? Truncate old view-related events

A primitive “pub/sub” system The Oracle is very simple but quite powerful Everyone sees what appears to be a single, highly available source of reliable “events” XML strings can encode all sorts of event data Library interfaces customize to offer various abstractions Too slow for high-rate events (although the Spread system works that way) But think of the Oracle as a bootstrapping tool that helps the groups implement their own direct, peer-to-peer protocols in a nicer world that if they didn’t have it.

Any group can use the Oracle to track membership Enabling reliable multicast! Protocol: Unreliable multicast to current members. ACK/NAK to ensure that all of them receive it Building group multicast p q r s

Concerns if sender crashes Perhaps it sent some message and only one process has seen it We would prefer to ensure that All receivers, in “current view” Receive any messages that any receiver receives (unless the sender and all receivers crash, erasing evidence…)

An interrupted multicast A message from q to r was “dropped” Since q has crashed, it won’t be resent p q r s

Flush protocol We say that a message is unstable if some receiver has it but (perhaps) others don’t For example, q’s message is unstable at process r If q fails we want to “flush” unstable messages out of the system

How to do this? Easy solution: all-to-all echo When a new view is reported All processes echo any unstable messages on all channels on which they haven’t received a copy of those messages A flurry of O(n 2 ) messages Note: must do this for all messages, not just those from the failed process. This is because more failures could happen in future

An interrupted multicast p had an unstable message, so it echoed it when it saw the new view p q r s

Event ordering We should first deliver the multicasts to the application layer and then report the new view This way all replicas see the same messages delivered “in” the same view Some call this “view synchrony” p q r s

State transfer At the instant the new view is reported, a process already in the group makes a checkpoint Sends point-to-point to new member(s) It (they) initialize from the checkpoint

State transfer and reliable multicast After re-ordering, it looks like each multicast is reliably delivered in the same view at each receiver Note: if sender and all receivers fails, unstable message can be “erased” even after delivery to an application This is a price we pay to gain higher speed p q r s

State transfer New view initiated, it adds a process We run the flush protocol, but as it ends… … some existing process creates a checkpoint of group Only state specific to the group, not ALL of its state Keep in mind that one application might be in many groups at the same time, each with its own state Transfer this checkpoint to joining member It loads it to initialize the state of its instance of the group – that object. One state transfer per group.

Ordering: The missing element Our fault-tolerant protocol was FIFO ordered: messages from a single sender are delivered in the order they were sent, even if someone crashes View synchronous: everyone receives a given message in the same group view This is the protocol we called fbcast

Other options cbcast: If cbcast(a)  cbcast(b), deliver a before b at common destinations abcast: Even if a and b are concurrent, deliver in some agreed order at common destinations gbcast: Deliver this message like a new group view: agreed order w.r.t. multicasts of all other flavors

Single updater If p is the only update source, the need is a bit like the TCP “fifo” ordering fbcast is a good choice for this case p r s t 1234

Causally ordered updates Events occur on a “causal thread” but multicasts have different senders p r s t

Causally ordered updates Events occur on a “causal thread” but multicasts have different senders p r s t Perhaps p invoked a remote operation implemented by some other object here… The process corresponding to that object is “t” and, while doing the operation, it sent a multicast Now we’re back in process p. The remote operation has returned and p resumes computing T finishes whatever the operation involved and sends a response to the invoker. Now t waits for other requests T gets another request. This one came from p “indirectly” via s… but the idea is exactly the same. P is really running a single causal thread that weaves through the system, visiting various objects (and hence the processes that own them)

How to implement it? Within a single group, the easiest option is to include a vector timestamp in the header of the message Array of counters, one per group member Increment your personal counter when sending iSend these “labeled” messages with fbcast Delay a received message if a causally prior message hasn’t been seen yet

Causally ordered updates Example: messages from p and s arrive out of order at t p r s t VT(a) = [0,0,0,1] VT(b)=[1,0,0,1] VT(c) = [1,0,1,1] c is early: VT(c) = [1,0,1,1] but VT(t)=[0,0,0,1]: clearly we are missing one message from s When b arrives, we can deliver both it and message c, in order

Causally ordered updates This works even with multiple causal threads. Concurrent messages might be delivered to different receivers in different orders Example: green 4 and red 1 are concurrent p r s t

Causally ordered updates Sorting based on vector timestamp In this run, everything can be delivered immediately on arrival p r s t [0,0,0,1] [1,0,0,1] [1,0,1,1] [1,1,1,1] [1,0,1,2][1,1,1,3] [2,1,1,3]

Causally ordered updates Suppose p’s message [1,0,0,1] is “delayed” When t receives message [1,0,1,1], t can “see” that one message from p is late and can delay deliver of s’s message until p’s prior message arrives! p r s t [0,0,0,1] [1,0,0,1] [1,0,1,1] [1,1,1,1] [1,0,1,2][1,1,1,3] [2,1,1,3]

Other uses for cbcast? The protocol is very helpful in systems that use locking for synchronization Gaining a lock gives some process mutual exclusion Then it can send updates to the locked variable or replicated data Cbcast will maintain the update order

Causally ordered updates A bursty application p r s t Can pack into one large message and amortize overheads

Other forms of ordering Abcast (total or “atomic” ordering) Basically, our locking protocol solved this problem Can also do it with fbcast by having a token-holder send out ordering to use Gbcast Provides applications with access to the same protocol used when extending the group view Basically, identical to “Paxos” with a leader

Algorithms that use multicast Locked access to shared data Multicast updates, read any local copy This is very efficient… 100k updates/second not unusual Parallel search Fault-tolerance (primary/backup, coordinator-cohort) Publish/subscribe Shared “work to do” tuples Secure replicated keys Coordinated actions that require a leader

Modern high visibility examples Google’s Chubby service (uses Paxos == gbcast) Yahoo! Zookeeper Microsoft cluster management technology for Windows Enterprise clusters IBM DCS platform and Websphere Basically: stuff like this is all around us, although often hidden inside some other kind of system

Summary Last week we looked at two notions of time Logical time is more relevant here Notice the similarity between delivery of an ordered multicast and computing something on a consistent cut We’re starting to think of “consistency” and “replication” in terms of events that occur along time- ordered event histories The GMS (Oracle) tracks “management” events The group communication system supports much higher data rate replication