Reliable Distributed Systems Membership. Agreement on Membership Recall our approach: Detecting failure is a lost cause. Too many things can mimic failure.

Slides:



Advertisements
Similar presentations
IMPOSSIBILITY OF CONSENSUS Ken Birman Fall Consensus… a classic problem  Consensus abstraction underlies many distributed systems and protocols.
Advertisements

6.852: Distributed Algorithms Spring, 2008 Class 7.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Computer Science 425 Distributed Systems CS 425 / ECE 428 Consensus
Consensus Hao Li.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Byzantine Generals Problem: Solution using signed messages.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Distributed Systems 2006 Group Communication I * *With material adapted from Ken Birman.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
LEADER ELECTION CS Election Algorithms Many distributed algorithms need one process to act as coordinator – Doesn’t matter which process does the.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Reliable Distributed Systems Fault Tolerance (Recoverability  High Availability)
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 9: Sept. 21.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.
EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
Distributed Systems 2006 Virtual Synchrony* *With material adapted from Ken Birman.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Consensus and Its Impossibility in Asynchronous Systems.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Two-Phase Commit Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
SysRép / 2.5A. SchiperEté The consensus problem.
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
Reliable Communication in the Presence of Failures Kenneth P. Birman and Thomas A. Joseph Presented by Gloria Chang.
Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.
Alternating Bit Protocol S R ABP is a link layer protocol. Works on FIFO channels only. Guarantees reliable message delivery with a 1-bit sequence number.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
Fault Tolerance (2). Topics r Reliable Group Communication.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Consensus, impossibility results and Paxos Ken Birman.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
The consensus problem in distributed systems
Alternating Bit Protocol
Reliable Distributed Systems
Advanced Operating System
Reliable Distributed Systems
Active replication for fault tolerance
EEC 688/788 Secure and Dependable Computing
CS514: Intermediate Course in Operating Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Presentation transcript:

Reliable Distributed Systems Membership

Agreement on Membership Recall our approach: Detecting failure is a lost cause. Too many things can mimic failure To be accurate would end up waiting for a process to recover Substitute agreement on membership Now we can drop a process because it isn’t fast enough This can seem “arbitrary”, e.g. A kills B… GMS implements this service for everyone else

Architecture Membership Agreement, “join/leave” and “P seems to be unresponsive” 2PC-like protocols use membership changes instead of failure notification Applications use replicated data for high availability

Architecture GMS A B C D join leave join A seems to have failed {A} {A,B,D} {A,D} {A,D,C} {D,C} XYZ Application processes GMS processes membership views

Contrast dynamic with static model Static model: fixed set of processes “tied” to resources Processes may be unreachable (while failed or partitioned away) but later recover Think: “cluster of PCs” Dynamic model: changing set of processes launched while system runs, some fail/terminate Failed processes never recover (partitioned process may reconnect, but uses a new pid) And can still own a physical resource, allowing us to emulate a static model

Consistency options Could require that system always be consistent with actions taken at a process even if that process fails immediately after taking the action This property is needed in systems that take external actions, like advising an air traffic controller May not be needed in high availability systems Alternative is to require that operational part of system remain continuously self-consistent

Obstacles to progress Fischer, Lynch and Patterson result: proof that agreement protocols cannot be both externally consistent and live in asynchronous environments Suggests that choice between internal consistency and external consistency is a fundamental one! Can show that this result also applies to dynamic membership problems

Usual response to FLP: Chandra/Toueg Consider system as having a failure detector that provides input to the basic system itself Agreement protocols within system are considered safe and live if they satisfy their properties and are live when the failure detector is live Babaoglu: expresses similar result in terms of reachability of processes: protocols are live during periods of reachability

Towards an Alternative In this lecture, focus on systems with self-defined membership Idea is that if p can’t talk to q it will initiate a membership change that removes q from p’s system “membership view” Illustrated on next slide

Commit protocol from when we discussed transactions ok to commit? ok decision unknown! vote unknown! ok

Suppose this is a partitioning failure ok to commit? ok decision unknown! vote unknown! ok Do these processes actually need to be consistent with the others?

Primary partition concept Idea is to identify notion of “the system” with a unique component of the partitioned system Call this distinguished component the “primary” partition of the system as a whole. Primary partition can speak with authority for the system as a whole Non-primary partitions have weaker consistency guarantees and limited ability to initiate new actions

Ricciardi: Group Membership Protocol For use in a group membership service (usually just a few processes that run on behalf of whole system) Tracks own membership; own members use this to maintain membership list for the whole system All user’s of the service see subsequences of a single system-wide group membership history GMS also tracks the primary partition

GMP protocol itself Used only to track membership of the “core” GMS Designates one GMS member as the coordinator Switches between 2PC and 3PC 2PC if the coordinator didn’t fail and other members failed or are joining 3PC if the coordinator failed and some other member is taking over as new coordinator Question: how to avoid “logical partitioning”?

GMS majority requirement To move from system “view” i to view i+1, GMS requires explicit acknowledgement by a majority of the processes in view i Can’t get a majority: causes GMS to lose its primaryness information Dahlia Malkhi has extended GMP to support partitioning and remerging; similar idea used by Yair Amir and others in Totem system

GMS in Action p 0 p 1... p 5 p 0 is the initial coordinator. p 1 and p 2 join, then p 3...p 5 join. But p 0 fails during join protocol, and later so does p 3. Notice use of majority consent to avoid partitioning!

GMS in Action p 0 p 1... p 5 2-phase commit… 3-phase… 2–phase P 0 is coordinator… P 1 takes over… P 1 is new coordinator

What if system has thousands of processes? Idea is to build a GMS subsystem that runs on just a few nodes GMS members track themselves Other processes ask to be admitted to system or for faulty processes to be excluded GMS treats overall system membership as a form of replicated data that it manages, reports to its “listeners”

Uses of membership? If we rewire TCP and RPC to use membership changes as trigger for breaking connections, can eliminate split-brain problems! But nobody really does this Problem is that networks lack standard GMS subsystems now! But we can still use it ourselves

Replicated data within groups A very general requirement: Data actually managed by group Inputs and outputs, in a server replicated for fault- tolerance Coordination and synchronization data Will see how to solve this, and then will use solution to implement “process groups” which are subgroups of the overall system membership

Replicated data Assume that we have a (dynamically defined) group of processes G and that its members manage a replicated data item Goal: update by sending a multicast to G Should be able to safely read any copy “locally” Consider situation where members of G may fail or recover

Some Initial Assumptions For now, assume that we work directly on the real network, not using Ricciardi’s GMS Later will need to put GMS in to solve a problem this raises, but for now, the model will be the very simple one: processes that communicate using messages, asynchronous network, crash failures We’ll also need our own implementation of TCP-style reliable point-to-point channels using GMS as input

Process group model Initially, we’ll assume we are simply given the model Later will see that we can use reliable multicast to implement the model First approximation: a process group is defined by a series of “views” of its membership. All members see the same sequence of view changes. Failures, joins reported by changing membership

Process groups with joins, failures crash G 0 ={p,q} G 1 ={p,q,r,s} G 2 ={q,r,s} G 3 ={q,r,s,t} pqrstpqrst r, s request to join r,s added; state xfer t added, state xfer t requests to join p fails

State transfer Method for passing information about state of a group to a joining member Looks instantaneous, at time the member is added to the view

Outline of treatment First, look at reliability and failure atomicity Next, look at options for “ordering” in group multicast Next, discuss implementation of the group view mechanisms themselves Finally, return to state transfer Outcome: process groups, group communication, state transfer, and fault- tolerance properties

Atomic delivery Atomic or failure atomic delivery If any process receives the message and remains operational, all operational destinations receive it pqrspqrs a b All processes that receive a subsequently fail. fails All processes receive b.

Additional properties A multicast is dynamically uniform if: If any process delivers the multicast, all group members that don’t fail will deliver it (even if the initial recipient fails immediately after delivery). Otherwise we say that the multicast is “not uniform”

Uniform and non-uniform delivery pqrspqrs a b Uniform delivery of a and b fails pqrspqrs a b Non-uniform delivery of a fails

Stronger properties cost more Weaker ordering guarantees are cheaper than stronger ones Non-uniform delivery is cheap Dynamic uniformity is costly Dynamic membership is cheap Static membership is more costly

Conceptual cost graph less ordered local total order global total order non-uniform, dynamic group uniform static group asynchronous and non-uniform “cbcast” to dynamically defined group uniform and globally total “abcast” in a static group cbcast in Horus: 85,000/second, 85us latency sender to dest Total, safe abcast in Totem or Transis: 600/second, 750ms latency sender to dest

Implementing multicast primitives Initially assume a static process group Crash failures: permanent failures, a process fails by crashing undetectably. No GMS (at first). Unreliable communication: messages can be lost in the channels... looks like the asynchronous model of FLP

Failures? Message loss: overcome with retransmission Process failures: assume they “crash” silently Network failures: also called “partitioning” Can’t distinguish between these cases! pqpq timeout: q failed! timeout: p failed! network partitions

Multicast by “flooding” All recipients echo message to all other recipients, O(n2) messages exchanged Reject duplicates on basis of message id When can we garbage collect the id? pqrspqrs a fails

Multicast by “flooding” All recipients echo message to all other recipients, O(n2) messages exchanged Reject duplicates on basis of message id When can we garbage collect the id? pqrspqrs a

Multicast by “flooding” All recipients echo message to all other recipients, O(n2) messages exchanged Reject duplicates on basis of message id When can we garbage collect the id? pqrspqrs a fails

Multicast by “flooding” All recipients echo message to all other recipients, O(n 2 ) messages exchanged Reject duplicates on basis of message id When can we garbage collect the id? pqrspqrs a fails

Garbage collection issue Must remember id as long as might still see a duplicate copy If no process fails: garbage collect after echoed by all destinations Very similar to 3PC protocol... correctness of this protocol depends upon having an accurate way to detect failure! Return to this point in a few minutes.

“Lazy” flooding and garbage collection Idea is to delay “non urgent” messages Recipients delay the echo in hope that sender will confirm successful delivery: O(n) messages pqrspqrs a ack...

“Lazy” flooding Recipients delay the echo in hope that sender will confirm successful delivery: O(n) messages pqrspqrs a ack... all got it...

“Lazy” flooding Recipients delay the echo in hope that sender will confirm successful delivery: O(n) messages Notice that garbage collection occurs in 3rd phase pqrspqrs a fails ack... all got it... garbage collect

“Lazy” flooding, delayed phases “Background” acknowedgements (not shown) Piggyback 2nd, 3rd phase on other multicasts pqrspqrs m1m1 m1m1

“Lazy” flooding, delayed phases “Background” acknowedgements (not shown) Piggyback 2nd, 3rd phase on other multicasts pqrspqrs m 1 m 2 m 1 m 2, all got m

“Lazy” flooding, delayed phases “Background” acknowedgements (not shown) Piggyback 2nd, 3rd phase on other multicasts pqrspqrs m 1 m 2 m 3 fails m 1 m 2, all got m 1 m 3, gc m 1

“Lazy” flooding, delayed phases “Background” acknowedgements (not shown) Piggyback 2nd, 3rd phase on other multicasts Reliable multicasts now look cheap! pqrspqrs m 1 m 2 m 3 m 4 m 1 m 2, all got m 1 m 3, gc m 1 m 4, gc m 2 fails

Lazy scheme continued If sender fails, recipients switch to flood-style algorithm... but now we have the same garbage collection problem: if sender fails we may never be able to garbage collect the id! Problem is caused by lack of failure detector

Garbage collection with inaccurate failure detections... we lack an accurate way to detect failure If any does seem to fail, but is really still operational and merely partitioned away, the connection might later be fixed. That process might “wake up” and send a duplicate Hence, if we are not sure a process has failed, can’t garbage collect our duplicate- supression data yet!

Exploiting a failure detector Suppose that we had a failstop environment Process group membership managed by oracle, perhaps the GMS we saw earlier Failures reported as “new group views” All see the same sequence of views: G = {p,q,r,s} {p,r,s} {r,s} Now can assume failures are accurately detected

Now our lazy scheme works! Garbage collect when all non-faulty processes are known to have received the message Use process ranking to pick a new “coordinator” if the initial one fails Cost only reaches n 2 if many fail during protocol Can delay 2nd, 3rd round if desired Also link GMS to point-to-point channel implementation

Failure Detectors Needed as “input” to GMS. For now, just assume we have one, perhaps Vogel’s investigator In practice many systems use “timeout”, but timeout is not safe for our purposes Feeding detections through group membership service converts inaccurate failure detections into what look like failstop failures for processes within the system

Cutting Channels to Failed Processes When a process is dropped from the membership, break the connection to it This will effectively eliminate the risk of “late” delivery of duplicate messages, etc. Makes a partitioning failure look like a failstop failure.

Dynamic uniformity This property requires an extra phase of communication Phase 1: distribute message Phase 2: can deliver if all non-faulty processes received it in phase 1 Insight: no process delivers a message until all have received it

Summary We know how to build a GMS that tracks its own membership We know how to build an unordered reliable multicast Actually, “sender-ordered” But from different senders, can delivery in arbitrary orders And we know how to support various forms of uniformity Next: multicast ordering