1 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta (Indy) Membership Protocols (and Failure Detectors) March 31, 2011 All Slides © IG.

Slides:



Advertisements
Similar presentations
COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.
Advertisements

CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy)
1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 4: Failure Detection and Membership All slides © IG.
Gossip Algorithms and Implementing a Cluster/Grid Information service MsSys Course Amar Lior and Barak Amnon.
Reliable Group Communication Quanzeng You & Haoliang Wang.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Failure Detectors CS 717 Ashish Motivala Dec 6 th 2001.
Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.
CS542: Topics in Distributed Systems
The Dping Scalable Membership Service Indranil Gupta Ashish Motivala Abhinandan Das Cornell University.
Group Communication Phuong Hoai Ha & Yi Zhang Introduction to Lab. assignments March 24 th, 2004.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
1 Failure Detectors: A Perspective Sam Toueg LIX, Ecole Polytechnique Cornell University.
Composition Model and its code. bound:=bound+1.
Distributed Systems – CS425/CSE424/ECE428 – Fall Nikita Borisov — UIUC1.
Communication (II) Chapter 4
CSE 486/586 Distributed Systems Failure Detectors
An Efficient Topology-Adaptive Membership Protocol for Large- Scale Cluster-Based Services Jingyu Zhou * §, Lingkun Chu*, Tao Yang* § * Ask Jeeves §University.
CS542: Topics in Distributed Systems Diganta Goswami.
Failure detection and consensus Ludovic Henrio CNRS - projet OASIS Distributed Algorithms.
Overlay Network Physical LayerR : router Overlay Layer N R R R R R N.
1 Next Few Classes Networking basics Protection & Security.
Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.
Lab 2 Group Communication Farnaz Moradi Based on slides by Andreas Larsson 2012.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
7/26/ Design and Implementation of a Simple Totally-Ordered Reliable Multicast Protocol in Java.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 4-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 5, 2013 Lecture 4 Failure Detection Reading:
Group Communication Group oriented activities are steadily increasing. There are many types of groups:  Open and Closed groups  Peer-to-peer and hierarchical.
Farnaz Moradi Based on slides by Andreas Larsson 2013.
On Scalable and Efficient Distributed Failure Detectors Presented By : Sindhu Karthikeyan.
2007/1/15http:// Lightweight Probabilistic Broadcast M2 Tatsuya Shirai M1 Dai Saito.
Lecture 11 Failure Detectors (Sections 12.1 and part of 2.3.2) Klara Nahrstedt CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009)
Hwajung Lee. A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types of groups:
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Building Dependable Distributed Systems, Copyright Wenbing Zhao
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Lecture 4-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) September 6, 2012 Lecture 4 Failure Detection.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Lecture 19-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) October 29, 2013 Lecture 19 Gossiping Reading: Section.
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Lecture 9: Multicast Sep 22, 2015 All slides © IG.
CSE 486/586 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Roie Melamed, Technion AT&T Labs Araneola: A Scalable Reliable Multicast System for Dynamic Wide Area Environments Roie Melamed, Idit Keidar Technion.
Lecture 19-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) October 30, 2012 Lecture 19 Gossiping.
Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Peer-to-peer Systems All slides © IG.
Distributed Systems Lecture 4 Failure detection 1.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS 425 / ECE 428 Distributed Systems Fall 2016 Indranil Gupta (Indy) Sep 8, 2016 Lecture 6: Failure Detection and Membership, Grids All slides © IG 1.
CSE 486/586 Distributed Systems Gossiping
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Failure Detectors
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
CSE 486/586 Distributed Systems Failure Detectors
湖南大学-信息科学与工程学院-计算机与科学系
Lecture 6: Failure Detection and Membership, Grids
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
EEC 688/788 Secure and Dependable Computing
Lecture 6: Failure Detection and Membership, Grids
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Failure Detectors
CS 425 / ECE 428 Distributed Systems Fall 2018 Indranil Gupta (Indy)
IS 698/800-01: Advanced Distributed Systems Membership Management
Presentation transcript:

1 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta (Indy) Membership Protocols (and Failure Detectors) March 31, 2011 All Slides © IG

2 Target Settings Process ‘group’-based systems –Clouds/Datacenters –Replicated servers –Distributed databases Crash-stop/Fail-stop process failures

3 Group Membership Service Application Queries e.g., gossip, overlays, DHT’s, etc. e.g., gossip, overlays, DHT’s, etc. Membership Protocol Group Membership List joins, leaves, failures of members UnreliableCommunication Application Process pi Membership List

4 Two sub-protocols Dissemination Failure Detector Application Process pi pj Group Membership List UnreliableCommunication Almost-Complete list (focus of this talk) Gossip-style, SWIM, Virtual synchrony, … Or Partial-random list (other papers) SCAMP, T-MAN, Cyclon,…

5 Large Group: Scalability A Goal this is us ( pi ) Unreliable Communication Network 1000’s of processes Process Group “Members”

6 pj I pj crashed Group Membership Protocol Unreliable Communication Network pi Some process finds out quickly Failure Detector II Dissemination III Crash-stop Failures only

7 I. pj crashes Nothing we can do about it! A frequent occurrence Common case rather than exception Frequency goes up at least linearly with size of datacenter

8 II. Distributed Failure Detectors: Desirable Properties Completeness = each failure is detected Accuracy = there is no mistaken detection Speed –Time to first detection of a failure Scale –Equal Load on each member –Network Message Load

9 Distributed Failure Detectors: Properties Completeness Accuracy Speed –Time to first detection of a failure Scale –Equal Load on each member –Network Message Load Impossible together in lossy networks [Chandra and Toueg] If possible, then can solve consensus!

10 What Real Failure Detectors Prefer Completeness Accuracy Speed –Time to first detection of a failure Scale –Equal Load on each member –Network Message Load Guaranteed Partial/Probabilistic guarantee

11 Failure Detector Properties Completeness Accuracy Speed –Time to first detection of a failure Scale –Equal Load on each member –Network Message Load Time until some process detects the failure Guaranteed Partial/Probabilistic guarantee

12 Failure Detector Properties Completeness Accuracy Speed –Time to first detection of a failure Scale –Equal Load on each member –Network Message Load Time until some process detects the failure Guaranteed Partial/Probabilistic guarantee No bottlenecks/single failure point

13 Failure Detector Properties Completeness Accuracy Speed –Time to first detection of a failure Scale –Equal Load on each member –Network Message Load In spite of arbitrary simultaneous process failures

14 Centralized Heartbeating … pi, Heartbeat Seq. l++ pi  Hotspot pj Heartbeats sent periodically If heartbeat not received from pi within timeout, mark pi as failed

15 Ring Heartbeating pi, Heartbeat Seq. l++  Unpredictable on simultaneous multiple failures pi … … pj

16 All-to-All Heartbeating pi, Heartbeat Seq. l++ … Equal load per member pi pj

17 Gossip-style Heartbeating Array of Heartbeat Seq. l for member subset Good accuracy properties pi

18 Gossip-Style Failure Detection Protocol: Nodes periodically gossip their membership list On receipt, the local membership list is updated Current time : 70 at node 2 (asynchronous clocks) Address Heartbeat Counter Time (local) Fig and animation by: Dongyun Jin and Thuy Ngyuen

19 Gossip-Style Failure Detection If the heartbeat has not increased for more than T fail seconds, the member is considered failed And after T cleanup seconds, it will delete the member from the list Why two different timeouts?

20 Gossip-Style Failure Detection What if an entry pointing to a failed node is deleted right after T fail seconds? Fix: remember for another T fail Current time : 75 at node 2

21 Multi-level Gossiping Network topology is hierarchical Random gossip target selection => core routers face O(N) load (Why?) Fix: Select gossip target in subnet I, which contains n i nodes, with probability 1/n i Router load=O(1) Dissemination time=O(log(N)) Why? What about latency for multi-level topologies? [Gupta et al, TPDS 06] Router N/2 nodes in a subnet

22 Analysis/Discussion What happens if gossip period T gossip is decreased? A single heartbeat takes O(log(N)) time to propagate. So: N heartbeats take: –O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be O(N) –O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1) –What about O(k) bandwidth? What happens to P mistake (false positive rate) as T fail,T cleanup is increased? Tradeoff: False positive rate vs. detection time

23 Simulations As # members increases, the detection time increases As requirement is loosened, the detection time decreases As # failed members increases, the detection time increases significantly The algorithm is resilient to message loss

24 Failure Detector Properties … Completeness Accuracy Speed –Time to first detection of a failure Scale –Equal Load on each member –Network Message Load

25 …Are application-defined Requirements Completeness Accuracy Speed –Time to first detection of a failure Scale –Equal Load on each member –Network Message Load Guarantee always Probability PM(T) T time units

26 …Are application-defined Requirements Completeness Accuracy Speed –Time to first detection of a failure Scale –Equal Load on each member –Network Message Load Guarantee always Probability PM(T) T time units N*L: Compare this across protocols

27 All-to-All Heartbeating pi, Heartbeat Seq. l++ … pi Every T units L=N/T

28 Gossip-style Heartbeating Array of Heartbeat Seq. l for member subset pi Every tg units =gossip period, send O(N) gossip message T=logN * tg L=N/tg=N*logN/T

29 Worst case load L* –as a function of T, PM(T), N –Independent Message Loss probability p ml (proof in PODC 01 paper) What’s the Best/Optimal we can do?

30 Heartbeating Optimal L is independent of N (!) All-to-all and gossip-based: sub-optimal L=O(N/T) try to achieve simultaneous detection at all processes fail to distinguish Failure Detection and Dissemination components  Key: Separate the two components Use a non heartbeat-based Failure Detection Component

31 SWIM Failure Detector Protocol Protocol period = T’ time units X K random processes pi ping ack ping-req ack random pj X ack ping random K pj

32 SWIM versus Heartbeating Process Load First Detection Time Constant O(N) SWIM For Fixed : False Positive Rate Message Loss Rate Heartbeating

33 SWIM Failure Detector ParameterSWIM First Detection Time Expected periods Constant (independent of group size) Process Load Constant per period < 8 L* for 15% loss False Positive Rate Tunable (via K) Falls exponentially as load is scaled Completeness Deterministic time-bounded Within O(log(N)) periods w.h.p.

34 Accuracy, Load PM(T) is exponential in -K. Also depends on pml (and pf ) –See paper for up to 15 % loss rates

35 Prob. of being pinged in T’= E[T ] = Completeness: Any alive member detects failure –Eventually –By using a trick: within worst case O(N) protocol periods Detection Time

36 pj crashed III. Dissemination Unreliable Communication Network pi Dissemination HOW ? Failure Detector Some process finds out quickly

37 Dissemination Options Multicast (Hardware / IP) –unreliable –multiple simultaneous multicasts Point-to-point (TCP / UDP) –expensive Zero extra messages: Piggyback on Failure Detector messages –Infection-style Dissemination

38 Infection-style Dissemination Protocol period = T time units X pi ping ack ping-req ack random pj X ack ping random K pj Piggybacked membership information K random processes

39 Infection-style Dissemination Epidemic style dissemination –After protocol periods, processes would not have heard about an update Maintain a buffer of recently joined/evicted processes –Piggyback from this buffer –Prefer recent updates Buffer elements are garbage collected after a while –After protocol periods; this defines weak consistency

40 Suspicion Mechanism False detections, due to –Perturbed processes –Packet losses, e.g., from congestion Indirect pinging may not solve the problem –e.g., correlated message losses near pinged host Key: suspect a process before declaring it as failed in the group

41 Suspicion Mechanism Alive Suspected Failed Dissmn (Suspect pj) Dissmn (Alive pj)Dissmn (Failed pj) pi :: State Machine for pj view element FD:: pi ping failed Dissmn::(Suspect pj) Time out FD::pi ping success Dissmn::(Alive pj) Dissmn FD pi

42 Suspicion Mechanism Distinguish multiple suspicions of a process – Per-process incarnation number – Inc # for pi can be incremented only by pi e.g., when it receives a (Suspect, pi) message –Somewhat similar to DSDV Higher inc# notifications over-ride lower inc#’s Within an inc#: (Suspect inc #) > (Alive, inc #) Nothing overrides a (Failed, inc #) –See paper

43 Time-bounded Completeness Key: select each membership element once as a ping target in a traversal –Round-robin pinging –Random permutation of list after each traversal Each failure is detected in worst case 2N-1 (local) protocol periods Preserves FD properties

44 Results from an Implementation Current implementation –Win2K, uses Winsock 2 –Uses only UDP messaging –900 semicolons of code (including testing) Experimental platform –Galaxy cluster: diverse collection of commodity PCs –100 Mbps Ethernet Default protocol settings –Protocol period=2 s; K=1; G.C. and Suspicion timeouts=3*ceil[log(N+1)] No partial membership lists observed in experiments

45 Per-process Send and Receive Loads are independent of group size

46 Time to First Detection of a process failure T1 T1+T2+T3

47 T1 Time to First Detection of a process failure apparently uncorrelated to group size T1+T2+T3

48 Membership Update Dissemination Time is low at high group sizes T2 + T1+T2+T3

49 Excess time taken by Suspicion Mechanism T3 + T1+T2+T3

50 Benefit of Suspicion Mechanism: Per-process 10% synthetic packet loss

51 More discussion points It turns out that with a partial membership list that is uniformly random, gossiping retains same properties as with complete membership lists –Why? (Think of the equation) –Partial membership protocols SCAMP, Cyclon, TMAN, … Gossip-style failure detection underlies –Astrolabe –Amazon EC2/S3 (rumored!) SWIM used in –CoralCDN/Oasis anycast service: –Mike Freedman used suspicion mechanism to blackmark frequently-failing nodes

Reminder – Due this Sunday April 3 rd at PM Project Midterm Report due, pm [12pt font, single-sided, page Business Plan max] Wiki Term Paper - Second Draft Due (Individual) Reviews – you only have to submit reviews for 15 sessions (any 15 sessions) from 2/10 to 4/28. Keep track of your count! Take a breather!

53 Questions