Lecture 6: Failure Detection and Membership, Grids

Slides:



Advertisements
Similar presentations
COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.
Advertisements

CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 4: Failure Detection and Membership All slides © IG.
1 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta (Indy) Membership Protocols (and Failure Detectors) March 31, 2011 All Slides © IG.
Reliable Group Communication Quanzeng You & Haoliang Wang.
Failure Detectors CS 717 Ashish Motivala Dec 6 th 2001.
A Computation Management Agent for Multi-Institutional Grids
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
The Dping Scalable Membership Service Indranil Gupta Ashish Motivala Abhinandan Das Cornell University.
Distributed Systems – CS425/CSE424/ECE428 – Fall Nikita Borisov — UIUC1.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
CS542: Topics in Distributed Systems Diganta Goswami.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 4-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 5, 2013 Lecture 4 Failure Detection Reading:
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
2007/1/15http:// Lightweight Probabilistic Broadcast M2 Tatsuya Shirai M1 Dai Saito.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
11 Indranil Gupta (Indy) Lecture 4 Cloud Computing: Older Testbeds January 28, 2010 CS 525 Advanced Distributed Systems Spring 2010 All Slides © IG.
CSE 486/586 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
Canadian Bioinformatics Workshops
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
CS 425 / ECE 428 Distributed Systems Fall 2016 Indranil Gupta (Indy) Sep 8, 2016 Lecture 6: Failure Detection and Membership, Grids All slides © IG 1.
CSE 486/586 Distributed Systems Gossiping
Primary-Backup Replication
Introduction to Distributed Platforms
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Leader Election
Example: Rapid Atmospheric Modeling System, ColoState U
When Is Agreement Possible
Distributed Systems – Paxos
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Grid Computing.
Lecture 17: Leader Election
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Failure Detectors
Chapter 16: Distributed System Structures
Replication Middleware for Cloud Based Storage Service
湖南大学-信息科学与工程学院-计算机与科学系
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
湖南大学-信息科学与工程学院-计算机与科学系
Providing Secure Storage on the Internet
Lecture 6: Failure Detection and Membership, Grids
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
EECS 498 Introduction to Distributed Systems Fall 2017
Lecture 26 A: Distributed Shared Memory
Distributed Systems, Consensus and Replicated State Machines
Ch 4. The Evolution of Analytic Scalability
Reliable Distributed Systems
Software models - Software Architecture Design Patterns
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
EEC 688/788 Secure and Dependable Computing
CS510 - Portland State University
Lecture 26 A: Distributed Shared Memory
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Web Caching
Overview of Networking
Lecture 21: Replication Control
CSE 486/586 Distributed Systems Failure Detectors
CS 425 / ECE 428 Distributed Systems Fall 2018 Indranil Gupta (Indy)
CSE 486/586 Distributed Systems Leader Election
IS 698/800-01: Advanced Distributed Systems Membership Management
Presentation transcript:

Lecture 6: Failure Detection and Membership, Grids CS 425 / ECE 428 Distributed Systems Fall 2018 Indranil Gupta (Indy) Lecture 6: Failure Detection and Membership, Grids All slides © IG

A Challenge You’ve been put in charge of a datacenter, and your manager has told you, “Oh no! We don’t have any failures in our datacenter!” Do you believe him/her? What would be your first responsibility? Build a failure detector What are some things that could go wrong if you didn’t do this?

Failures are the Norm … not the exception, in datacenters. Say, the rate of failure of one machine (OS/disk/motherboard/network, etc.) is once every 10 years (120 months) on average. When you have 120 servers in the DC, the mean time to failure (MTTF) of the next machine is 1 month. When you have 12,000 servers in the DC, the MTTF is about once every 7.2 hours! Soft crashes and failures are even more frequent!

To build a failure detector You have a few options 1. Hire 1000 people, each to monitor one machine in the datacenter and report to you when it fails. 2. Write a failure detector program (distributed) that automatically detects failures and reports to your workstation. Which is more preferable, and why?

Target Settings Process ‘group’-based systems Clouds/Datacenters Replicated servers Distributed databases Fail-stop (crash) process failures

Group Membership Service Application Queries e.g., gossip, overlays, DHT’s, etc. Application Process pi joins, leaves, failures of members Membership Protocol Membership List Group Membership List Unreliable Communication - say “non-byzantine” ANIMATE!

Two sub-protocols Application Process pi Group Membership List pj Dissemination Failure Detector Complete list all the time (Strongly consistent) Virtual synchrony Almost-Complete list (Weakly consistent) Gossip-style, SWIM, … Or Partial-random list (other systems) SCAMP, T-MAN, Cyclon,… Focus of this series of lecture Unreliable Communication - say “non-byzantine”

Large Group: Scalability A Goal this is us (pi) Process Group “Members” 1000’s of processes Unreliable Communication Network - say “non-byzantine”

Group Membership Protocol Some process finds out quickly Failure Detector II pi I pj crashed pj Dissemination III Unreliable Communication Network - animate f-d and then dissemination? - say “non-byzantine” Fail-stop Failures only

Next How do you design a group membership protocol?

I. pj crashes Nothing we can do about it! A frequent occurrence Common case rather than exception Frequency goes up linearly with size of datacenter

II. Distributed Failure Detectors: Desirable Properties Completeness = each failure is detected Accuracy = there is no mistaken detection Speed Time to first detection of a failure Scale Equal Load on each member Network Message Load

Distributed Failure Detectors: Properties Impossible together in lossy networks [Chandra and Toueg] If possible, then can solve consensus! (but consensus is known to be unsolvable in asynchronous systems) Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load on each member Network Message Load

What Real Failure Detectors Prefer Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load on each member Network Message Load Guaranteed Partial/Probabilistic guarantee

What Real Failure Detectors Prefer Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load on each member Network Message Load Guaranteed Partial/Probabilistic guarantee Time until some process detects the failure

What Real Failure Detectors Prefer Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load on each member Network Message Load Guaranteed Partial/Probabilistic guarantee Time until some process detects the failure No bottlenecks/single failure point

Failure Detector Properties Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load on each member Network Message Load In spite of arbitrary simultaneous process failures

Centralized Heartbeating  Hotspot pi … pi, Heartbeat Seq. l++ pj Heartbeats sent periodically If heartbeat not received from pi within timeout, mark pi as failed

Ring Heartbeating … …  Unpredictable on simultaneous multiple failures pi pi, Heartbeat Seq. l++ pj … …

All-to-All Heartbeating Equal load per member  Single hb loss  false detection pi pi, Heartbeat Seq. l++ … pj

Next How do we increase the robustness of all-to-all heartbeating?

Gossip-style Heartbeating  Good accuracy properties pi Array of Heartbeat Seq. l for member subset

Gossip-Style Failure Detection 1 10118 64 2 10110 3 10090 58 4 10111 65 1 10120 66 2 10103 62 3 10098 63 4 10111 65 2 1 Address Time (local) 1 10120 70 2 10110 64 3 10098 4 10111 65 Heartbeat Counter Protocol: Nodes periodically gossip their membership list: pick random nodes, send it list On receipt, it is merged with local membership list When an entry times out, member is marked as failed 4 3 Current time : 70 at node 2 (asynchronous clocks)

Gossip-Style Failure Detection If the heartbeat has not increased for more than Tfail seconds, the member is considered failed And after a further Tcleanup seconds, it will delete the member from the list Why an additional timeout? Why not delete right away?

Gossip-Style Failure Detection What if an entry pointing to a failed node is deleted right after Tfail (=24) seconds? Fix: remember for another Tfail 1 10120 66 2 10110 64 3 10098 75 4 10111 65 1 10120 66 2 10110 64 4 10111 65 1 10120 66 2 10110 64 3 10098 50 4 10111 65 1 10120 66 2 10103 62 3 10098 55 4 10111 65 2 1 Current time : 75 at node 2 4 3

Analysis/Discussion Well-known result: a gossip takes O(log(N)) time to propagate. So: Given sufficient bandwidth, a single heartbeat takes O(log(N)) time to propagate. So: N heartbeats take: O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be O(N) O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1) What about O(k) bandwidth? What happens if gossip period Tgossip is decreased? What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased? Tradeoff: False positive rate vs. detection time vs. bandwidth

Next So, is this the best we can do? What is the best we can do?

Failure Detector Properties … Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load on each member Network Message Load

…Are application-defined Requirements Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load on each member Network Message Load Guarantee always Probability PM(T) T time units

…Are application-defined Requirements Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load on each member Network Message Load Guarantee always Probability PM(T) T time units N*L: Compare this across protocols

All-to-All Heartbeating pi Every T units pi, Heartbeat Seq. l++ L=N/T …

Gossip-style Heartbeating T=logN * tg pi Array of Heartbeat Seq. l for member subset L=N/tg=N*logN/T Every tg units =gossip period, send O(N) gossip message

What’s the Best/Optimal we can do? Worst case load L* per member in the group (messages per second) as a function of T, PM(T), N Independent Message Loss probability pml

Heartbeating Optimal L is independent of N (!) All-to-all and gossip-based: sub-optimal L=O(N/T) try to achieve simultaneous detection at all processes fail to distinguish Failure Detection and Dissemination components Can we reach this bound? Key: Separate the two components Use a non heartbeat-based Failure Detection Component Remove the elaborate h-b scheme slides, and include more information on this slide.

Next Is there a better failure detector?

SWIM Failure Detector Protocol Protocol period = T’ time units X K random processes pi ping ack ping-req random pj random K pj Animate? May be slow.

Detection Time Prob. of being pinged in T’= E[T ] = Completeness: Any alive member detects failure Eventually By using a trick: within worst case O(N) protocol periods Put this up and say “converges” several times. Also say “independent” of N. Pr[pinging]; => expected detection time, whp detection time, wc detection time. (ind of n) PM(T) can be calc’ed as a function of T, pml, pf (ind of n) Trends on wc L/L* and E[L]/L*

Accuracy, Load PM(T) is exponential in -K. Also depends on pml (and pf ) See paper for up to 15 % loss rates Pr[pinging]; => expected detection time, whp detection time, wc detection time. (ind of n) PM(T) can be calc’ed as a function of T, pml, pf (ind of n) Trends on wc L/L* and E[L]/L*

SWIM Failure Detector Parameter SWIM First Detection Time Expected periods Constant (independent of group size) Process Load Constant per period < 8 L* for 15% loss False Positive Rate Tunable (via K) Falls exponentially as load is scaled Completeness Deterministic time-bounded Within O(log(N)) periods w.h.p.

Time-bounded Completeness Key: select each membership element once as a ping target in a traversal Round-robin pinging Random permutation of list after each traversal Each failure is detected in worst case 2N-1 (local) protocol periods Preserves FD properties One slide is enough for this.

SWIM versus Heartbeating O(N) First Detection Time SWIM Heartbeating Constant Process Load Constant O(N) For Fixed : False Positive Rate Message Loss Rate

Next How do failure detectors fit into the big picture of a group membership protocol? What are the missing blocks?

Group Membership Protocol Some process finds out quickly Failure Detector II pi I pj crashed pj Dissemination III Unreliable Communication Network - animate f-d and then dissemination? - say “non-byzantine” Fail-stop Failures only HOW ? HOW ? HOW ? HOW ?

Dissemination Options Multicast (Hardware / IP) unreliable multiple simultaneous multicasts Point-to-point (TCP / UDP) expensive Zero extra messages: Piggyback on Failure Detector messages Infection-style Dissemination

Infection-style Dissemination Protocol period = T time units X pi ping ack ping-req random pj random K pj K random processes Piggybacked membership information Animate? May be slow.

Infection-style Dissemination Epidemic/Gossip style dissemination After protocol periods, processes would not have heard about an update Maintain a buffer of recently joined/evicted processes Piggyback from this buffer Prefer recent updates Buffer elements are garbage collected after a while After protocol periods, i.e., once they’ve propagated through the system; this defines weak consistency Black fonts not visible [ken] topological awareness, e.g., in a WAN-wide setting or an ad-hoc network with group members spread out over an area.

Suspicion Mechanism False detections, due to Perturbed processes Packet losses, e.g., from congestion Indirect pinging may not solve the problem Key: suspect a process before declaring it as failed in the group [rvr]: f-d more than a black-box, but as a ‘quality’ of connection to a process, e.g., weighing of gossip targets.

Suspicion Mechanism Dissmn FD pi Dissmn (Suspect pj) Suspected FD:: pi ping failed Dissmn::(Suspect pj) Time out FD::pi ping success Dissmn::(Alive pj) Alive Failed Dissmn (Alive pj) Dissmn (Failed pj) Use a pointer to show the 3 states.

Suspicion Mechanism Distinguish multiple suspicions of a process Per-process incarnation number Inc # for pi can be incremented only by pi e.g., when it receives a (Suspect, pi) message Somewhat similar to DSDV (routing protocol in ad-hoc nets) Higher inc# notifications over-ride lower inc#’s Within an inc#: (Suspect inc #) > (Alive, inc #) (Failed, inc #) overrides everything else

SWIM In Industry First used in Oasis/CoralCDN Implemented open-source by Hashicorp Inc. Called “Serf” Later “Consul” Today: Uber implemented it, uses it for failure detection in their infrastructure See “ringpop” system

Wrap Up Failures the norm, not the exception in datacenters Every distributed system uses a failure detector Many distributed systems use a membership service Ring failure detection underlies IBM SP2 and many other similar clusters/machines Gossip-style failure detection underlies Amazon EC2/S3 (rumored!)

Grid Computing

“A Cloudy History of Time” The first datacenters! Timesharing Companies & Data Processing Industry Clouds and datacenters 1940 1950 Clusters 1960 Grids 1970 1980 PCs (not distributed!) 1990 2000 Peer to peer systems 2012

“A Cloudy History of Time” First large datacenters: ENIAC, ORDVAC, ILLIAC Many used vacuum tubes and mechanical relays Berkeley NOW Project Supercomputers Server Farms (e.g., Oceano) 1940 1950 P2P Systems (90s-00s) Many Millions of users Many GB per day 1960 1970 1980 Data Processing Industry - 1968: $70 M. 1978: $3.15 Billion 1990 Timesharing Industry (1975): Market Share: Honeywell 34%, IBM 15%, Xerox 10%, CDC 10%, DEC 10%, UNIVAC 10% Honeywell 6000 & 635, IBM 370/168, Xerox 940 & Sigma 9, DEC PDP-10, UNIVAC 1108 2000 Grids (1980s-2000s): GriPhyN (1970s-80s) Open Science Grid and Lambda Rail (2000s) Globus & other standards (1990s-2000s) 2012 Clouds

Example: Rapid Atmospheric Modeling System, ColoState U Hurricane Georges, 17 days in Sept 1998 “RAMS modeled the mesoscale convective complex that dropped so much rain, in good agreement with recorded data” Used 5 km spacing instead of the usual 10 km Ran on 256+ processors Computation-intenstive computing (or HPC = high performance computing) Can one run such a program without access to a supercomputer?

Distributed Computing Resources Wisconsin NCSA MIT

An Application Coded by a Physicist Job 0 Output files of Job 0 Input to Job 2 Job 1 Job 2 Jobs 1 and 2 can be concurrent Job 3 Output files of Job 2 Input to Job 3

An Application Coded by a Physicist Output files of Job 0 Input to Job 2 Several GBs May take several hours/days 4 stages of a job Init Stage in Execute Stage out Publish Computation Intensive, so Massively Parallel Job 2 Output files of Job 2 Input to Job 3

Allocation? Scheduling? Scheduling Problem Wisconsin Job 0 Job 1 Job 2 Job 3 Allocation? Scheduling? NCSA MIT

2-level Scheduling Infrastructure Wisconsin Job 0 HTCondor Protocol Job 1 Job 2 Job 3 Globus Protocol NCSA MIT Some other intra-site protocol 60

Intra-site Protocol Wisconsin Job 3 Job 0 HTCondor Protocol Wisconsin Job 3 Job 0 Internal Allocation & Scheduling Monitoring Distribution and Publishing of Files

Condor (now HTCondor) High-throughput computing system from U. Wisconsin Madison Belongs to a class of “Cycle-scavenging” systems SETI@Home and Folding@Home are other systems in this category Such systems Run on a lot of workstations When workstation is free, ask site’s central server (or Globus) for tasks If user hits a keystroke or mouse click, stop task Either kill task or ask server to reschedule task Can also run on dedicated machines

Internal structure of different sites invisible to Globus Inter-site Protocol Wisconsin Job 3 Job 0 Internal structure of different sites invisible to Globus Globus Protocol Job 1 NCSA MIT Job 2 External Allocation & Scheduling Stage in & Stage out of Files

Globus Globus Alliance involves universities, national US research labs, and some companies Standardized several things, especially software tools Separately, but related: Open Grid Forum Globus Alliance has developed the Globus Toolkit http://toolkit.globus.org/toolkit/

Globus Toolkit Open-source Consists of several components GridFTP: Wide-area transfer of bulk data GRAM5 (Grid Resource Allocation Manager): submit, locate, cancel, and manage jobs Not a scheduler Globus communicates with the schedulers in intra-site protocols like HTCondor or Portable Batch System (PBS) RLS (Replica Location Service): Naming service that translates from a file/dir name to a target location (or another file/dir name) Libraries like XIO to provide a standard API for all Grid IO functionalities Grid Security Infrastructure (GSI)

Security Issues Important in Grids because they are federated, i.e., no single entity controls the entire infrastructure Single sign-on: collective job set should require once-only user authentication Mapping to local security mechanisms: some sites use Kerberos, others using Unix Delegation: credentials to access resources inherited by subcomputations, e.g., job 0 to job 1 Community authorization: e.g., third-party authentication These are also important in clouds, but less so because clouds are typically run under a central control In clouds the focus is on failures, scale, on-demand nature

Summary Grid computing focuses on computation-intensive computing (HPC) Though often federated, architecture and key concepts have a lot in common with that of clouds Are Grids/HPC converging towards clouds? E.g., Compare OpenStack and Globus

Announcements MP1: Due this Sunday, demos Monday VMs distributed: see Piazza Demo signup sheet: soon on Piazza Demo details: see Piazza Make sure you print individual and total linecounts Check Piazza often! It’s where all the announcements are at!