Lecture 4-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 5, 2013 Lecture 4 Failure Detection Reading:

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Teaser - Introduction to Distributed Computing
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 4: Failure Detection and Membership All slides © IG.
Failure Detectors CS 717 Ashish Motivala Dec 6 th 2001.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
OCT Masters of Information Systems Management 1 Organizational Communications and Distributed Object Technologies Week 3: Models and Architectures.
Cloud Computing Concepts
OCT1 Principles From Chapter Two of “Distributed Systems Concepts and Design” Material on Lamport Clocks from “Distributed Systems Principles and Paradigms”
Composition Model and its code. bound:=bound+1.
Distributed Systems – CS425/CSE424/ECE428 – Fall Nikita Borisov — UIUC1.
CSE 486/586 Distributed Systems Failure Detectors
CS542: Topics in Distributed Systems Diganta Goswami.
Failure detection and consensus Ludovic Henrio CNRS - projet OASIS Distributed Algorithms.
Lecture 8-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 16, 2010 Lecture 8 The Consensus.
1 MSCS 237 Communication issues. 2 Colouris et al. (2001): Is a system in which hardware or software components located at networked computers communicate.
Source: George Colouris, Jean Dollimore, Tim Kinderberg & Gordon Blair (2012). Distributed Systems: Concepts & Design (5 th Ed.). Essex: Addison-Wesley.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Mutual Exclusion Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed System Models (Fundamental Model). Architectural Model Goal Reliability Manageability Adaptability Cost-effectiveness Service Layers Platform.
Chapter 2: System Models. Objectives To provide students with conceptual models to support their study of distributed systems. To motivate the study of.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 6-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 12, 2013 Lecture 6 Global Snapshots Reading:
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 8 Instructor: Haifeng YU.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 2: Architectural.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 8 Leader Election Section 12.3 Klara Nahrstedt.
1 MSCS 237 Communication issues. 2 Colouris et al. (2001): Is a system in which hardware or software components located at networked computers communicate.
Chapter 2: Architectural Models Jenhui Chen. Introduction zArchitectural models ySoftware layers ySystem architecture yVariations on the client-server.
Communication & Synchronization Why do processes communicate in DS? –To exchange messages –To synchronize processes Why do processes synchronize in DS?
Lecture 11 Failure Detectors (Sections 12.1 and part of 2.3.2) Klara Nahrstedt CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009)
Prepared By: Md Rezaul Huda Reza
Chapter 2: System Models  Architectural Models  Fundamental Models.
Slides for Chapter 2: Architectural Models
Lecture 11-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 28, 2010 Lecture 11 Leader Election.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Mutual Exclusion & Leader Election Steve Ko Computer Sciences and Engineering University.
Physical clock synchronization Question 1. Why is physical clock synchronization important? Question 2. With the price of atomic clocks or GPS coming down,
Lecture 10: Coordination and Agreement (Chap 12) Haibin Zhu, PhD. Assistant Professor Department of Computer Science Nipissing University © 2002.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Distributed System Models
Failure detection The design of fault-tolerant systems will be easier if failures can be detected. Depends on the 1. System model, and 2. The type of failures.
Lecture 4-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) September 6, 2012 Lecture 4 Failure Detection.
Lecture 12-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) October 4, 2012 Lecture 12 Mutual Exclusion.
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586 CSE 486/586 Distributed Systems Leader Election Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Lecture 9: Multicast Sep 22, 2015 All slides © IG.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 System Models by Dr. Sarmad Sadik.
CSE 486/586 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Distributed Systems Lecture 4 Failure detection 1.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CSE 486/586 Distributed Systems Gossiping
CSE 486/586 Distributed Systems Failure Detectors
When Is Agreement Possible
Lecture 17: Leader Election
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Failure Detectors
View Change Protocols and Reconfiguration
Lecture 6: Failure Detection and Membership, Grids
Distributed Systems, Consensus and Replicated State Machines
Slides for Chapter 2: Architectural Models
CSE 486/586 Distributed Systems Leader Election
Chapter 5 (through section 5.4)
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Leader Election
IS 698/800-01: Advanced Distributed Systems Membership Management
Presentation transcript:

Lecture 4-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 5, 2013 Lecture 4 Failure Detection Reading: Section 15.1 and parts of  2013, I. Gupta

Lecture 4-2 Your new datacenter You’ve been put in charge of a datacenter (think of the Prineville Facebook DC), and your manager has told you, “Oh no! We don’t have any failures in our datacenter!” Do you believe him/her? What would be your first responsibility? Build a failure detector What are some things that could go wrong if you didn’t do this?

Lecture 4-3 Failures are the norm … not the exception, in datacenters. Say, the rate of failure of one machine (OS/disk/motherboard/network, etc.) is once every 10 years (120 months) on average. When you have 120 servers in the DC, the mean time to failure (MTTF) of the next machine is 1 month. When you have 12,000 servers in the DC, the MTTF is about once every 7.2 hours!

Lecture 4-4 To build a failure detector You have a few options 1. Hire 1000 people, each to monitor one machine in the datacenter and report to you when it fails. 2. Write a failure detector program (distributed) that automatically detects failures and reports to your workstation. Which is more preferable, and why?

Lecture 4-5 Two Different System Models Whenever someone gives you a distributed computing problem, the first question you want to ask is, “What is the system model under which I need to solve the problem?”  S ynchronous Distributed System  Each message is received within bounded time  Each step in a process takes lb < time < ub  (Each local clock’s drift has a known bound) Examples: Multiprocessor systems  Asynchronous Distributed System  No bounds on message transmission delays  No bounds on process execution  (The drift of a clock is arbitrary) Examples: Internet, wireless networks, datacenters, most real systems

Lecture 4-6 Failure Model  Process omission failure  Crash-stop (fail-stop) – a process halts and does not execute any further operations  Crash-recovery – a process halts, but then recovers (reboots) after a while  Special case of crash-stop model (use a new identifier on recovery)  We will focus on Crash-stop failures  They are easy to detect in synchronous systems  Not so easy in asynchronous systems

Lecture 4-7 What’s a failure detector? pipj

Lecture 4-8 What’s a failure detector? pipj X Crash-stop failure (pj is a failed process)

Lecture 4-9 What’s a failure detector? pipj X needs to know about pj’s failure (pi is a non-faulty process or alive process) There are two main flavors of Failure Detectors… Crash-stop failure (pj is a failed process)

Lecture 4-10 I. Ping-Ack Protocol pipj needs to know about pj’s failure - pj replies - pi queries pj once every T time units - if pj does not respond within another T time units of being sent the ping, pi detects pj as failed ping ack Worst case Detection time = 2T If pj fails, then within T time units, pi will send it a ping message. pi will time out within another T time units. The waiting time ‘T’ can be parameterized.

Lecture 4-11 II. Heartbeating Protocol pipj needs to know about pj’s failure - pj maintains a sequence number - pj sends pi a heartbeat with incremented seq. number after every T time units -if pi has not received a new heartbeat for the past, say 3*T time units, since it received the last heartbeat, then pi detects pj as failed heartbeat If T >> round trip time of messages, then worst case detection time ~ 3*T (why?) The ‘3’ can be changed to any positive number since it is a parameter

Lecture 4-12 In a Synchronous System The Ping-ack and Heartbeat failure detectors are always “correct” –If a process pj fails, then pi will detect its failure as long as pi itself is alive Why? –Ping-ack: set waiting time ‘T’ to be > round—trip time upper bound »pi->pj latency + pj processing + pj->pi latency + pi processing time –Heartbeat: set waiting time ‘3*T’ to be > round—trip time upper bound

Lecture 4-13 Failure Detector Properties Completeness = every process failure is eventually detected (no misses) Accuracy = every detected failure corresponds to a crashed process (no mistakes) What is a protocol that is 100% complete? What is a protocol that is 100% accurate? Completeness and Accuracy –Can both be guaranteed 100% in a synchronous distributed system –Can never be guaranteed simultaneously in an asynchronous distributed system Why?

Lecture 4-14 Impossible because of arbitrary message delays, message losses –If a heartbeat/ack is dropped (or several are dropped) from pj, then pj will be mistakenly detected as failed => inaccurate detection –How large would the T waiting period in ping-ack or 3*T waiting period in heartbeating, need to be to obtain 100% accuracy? –In asynchronous systems, delay/losses on a network link are impossible to distinguish from a faulty process Heartbeating – satisfies completeness but not accuracy (why?) Ping-Ack – satisfies completeness but not accuracy (why?) Satisfying both Completeness and Accuracy in Asynchronous Systems

Lecture 4-15 Completeness or Accuracy? (in asynchronous system) Most failure detector implementations are willing to tolerate some inaccuracy, but require 100% Completeness Plenty of distributed apps designed assuming 100% completeness, e.g., p2p systems –“Err on the side of caution”. –Processes not “stuck” waiting for other processes But it’s ok to mistakenly detect once in a while since – the victim process need only rejoin as a new process and catch up Both Hearbeating and Ping-ack provide –Probabilistic accuracy: for a process detected as failed, with some probability close to 1.0 (but not equal), it is true that it has actually crashed.

Lecture 4-16 That was for one process pj being detected and one process pi detecting failures Let’s extend it to an entire distributed system Difference from original failure detection is –We want failure detection of not merely one process (pj), but all processes in system Failure Detection in a Distributed System

Lecture 4-17 Centralized Heartbeating … pj, Heartbeat Seq. l++ pj pi Downside?

Lecture 4-18 Ring Heartbeating pj, Heartbeat Seq. l++ pj … … pi No SPOF (single point of failure) Downside?

Lecture 4-19 All-to-All Heartbeating pj, Heartbeat Seq. l++ … pj pi Advantage: Everyone is able to keep track of everyone Downside?

Lecture 4-20 Efficiency of Failure Detector: Metrics Bandwidth: the number of messages sent in the system during steady state (no failures) –Small is good Detection Time –Time between a process crash and its detection –Small is good Scalability: How do bandwidth and detection properties scale with N, the number of processes? Accuracy –Large is good (lower inaccuracy is good)

Lecture 4-21 Accuracy metrics False Detection Rate/False Positive Rate (inaccuracy) –Multiple possible metrics –1. Average number of failures detected per second, when there are in fact no failures –2. Fraction of failure detections that are false Tradeoffs: If you increase the T waiting period in ping-ack or 3*T waiting period in heartbeating what happens to: –Detection Time? –False positive rate? –Where would you set these waiting periods?

Lecture 4-22 Membership Protocols Maintain a list of other alive (non-faulty) processes at each process in the system Failure detector is a component in membership protocol –Failure of pj detected -> delete pj from membership list –New machine joins -> pj sends message to everyone -> add pj to membership list Flavors –Strongly consistent: all membership lists identical at all times (hard, may not scale) –Weakly consistent: membership lists not identical at all times –Eventually consistent: membership lists always moving towards becoming identical eventually (scales well)

23 Gossip-style Membership Array of Heartbeat Seq. l for member subset Good accuracy properties pi

24 Gossip-Style Failure Detection Protocol Each process maintains a membership list Each process periodically increments its own heartbeat counter Each process periodically gossips its membership list On receipt, the heartbeats are merged, and local times are updated Current time : 70 at node 2 (asynchronous clocks) Address Heartbeat Counter Time (local) Fig and animation by: Dongyun Jin and Thuy Ngyuen

25 Gossip-Style Failure Detection Well-known result (you’ll see it in a few months) –In a group of N processes, it takes O(log(N)) time for a heartbeat update to propagate to everyone with high probability –Very robust against failures – even if a large number of processes crash, most/all of the remaining processes still receive all heartbeats Failure detection: If the heartbeat has not increased for more than T fail seconds, the member is considered failed –T fail usually set to O(log(N)). But entry not deleted immediately: wait another T cleanup seconds (usually = T fail ) Why not delete it immediately after the T fail timeout?

26 Gossip-Style Failure Detection What if an entry pointing to a failed node is deleted right after T fail (=24) seconds? Fix: remember for another T fail Current time : 75 at node 2

Lecture 4-27 Other Types of Failures Let’s discuss the other types of failures Failure detectors exist for them too (but we won’t discuss those)

Lecture 4-28 Processes and Channels

Lecture 4-29  Communication omission failures  Send-omission: loss of messages between the sending process and the outgoing message buffer (both inclusive)  What might cause this?  Channel omission: loss of message in the communication channel  What might cause this?  Receive-omission: loss of messages between the incoming message buffer and the receiving process (both inclusive)  What might cause this? Other Failure Types

Lecture 4-30  Arbitrary failures  Arbitrary process failure: arbitrarily omits intended processing steps or takes unintended processing steps.  Arbitrary channel failures: messages may be corrupted, duplicated, delivered out of order, incur extremely large delays; or non-existent messages may be delivered.  Above two are Byzantine failures, e.g., due to hackers, man-in-the-middle attacks, viruses, worms, etc., and even bugs in the code  A variety of Byzantine fault-tolerant protocols have been designed in literature! Other Failure Types

Lecture 4-31 Omission and Arbitrary Failures Class of failureAffectsDescription Fail-stop or Crash-stop ProcessProcess halts and remains halted. Other processes may detect this state. OmissionChannelA message inserted in an outgoing message buffer never arrives at the other end’s incoming message buffer. Send-omissionProcessA process completes asend, but the message is not put in its outgoing message buffer. Receive-omissionProcessA message is put in a process’s incoming message buffer, but that process does not receive it. Arbitrary (Byzantine) Process or channel Process/channel exhibits arbitrary behaviour: it may send/transmit arbitrary messages at arbitrary times, commit omissions; a process may stop or take an incorrect step.

Lecture 4-32 Summary Failure Detectors Completeness and Accuracy Ping-ack and Heartbeating Gossip-style Suspicion, Membership

Lecture 4-33 Next Week Reading for Next Two Lectures: Sections –Time and Synchronization –Global States and Snapshots HW1 already out, due Sep 19 th MP1 already out, due 9/15: By now you should –Be in a group (send to us at TODAY, subject line: “425 MP group”), use Piazza to find partners –Have a basic design. –If you’ve already started coding, you’re doing well.

Lecture 4-34 Optional - Suspicion Augment failure detection with suspicion count Ex: In all-to-all heartbeating, suspicion count = number of machines that have timed out waiting for heartbeats from a particular machine M –When suspicion count crosses a threshold, declare M failed –Issues: Who maintains this count? If distributed, need to circulate the count Lowers mistaken detections (e.g., message dropped, Internet path bad), e.g., in Cassandra key-value store Can also keep much longer-term failure counts, and use this to blacklist and greylist machines, e.g., in OpenCorral CDN