CSE 486/586 Distributed Systems Failure Detectors

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
DISTRIBUTED SYSTEMS II FAULT-TOLERANT BROADCAST Prof Philippas Tsigas Distributed Computing and Systems Research Group.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Leader Election Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Consensus Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Reliable Multicast Steve Ko Computer Sciences and Engineering University at Buffalo.
Teaser - Introduction to Distributed Computing
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo.
Cloud Computing Concepts
OCT1 Principles From Chapter Two of “Distributed Systems Concepts and Design” Material on Lamport Clocks from “Distributed Systems Principles and Paradigms”
Composition Model and its code. bound:=bound+1.
Distributed Systems – CS425/CSE424/ECE428 – Fall Nikita Borisov — UIUC1.
CSE 486/586 Distributed Systems Failure Detectors
CS542: Topics in Distributed Systems Diganta Goswami.
Failure detection and consensus Ludovic Henrio CNRS - projet OASIS Distributed Algorithms.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Mutual Exclusion Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed System Models (Fundamental Model). Architectural Model Goal Reliability Manageability Adaptability Cost-effectiveness Service Layers Platform.
Lecture 4-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 5, 2013 Lecture 4 Failure Detection Reading:
Lecture 11 Failure Detectors (Sections 12.1 and part of 2.3.2) Klara Nahrstedt CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009)
Prepared By: Md Rezaul Huda Reza
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Mutual Exclusion & Leader Election Steve Ko Computer Sciences and Engineering University.
CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Failure detection The design of fault-tolerant systems will be easier if failures can be detected. Depends on the 1. System model, and 2. The type of failures.
Lecture 4-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) September 6, 2012 Lecture 4 Failure Detection.
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586 CSE 486/586 Distributed Systems Leader Election Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586 CSE 486/586 Distributed Systems Time and Synchronization Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Lecture 4 Failure detection 1.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
CSE 486/586 Distributed Systems Gossiping
CSE 486/586 Distributed Systems Reliable Multicast --- 1
CSE 486/586 Distributed Systems Wrap-up
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Leader Election
When Is Agreement Possible
CSE 486/586 Distributed Systems Consistency --- 1
CSE 486/586 Distributed Systems Wrap-up
Lecture 17: Leader Election
CSE 486/586 Distributed Systems Global States
CSE 486/586 Distributed Systems Time and Synchronization
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Failure Detectors
Slides for Chapter 2: Architectural Models
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Distributed Systems, Consensus and Replicated State Machines
Slides for Chapter 2: Architectural Models
CSE 486/586 Distributed Systems Consistency --- 1
CSE 486/586 Distributed Systems Leader Election
CSE 486/586 Distributed Systems Concurrency Control --- 3
CSE 486/586 Distributed Systems Mutual Exclusion
Physical clock synchronization
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Global States
CSE 486/586 Distributed Systems Consistency --- 1
CSE 486/586 Distributed Systems Reliable Multicast --- 2
Broadcasting with failures
CSE 486/586 Distributed Systems Mutual Exclusion
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
CSE 486/586 Distributed Systems Time and Synchronization
CSE 486/586 Distributed Systems Reliable Multicast --- 1
CSE 486/586 Distributed Systems Leader Election
CSE 486/586 Distributed Systems Consensus
Presentation transcript:

CSE 486/586 Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo

Recap Best Practices

Today’s Question How do we handle failures? Cannot answer this fully (yet!) You’ll learn new terminologies, definitions, etc. Let’s start with some new definitions. One of the fundamental challenges in distributed systems Failure Ordering (with concurrency) Etc…

Two Different System Models Synchronous Distributed System Each message is received within bounded time Each step in a process takes lb < time < ub (Each local clock’s drift has a known bound) Examples: Multiprocessor systems Asynchronous Distributed System No bounds on message transmission delays No bounds on process execution (The drift of a clock is arbitrary) Examples: Internet, wireless networks, datacenters, most real systems These are used to reason about how protocols would behave, e.g., in formal proofs.

Failure Model What is a failure? We’ll consider: process omission failure A process disappears. Permanently: crash-stop (fail-stop) – a process halts and does not execute any further operations Temporarily: crash-recovery – a process halts, but then recovers (reboots) after a while We will focus on crash-stop failures Meaning, we assume there’s no other failure (e.g., network error). More failure types at the end of this lecture. They are easy to detect in synchronous systems Not so easy in asynchronous systems

Why, What, and How Why design a failure detector? First step to failure handling What do we want from a failure detector? No miss (completeness) No mistake (accuracy) How do we design one?

What is a Failure Detector? pi pj

What is a Failure Detector? Crash-stop failure (pj is a failed process) pi pj

What is a Failure Detector? needs to know about pj’s failure (pi is a non-faulty process or alive process) Crash-stop failure (pj is a failed process) pi pj There are two styles of failure detectors

I. Ping-Ack Protocol pi ping pj ack pj replies pi queries pj once every T time units If pj does not respond within another T time units of being sent the ping, pi detects/declares pj as failed If pj fails, then within T time units, pi will send it a ping message. pi will time out within another T time units. Worst case Detection time = 2T The waiting time ‘T’ can be parameterized.

II. Heartbeating Protocol pi heartbeat pj If pi has not received a new heartbeat for the past, say 3T time units, since it received the last heartbeat, then pi detects pj as failed pj maintains a sequence number pj sends pi a heartbeat with incremented seq. number after every T time units If T ≫ round trip time of messages, then worst case detection time ~ 3T (why?) The ‘3’ can be changed to any positive number since it is a parameter

In a Synchronous System The Ping-Ack and Heartbeat failure detectors are always correct. For example (there could be other ways), Ping-Ack: set waiting time ‘T’ to be > round-trip time upper bound Heartbeat: set waiting time ‘3*T’ to be > round-trip time upper bound The following property is guaranteed: If a process pj fails, then pi will detect its failure as long as pi itself is alive Its next ack/heartbeat will not be received (within the timeout), and thus pi will detect pj as having failed

Failure Detector Properties What do you mean a failure detector is “correct”? Completeness = every process failure is eventually detected (no misses) Accuracy = every detected failure corresponds to a crashed process (no mistakes) Completeness and Accuracy Can both be guaranteed 100% in a synchronous distributed system (with reliable message delivery in bounded time) Can never be guaranteed simultaneously in an asynchronous distributed system Why?

Completeness and Accuracy in Asynchronous Systems Impossible because of arbitrary message delays If a heartbeat/ack is dropped (or several are dropped) from pj, then pj will be mistakenly detected as failed => inaccurate detection How large would the T waiting period in ping-ack or 3*T waiting period in heartbeating, need to be to obtain 100% accuracy? In asynchronous systems, delays on a network link are impossible to distinguish from a faulty process Heartbeating – satisfies completeness but not accuracy (why?) Ping-Ack – satisfies completeness but not accuracy (why?) Point: You can’t design a perfect failure detector! You need to think about what metrics are important.

Completeness or Accuracy? (in Asynchronous System) Most failure detector implementations are willing to tolerate some inaccuracy, but require 100% completeness. Plenty of distributed apps designed assuming 100% completeness, e.g., p2p systems “Err on the side of caution”. Processes not “stuck” waiting for other processes But it’s ok to mistakenly detect once in a while since – (the victim process need only rejoin as a new process-—more later) Both Hearbeating and Ping-Ack provide Probabilistic accuracy (for a process detected as failed, with some probability close to 1.0 (but not equal), it is true that it has actually crashed).

Failure Detection in a Distributed System That was for one process pj being detected and one process pi detecting failures Let’s extend it to an entire distributed system Difference from original failure detection is We want failure detection of not merely one process (pj), but all processes in system

CSE 486/586 Administrivia PA2A due in roughly two weeks (Fri, 2/22) Please use Piazza; all announcements will go there. If you want an invite, let me know. Please come during the office hours! Give feedback about the class, ask questions, etc.

Failure Detection in a Distributed System That was for one process pj being detected and one process pi detecting failures Let’s extend it to an entire distributed system Difference from original failure detection is We want failure detection of not merely one process (pj), but all processes in system Any idea? Why What How

Efficiency of Failure Detector: Metrics Bandwidth: the number of messages sent in the system during steady state (no failures) Small is good Detection Time Time between a process crash and its detection Scalability: Given the bandwidth and the detection properties, can you scale to a 1000 or million nodes? Large is good Accuracy Large is good (lower inaccuracy is good)

Accuracy Metrics False Detection Rate: Average number of failures detected per second, when there are in fact no failures Fraction of failure detections that are false Tradeoffs: If you increase the T waiting period in ping-ack or 3*T waiting period in heartbeating what happens to: Detection Time? False positive rate? Where would you set these waiting periods?

Centralized Heartbeat pj … pj, Heartbeat Seq. l++ pi Downside?

Ring Heartbeat pj pj, Heartbeat Seq. l++ pi … … Downside?

… All-to-All Heartbeat pj pj, Heartbeat Seq. l++ pi Advantage: Everyone is able to keep track of everyone Downside?

Other Types of Failures Let’s discuss the other types of failures Failure detectors exist for them too (but we won’t discuss those)

Processes and Channels Process p Process q send m receive Communication channel Outgoing message buffer Incoming message buffer

Other Failure Types Communication omission failures Send-omission: loss of messages between the sending process and the outgoing message buffer (both inclusive) What might cause this? Channel omission: loss of message in the communication channel Receive-omission: loss of messages between the incoming message buffer and the receiving process (both inclusive)

Other Failure Types Arbitrary failures Arbitrary process failure: arbitrarily omits intended processing steps or takes unintended processing steps. Arbitrary channel failures: messages may be corrupted, duplicated, delivered out of order, incur extremely large delays; or non-existent messages may be delivered. Above two are Byzantine failures, e.g., due to hackers, man-in-the-middle attacks, viruses, worms, etc. A variety of Byzantine fault-tolerant protocols have been designed in literature!

Omission and Arbitrary Failures Class of failure Affects Description Fail-stop Process Process halts and remains halted. Other processes may detect this state. Omission Channel A message inserted in an outgoing message buffer never arrives at the other end’s incoming message buffer. Send-omission A process completes a send, but the message is not put in its outgoing message buffer. Receive-omission A message is put in a process’s incoming message buffer, but that process does not receive it. Arbitrary (Byzantine) Process or channel Process/channel exhibits arbitrary behaviour: it may send/transmit arbitrary messages at arbitrary times, commit omissions; a process may stop or take an incorrect step.

Summary Failure detectors are required in distributed systems to keep system running in spite of process crashes Properties – completeness & accuracy, together unachievable in asynchronous systems but achievable in synchronous systems Most apps require 100% completeness, but can tolerate inaccuracy 2 failure detector algorithms - heartbeating and ping Distributed FD through heartbeating: centralized, ring, all-to-all Metrics: bandwidth, detection time, scale, accuracy Other types of failures Next: the notion of time in distributed systems

Acknowledgements These slides contain material developed and copyrighted by Indranil Gupta at UIUC.