Distributed Systems Lecture 4 Failure detection 1.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Chapter 8 Fault Tolerance
CS 542: Topics in Distributed Systems Diganta Goswami.
Teaser - Introduction to Distributed Computing
Consensus Hao Li.
Byzantine Generals Problem: Solution using signed messages.
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo.
Last Class: Weak Consistency
OCT Masters of Information Systems Management 1 Organizational Communications and Distributed Object Technologies Week 3: Models and Architectures.
Primary-Backup Systems CS249 FALL 2005 Sang Soo Kim.
Cloud Computing Concepts
OCT1 Principles From Chapter Two of “Distributed Systems Concepts and Design” Material on Lamport Clocks from “Distributed Systems Principles and Paradigms”
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Composition Model and its code. bound:=bound+1.
R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby.
Distributed Systems – CS425/CSE424/ECE428 – Fall Nikita Borisov — UIUC1.
CSE 486/586 Distributed Systems Failure Detectors
CS542: Topics in Distributed Systems Diganta Goswami.
Failure detection and consensus Ludovic Henrio CNRS - projet OASIS Distributed Algorithms.
1 MSCS 237 Communication issues. 2 Colouris et al. (2001): Is a system in which hardware or software components located at networked computers communicate.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Source: George Colouris, Jean Dollimore, Tim Kinderberg & Gordon Blair (2012). Distributed Systems: Concepts & Design (5 th Ed.). Essex: Addison-Wesley.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Mutual Exclusion Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed System Models (Fundamental Model). Architectural Model Goal Reliability Manageability Adaptability Cost-effectiveness Service Layers Platform.
Chapter 2: System Models. Objectives To provide students with conceptual models to support their study of distributed systems. To motivate the study of.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 4-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 5, 2013 Lecture 4 Failure Detection Reading:
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 2: Architectural.
1 MSCS 237 Communication issues. 2 Colouris et al. (2001): Is a system in which hardware or software components located at networked computers communicate.
Chapter 2: Architectural Models Jenhui Chen. Introduction zArchitectural models ySoftware layers ySystem architecture yVariations on the client-server.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Lecture 11 Failure Detectors (Sections 12.1 and part of 2.3.2) Klara Nahrstedt CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009)
Prepared By: Md Rezaul Huda Reza
Chapter 2: System Models  Architectural Models  Fundamental Models.
Slides for Chapter 2: Architectural Models
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Distributed System Models
Failure detection The design of fault-tolerant systems will be easier if failures can be detected. Depends on the 1. System model, and 2. The type of failures.
Lecture 4-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) September 6, 2012 Lecture 4 Failure Detection.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 System Models by Dr. Sarmad Sadik.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CSE 486/586 Distributed Systems Failure Detectors
When Is Agreement Possible
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Failure Detectors
Slides for Chapter 2: Architectural Models
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
CSE 486/586 Distributed Systems Failure Detectors
IS 698/800-01: Advanced Distributed Systems Membership Management
Presentation transcript:

Distributed Systems Lecture 4 Failure detection 1

Previous lecture Big Data MapReduce 2

Motivation Large datacenters need to be available on demand and to process data reliably However, hardware fails  Need to ensure hardware is available with small downtime – Amazon promises 99.95% uptime per month – Google promises >99.9% per month Is it enough? What do we do when hardware fails? Butler Lampson A distributed system is a system in which I can’t get my work done because a computer that I’ve never heard of has failed. 3

Example Rate of disk failure is once every 10 years For 120 servers in the datacenter this translates to once every month For 120,000 servers it goes down to 7.2 hours What are our options? 1.Hire a team to monitor machines in the datacenter and report to you when they fails 2.Write a failure detector program (distributed) that automatically detects failures and reports to your workstation In 2002 when ASCI Q supercomputer (2 nd fastest in the world at the time) was installed at New Mexico Lab the computer could not run more than an hour without crashing 4

Synchronous Distributed System – Each message is received within bounded time – Each step in a process takes lb < time < ub Each local clock’s drift has a known bound – Example: Multiprocessor systems Asynchronous Distributed System – No bounds on message transmission delays – No bounds on process execution The drift of a clock is arbitrary – Example: Internet, wireless networks, datacenters, most real systems Two types of DSs 5

Problems with Distributed Systems Failures are more frequent – Many places to fail – More complex More pieces New challenges: asynchrony, communication Potential problems of failures – Single system  everything stops – Distributed  some parts may continue 6

First step: Goals Availability – Can I use it now? Reliability – Will it be up as long as I need it? Safety – If it fails, what are the consequences? Maintainability – How easy is it to fix if it breaks? 7

Next step: Failure models Failure: System does not behave as expected – Component-level failure (can compensate) – System-level failure (incorrect result) Fault: Cause of failure (component-level) – Transient: Not repeatable – Intermittent: Repeats, but (apparently) independent of system operations – Permanent: Exists until component repaired Failure model: How the system behaves when it doesn’t behave properly Failure semantics: describes and classifies errors that distributed systems can experience 8

Failure classification Correct – In response to inputs, behaves in a manner consistent with the service specification Omission Failure – Does not respond to input Crash: After first omission failure, subsequent requests result in omission failure Timing failure (early, late) – Correct response, but outside required time window Response failure – Value: Wrong output for inputs – State Transition: Server ends in wrong state 9

Crash failure types (based on recovery behavior) Crash-stop (fail-stop) – process halts and does not execute any further operations – Halting Never restarts Crash-recovery – process halts, but then recovers (reboots) after a while – Special case of crash-stop model (uses a new identifier on recovery) – Classification: Amnesia – Server recovers to predefined state independent of operations before crash Partial amnesia – Some part of state is as before crash, rest to predefined state Pause – Recovers to state before omission failure 10

Hierarchical failure masking – Dependency: Higher level gets (at best) failure semantics of lower level – Can compensate for lower level failure to improve this Example: – TCP fixes communication errors, so some failure semantics not propagated to higher level 11

Group failure masking Redundant servers – Failed server masked by others in group – Allows failure semantics of group to be higher than individuals k-fault tolerant – Group can mask k concurrent group member failures from client May “upgrade” failure semantics – Example: Group detects non-responsive server, other member picks up the slack – Omission failure becomes performance failure 12

pipi pjpj 13 Detecting failures

pipi pj X Crash-stop failure (p j is a failed process) 14 Detecting failures

pipi pj X needs to know about pj’s failure (pi is a non-faulty process or alive process) There are two main flavors of failure detectors 1. Ping-Ack (proactive) 2. Heartbeat (reactive) Crash-stop failure (p j is a failed process) 15 Detecting failures

pipi pjpj p i needs to know about p j ’s failure - p j replies - p i queries p j once every T time units - if p j does not respond within another T time units of being sent the ping, p i detects p j as failed ping ack Worst case Detection time = 2T If p j fails, then within T time units, p i will send it a ping message. p i will time out within another T time units. The waiting time T can be parameterized. 16 Ping-ack protocol

pipi pjpj - p j maintains a sequence number - p j sends p i a heartbeat with incremented sequence number after every T time units - if p i has not received a new heartbeat for the past, say 3T time units, since it received the last heartbeat, then p i detects p j as failed heartbeat If T >> round trip time of messages, then worst case detection time ~ 3*T (why?) The 3 can be changed to any positive number since it is a parameter 17 Heartbeat protocol p i needs to know about p j ’s failure

The Ping-ack and Heartbeat failure detectors are always “correct” – If a process p j fails, then p i will detect its failure as long as p i itself is alive Why? – Ping-ack: set waiting time T to be > round trip time upper bound p i  p j latency + p j processing + p j  p i latency + p i processing time – Heartbeat: set waiting time 3T to be > round trip time upper bound 18 Synchronous DS case

Completeness = every process failure is eventually detected (no misses) Accuracy = every detected failure corresponds to a crashed process (no mistakes) Completeness and Accuracy – Can both be guaranteed 100% in a synchronous distributed system – Can never be guaranteed simultaneously in an asynchronous distributed system 19 Failure detector properties

Impossible because of arbitrary message delays & message losses – If a heartbeat/ack is dropped (or several are dropped) from p j, then p j will be mistakenly detected as failed  inaccurate detection – How large would the T waiting period in ping-ack or 3T heartbeat waiting period, need to be to obtain 100% accuracy? – In asynchronous systems, delay/losses on a network link are impossible to distinguish from a faulty process Heartbeat – satisfies completeness but not accuracy Ping-Ack – satisfies completeness but not accuracy 20 Satisfying completeness and accuracy in asynchronous DS

Most failure detector implementations are willing to tolerate some inaccuracy, but require 100% completeness Many distributed apps designed assuming 100% completeness, e.g., P2P systems – “Err on the side of caution” – Processes not “stuck” waiting for other processes If error in identifying is made then victim process rejoins as a new process and catches up Hearbeating and Ping-ack provide – Probabilistic accuracy: for a process detected as failed, with some probability close to 1.0 (but not equal) it is true that it has actually crashed 21 Completeness or accuracy in asynchronous DS

We want failure detection of not merely one process (p j ), but all processes in the DS Approaches: – Centralized heartbeat – Ring heartbeat – All-to-all heartbeat Who guards the failure detectors? 22 Failure detection across the DS

… p j, Heartbeat Seq. i++ pjpj pipi 23 Centralized heartbeat

p j, Heartbeat Seq. i++ pjpj … … pipi No SPOF (single point of failure) 24 Ring heartbeat

p j, Heartbeat Seq. i++ … pjpj pipi Advantage: Everyone is able to keep track of everyone 25 All-to-all heartbeat

Bandwidth: – the number of messages sent in the system during steady state (no failures) – Small is good Detection time – Time between a process crash and its detection – Small is good Scalability: – How do bandwidth and detection properties scale with N, the number of processes? Accuracy – Large is good 26 Detection efficiency metrics

False Detection Rate/False Positive Rate (inaccuracy) – Multiple possible metrics 1.Average number of failures detected per second, when there are in fact no failures 2.Fraction of failure detections that are false Tradeoffs: If you increase the T waiting period in ping- ack or 3T waiting period in heartbeating what happens to: – Detection Time? – False positive rate? – Where would you set these waiting periods? 27 Accuracy metrics

Maintain a list of other alive (non-faulty) processes at each process in the system Failure detector is a component in membership protocol – Failure of p j detected  delete p j from membership list – New machine joins  p j sends message to everyone  add p j to membership list Flavors – Strongly consistent: all membership lists identical at all times (hard, may not scale) – Weakly consistent: membership lists not identical at all times – Eventually consistent: membership lists always moving towards becoming identical eventually (scales well) 28 Membership protocols

Array of Heartbeat Seq. i for member subset Good accuracy properties pipi 29 Gossip protocols Mimic gossip in a social network efficient to use due to DS large scale In a random search the access time to any VM is of at most n 3 for a regular graph and a third degree polynomial for any graph

Gossip based failure detection Protocol Each process maintains a membership list Each process periodically increments its own heartbeat counter Each process periodically gossips its membership list On receipt, the heartbeats are merged, and local times are updated Current time : 70 at node 2 (asynchronous clocks) Address Heartbeat Counter Time (local) 30

O(log(N)) time for a heartbeat update to propagate to everyone with high probability Very robust against failures – even if a large number of processes crash, most/all of the remaining processes still receive all heartbeats Failure detection: If the heartbeat has not increased for more than T fail seconds, the member is considered failed – T fail usually set to O(log(N)). – But entry not deleted immediately: wait another T cleanup seconds (usually = T fail ) Why? 31 Gossip based failure detection

What if an entry pointing to a failed node is deleted right after T fail (=24) seconds? Solution: remember for another T cleanup Current time : 75 at node 2 32 Gossip based failure detection

Communication omission failures – Send-omission: loss of messages between the sending process and the outgoing message buffer (both inclusive) – Channel omission: loss of message in the communication channel – Receive-omission: loss of messages between the incoming message buffer and the receiving process (both inclusive) Arbitrary failures – Arbitrary process failure: arbitrarily omits intended processing steps or takes unintended processing steps. – Arbitrary channel failures: messages may be corrupted, duplicated, delivered out of order, incur extremely large delays; or non-existent messages may be delivered. – These are known as Byzantine failures, e.g., due to hackers, man-in-the- middle attacks, viruses, worms, etc., and even bugs in the code – A variety of Byzantine fault-tolerant protocols have been designed in literature 33 Other types of failures

Class of failureAffectsDescription Fail-stop or Crash-stop ProcessProcess halts and remains halted. Other processes may detect this state. OmissionChannelA message inserted in an outgoing message buffer never arrives at the other end’s incoming message buffer. Send-omissionProcessA process completes asend, but the message is not put in its outgoing message buffer. Receive-omissionProcessA message is put in a process’s incoming message buffer, but that process does not receive it. Arbitrary (Byzantine) Process or channel Process/channel exhibits arbitrary behaviour: it may send/transmit arbitrary messages at arbitrary times, commit omissions; a process may stop or take an incorrect step. 34 Omission and arbitrary failures

Time and synchronization 35 Next lecture