BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
Impossibility of Consensus in Asynchronous Systems (FLP) Ali Ghodsi – UC Berkeley / KTH alig(at)cs.berkeley.edu.
CS 542: Topics in Distributed Systems Diganta Goswami.
DISTRIBUTED SYSTEMS II FAULT-TOLERANT BROADCAST Prof Philippas Tsigas Distributed Computing and Systems Research Group.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Consensus Steve Ko Computer Sciences and Engineering University at Buffalo.
Teaser - Introduction to Distributed Computing
Distributed Systems Overview Ali Ghodsi
1 The Case for Byzantine Fault Detection. 2 Challenge: Byzantine faults Distributed systems are subject to a variety of failures and attacks Hacker break-in.
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©
Distributed Algorithms – 2g1513 Lecture 10 – by Ali Ghodsi Fault-Tolerance in Asynchronous Networks.
1 © P. Kouznetsov On the weakest failure detector for non-blocking atomic commit Rachid Guerraoui Petr Kouznetsov Distributed Programming Laboratory Swiss.
Byzantine Generals Problem: Solution using signed messages.
Failure Detectors. Can we do anything in asynchronous systems? Reliable broadcast –Process j sends a message m to all processes in the system –Requirement:
CPSC 668Set 14: Simulations1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
CPSC 668Set 1: Introduction1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
Failure Detectors & Consensus. Agenda Unreliable Failure Detectors (CHANDRA TOUEG) Reducibility ◊S≥◊W, ◊W≥◊S Solving Consensus using ◊S (MOSTEFAOUI RAYNAL)
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
SRG PeerReview: Practical Accountability for Distributed Systems Andreas Heaberlen, Petr Kouznetsov, and Peter Druschel SOSP’07.
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.
© 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation 5: Reliable.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
1 Failure Detectors: A Perspective Sam Toueg LIX, Ecole Polytechnique Cornell University.
Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.
Systems of Distributed systems Module 2 - Distributed algorithms Teaching unit 2 – Properties of distributed algorithms Ernesto Damiani University of Bozen.
Composition Model and its code. bound:=bound+1.
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Lecture 8-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 16, 2010 Lecture 8 The Consensus.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 14 Asynchronous Network Model by Mikhail Nesterenko “Distributed Algorithms” by Nancy A. Lynch.
Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey (Paper by X. Défago, A. Schiper, and P. Urbán) ACM computing Surveys, Vol. 36,No 4,
Consensus and Its Impossibility in Asynchronous Systems.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch CSCE 668 Set 1: Introduction 1.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
Distributed Algorithms Lecture 10b – by Ali Ghodsi Fault-Tolerance in Asynchronous Networks – Probabilistic Consensus.
1 © R. Guerraoui Regular register algorithms R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch.
A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Several sets of slides by Prof. Jennifer Welch will be used in this course. The slides are mostly identical to her slides, with some minor changes. Set.
SysRép / 2.5A. SchiperEté The consensus problem.
SOSP 2007 © 2007 Andreas Haeberlen, MPI-SWS 1 Practical accountability for distributed systems Andreas Haeberlen MPI-SWS / Rice University Petr Kuznetsov.
Failure Detectors n motivation n failure detector properties n failure detector classes u detector reduction u equivalence between classes n consensus.
Replication predicates for dependent-failure algorithms Flavio Junqueira and Keith Marzullo University of California, San Diego Euro-Par Conference, Lisbon,
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
DISTRIBUTED ALGORITHMS Spring 2014 Prof. Jennifer Welch Set 9: Fault Tolerant Consensus 1.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 9 Instructor: Haifeng YU.
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
Intrusion Tolerant Consensus in Wireless Ad hoc Networks Henrique Moniz, Nuno Neves, Miguel Correia LASIGE Dep. Informática da Faculdade de Ciências Universidade.
CSE 486/586 Distributed Systems Failure Detectors
When Is Agreement Possible
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Agreement Protocols CS60002: Distributed Systems
Distributed Systems, Consensus and Replicated State Machines
Maya Haridasan April 15th
Fault-tolerant Consensus in Directed Networks Lewis Tseng Boston College Oct. 13, 2017 (joint work with Nitin H. Vaidya)
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Presented By: Md Amjad Hossain
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Presentation transcript:

BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom Labs

BFTW 3 workshop (Sep 22, 2009) Fault tolerance vs. fault detection Distributed systems need to be robust against faults Approaches: Masking and detection Complementary: In many systems, we want both! Masking is well understood, detection is not 2 © 2009 Andreas Haeberlen Network "Machine XYZ is faulty" and

BFTW 3 workshop (Sep 22, 2009) What we know about fault detection Rich literature on detecting crash faults e.g. failure detector abstraction [Chandra96] But very little on general (Byzantine) faults Byzantine failure detector for consensus [Kihlstrom03] Several specific algorithms: SUNDR, PeerReview... What we do not know about general faults: Which faults are detectable in a given system? What is the complexity of detection? How does detection depend on synchrony? How much do cryptographic primitives help?... 3 © 2009 Andreas Haeberlen This talk

BFTW 3 workshop (Sep 22, 2009) Outline A "language" for reasoning about faults System model Extensions and fault instances Precise problem statement Two initial results Characterization of the set of detectable faults Tight lower bounds on the 'cost' of detection 4 © 2009 Andreas Haeberlen

BFTW 3 workshop (Sep 22, 2009) A B C System model 5 © 2009 Andreas Haeberlen Set N of nodes One terminal per node Reliable unicast network Messages can be signed System is asynchronous Execution is a sequence of events e k =(i k,I k,O k ) Node i has an algorithm A i =(M i,TI i,TO i,  i,  0,i,  i ) Algorithm is deterministic We say that node i is correct if it follows its algorithm A i Distributed algorithm A:=(A 1,A 2,...,A n ) Network Terminal

BFTW 3 workshop (Sep 22, 2009) Intuition: What is fault detection? Given: Algorithm A, set of faults F to detect Goal: When fault in f  F occurs, correct nodes output a list of faulty nodes to their terminals "Fault detection problem" for F: Find a function  F that maps any algorithm A to another algorithm  F (A) that solves the same problem but can additionally detect all the faults in F 6 © 2009 Andreas Haeberlen CD A B 0<x<10 x+y 0<y< FAULTY(C)

BFTW 3 workshop (Sep 22, 2009) Intuition: Faults and extensions How should a general 'fault' be defined? Needs to be specific to an algorithm! First approximation: A tuple (A,e) Problem: Need to change the algorithm to do fault detection  new algorithm A=  F (A)! What does it mean to detect a fault (A,e) in execution of A? Need to restrict the power of  ! Idea: Produce an extension of A that 'works exactly like A', except that it does some additional work to detect faults 7 © 2009 Andreas Haeberlen

BFTW 3 workshop (Sep 22, 2009) Definition: Extensions A is an extension of A if: 1. their terminal in/outputs are compatible: TI=TI, TO  TO 2. there are surjective mappings  1 and  2 such that, when  has a transition (I,s 1 )  (O,s 2 ), then  has a transition (  1 (I),  2 (s 1 ))  (  1 (O),  2 (s 2 )) What does this mean? We can map each execution e of A to an execution  e  e  of A If a node is correct in e, then it is also correct in  e  e  ! 8 © 2009 Andreas Haeberlen A=(M,TI,TO, ,  0,  ) =  11 22

BFTW 3 workshop (Sep 22, 2009) Definition: Fault instances A fault instance is a four-tuple (A, e): A is an algorithm e is an execution C is a set of correct nodes S is a set of suspects; S must contain at least one faulty node 9 © 2009 Andreas Haeberlen G H K C I F L J M O E Q N C S P D A B Algorithm: 1. A sends 0≤x≤10 to C 2. B sends 0≤y≤10 to C 3. C sends x+y to D C,S, C Needed because detectability depends on who is correct Needed to quantify how precisely we can say who is faulty ? ? 23

BFTW 3 workshop (Sep 22, 2009) The Fault Detection Problem Given a fault class F, find a  that maps any algorithm A to an extension  (A) such that: Nontriviality: Correct nodes regularly output lists of faulty nodes Completeness: If a fault (A,C,S,e)  F occurs, at least one correct node c  C will permanently output at least one faulty suspect s  S Accuracy: Correct nodes do not permanently output each other (occasional mistakes are ok) Agreement: Eventually all correct nodes will permanently output the same set of nodes 10 © 2009 Andreas Haeberlen

BFTW 3 workshop (Sep 22, 2009) Outline A "language" for reasoning about faults System model Extensions and fault instances Precise problem statement Some initial results How to prove impossibilities Fault classification; commission and omission faults Message complexity of detection 11 © 2009 Andreas Haeberlen

BFTW 3 workshop (Sep 22, 2009) Preview: Fault classification Characterize the set of fault instances for which Fault Detection Problem can be solved 12 © 2009 Andreas Haeberlen All (!) general fault instances Solution exists Non-observable fault instances Ambiguous fault instances Impossible to solve Commission faults Omission faults

BFTW 3 workshop (Sep 22, 2009) Preview: Message complexity How many additional messages are needed?  has message complexity c iff, for each execution e, the number of messages sent by correct nodes in any e with  e (e)=e is at most c times the number of messages in e Assumption: At most f<|N|-2 nodes can be faulty 13 © 2009 Andreas Haeberlen Fault detection problem Fault detection problem with agreement Commission faultsf+2 Omission faults3f+4(|N|-1) 2

BFTW 3 workshop (Sep 22, 2009) Future work This work provides a "language" for reasoning about fault detection with general faults Possible next steps: Probabilistic guarantees Lower cost? [SOSP'07] More synchrony More faults detectable? Stronger accuracy? Bound the time to detection? Bounds on message sizes and/or state space Impact on set of detectable faults? Broadcast channel Different cryptographic primitives © 2009 Andreas Haeberlen

BFTW 3 workshop (Sep 22, 2009) Summary Framework for reasoning about fault detection with general (Byzantine) faults Precise definition of a general fault instance Formal statement of the fault detection problem Two initial results Characterization of the set of detectable faults Tight lower bounds on the message complexity of detection 15 © 2009 Andreas Haeberlen Questions?