BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom Labs
BFTW 3 workshop (Sep 22, 2009) Fault tolerance vs. fault detection Distributed systems need to be robust against faults Approaches: Masking and detection Complementary: In many systems, we want both! Masking is well understood, detection is not 2 © 2009 Andreas Haeberlen Network "Machine XYZ is faulty" and
BFTW 3 workshop (Sep 22, 2009) What we know about fault detection Rich literature on detecting crash faults e.g. failure detector abstraction [Chandra96] But very little on general (Byzantine) faults Byzantine failure detector for consensus [Kihlstrom03] Several specific algorithms: SUNDR, PeerReview... What we do not know about general faults: Which faults are detectable in a given system? What is the complexity of detection? How does detection depend on synchrony? How much do cryptographic primitives help?... 3 © 2009 Andreas Haeberlen This talk
BFTW 3 workshop (Sep 22, 2009) Outline A "language" for reasoning about faults System model Extensions and fault instances Precise problem statement Two initial results Characterization of the set of detectable faults Tight lower bounds on the 'cost' of detection 4 © 2009 Andreas Haeberlen
BFTW 3 workshop (Sep 22, 2009) A B C System model 5 © 2009 Andreas Haeberlen Set N of nodes One terminal per node Reliable unicast network Messages can be signed System is asynchronous Execution is a sequence of events e k =(i k,I k,O k ) Node i has an algorithm A i =(M i,TI i,TO i, i, 0,i, i ) Algorithm is deterministic We say that node i is correct if it follows its algorithm A i Distributed algorithm A:=(A 1,A 2,...,A n ) Network Terminal
BFTW 3 workshop (Sep 22, 2009) Intuition: What is fault detection? Given: Algorithm A, set of faults F to detect Goal: When fault in f F occurs, correct nodes output a list of faulty nodes to their terminals "Fault detection problem" for F: Find a function F that maps any algorithm A to another algorithm F (A) that solves the same problem but can additionally detect all the faults in F 6 © 2009 Andreas Haeberlen CD A B 0<x<10 x+y 0<y< FAULTY(C)
BFTW 3 workshop (Sep 22, 2009) Intuition: Faults and extensions How should a general 'fault' be defined? Needs to be specific to an algorithm! First approximation: A tuple (A,e) Problem: Need to change the algorithm to do fault detection new algorithm A= F (A)! What does it mean to detect a fault (A,e) in execution of A? Need to restrict the power of ! Idea: Produce an extension of A that 'works exactly like A', except that it does some additional work to detect faults 7 © 2009 Andreas Haeberlen
BFTW 3 workshop (Sep 22, 2009) Definition: Extensions A is an extension of A if: 1. their terminal in/outputs are compatible: TI=TI, TO TO 2. there are surjective mappings 1 and 2 such that, when has a transition (I,s 1 ) (O,s 2 ), then has a transition ( 1 (I), 2 (s 1 )) ( 1 (O), 2 (s 2 )) What does this mean? We can map each execution e of A to an execution e e of A If a node is correct in e, then it is also correct in e e ! 8 © 2009 Andreas Haeberlen A=(M,TI,TO, , 0, ) = 11 22
BFTW 3 workshop (Sep 22, 2009) Definition: Fault instances A fault instance is a four-tuple (A, e): A is an algorithm e is an execution C is a set of correct nodes S is a set of suspects; S must contain at least one faulty node 9 © 2009 Andreas Haeberlen G H K C I F L J M O E Q N C S P D A B Algorithm: 1. A sends 0≤x≤10 to C 2. B sends 0≤y≤10 to C 3. C sends x+y to D C,S, C Needed because detectability depends on who is correct Needed to quantify how precisely we can say who is faulty ? ? 23
BFTW 3 workshop (Sep 22, 2009) The Fault Detection Problem Given a fault class F, find a that maps any algorithm A to an extension (A) such that: Nontriviality: Correct nodes regularly output lists of faulty nodes Completeness: If a fault (A,C,S,e) F occurs, at least one correct node c C will permanently output at least one faulty suspect s S Accuracy: Correct nodes do not permanently output each other (occasional mistakes are ok) Agreement: Eventually all correct nodes will permanently output the same set of nodes 10 © 2009 Andreas Haeberlen
BFTW 3 workshop (Sep 22, 2009) Outline A "language" for reasoning about faults System model Extensions and fault instances Precise problem statement Some initial results How to prove impossibilities Fault classification; commission and omission faults Message complexity of detection 11 © 2009 Andreas Haeberlen
BFTW 3 workshop (Sep 22, 2009) Preview: Fault classification Characterize the set of fault instances for which Fault Detection Problem can be solved 12 © 2009 Andreas Haeberlen All (!) general fault instances Solution exists Non-observable fault instances Ambiguous fault instances Impossible to solve Commission faults Omission faults
BFTW 3 workshop (Sep 22, 2009) Preview: Message complexity How many additional messages are needed? has message complexity c iff, for each execution e, the number of messages sent by correct nodes in any e with e (e)=e is at most c times the number of messages in e Assumption: At most f<|N|-2 nodes can be faulty 13 © 2009 Andreas Haeberlen Fault detection problem Fault detection problem with agreement Commission faultsf+2 Omission faults3f+4(|N|-1) 2
BFTW 3 workshop (Sep 22, 2009) Future work This work provides a "language" for reasoning about fault detection with general faults Possible next steps: Probabilistic guarantees Lower cost? [SOSP'07] More synchrony More faults detectable? Stronger accuracy? Bound the time to detection? Bounds on message sizes and/or state space Impact on set of detectable faults? Broadcast channel Different cryptographic primitives © 2009 Andreas Haeberlen
BFTW 3 workshop (Sep 22, 2009) Summary Framework for reasoning about fault detection with general (Byzantine) faults Precise definition of a general fault instance Formal statement of the fault detection problem Two initial results Characterization of the set of detectable faults Tight lower bounds on the message complexity of detection 15 © 2009 Andreas Haeberlen Questions?