R R R Fault Tolerant Computing
R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby –J. Knight
R R R Objectives Exposure to area of Critical Systems What it means to have a fault-tolerant system Specification techniques for representing critical properties How to Design Fault tolerance into a system
R R R Reliability and Recovery Reliability – Probability that a system will not fail at time t if it was operating properly at time 0. Recovery – Process of restoring consistency after a failure
R R R Dependability Dependability: –How much one may rely on the quality of services delivered –Quality of service depends on: Correctness Continuity of service
R R R Terms Failure: malfunction Fault: condition that might lead to failure Error: an incorrect response indicates a fault is present Faults may be: o permanent o intermittent o transient
R R R Terms (cont’d) Graceful Degradation system is operational, but degraded, after faults Fail-safe system execution is safe after the fault Stabilizing system recovers to a consistent state after the fault Masking the user of the system does not see any unintended behavior due to faults
R R R Terms (cont’d) Mean Time to Failure (MTTF) –expected value of system failure time Mean Time to Repair (MTTR) –expected value of system repair time Mean Time Between Failure –expected time between successive failures MTBF = MTTF + MTTR Fault Tolerance –ability to continue operation after occurrence of faults
R R R Design Decisions Fault detection Fault confinement Fault diagnosis Repair and/or reconfigure Redundancy –Hardware: extra hardware –Information: redundancy bits –Software: diagnosis software, extra software – Temporal: re-execute software to recover from intermittent faults
R R R Safety vs Reliability Reliability: –concerns occurrence of failures –System failures defined in terms of system services Safety: concerns occurrence of accidents –Unplanned events that result in death, inury, illness, damage, loss of property or evironmental harm –Defined in terms of external consequences
R R R Types of Faults Omission failure –server omits to respond to an input (fail-silent failure) Timing failure –response is functionally correct, but untimely - can be early timing failure or late timing failure – (performance failure) Response failure –incorrect response –if output value incorrect (value failure) –state transition incorrect (state transition failure)
R R R Types of Faults (cont’d) Crash failure –if after a first omission, a server omits to produce output until it restarts Amnesia crash –server restarts in a predefined initial state that does not depend on the inputs seen before crash Partial amnesia crash – some part of the state is the same before the crash; rest is in predefined initial state Pause crash – server restarts in the state it had before the crash Halting crash – crashed server never restarts
R R R Examples OS crashed followed by reboots in initial state Database server crash followed by recovery of a database state that reflects all transactions before the crash Communication server occasionally loses messages but does not delay messages (omission failure) Excessive message transmission or message processing delay (communication performance failure) Alteration of a message due to random noise during transmission (response failure)
R R R Hierarchical Failure Masking A failure of a certain type at a lower level can propagate as a different kind of failure at a higher level abstraction. Value Error at the physical layer (e.g., 2 bits corrupted) propagates as omission error at data link layer
R R R Group Failure Masking To ensure a service remains available to clients despite server failure, –one can implement a group of redundant, physically independent servers. The group masks the failure of a member. Hierarchical masking requires: – users to implement resource failure-masking attempts as exception handling code. In group masking, – individual members failures are entirely hidden from users by group management mechanisms.
R R R Group Failure Masking (cont’d) Group output is a function of outputs of individual group members. –fastest member –distinguished member –result of majority vote A server able to mask any k concurrent member failures will be termed k-fault tolerant –e.g., a primary/standby group of k servers with members ranked as primary, 1st backup, 2nd backup,..., can mask k-1 failures.
R R R Some Formalism Programs A Program consists of: – a finite set of variables – a finite set of actions – where guard is a boolean expression over program variables, and statement updates program variables Modifications –guards may contain receive from channels –statements may contain sends/receive guardstatement
R R R Computation A program computation is a ``fair'' sequence of steps, where in each step an action whose guard is true has its statement executed –In one step, multiple guards may be true. –If guard of some action is true continuously, then that action would eventually be chosen for execution. Notes A program computation is a sequence of states
R R R Specification A specification is a set of sequences of states. What does it mean for a program, p to satisfy a specification sp from a set of states S? –every computation of p that starts from a state in S is in sp.
R R R Examples of specifications Let S be a predicate. –invariant : Invariant(S) = {seq: S is true in each state of seq} A sequence seq is in invariant(S) iff S is true in each state in seq. –Closure Closed(S) = –{seq: Ai: I >= 0: ‘ S is true in the ith state of seq’ => ‘S is true in the (I+1)th state of seq’ } If S ever becomes true, it continues to be true.
R R R Examples of specifications (cont’d) Let R and S be predicates. leads-to: R leads-to S = {seq: (Ai: i>= 0: ‘R is true in ith state of seq’ => (Ek: k >=i : ‘S is true in kth state of seq’) ) }
R R R Examples of specifications (cont’d) Mutual Exclusion – invariant( (j <> k) => ~(cs.j /\ cs.k) ) – (Aj :: (req.j leads-to cs.j)) Leader Election – invariant ( ( j<>k) => ~(leader.j s /\ leader.k) ) – true leads-to (Ej:: leader.j) Load Balancing – true leads-to (Aj,k:: |load.j - load.k| =< bound)
R R R Safety and Liveness Safety specification –A sequence ``does nothing bad'' –No sequence has a bad prefix Let sp be a specification. –sp is a safety specification –iff (A s:: s ~element_of sp => (E a: a is a prefix of s: (Ab:: ab ~element_of sp)))
R R R Liveness Specification Liveness specification – A sequence “does something good” – Every finite prefix has a good extension Let sp be a specification – sp is a liveness specification iff – (A a:: (E b:: ab element_of sp))
R R R Faults A fault is an action that can change the program state All faults – (be they crash, failstop, omission, corruption, timing, Byzantine, intruders, or...) can be thus viewed as perturbations on the system
R R R Faults (cont’d) A program computation in the presence of faults is a sequence of steps where –in each step either program action executes or fault action executes –the program actions are fairly executed –the fault occurrences are finite
R R R Representation of Faults Communication faults –Let c denote the sequence of messages on a channel. –Let m 1 and m 2 be messages, and let seq m be a sequence of messages. Message Loss c = => c = Message Duplication c = => c = Message Reorder c = => c =
R R R Representation of Faults (cont’d) Amnesia/Transient faults. Let c denote all the variables of a process. –True => c=??
R R R Representation of Permanent Faults Fail-stop fault : –Upon fail-stop, a process does nothing; –it does not execute any action and –it does not send any messages. Introduce an auxiliary variable up.j at process j Add up.j to the guard of each action of j –If processes can detect failure of other processes, then they can do so using variable up.
R R R Representation of Permanent Faults Byzantine Faults: –Introduce an auxiliary variable b.j at process j –Add these actions as faults ~b.j => b.j = true b.j => state.j=??
R R R Goal of Fault-tolerance Design Starting from some initial states, S, –If the program executes alone then the original specification, sp, is satisfied –If the program executes in the presence of faults then the fault-tolerant specification, sp', is satisfied. The fault-tolerance specification depends upon the type of the desired fault-tolerance, e.g., –for masking sp' = sp –for fail-safe sp' = `safety specification of sp'
R R R Representation of Permanent Faults Fault-tolerant systems are rarely designed from scratch!!! One needs to modify a fault-intolerant system to add fault- tolerance –Need for reuse the fault-intolerant program. Fault-tolerant systems need to be modified to deal with new faults. –Need for incremental design Need to perform several activities while developing fault- tolerant systems. –manual or automated design, testing, verification, synthesis,... –desirable to have a unified framework that allows to perform these activities.
R R R Overall Design Manual Design TestingAutomated Synthesis Theorem Proving Model Checking Refinement
R R R Overall Design (cont’d) Should separate concerns of functionality and fault-tolerance. –Should use components that are responsible for fault-tolerance alone. Should provide structural continuity while performing these tasks. –Should be able to use the same components while performing the above tasks.