Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University
Fault Tolerance Means to isolate component faults Prevents system failures May increase system dependability
Dependability - attributes Availability Reliability Safety Confidentiality Integrity Maintainability BW p. 129
Dependability - impairments Faults Errors Failures BW p. 103,...,130 FaultErrorFailure... Fault
System and Component
Dependability - means Fault prevention Fault tolerance Error Removal Failure Forecasting BW p. 106,..., 130
Fault classification Origin Kind Property physical (internal/external) logical (design/interaction) omission value timing byzantine duration (permanent, transient) consistency (determinate, nondeterminate) autonomy (spontaneous, event-dependent)
Error Classification (Fault Error) Effect Extent latent effective local distributed
Failure Classification (Fault Error Failure) Consequence benign malign (a mishap) BW (Failure modes) p. 105
Fault Avoidance Careful Design Conservative Design process (activities) notations tools robust functionality testability tracability
Error Removal Verification (analysis of design) Test (analysis of implementation)
Failure Forecasting Calculation – analysis of design Simulation – measurement on design Test -- measurement on implementation
Fault Tolerance Means to isolate component faults Prevents system failures May increase system dependability... And mask them
Dependability - means Fault prevention Fault tolerance Error Removal Failure Forecasting BW p. 106,...
Fault Tolerance
FT - levels Full tolerance Graceful Degradation Fail safe BW p. 107
FT basis: Redundancy Time Space TryRetry... Try... BW p. 109
N-version programming V1 V2 V3 Driver (comparator) Comparison vectors (votes) Comparison status indicators BW p. 109 Comparison points
Fault classification (scope of N-VP) Origin Kind Property physical (internal/external) logical (design/interaction) omission value timing byzantine duration (permanent, transient) consistency (determinate, nondeterminate) autonomy (spontaneous, event-dependent) + (+) ++ (+) + / (+) + / +
Dynamic Redundancy 1.Error detection 2.Damage confinement and assessment 3.Error recovery 4.Fault treatment and continued service BW p. 114
Error Detection f: State x Input State x Output Environment (exception) Application BW p. 115 Assertion: precondition (input) postcondition (input, output) invariant(state, state’) Timing: WCET(f, input) Deadline (f,input) D
Damage Confinement Static structure Dynamic structure BW p. 117 object I I
Error Recovery Forward Backward BW p. 118 Repair the state – if you can ! define recovery points checkpoint state at r. p. roll back retry Domino effect
Recovery blocks ENSURE acceptance_test BY { module_1 } ELSE BY { module_2 }... ELSE BY { module_m } ELSE ERROR BW p. 120
The ideal FT-component Exception HandlerNormal mode Request/response Interface exception Interface exception Failure exception Failure exception BW p. 126
Safety Assessment Find faults that may lead to mishaps, analyze their relations, and estimate their consequences. May involve probabilistic reasoning (Reliability Engineering)
Fault Tree - Events Primary Events: Basic event – fault in atomic component Undeveloped Event – fault in composite component (may be analyzed later) External event – expected event from environment Intermediate event: Nodes inside a fault-tree
Fault Tree - Gates... condition Inhibit gate
Example – ”Wake too late” Wake too late Alarm clock fails Phone fails ”Inner clock” fails
Example ”Alarm clock fails” Beeper fails Button fails Alarm clock fails electronics fail SW fails Power fails Button read failsBeeper not set
Cut Set A cut set is a set of events that causes a top level event A singleton cut set is a single point of failure
Example – ”Wake too late” Wake too late Alarm clock fails Phone fails ”Inner clock” fails
Example ”Alarm clock fails” Beeper fails Button fails Alarm clock fails electronics fail SW fails Power fails Button read failsBeeper not set
Extensions etc. Probabilities on edges Event tree (forward analysis from initiating event) Combinations (cause-consequence diagrams) Many tools Kirsten M. Hansen, Anders P. Ravn and Victoria Stavridou, From Safety Analysis to Formal Specification, IEEE Trans. Softw. Eng.24,pp , July 1998
Example
Fault Hypotheses
Fault-Tolerant System
Impulse Generator
CU
Voter and Arbiter
Parameters
Properties
Procedure 1.Model the correct component and check that it has the desired properties. 2.Model relevant faults and introduce them as internal transitions to error states. Check that this fault-affected. 3. Introduce into the model the mechanisms for fault detection, error recovery and masking and check that the desired properties are valid for this design.