Download presentation
Presentation is loading. Please wait.
1
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University
2
Fault Tolerance Means to isolate component faults Prevents system failures May increase system dependability
3
Dependability - attributes Availability Reliability Safety Confidentiality Integrity Maintainability BW p. 129
4
Dependability - impairments Faults Errors Failures BW p. 103,...,130 FaultErrorFailure... Fault
5
System and Component
6
Dependability - means Fault prevention Fault tolerance Error Removal Failure Forecasting BW p. 106,..., 130
7
Fault classification Origin Kind Property physical (internal/external) logical (design/interaction) omission value timing byzantine duration (permanent, transient) consistency (determinate, nondeterminate) autonomy (spontaneous, event-dependent)
8
Error Classification (Fault Error) Effect Extent latent effective local distributed
9
Failure Classification (Fault Error Failure) Consequence benign malign (a mishap) BW (Failure modes) p. 105
10
Fault Avoidance Careful Design Conservative Design process (activities) notations tools robust functionality testability tracability
11
Error Removal Verification (analysis of design) Test (analysis of implementation)
12
Failure Forecasting Calculation – analysis of design Simulation – measurement on design Test -- measurement on implementation
13
Fault Tolerance Means to isolate component faults Prevents system failures May increase system dependability... And mask them
14
Dependability - means Fault prevention Fault tolerance Error Removal Failure Forecasting BW p. 106,...
15
Fault Tolerance
16
FT - levels Full tolerance Graceful Degradation Fail safe BW p. 107
17
FT basis: Redundancy Time Space TryRetry... Try... BW p. 109
18
N-version programming V1 V2 V3 Driver (comparator) Comparison vectors (votes) Comparison status indicators BW p. 109 Comparison points
19
Fault classification (scope of N-VP) Origin Kind Property physical (internal/external) logical (design/interaction) omission value timing byzantine duration (permanent, transient) consistency (determinate, nondeterminate) autonomy (spontaneous, event-dependent) + (+) ++ (+) + / (+) + / +
20
Dynamic Redundancy 1.Error detection 2.Damage confinement and assessment 3.Error recovery 4.Fault treatment and continued service BW p. 114
21
Error Detection f: State x Input State x Output Environment (exception) Application BW p. 115 Assertion: precondition (input) postcondition (input, output) invariant(state, state’) Timing: WCET(f, input) Deadline (f,input) D
22
Damage Confinement Static structure Dynamic structure BW p. 117 object I I
23
Error Recovery Forward Backward BW p. 118 Repair the state – if you can ! define recovery points checkpoint state at r. p. roll back retry Domino effect
24
Recovery blocks ENSURE acceptance_test BY { module_1 } ELSE BY { module_2 }... ELSE BY { module_m } ELSE ERROR BW p. 120
25
The ideal FT-component Exception HandlerNormal mode Request/response Interface exception Interface exception Failure exception Failure exception BW p. 126
26
Safety Assessment Find faults that may lead to mishaps, analyze their relations, and estimate their consequences. May involve probabilistic reasoning (Reliability Engineering)
27
Fault Tree - Events Primary Events: Basic event – fault in atomic component Undeveloped Event – fault in composite component (may be analyzed later) External event – expected event from environment Intermediate event: Nodes inside a fault-tree
28
Fault Tree - Gates... condition Inhibit gate
29
Example – ”Wake too late” Wake too late Alarm clock fails Phone fails ”Inner clock” fails
30
Example ”Alarm clock fails” Beeper fails Button fails Alarm clock fails electronics fail SW fails Power fails Button read failsBeeper not set
31
Cut Set A cut set is a set of events that causes a top level event A singleton cut set is a single point of failure
32
Example – ”Wake too late” Wake too late Alarm clock fails Phone fails ”Inner clock” fails
33
Example ”Alarm clock fails” Beeper fails Button fails Alarm clock fails electronics fail SW fails Power fails Button read failsBeeper not set
34
Extensions etc. Probabilities on edges Event tree (forward analysis from initiating event) Combinations (cause-consequence diagrams) Many tools Kirsten M. Hansen, Anders P. Ravn and Victoria Stavridou, From Safety Analysis to Formal Specification, IEEE Trans. Softw. Eng.24,pp. 573-584, July 1998
35
Example
36
Fault Hypotheses
37
Fault-Tolerant System
38
Impulse Generator
39
CU
40
Voter and Arbiter
41
Parameters
42
Properties
43
Procedure 1.Model the correct component and check that it has the desired properties. 2.Model relevant faults and introduce them as internal transitions to error states. Check that this fault-affected. 3. Introduce into the model the mechanisms for fault detection, error recovery and masking and check that the desired properties are valid for this design.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.