Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg University August 2011
Fault Tolerance Means to isolate component faults Prevents system failures May increase system dependability... And mask them
Fault Tolerance
FT - levels Full tolerance Graceful Degradation Fail safe BW p. 107
FT basis: Redundancy Time Space TryRetry... Try...
Fault Tolerance
Basic Strategies
Dynamic Redundancy 1.Error detection 2.Damage confinement and assessment 3.Error recovery 4.Fault treatment and continued service BW p. 114
Error Detection f: State x Input State x Output Environment (exception) Application Assertion: precondition (input) postcondition (input, output) invariant(state, state’) Timing: WCET(f, input) Deadline (f,input) D
Damage Confinement Static structure Dynamic structure (transaction) object I I
Error Recovery Forward Backward Repair the state – if you can ! define recovery points checkpoint state at r. p. roll back retry Domino effect
Recovery blocks ENSURE acceptance_test BY { module_1 } ELSE BY { module_2 }... ELSE BY { module_m } ELSE ERROR BW p. 120
Implementation of Recovery Blocks
Abstract class RecoveryBlock public abstract class RecoveryBlock { abstract boolean acceptanceTest(); /** method to produce the result, it must be implemented by the application. module 0,..., MaxModule-1 */ abstract void block(int module); /* MaxModules must be set by the application to the number of blocks */ protected int MaxModules; ENSURE acceptance_test BY { module_1 } ELSE BY { module_2 }... ELSE BY { module_m } ELSE ERROR
RecoveryBlock execution /** method to execute recovery module 0, 1,... MaxModules-1 until one succeds NoAccept if no module passes acceptanceTest. */ public final void do_it() throws NoAccept, CloneNotSupportedException{ save(); int i = 0; do { try { block(i++); if ( acceptanceTest() ) return; } catch (Exception e) {/* if the block fails, we continue - not acceptance */} restore(copy); } while (i < MaxBlocks); throw new NoAccept(); } ENSURE acceptance_test BY { module_1 } ELSE BY { module_2 }... ELSE BY { module_m } ELSE ERROR
RecoveryBlock cache public abstract class RecoveryBlock { /** The recovery Cache is implemented by a clone of the original object */ RecoveryBlock copy; /** save object to recovery cache, uses Java clone which must be a deep clone. */ private final void save() throws CloneNotSupportedException { copy = (RecoveryBlock) this.clone(); } /** method to restore data from recovery cache, it must be implemented by the application value of the object to be restored */ abstract void restore(RecoveryBlock copy);
Application /** Extends the basic abstract RecoveryBlock with faulty sorting * algorithms and log calls, returns etc. to a TextArea. */ public class RecoveringSort extends RecoveryBlock { /** checksum for acceptance test */ private int checksum; /** data to be saved in recovery cache */ private int [] argument; public RecoveringSort(TextArea t) { MaxBlocks = 3; log = t; }
Acceptance criteria /* Acceptance test for sorting; it shall verify: * 1) the return value is an ordered list, * 2) the return value is a permutation of the initial values */ boolean acceptanceTest() { boolean result = true; // check ordering int i = argument.length-1; while (i > 0) if (argument[i] < argument[--i]) {result = false; break; } // check permutation, this is a partial check through a checksum // A full check is as expensive computationally as sorting, // thus, we use a partial check. i = argument.length; int sum = 0; while (i > 0) sum+=argument[--i]; return result && (sum == checksum); }
Application - modules /** Starts sorting using the recovery block mechanisms.. data integer array containing elements to be sorted. */ public int [] sort(int [] data) { argument = (int [])data.clone(); // copy needed for recovery to work checksum = 0; int i = argument.length; while (i > 0) checksum+=argument[--i]; try { do_it(); } catch (NoAccept e) { log.append("All blocks falied\n"); } return argument; } void block(int i) { switch (i) { case 0: BucketSort(argument); break; case 1: BadSort(argument); break; case 2: AlmostGoodSort(argument); break; default: }
Fault classes (scope of R-B) Origin Kind Property physical (internal/external) logical (design/interaction) omission value timing byzantine duration (permanent, transient) consistency (determinate, nondeterminate) autonomy (spontaneous, event-dependent) + (+) ++ (-) + / (+) + / +
The ideal FT-component Exception HandlerNormal mode Request/response Interface exception Interface exception Failure exception Failure exception
N-version programming V1 V2 V3 Driver (comparator) Comparison vectors (votes) Comparison status indicators Comparison points
Fault classes (scope of N-VP) Origin Kind Property physical (internal/external) logical (design/interaction) omission value timing byzantine duration (permanent, transient) consistency (determinate, nondeterminate) autonomy (spontaneous, event-dependent) + (+) / (+) + / +