Failure Mode Assumptions and Assumption Coverage David Powell
Fault-Tolerance Key questions –How components may fail? Prevention strategies –At what rate they may fail? The Amount of redundancy needed –What are the important type of faults? Types of redundancy needed –The relation between dependability, redundancy and faults? General FT design guidelines
An F-T Paradox/Dilemma More faulty More redundancy More redundancy More possibility of faults More possibility of faults???
Solution- Some Key Steps Classify, quantify and verify the assumptions
Type of Failures
Overview Single-user service –Service Model –Potential Errors Multiple-user service –Service Model –Potential Errors
Single-user Service Model Service items: s i, i=1,2,… Values of s i : vs i Observation time of s i : ts i Service Model: S i = S i = An omniscient observer
Correctness Model Service item s i is correct iff (vs i SV i ) (ts i ST i ) (vs i SV i ) (ts i ST i ) SV i and ST i are respectively the specified sets of values and times for service item s i
Potential Errors Arbitrary value error: s i : vs i SV i Noncode error: s i : vs i CV (CV defines a code) Arbitrary timing error: s i : ts i ST i Early timing error: s i : ts i < min(ST i ) Late timing error: s i : ts i > max(ST i ) Omission error: s i : ts i = Impromptu error: s i : (vs i = ) (ts i = )
Multi-user Service Model Service item s i ={s i (1), s i (2),…, s i (n),} Service model:, all i,u New issues: “consistency”
Correctness Model vs i (u)– the value of service item i on process u vs i -- the value of service item i SV i – the set of specified service item i ts i (u)– the observation time of service item i on process u ST i (u) – the range of specified observation time of service item i on process u uv -- the time bound of related occurrences uv -- the time bound of related occurrences
Examples of Potential Errors Consistent value error Consistent timing error Semi-consistent value error
Failure Mode Assumptions Attempt to formalize the concept of an assumed failure mode By assertions on the sequences of service items delivered by a component
Examples of Value Error Assertions No value errors occur (V none ) i, vs i SV i i, vs i SV i The only value errors that occur are noncode value errors (V n ) i, (vs i SV i ) (vs i CV ) i, (vs i SV i ) (vs i CV ) Arbitrary value error can occur (V arb ) i, (vs i SV i ) (vs i SV i ) i, (vs i SV i ) (vs i SV i )
Examples of Timing Error Assertions No timing error occurs (T none ) The only timing errors are omission errors (T O ) The only timing errors are late timing errors (T L ) The only timing errors are early timing errors (T E ) Arbitrary timing error can occur (T arb ) Permanent omission/crash (T p ) Bounded omission degree (T Bk )
Timing Error Implications
Failure Mode Assertions(FMA) A complete FMA entails an assertion on errors occurring on both value and time domains By taking the Cartesian production of the two domains, we get a family of FMA
FMA Implication Graph
So what? The FMA classification and implication graph can serve as a guideline to design families of FT algorithms that can process errors in increasing severity!
Assumption Coverage Establishing a link between assumed component failure mode and system dependability (The design a FT system relies on the assumption they make) (The dependability of a FT system is related to the failure mode they assume)
Motivation Components may fail They may fail in a bad way leads to a violation of assumptions of the system The system, in turn, can fail Question: to what degree can a component FMA prove to be true in the real system?
The Coverage of the Assumption Definition P(X) = Pr{ X= true | component failed} P(X) = Pr{ X= true | component failed} P(V arb T arb ) = 1 P(V none T none ) = 0
Coverage of an FT system PS(X) = PS(X) = Pr{ correct error processing |X= true} Pr{ correct error processing |X= true} *Pr{ X= true | component failed} *Pr{ X= true | component failed}
Influence of Assumption Coverage on System Dependability A Case Study
The System A system of n processors Connected via unidirectional message-passing bus Each processor carries out the same computation steps The result of each processing step is communicated to all other processors Each process has a decision function (DF) The DF is applied to the results received from other processors … Each processor and its associated bus is viewed as a single component
Fail-Silent Processor-bus A fail-silent processor –Only has semi-consistent value errors –Always produces message on time –Or ceases to produce messages forever –If a message is delivered to a processor, it is to be delivered to all processors with consistent fixed delay
Fail-Consistent Processor Bus Only semi-consistent value errors may occur Faulty processors may send erroneous values Consistent timing error may occur
Fail-uncontrolled Processor Bus Arbitrary timing error Arbitrary value error
Implications of Assumption Coverage Failure mode relations Coverage relations
Dependability Expressions From Markov Models r = e –λt λ = failure rate
A Life-critical Application System reliability objective: R > over 10 hours Single processor reliability: –r = e -λt –1/λ = 5 years
A Money-Critical Application It is about availability of the system rather than reliability of the system Please look at the paper for more details
Unavailability v.s. Coverage
Conclusion A formalism for describing component failure modes Multiplicity of value and timing errors The notion of assumption coverage The relation between dependability, availability and assumption coverage
Thank you