Download presentation
Presentation is loading. Please wait.
Published byEdgar Alexander Modified over 9 years ago
1
Failure Mode Assumptions and Assumption Coverage David Powell
2
Fault-Tolerance Key questions –How components may fail? Prevention strategies –At what rate they may fail? The Amount of redundancy needed –What are the important type of faults? Types of redundancy needed –The relation between dependability, redundancy and faults? General FT design guidelines
3
An F-T Paradox/Dilemma More faulty More redundancy More redundancy More possibility of faults More possibility of faults???
4
Solution- Some Key Steps Classify, quantify and verify the assumptions
5
Type of Failures
6
Overview Single-user service –Service Model –Potential Errors Multiple-user service –Service Model –Potential Errors
7
Single-user Service Model Service items: s i, i=1,2,… Values of s i : vs i Observation time of s i : ts i Service Model: S i = S i = An omniscient observer
8
Correctness Model Service item s i is correct iff (vs i SV i ) (ts i ST i ) (vs i SV i ) (ts i ST i ) SV i and ST i are respectively the specified sets of values and times for service item s i
9
Potential Errors Arbitrary value error: s i : vs i SV i Noncode error: s i : vs i CV (CV defines a code) Arbitrary timing error: s i : ts i ST i Early timing error: s i : ts i < min(ST i ) Late timing error: s i : ts i > max(ST i ) Omission error: s i : ts i = Impromptu error: s i : (vs i = ) (ts i = )
10
Multi-user Service Model Service item s i ={s i (1), s i (2),…, s i (n),} Service model:, all i,u New issues: “consistency”
11
Correctness Model vs i (u)– the value of service item i on process u vs i -- the value of service item i SV i – the set of specified service item i ts i (u)– the observation time of service item i on process u ST i (u) – the range of specified observation time of service item i on process u uv -- the time bound of related occurrences uv -- the time bound of related occurrences
12
Examples of Potential Errors Consistent value error Consistent timing error Semi-consistent value error
13
Failure Mode Assumptions Attempt to formalize the concept of an assumed failure mode By assertions on the sequences of service items delivered by a component
14
Examples of Value Error Assertions No value errors occur (V none ) i, vs i SV i i, vs i SV i The only value errors that occur are noncode value errors (V n ) i, (vs i SV i ) (vs i CV ) i, (vs i SV i ) (vs i CV ) Arbitrary value error can occur (V arb ) i, (vs i SV i ) (vs i SV i ) i, (vs i SV i ) (vs i SV i )
15
Examples of Timing Error Assertions No timing error occurs (T none ) The only timing errors are omission errors (T O ) The only timing errors are late timing errors (T L ) The only timing errors are early timing errors (T E ) Arbitrary timing error can occur (T arb ) Permanent omission/crash (T p ) Bounded omission degree (T Bk )
16
Timing Error Implications
17
Failure Mode Assertions(FMA) A complete FMA entails an assertion on errors occurring on both value and time domains By taking the Cartesian production of the two domains, we get a family of FMA
18
FMA Implication Graph
19
So what? The FMA classification and implication graph can serve as a guideline to design families of FT algorithms that can process errors in increasing severity!
20
Assumption Coverage Establishing a link between assumed component failure mode and system dependability (The design a FT system relies on the assumption they make) (The dependability of a FT system is related to the failure mode they assume)
21
Motivation Components may fail They may fail in a bad way leads to a violation of assumptions of the system The system, in turn, can fail Question: to what degree can a component FMA prove to be true in the real system?
22
The Coverage of the Assumption Definition P(X) = Pr{ X= true | component failed} P(X) = Pr{ X= true | component failed} P(V arb T arb ) = 1 P(V none T none ) = 0
23
Coverage of an FT system PS(X) = PS(X) = Pr{ correct error processing |X= true} Pr{ correct error processing |X= true} *Pr{ X= true | component failed} *Pr{ X= true | component failed}
24
Influence of Assumption Coverage on System Dependability A Case Study
25
The System A system of n processors Connected via unidirectional message-passing bus Each processor carries out the same computation steps The result of each processing step is communicated to all other processors Each process has a decision function (DF) The DF is applied to the results received from other processors … Each processor and its associated bus is viewed as a single component
26
Fail-Silent Processor-bus A fail-silent processor –Only has semi-consistent value errors –Always produces message on time –Or ceases to produce messages forever –If a message is delivered to a processor, it is to be delivered to all processors with consistent fixed delay
27
Fail-Consistent Processor Bus Only semi-consistent value errors may occur Faulty processors may send erroneous values Consistent timing error may occur
28
Fail-uncontrolled Processor Bus Arbitrary timing error Arbitrary value error
29
Implications of Assumption Coverage Failure mode relations Coverage relations
30
Dependability Expressions From Markov Models r = e –λt λ = failure rate
31
A Life-critical Application System reliability objective: R > 1-10 -9 over 10 hours Single processor reliability: –r = e -λt –1/λ = 5 years
33
A Money-Critical Application It is about availability of the system rather than reliability of the system Please look at the paper for more details
34
Unavailability v.s. Coverage
35
Conclusion A formalism for describing component failure modes Multiplicity of value and timing errors The notion of assumption coverage The relation between dependability, availability and assumption coverage
36
Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.