Download presentation
Presentation is loading. Please wait.
Published byJune Griffith Modified over 8 years ago
1
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.1 FAULT TOLERANT SYSTEMS Part 4 – Analysis Methods Chapter 2 – HW Fault Tolerance
2
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.2 Duplex Systems Both processors execute the same task If outputs are in agreement - result is assumed to be correct If results are different - we can not identify the failed processor A higher-level software has to decide how failure is to be handled This can be done using one of several methods
3
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.3 Duplex Reliability Two active identical processors with reliability R(t) Lifetime of duplex - time until both processors fail C - Coverage Factor - probability that a faulty processor will be correctly diagnosed, identified and disconnected R duplex (t) - the reliability of duplex system: R duplex (t) = R comp (t) [ R² (t)+2C R(t)(1-R(t) ] R comp(t) - reliability of comparator
4
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.4 Duplex - Constant Failure Rates Each processor has a constant failure rate Ideal comparator - R comp (t)=1, Duplex reliability - MTTF duplex = 1/(2 ) + C/
5
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.5 Fault Detection: First Method - Acceptance Tests Acceptance Test - a range check of each processor's output Example - the pressure in a gas container must be in some known range We use semantic information of the task to predict which values of output indicate an error How should the acceptance range be picked?
6
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.6 Acceptance Test - Sensitivity Vs. Specificity Narrow acceptance range: high probability of identifying an incorrect output, but also a high probability that a correct output will be misidentified as erroneous (false positive) Wide acceptance range: low probability of both Sensitivity - the (conditional) probability that the test will recognize an erroneous output as such Specificity – the (conditional) probability that the output is erroneous if the test identified it as such Narrow range - high sensitivity but low specificity Wide range - low sensitivity but high specificity
7
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.7 Second Method – Hardware Testing Both processors are subjected to diagnostic tests The processor which fails the test is identified as faulty Real-life tests are never perfect Test Coverage - same as test sensitivity - the probability that the diagnostic test can identify a faulty processor as such Test Transparency - the complement of the test coverage - the probability that the test passes a faulty processor as good
8
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.8 Third Method - Forward Recovery Use a third processor to repeat the computation carried out by the duplex If only one of the three processors is faulty, the one that disagrees is the faulty one It is possible to use a combination of these methods Acceptance test - quickest to run but often the least sensitive
9
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.9 Pair & Spare System Avoid disruption of operation upon a mismatch between the two modules in a duplex Disconnect duplex and transfer task to spare pair Test offline, and if fault is transient - mark duplex as a good spare
10
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.10 Triplex-Duplex Architecture Form a triplex out of duplexes When processors in a duplex disagree, both are switched out Allows simple identification of faulty processors Triplex can function even if only one duplex is left - duplex allows fault detection
11
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.11 Other Reliability Evaluation Techniques Poisson Processes Markov Models
12
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.12 The Poisson Process - Assumptions Non-deterministic events of some kind occurring over time with the following probabilistic behavior For some constant and a very short interval of length t: 1. Probability of one event occurring during t is t plus a negligible term 2. Probability of more than one event occurring during t is negligible ( of order of t 2 ) 3. Probability of no events occurring during t is 1- t plus a negligible term ( t 2 )
13
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.13 Poisson Process - Derivation N(t) – number of events occurring during [0,t] For a given t, N(t) is a random variable P k (t)=Prob{N(t)=k} – probability of k events occurring during a time period of length t, (k=0,1,2,…) Based on the previous assumptions (1) –(3) (for k=1,2,…) And These approximations become more accurate as t 0, and
14
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.14 Poisson Process – Differential Equations This results in the differential equations: and With the initial conditions The solution (for k=0,1,2,…) is or For a given t, N(t) has a Poisson distribution with the parameter t
15
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.15 Poisson Process - Properties For all values of t, N(t) a Poisson process with rate has the following properties: Expected number of events in an interval of length t is t Length of time between consecutive events has an exponential distribution with parameter and mean 1/ Numbers of events in disjoint intervals are statistically independent * Sum of two Poisson processes with parameters 1 and 2 is a Poisson process with parameter 1 + 2
16
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.16 Example of a Poisson Process - Duplex with Redundancy Two active processors + unlimited number of inactive spares Induction process instantaneous, spares are always functional Each processor has a constant failure rate Lifetime of a processor - Exponential distribution with parameter Time between two consecutive failures of same logical processor - Exponentially distributed with a parameter N(t) - number of failures in one logical processor during [0,t] M(t) - number of failures in the duplex system during [0,t]
17
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.17 Duplex with redundancy - Reliability Calculation Duplex has two processors - failure rate is 2 Comparator failure rate - negligible Probability of k failures in duplex in [0,t] - ( for k=0,1,2,…) For the duplex not to fail, each of these failures must be detected and successfully replaced - probability C For k failures – probability R duplex (t)=
18
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.18 Duplex with Redundancy Reliability - Alternative Derivation (like Hybrid redundancy) 1-Individual processors fail at rate, and so Rate of failures in the duplex is 2 2-Probability C of each failure to be successfully dealt with, and 1-C to cause duplex failure 3-Failures that crash the duplex occur with rate 2 (1-C) 4-The reliability of the system is R duplex (t)=
19
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.19 NMR systems in which failing processors are identified and replaced from an infinite pool of spares - similar calculation to duplex Finite set of spares - the summation in the reliability derivation is capped at that number of spares, rather than going to infinity Other variations of duplex systems - One processor is active while the second is a standby spare Processors can be repaired when they become faulty
20
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.20 Markov Models for More Complex Systems Combinatorial arguments may be insufficient for reliability calculation in more complex systems If failure rates are constant, we can use Markov Models for reliability calculations Markov Models provide a structured approach for the derivation of the reliability of complex systems that may include Coverage factors and Repair process
21
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.21 Markov Models Markov models are more generic than combinatorial models. They can handle repairs and much more complex situations. Assumption: Any component may in one the two states: working or failed; Probability of state transition depends only on the current state. Failure rates and repair rates are constants. Transition probability is proportional to the time that the component stays at a state. Exponential distribution of the reliability/availability
22
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.22 Steps of Applying Markov Models A system consists of multiple components Construct state transition diagram (1) (2) Construct differential equations (3) Solve the equations to obtain the probability in each state (4) The reliability or availability is the sum of the probabilities of working states.
23
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.23 Step 1: Construct state transition diagram Example 1: Simplex system with repair System A(t) = p0(t) 0 1
24
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.24 Step 1: Construct state transition diagram Example 2: Reliability of TMR system with repair Input Output Module A Module B Module C Voter
25
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.25 Step 1: Construct state transition diagram Example 3: A ring system with different node and link failure rates and . Assume that the system fails if any two or more than components failed. 00 1001 11 (failed nodes, failed links) Failed
26
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.26 Step 2: Construct differential equations p0 (t + t) = (1 – · t ) · p 0 (t) + · t · p1 (t) p1 (t + t) = · t · p0 (t) + (1 – · t ) · p 1 (t) A(t) = p0(t) The question is how to obtain the probability of each state. = – · p0 (t) + · p1 (t) = · p0 (t) – · p1 (t) Solve the differential equations to obtain (p0 (t), p1 (t)).
27
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.27 0 1 Step 2: Construct differential equations = – · p0 (t) + · p1 (t) = · p0 (t) – · p1 (t) p0p1p0p1 =
28
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.28 Step 3: Solve differential equations There are many different ways to solve differential equations -LaPlace Transformation -Tools like MathLab or Mathematica p0p0 (t) e ( )t (t) e ( )t p1p1 p0p1p0p1 =
29
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.29 Step 4: Find the Probabilities of Working States p0p0 (t) e ( )t (t) e ( )t p1p1 A(t) e ( )t p0p0 (t) If = 0, the probability at p 0 represents the reliability R(t) e ( )t p0p0 (t) e t =
30
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.30 Step 2: Construct differential equations (Find the pattern) p1P2p3p4p5p1P2p3p4p5 = dp 2 (t) dt dp 5 (t) dt … ij 12 45 3 12 21 45 54 14 41 25 52 13 31 35 53 23 32 34 43
31
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.31 In general, assume a STD has n states and is fully connected. Any state has n incoming and n outgoing transitions: ij 0 is the transition rate from state i to j. For i, j = 1, 2,..., n, and i ≠ j. Step 2: Construct differential equations (Find the pattern) + -
32
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.32 The probability in state j at t + t = the probability in state j at t + incoming prob – outgoing prob where Math manipulation: Divide t on both sides, let t 0 Step 2: Construct differential equations (Find the pattern)
33
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.33 p1p2p3…pnp1p2p3…pn = dp 2 (t) dt dp n (t) dt … 21 31 41 n1 11 12 32 42 n2 22 13 23 43 n3 33 1n 2n 3n 4n nn where Step 2: Construct differential equations (found the pattern) -
34
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.34 Example 1: Apply the Pattern R(t) = p 1 (t) + p 2 (t)
35
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.35 Example 2 00 1001 11 R(t) = p 1 (t) + p 2 (t) + p 3 (t)
36
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.36 Markov Chains - Introduction A Markov Chain is a stochastic process X(t) - an infinite sequence of random variables indexed by time t, with a special probabilistic structure Prob{X(t n )=j | X(t 0 )=I 0, x(t 1 )=i 1,..x(t n-1 )=i n-1 } = Prob{X(t n )=j | x(t n-1 )=i n-1 )} for every t 0 <t 1 <... <t n For a stochastic process to be a Markov Chain, its future behavior must depend only on its present state, and not on any past state X(t+s) depends on X(t), but given X(t), X(t+s) does not depend on any X( ) for < t If X(t)=i - the chain is in state i at time t We deal only with Markov Chains with continuous time ( 0 t ) and discrete state (X(t)=0,1,2,…)
37
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.37 Markov Chain - Probabilistic Interpretation Once the chain moves into state i, it stays there for a length of time which has an exponential distribution with parameter i - it has a constant rate i of leaving state I The probability that when leaving state i the chain will move to state j (with j i) - P ij ( ) Transition rate from state i to state j is ( )
38
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.38 State Probabilities P i (t) - probability that the process is in state i at time t, given it started at state i 0 at time 0 For given time instant t, state i and a very small interval of time t, the chain can be in state i at time t+ t in one of the following cases: It was in state i at time t and has not moved during the interval t – this event has probability of It was at some state j at time t (j i) and moved from j to i during t : probability Probability of more than one transition is negligible if t is small enough These assumptions result in
39
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.39 Differential Equations for State Probabilities P i (t) ,for Δt 0 Since This (for i=0,1,2,...) can now be solved, using the initial conditions P i 0 (0)=1 and P j (0 )=0 for j i 0
40
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.40 Markov Chain for a Duplex with a Standby Example: One active processor and a one standby spare -connected when the active unit fails Constant failure rate of an active processor C- coverage factor - probability that a failure of the active processor is correctly detected and the spare processor is successfully connected The Markov chain - Fig.2-15
41
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.41 Differential Equations for Duplex with Standby Initial conditions: P 2 (0)=1, P 1 (0)=P 0 (0)=0
42
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.42 Reliability of Duplex with Standby Solution of differential equations: Exercise - derive this expression using combinatorial arguments
43
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.43 Markov Chain for a Duplex with Repair Two active processors: each with failure rate and repair rate (repair time is exponential with parameter ) The Markov model The differential equations - Initial conditions - P 2 (0)=1, P 1 (0)=P 0 (0)=0 Fig. 2-16
44
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.44 Duplex with Repair - State Probabilities The solution to the differential equations -
45
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.45 Availability vs. Reliability In systems without repair, mainly the reliability measure is of significance With repair – availability is more meaningful than reliability Point Availability - Ap(t) = Prob{The system is operational at time t}=1-P 0 (t) Reliability - R(t)=Prob{The system is operational during [0,t] } - can be calculated by removing the transition from state 0 to state 1, solving the resulting new differential equations - R(t)=1-P 0 (t)
46
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.46 Long-Run Availability We calculate A - the long-run availability - the proportion of time in the long run that the system is operational We first calculate the steady-state probabilities - P 2 ( ), P 1 ( ), and P 0 ( ) (or P 2,P 1,P 0 ) These steady-state probabilities can be calculated in one of the two methods: letting t approach in P i (t) setting dP i (t)/dt=0 (i=0,1,2) and solving the linear equations for Pi, using the relationship P 2 +P 1 +P 0 =1 A=1-P 0
47
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.47 Duplex with Repair – Long-Run Availability Steady state probabilities - Long-run availability - A = P 2(∞) + P 1(∞) = 1 - P 0
48
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.48
49
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.49 Homework Exercises: 7, 8, 9, 10, 11, 12, 14
50
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.50 Success Probabilities for a Standby System
51
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.51
52
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.52
53
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.53 Reliability of Duplex with Standby
54
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.54
55
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.55
56
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.56 Reliability of a Two Element System with Repair
57
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.57
58
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.58
59
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.59 The Effect of Coverage on system Reliability
60
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.60 P s0 (s) +P s1 (s) = S 0
61
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.61
62
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.62
63
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.63 P s0 + P s1 =
64
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.64 S 0 =
65
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.65
66
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.66
67
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.67
68
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.68
69
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.69
70
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.70
71
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.71
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.