Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.1 FAULT TOLERANT SYSTEMS Part 4 – Analysis Methods Chapter 2 – HW Fault Tolerance.

Similar presentations


Presentation on theme: "Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.1 FAULT TOLERANT SYSTEMS Part 4 – Analysis Methods Chapter 2 – HW Fault Tolerance."— Presentation transcript:

1 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.1 FAULT TOLERANT SYSTEMS Part 4 – Analysis Methods Chapter 2 – HW Fault Tolerance

2 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.2 Duplex Systems  Both processors execute the same task  If outputs are in agreement - result is assumed to be correct  If results are different - we can not identify the failed processor  A higher-level software has to decide how failure is to be handled  This can be done using one of several methods

3 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.3 Duplex Reliability  Two active identical processors with reliability R(t)  Lifetime of duplex - time until both processors fail  C - Coverage Factor - probability that a faulty processor will be correctly diagnosed, identified and disconnected  R duplex (t) - the reliability of duplex system:  R duplex (t) = R comp (t) [ R² (t)+2C R(t)(1-R(t) ] R comp(t) - reliability of comparator

4 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.4 Duplex - Constant Failure Rates  Each processor has a constant failure rate  Ideal comparator - R comp (t)=1,  Duplex reliability -   MTTF duplex = 1/(2 ) + C/

5 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.5 Fault Detection: First Method - Acceptance Tests  Acceptance Test - a range check of each processor's output  Example - the pressure in a gas container must be in some known range  We use semantic information of the task to predict which values of output indicate an error  How should the acceptance range be picked?

6 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.6 Acceptance Test - Sensitivity Vs. Specificity  Narrow acceptance range: high probability of identifying an incorrect output, but also a high probability that a correct output will be misidentified as erroneous (false positive)  Wide acceptance range: low probability of both  Sensitivity - the (conditional) probability that the test will recognize an erroneous output as such  Specificity – the (conditional) probability that the output is erroneous if the test identified it as such  Narrow range - high sensitivity but low specificity  Wide range - low sensitivity but high specificity

7 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.7 Second Method – Hardware Testing  Both processors are subjected to diagnostic tests  The processor which fails the test is identified as faulty  Real-life tests are never perfect  Test Coverage - same as test sensitivity - the probability that the diagnostic test can identify a faulty processor as such  Test Transparency - the complement of the test coverage - the probability that the test passes a faulty processor as good

8 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.8 Third Method - Forward Recovery  Use a third processor to repeat the computation carried out by the duplex  If only one of the three processors is faulty, the one that disagrees is the faulty one  It is possible to use a combination of these methods  Acceptance test - quickest to run but often the least sensitive

9 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.9 Pair & Spare System  Avoid disruption of operation upon a mismatch between the two modules in a duplex  Disconnect duplex and transfer task to spare pair  Test offline, and if fault is transient - mark duplex as a good spare

10 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.10 Triplex-Duplex Architecture  Form a triplex out of duplexes  When processors in a duplex disagree, both are switched out  Allows simple identification of faulty processors  Triplex can function even if only one duplex is left - duplex allows fault detection

11 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.11 Other Reliability Evaluation Techniques  Poisson Processes  Markov Models

12 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.12 The Poisson Process - Assumptions  Non-deterministic events of some kind occurring over time with the following probabilistic behavior  For some constant and a very short interval of length  t: 1. Probability of one event occurring during  t is t plus a negligible term 2. Probability of more than one event occurring during  t is negligible ( of order of  t 2 ) 3. Probability of no events occurring during  t is 1- t plus a negligible term (  t 2 )

13 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.13 Poisson Process - Derivation  N(t) – number of events occurring during [0,t]  For a given t, N(t) is a random variable  P k (t)=Prob{N(t)=k} – probability of k events occurring during a time period of length t, (k=0,1,2,…)  Based on the previous assumptions (1) –(3) (for k=1,2,…) And These approximations become more accurate as  t 0, and

14 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.14 Poisson Process – Differential Equations  This results in the differential equations: and  With the initial conditions  The solution (for k=0,1,2,…) is or  For a given t, N(t) has a Poisson distribution with the parameter t

15 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.15 Poisson Process - Properties  For all values of t, N(t) a Poisson process with rate has the following properties:  Expected number of events in an interval of length t is t  Length of time between consecutive events has an exponential distribution with parameter and mean 1/  Numbers of events in disjoint intervals are statistically independent * Sum of two Poisson processes with parameters 1 and 2 is a Poisson process with parameter 1 + 2

16 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.16 Example of a Poisson Process - Duplex with Redundancy  Two active processors + unlimited number of inactive spares  Induction process instantaneous, spares are always functional  Each processor has a constant failure rate  Lifetime of a processor - Exponential distribution with parameter  Time between two consecutive failures of same logical processor - Exponentially distributed with a parameter  N(t) - number of failures in one logical processor during [0,t]  M(t) - number of failures in the duplex system during [0,t]

17 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.17 Duplex with redundancy - Reliability Calculation  Duplex has two processors - failure rate is 2  Comparator failure rate - negligible  Probability of k failures in duplex in [0,t] - ( for k=0,1,2,…)  For the duplex not to fail, each of these failures must be detected and successfully replaced - probability C  For k failures – probability  R duplex (t)=

18 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.18 Duplex with Redundancy Reliability - Alternative Derivation (like Hybrid redundancy) 1-Individual processors fail at rate, and so Rate of failures in the duplex is 2 2-Probability C of each failure to be successfully dealt with, and 1-C to cause duplex failure 3-Failures that crash the duplex occur with rate 2 (1-C) 4-The reliability of the system is R duplex (t)=

19 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.19  NMR systems in which failing processors are identified and replaced from an infinite pool of spares - similar calculation to duplex  Finite set of spares - the summation in the reliability derivation is capped at that number of spares, rather than going to infinity  Other variations of duplex systems -  One processor is active while the second is a standby spare  Processors can be repaired when they become faulty

20 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.20 Markov Models for More Complex Systems  Combinatorial arguments may be insufficient for reliability calculation in more complex systems  If failure rates are constant, we can use Markov Models for reliability calculations  Markov Models provide a structured approach for the derivation of the reliability of complex systems that may include Coverage factors and Repair process

21 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.21 Markov Models Markov models are more generic than combinatorial models. They can handle repairs and much more complex situations. Assumption: Any component may in one the two states: working or failed; Probability of state transition depends only on the current state.  Failure rates and repair rates are constants.  Transition probability is proportional to the time that the component stays at a state.  Exponential distribution of the reliability/availability

22 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.22 Steps of Applying Markov Models A system consists of multiple components  Construct state transition diagram (1) (2)  Construct differential equations (3)  Solve the equations to obtain the probability in each state (4)  The reliability or availability is the sum of the probabilities of working states.

23 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.23 Step 1: Construct state transition diagram Example 1: Simplex system with repair System A(t) = p0(t)  0 1

24 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.24 Step 1: Construct state transition diagram Example 2: Reliability of TMR system with repair Input Output Module A Module B Module C Voter

25 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.25 Step 1: Construct state transition diagram Example 3: A ring system with different node and link failure rates  and . Assume  that the system fails if any two or more than components failed.   00 1001 11     (failed nodes, failed links) Failed

26 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.26 Step 2: Construct differential equations p0 (t +  t) = (1 – ·  t ) · p 0 (t) +  ·  t · p1 (t) p1 (t +  t) = ·  t · p0 (t) + (1 –  ·  t ) · p 1 (t) A(t) = p0(t) The question is how to obtain the probability of each state. = – · p0 (t) +  · p1 (t) = · p0 (t) –  · p1 (t) Solve the differential equations to obtain (p0 (t), p1 (t)).

27 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.27  0 1 Step 2: Construct differential equations = – · p0 (t) +  · p1 (t) = · p0 (t) –  · p1 (t)  p0p1p0p1 =

28 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.28 Step 3: Solve differential equations There are many different ways to solve differential equations -LaPlace Transformation -Tools like MathLab or Mathematica p0p0 (t)      e  (  )t (t)      e  (  )t p1p1  p0p1p0p1 =

29 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.29 Step 4: Find the Probabilities of Working States p0p0 (t)      e  (  )t (t)      e  (  )t p1p1 A(t)      e  (  )t p0p0 (t)  If  = 0, the probability at p 0 represents the reliability R(t)      e  (  )t p0p0 (t)  e  t =

30 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.30 Step 2: Construct differential equations (Find the pattern) p1P2p3p4p5p1P2p3p4p5 = dp 2 (t) dt dp 5 (t) dt …  ij 12 45 3  12  21  45  54  14  41  25  52  13  31  35  53  23  32  34  43

31 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.31 In general, assume a STD has n states and is fully connected. Any state has n incoming and n outgoing transitions:  ij  0 is the transition rate from state i to j. For i, j = 1, 2,..., n, and i ≠ j. Step 2: Construct differential equations (Find the pattern) + -

32 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.32 The probability in state j at t +  t = the probability in state j at t + incoming prob – outgoing prob where Math manipulation: Divide  t on both sides, let  t  0 Step 2: Construct differential equations (Find the pattern)

33 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.33 p1p2p3…pnp1p2p3…pn = dp 2 (t) dt dp n (t) dt …  21  31  41  n1 11  12  32  42  n2 22  13  23  43  n3 33  1n  2n  3n  4n nn where Step 2: Construct differential equations (found the pattern) -

34 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.34 Example 1: Apply the Pattern R(t) = p 1 (t) + p 2 (t)

35 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.35 Example 2 00 1001 11     R(t) = p 1 (t) + p 2 (t) + p 3 (t)

36 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.36 Markov Chains - Introduction  A Markov Chain is a stochastic process X(t) - an infinite sequence of random variables indexed by time t, with a special probabilistic structure Prob{X(t n )=j | X(t 0 )=I 0, x(t 1 )=i 1,..x(t n-1 )=i n-1 } = Prob{X(t n )=j | x(t n-1 )=i n-1 )} for every t 0 <t 1 <... <t n  For a stochastic process to be a Markov Chain, its future behavior must depend only on its present state, and not on any past state  X(t+s) depends on X(t), but given X(t), X(t+s) does not depend on any X(  ) for  < t  If X(t)=i - the chain is in state i at time t  We deal only with Markov Chains with continuous time ( 0  t   ) and discrete state (X(t)=0,1,2,…)

37 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.37 Markov Chain - Probabilistic Interpretation  Once the chain moves into state i, it stays there for a length of time which has an exponential distribution with parameter i - it has a constant rate i of leaving state I  The probability that when leaving state i the chain will move to state j (with j  i) - P ij ( )  Transition rate from state i to state j is ( )

38 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.38 State Probabilities  P i (t) - probability that the process is in state i at time t, given it started at state i 0 at time 0  For given time instant t, state i and a very small interval of time  t, the chain can be in state i at time t+  t in one of the following cases:  It was in state i at time t and has not moved during the interval  t – this event has probability of  It was at some state j at time t (j  i) and moved from j to i during  t : probability  Probability of more than one transition is negligible if  t is small enough  These assumptions result in

39 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.39 Differential Equations for State Probabilities P i (t) ,for Δt 0 Since   This (for i=0,1,2,...) can now be solved, using the initial conditions P i 0 (0)=1 and P j (0 )=0 for j  i 0

40 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.40 Markov Chain for a Duplex with a Standby  Example: One active processor and a one standby spare -connected when the active unit fails  Constant failure rate of an active processor  C- coverage factor - probability that a failure of the active processor is correctly detected and the spare processor is successfully connected  The Markov chain - Fig.2-15

41 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.41 Differential Equations for Duplex with Standby  Initial conditions:  P 2 (0)=1, P 1 (0)=P 0 (0)=0

42 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.42 Reliability of Duplex with Standby  Solution of differential equations:  Exercise - derive this expression using combinatorial arguments

43 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.43 Markov Chain for a Duplex with Repair  Two active processors: each with failure rate and repair rate  (repair time is exponential with parameter  )  The Markov model  The differential equations -  Initial conditions -  P 2 (0)=1, P 1 (0)=P 0 (0)=0 Fig. 2-16

44 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.44 Duplex with Repair - State Probabilities  The solution to the differential equations -

45 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.45 Availability vs. Reliability  In systems without repair, mainly the reliability measure is of significance  With repair – availability is more meaningful than reliability  Point Availability - Ap(t) = Prob{The system is operational at time t}=1-P 0 (t)  Reliability - R(t)=Prob{The system is operational during [0,t] } - can be calculated by removing the transition from state 0 to state 1, solving the resulting new differential equations - R(t)=1-P 0 (t)

46 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.46 Long-Run Availability  We calculate A - the long-run availability - the proportion of time in the long run that the system is operational  We first calculate the steady-state probabilities - P 2 (  ), P 1 (  ), and P 0 (  ) (or P 2,P 1,P 0 )  These steady-state probabilities can be calculated in one of the two methods:  letting t approach  in P i (t)  setting dP i (t)/dt=0 (i=0,1,2) and solving the linear equations for Pi, using the relationship P 2 +P 1 +P 0 =1  A=1-P 0

47 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.47 Duplex with Repair – Long-Run Availability  Steady state probabilities -  Long-run availability -  A = P 2(∞) + P 1(∞) = 1 - P 0

48 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.48

49 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.49 Homework Exercises: 7, 8, 9, 10, 11, 12, 14

50 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.50 Success Probabilities for a Standby System

51 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.51

52 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.52

53 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.53 Reliability of Duplex with Standby

54 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.54

55 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.55

56 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.56 Reliability of a Two Element System with Repair

57 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.57

58 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.58

59 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.59 The Effect of Coverage on system Reliability

60 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.60 P s0 (s) +P s1 (s) = S 0

61 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.61

62 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.62

63 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.63 P s0 + P s1 =

64 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.64 S 0 =

65 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.65

66 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.66

67 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.67

68 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.68

69 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.69

70 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.70

71 Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.71


Download ppt "Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.1 FAULT TOLERANT SYSTEMS Part 4 – Analysis Methods Chapter 2 – HW Fault Tolerance."

Similar presentations


Ads by Google