Fault-Tolerant Computing Systems #5 Reliability and Availability2 Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th
Reliability and Availability The probability that a system survives till time t (it has not fail till t) Availability The probability that a system works properly at time t
Preliminaries of Probability Discrete sample space: Tossing a coin {head, tail} sample space Continuous sample space: How long the pc stays up after reboot {t | t>0} sample space Random variable A function mapping each element of sample space to a real number Ex. heads=1, tails=0
Preliminaries Random variable A function mapping each element of sample space to a real number CDF (Cumulative distributed function) [สะสมเพิ่มขึ้น] F(t) = Pr [X ≤ t] Pr : probability that the system has gone down by time t Pdf (Probability density function) f(t) = dF(t) / dx Expected Value, Mean E[X] = 0 t f(t)dt (X≥0) Average outcome of the random experiment expect value, mean of a random variable
Exponential Distribution The most commonly used distribute function in reliability modeling. CDF F(t) = 1 – e-lt pdf f(t) = l e-lt Mean 1/l Memoryless property Y = X – t Gt(y) = Pr [Y ≤ y | X > t ] = 1 – e-ly Distribute of remaining life of a component does not depend on how long it has been working. The component does not AGE ! (remaining life of X does not depend on the time that has passed) f(t) = 2e-2t F(t) = 1 – e-2t
Reliability R(t) Reliability The probability that a system survives till time t R(t) = Pr [X > t] = 1 – F(t) X : Random probability variable X which represents a time to failure of the system (the life of the system) R(t): represents probability that the system survives till time t F(t) = exponential distribution F(t) = 1 – e-2t R(t) = e-2t t time 0 X time t
Reliability R(t) Reliability R(t) = Pr [X > t] = 1 – F(t) t time 0 The system is initially working R() = 0 No system has infinite lifetime F(t) = exponential Distribution R(t) = reliability F(t) = 1 – e-2t R(t) = e-2t t time 0 X time t
Failure Rate Probability that fault will occur in an interval time [t, t+Dt] = f(t)Dt Probability that fault will occur in time [t, t+Dt] f(t)Dt / R(t) Probability of occurrence of fault at time [t, t+Dt], when the system is working properly at t Failure Rate f(t) / R(t) f(t) = probability of fault F(t) = exponential distribution R(t) = reliability f(t) = 2e-2t R(t) = e-2t F(t) = 1 – e-2t [t, t+Dt]
Bathtub Curve Failure Rate Bathtub Curve f(t) / R(t) Bathtub Curve General Failure Rate observed from the empirical data collected from mechanical and electronic component When lifetime of a system F(t) is exponential distribution,it has a constant Failure Rate (see previous slide) 2.constant failure rate 1.Initial stage: Inherit defects faulty design 3.last stage: faults caused by age
Availability The probability that system works properly at time t Availability is a measure that is frequently used for describing the behavior of the system *If the system has no repair or replacement, availability is equal to reliability R(t) Because, R(t) = the probability that no failures have occurred during the whole period (0,t) fails repairs fails repairs Operational Under repair Operational t Xi Xi+1 Xi+2 Ui Ui+1
MTTF (Mean Time To Failure) E[X] = 0 t f(t)dt = 0 R(t)dt The expected value of the probability variable X which represents time till fault occurs in the system When R(t) = e-lt (X is exponential distribution) Failure Rate = l MTTF = 1 / l time 0 expected value
MTTR (Mean Time To Repair) MTTR = E [ Ui ] The expected value of the random variable Ui which represents the downtime for i th repair or replacement When R(t) = e-lt Repairing Rate = m MTTR = 1 / m t Xi Xi+1 Xi+2 Ui Ui+1
Availability Instantaneous availability (ทันทีทันใด) A(t) = Pr [probability that the component is functioning correctly at t ] Steady-State Availability (general meaning) A = limt→∞ A(t) fails repairs fails repairs t Xi Xi+1 Xi+2 Ui Ui+1
Availability When Xi, Ui is exponential distribution FXi(t) = 1 – e-lt, FUi(t) = 1 – e-mt Instantaneous Availability A(t) = (m + le-(l + m)t ) /(m + l) Steady-State Availability A = limt→∞ A(t) = m /(m + l) t Xi Xi+1 Xi+2 Ui Ui+1
Availability and MTTF / MTTR MTTR (mean time to repair) MTTR = E [ Ui ] The expected value of Ui which is the random variable that represents the downtime for i th repair or replacement MTTF (mean time to failure) MTTF = E [ Xi ] The expected value of Xi which is the random variable that represents the duration of the i th function period. Steady-State Availability A = MTTF / (MTTF+MTTR) = m /(m + l) (when Xi,Ui is the exponential distribution of parameter l,m) t Xi Xi+1 Xi+2 Ui Ui+1
Reliability Block Diagrams Represents the logical structure of a system with regard to how the reliability of its component affects the system reliability Series Structure Every component has to be functioning Decreasing reliability Parallel Structure At least one component is operational Increasing reliability R1 R2 R1 R2 R = R1*R2 Ra R = 1– (1– R1)*(1– R2) F(t)=1– (1– F1(t))*(1– F2(t)*F3(t)) R2=0.9 Ra = 1– (1- R2)*(1- R3) = 1-(1-0.9)*(1-0.8) = 1-(0.1*0.2) = 0.98 Rtotal = R1 * Ra = 0.95*0.98 = 0.931 R1=0.95 R3=0.8
Reliability Block Diagrams Represents the logical structure of a system with regard to how the reliability of its component affects the system reliability F(t) = distribution function for failure time (unreliability) F1 F1 F2 Series Structure F2 F(t)=1– (1– F1(t))*(1– F2(t)) Parallel Structure F(t)=F1(t)*F2(t) F(t)=1– (1– F1(t))*(1– F2(t)*F3(t)) F2=0.001 F2(t)*F3(t) = 0.001*0.002 = 0.000002 F(t) = 1– (1- F1(t))(1-0.000002) = 1-(1-0.003)(1-0.000002) = 1-(0.997)(0.999998) = 1-0.994 = 0.006 F1=0.003 F3=0.002
Example of Calculation
Reliability & Failure Rate Calculate the reliability (R) within 2 years for the system that has the failure rate = 0.3 (case/year) R(t) = e-lt Failure rate (l) = 0.3 R(2) = e-0.3 * 2 = e-0.6 = 2.71828183 -0.6 = 0.5488
MTTF What is MTTF of this system ? (100 + 120 + 140) / 3 = 120 fails repaired 4hr 100hr 3hr 120hr 2hr 140hr t (100 + 120 + 140) / 3 = 120 (hr / case)
MTTF & Failure Rate Failure rate (l) = 1 / MTTF = 1 / 40 = 0.025 What is the failure rate of the system which has MTTF = 40 hr/case fails repairs 4hr 30hr 3hr 40hr 2hr 50hr t Failure rate (l) = 1 / MTTF = 1 / 40 = 0.025 (case/hr)
MTTR What is MTTR of this system ? (3+ 2 + 4) / 3 = 3 (hr / case) fails repairs 4hr 100hr 3hr 120hr 2hr 140hr t (3+ 2 + 4) / 3 = 3 (hr / case)
MTTR & Repairing Rate Repairing rate (m) = 1 / MTTR MTTR = 1 / m What is the MTTR of the system which has repairing rate of 0.2 (case/hr) Repairing rate (m) = 1 / MTTR MTTR = 1 / m = 1 / 0.2 = 5 (hr/case)
Availability & MTTF & MTTR What is Availability of this system ? fails repairs 4hr 100hr 3hr 120hr 2hr 140hr t A = MTTF / (MTTF+MTTR) MTTF = (100 + 120 + 140) / 3 = 120 (hr/case) MTTR = (3 + 2 + 4) / 3 = 3 (hr/case) A = 120 / 120+3 = 120/123 = 0.975
Fault Trees Si=kn ( )F(t)i(1-F(t))n-i Pictorial representation of the combination of events that can cause the occurrence of an undesirable event (failure). Staring point (of tree) is the definition of a single undesirable event (failure). An event is reduced to a combination of low-level events by means of logic gates. F(t) = probability of the occurrence of failure event (function of F(t) is CDF) 0: Normal, 1: Fails TMR Failure 2-out-of-3 S1 S2 S3 OR gate AND gate k-out-of-n Si=kn ( )F(t)i(1-F(t))n-i n i
Fault Tree Model & Reliability Block Model The structure that shows when the system is functioning. Fault Tree Model The structure that shows when the system has failed The output of the top event is a logic 1 Fault Tree Failure P1 M2 M3 or and M1 P2 Reliability block diagram 0: Normal, 1: Fails 2 processors (P) 3 memory module (M)
Example of Fault Tree 0: Normal, 1: Fails Failure F1(t)*(1-(1-F2(t))*(1-F3(t))) and 1-(1-F2(t))*(1-F3(t)) or S1 S2 S3
Fault Trees (Basic) Fault Tree when there is no repeated component Failure distribution for the component is independent Reliability block diagram Fault Tree Failure P1 M2 M3 or and M1 P2 F(t) = ?? 2 processors (P) 3 memory module (M) The system is operational if at least one processor and one memory module are operational.
Fault Trees (Advance) We have to use factoring technique ! Fault Tree when there is repeated component Failure distribution for the component is not independent Suppose that instead of all three memory modules being shared between the two processors, one of the memory modules (M3) is shared and the other two are private, one for each processor. Failure and or M1 M3 M2 P1 P2 What it reliability block diagram ? We have to use factoring technique !
Factoring Fa(t) F(t) Fb(t) F(t) = FM3(t)*Fa(t) + (1-FM3(t))*Fb(t) M3 has failed Fa(t) Failure 0: Normal, 1: Fails and F(t) Failure or or and P1 M1 P2 M2 or or M3 has not failed Fb(t) Failure and P1 P2 and and P1 M1 M3 P2 M2 M3 F(t) = FM3(t)*Fa(t) + (1-FM3(t))*Fb(t) Multiply the result for each case by the probability that case happens, then add the products.