Fault-Tolerant Computing Systems #5 Reliability and Availability2

Slides:



Advertisements
Similar presentations
Stats for Engineers Lecture 5
Advertisements

1 Fault-Tolerant Computing Systems #6 Network Reliability Pattara Leelaprute Computer Engineering Department Kasetsart University
Reliability Engineering (Rekayasa Keandalan)
Random Variables ECE460 Spring, 2012.
ฟังก์ชั่นการแจกแจงความน่าจะเป็น แบบไม่ต่อเนื่อง Discrete Probability Distributions.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Statistics.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 5-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
5.2 Continuous Random Variable
SMJ 4812 Project Mgmt and Maintenance Eng.
Reliable System Design 2011 by: Amir M. Rahmani
Continuous Random Variables. For discrete random variables, we required that Y was limited to a finite (or countably infinite) set of values. Now, for.
Dependability Evaluation. Techniques for Dependability Evaluation The dependability evaluation of a system can be carried out either:  experimentally.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Exponential distribution Reliability and failure rate (Sec )
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.2.1 FAULT TOLERANT SYSTEMS Part 2 – Canonical.
1 Review Definition: Reliability is the probability that a component or system will perform a required function for a given period of time when used under.
Introduction Before… Next…
Chapter 21 Random Variables Discrete: Bernoulli, Binomial, Geometric, Poisson Continuous: Uniform, Exponential, Gamma, Normal Expectation & Variance, Joint.
1 Reliability Application Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS.
-Exponential Distribution -Weibull Distribution
Transition of Component States N F Component fails Component is repaired Failed state continues Normal state continues.
Software Reliability SEG3202 N. El Kadri.
Chapter 5 Statistical Models in Simulation
1 Basic probability theory Professor Jørn Vatn. 2 Event Probability relates to events Let as an example A be the event that there is an operator error.
Lecture 2: Combinatorial Modeling CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier Adapted from slides by WHS at.
1 Topic 3 - Discrete distributions Basics of discrete distributions Mean and variance of a discrete distribution Binomial distribution Poisson distribution.
 How do you know how long your design is going to last?  Is there any way we can predict how long it will work?  Why do Reliability Engineers get paid.
STA347 - week 31 Random Variables Example: We roll a fair die 6 times. Suppose we are interested in the number of 5’s in the 6 rolls. Let X = number of.
Fault-Tolerant Computing Systems #4 Reliability and Availability
Reliability Failure rates Reliability
Random Variable The outcome of an experiment need not be a number, for example, the outcome when a coin is tossed can be 'heads' or 'tails'. However, we.
Topic 3 - Discrete distributions Basics of discrete distributions - pages Mean and variance of a discrete distribution - pages ,
Stracener_EMIS 7305/5305_Spr08_ Systems Reliability Modeling & Analysis Series and Active Parallel Configurations Dr. Jerrell T. Stracener, SAE.
1 Keep Life Simple! We live and work and dream, Each has his little scheme, Sometimes we laugh; sometimes we cry, And thus the days go by.
Engineering Probability and Statistics - SE-205 -Chap 3 By S. O. Duffuaa.
Part.2.1 In The Name of GOD FAULT TOLERANT SYSTEMS Part 2 – Canonical Structures Chapter 2 – Hardware Fault Tolerance.
Chapter 4 Continuous Random Variables and Probability Distributions  Probability Density Functions.2 - Cumulative Distribution Functions and E Expected.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Business Statistics,
CS203 – Advanced Computer Architecture Dependability & Reliability.
Random Variables By: 1.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
More on Exponential Distribution, Hypo exponential distribution
Expectations of Random Variables, Functions of Random Variables
MECH 373 Instrumentation and Measurements
Quantitative evaluation of Dependability
ECE 753: FAULT-TOLERANT COMPUTING
Chapter 4 Continuous Random Variables and Probability Distributions
ECE 313 Probability with Engineering Applications Lecture 7
Random variables (r.v.) Random variable
The Exponential and Gamma Distributions
Most people will have some concept of what reliability is from everyday life, for example, people may discuss how reliable their washing machine has been.
Availability Availability - A(t)
Engineering Probability and Statistics - SE-205 -Chap 3
Continuous Random Variables
Multinomial Distribution
Chapter 5 Some Important Discrete Probability Distributions
Econometric Models The most basic econometric model consists of a relationship between two variables which is disturbed by a random error. We need to use.
Reliability.
Section 6.2 Probability Models
T305: Digital Communications
Dept. of Electrical & Computer engineering
Random Variables Binomial Distributions
CHAPTER 6 Random Variables
Dept. of Electrical & Computer engineering
Random Variate Generation
Discrete Random Variables: Basics
Discrete Random Variables: Basics
سیستم های تحمل پذیر خرابی
Definitions Cumulative time to failure (T): Mean life:
Discrete Random Variables: Basics
Presentation transcript:

Fault-Tolerant Computing Systems #5 Reliability and Availability2 Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th

Reliability and Availability The probability that a system survives till time t (it has not fail till t) Availability The probability that a system works properly at time t

Preliminaries of Probability Discrete sample space: Tossing a coin {head, tail} sample space Continuous sample space: How long the pc stays up after reboot {t | t>0} sample space Random variable A function mapping each element of sample space to a real number Ex. heads=1, tails=0

Preliminaries Random variable A function mapping each element of sample space to a real number CDF (Cumulative distributed function) [สะสมเพิ่มขึ้น] F(t) = Pr [X ≤ t] Pr : probability that the system has gone down by time t Pdf (Probability density function) f(t) = dF(t) / dx Expected Value, Mean E[X] = 0 t f(t)dt (X≥0) Average outcome of the random experiment expect value, mean of a random variable

Exponential Distribution The most commonly used distribute function in reliability modeling. CDF F(t) = 1 – e-lt pdf f(t) = l e-lt Mean 1/l Memoryless property Y = X – t Gt(y) = Pr [Y ≤ y | X > t ] = 1 – e-ly Distribute of remaining life of a component does not depend on how long it has been working. The component does not AGE ! (remaining life of X does not depend on the time that has passed) f(t) = 2e-2t F(t) = 1 – e-2t

Reliability R(t) Reliability The probability that a system survives till time t R(t) = Pr [X > t] = 1 – F(t) X : Random probability variable X which represents a time to failure of the system (the life of the system) R(t): represents probability that the system survives till time t F(t) = exponential distribution F(t) = 1 – e-2t R(t) = e-2t t time 0 X time t

Reliability R(t) Reliability R(t) = Pr [X > t] = 1 – F(t) t time 0 The system is initially working R() = 0 No system has infinite lifetime F(t) = exponential Distribution R(t) = reliability F(t) = 1 – e-2t R(t) = e-2t t time 0 X time t

Failure Rate Probability that fault will occur in an interval time [t, t+Dt] = f(t)Dt Probability that fault will occur in time [t, t+Dt] f(t)Dt / R(t) Probability of occurrence of fault at time [t, t+Dt], when the system is working properly at t Failure Rate f(t) / R(t) f(t) = probability of fault F(t) = exponential distribution R(t) = reliability f(t) = 2e-2t R(t) = e-2t F(t) = 1 – e-2t [t, t+Dt]

Bathtub Curve Failure Rate Bathtub Curve f(t) / R(t) Bathtub Curve General Failure Rate observed from the empirical data collected from mechanical and electronic component When lifetime of a system F(t) is exponential distribution,it has a constant Failure Rate (see previous slide) 2.constant failure rate 1.Initial stage: Inherit defects faulty design 3.last stage: faults caused by age

Availability The probability that system works properly at time t Availability is a measure that is frequently used for describing the behavior of the system *If the system has no repair or replacement, availability is equal to reliability R(t) Because, R(t) = the probability that no failures have occurred during the whole period (0,t) fails repairs fails repairs Operational Under repair Operational t Xi Xi+1 Xi+2 Ui Ui+1

MTTF (Mean Time To Failure) E[X] = 0 t f(t)dt = 0 R(t)dt The expected value of the probability variable X which represents time till fault occurs in the system When R(t) = e-lt (X is exponential distribution) Failure Rate = l MTTF = 1 / l time 0 expected value

MTTR (Mean Time To Repair) MTTR = E [ Ui ] The expected value of the random variable Ui which represents the downtime for i th repair or replacement When R(t) = e-lt Repairing Rate = m MTTR = 1 / m t Xi Xi+1 Xi+2 Ui Ui+1

Availability Instantaneous availability (ทันทีทันใด) A(t) = Pr [probability that the component is functioning correctly at t ] Steady-State Availability (general meaning) A = limt→∞ A(t) fails repairs fails repairs t Xi Xi+1 Xi+2 Ui Ui+1

Availability When Xi, Ui is exponential distribution FXi(t) = 1 – e-lt, FUi(t) = 1 – e-mt Instantaneous Availability A(t) = (m + le-(l + m)t ) /(m + l) Steady-State Availability A = limt→∞ A(t) = m /(m + l) t Xi Xi+1 Xi+2 Ui Ui+1

Availability and MTTF / MTTR MTTR (mean time to repair) MTTR = E [ Ui ] The expected value of Ui which is the random variable that represents the downtime for i th repair or replacement MTTF (mean time to failure) MTTF = E [ Xi ] The expected value of Xi which is the random variable that represents the duration of the i th function period. Steady-State Availability A = MTTF / (MTTF+MTTR) = m /(m + l)   (when Xi,Ui is the exponential distribution of parameter l,m) t Xi Xi+1 Xi+2 Ui Ui+1

Reliability Block Diagrams Represents the logical structure of a system with regard to how the reliability of its component affects the system reliability Series Structure Every component has to be functioning Decreasing reliability Parallel Structure At least one component is operational Increasing reliability R1 R2 R1 R2 R = R1*R2 Ra R = 1– (1– R1)*(1– R2) F(t)=1– (1– F1(t))*(1– F2(t)*F3(t)) R2=0.9 Ra = 1– (1- R2)*(1- R3) = 1-(1-0.9)*(1-0.8) = 1-(0.1*0.2) = 0.98 Rtotal = R1 * Ra = 0.95*0.98 = 0.931 R1=0.95 R3=0.8

Reliability Block Diagrams Represents the logical structure of a system with regard to how the reliability of its component affects the system reliability F(t) = distribution function for failure time (unreliability) F1 F1 F2 Series Structure F2 F(t)=1– (1– F1(t))*(1– F2(t)) Parallel Structure F(t)=F1(t)*F2(t) F(t)=1– (1– F1(t))*(1– F2(t)*F3(t)) F2=0.001 F2(t)*F3(t) = 0.001*0.002 = 0.000002 F(t) = 1– (1- F1(t))(1-0.000002) = 1-(1-0.003)(1-0.000002) = 1-(0.997)(0.999998) = 1-0.994 = 0.006 F1=0.003 F3=0.002

Example of Calculation

Reliability & Failure Rate Calculate the reliability (R) within 2 years for the system that has the failure rate = 0.3 (case/year) R(t) = e-lt Failure rate (l) = 0.3 R(2) = e-0.3 * 2 = e-0.6 = 2.71828183 -0.6 = 0.5488

MTTF What is MTTF of this system ? (100 + 120 + 140) / 3 = 120 fails repaired 4hr 100hr 3hr 120hr 2hr 140hr t (100 + 120 + 140) / 3 = 120 (hr / case)

MTTF & Failure Rate Failure rate (l) = 1 / MTTF = 1 / 40 = 0.025 What is the failure rate of the system which has MTTF = 40 hr/case fails repairs 4hr 30hr 3hr 40hr 2hr 50hr t Failure rate (l) = 1 / MTTF = 1 / 40 = 0.025 (case/hr)

MTTR What is MTTR of this system ? (3+ 2 + 4) / 3 = 3 (hr / case) fails repairs 4hr 100hr 3hr 120hr 2hr 140hr t (3+ 2 + 4) / 3 = 3 (hr / case)

MTTR & Repairing Rate Repairing rate (m) = 1 / MTTR MTTR = 1 / m What is the MTTR of the system which has repairing rate of 0.2 (case/hr) Repairing rate (m) = 1 / MTTR MTTR = 1 / m = 1 / 0.2 = 5 (hr/case)

Availability & MTTF & MTTR What is Availability of this system ? fails repairs 4hr 100hr 3hr 120hr 2hr 140hr t A = MTTF / (MTTF+MTTR) MTTF = (100 + 120 + 140) / 3 = 120 (hr/case) MTTR = (3 + 2 + 4) / 3 = 3 (hr/case) A = 120 / 120+3 = 120/123 = 0.975

Fault Trees Si=kn ( )F(t)i(1-F(t))n-i Pictorial representation of the combination of events that can cause the occurrence of an undesirable event (failure). Staring point (of tree) is the definition of a single undesirable event (failure). An event is reduced to a combination of low-level events by means of logic gates. F(t) = probability of the occurrence of failure event (function of F(t) is CDF) 0: Normal, 1: Fails TMR Failure 2-out-of-3 S1 S2 S3 OR gate AND gate k-out-of-n Si=kn ( )F(t)i(1-F(t))n-i n i

Fault Tree Model & Reliability Block Model The structure that shows when the system is functioning. Fault Tree Model The structure that shows when the system has failed The output of the top event is a logic 1 Fault Tree Failure P1 M2 M3 or and M1 P2 Reliability block diagram 0: Normal, 1: Fails 2 processors (P) 3 memory module (M)

Example of Fault Tree 0: Normal, 1: Fails Failure F1(t)*(1-(1-F2(t))*(1-F3(t))) and 1-(1-F2(t))*(1-F3(t)) or S1 S2 S3

Fault Trees (Basic) Fault Tree when there is no repeated component Failure distribution for the component is independent Reliability block diagram Fault Tree Failure P1 M2 M3 or and M1 P2 F(t) = ?? 2 processors (P) 3 memory module (M) The system is operational if at least one processor and one memory module are operational.

Fault Trees (Advance) We have to use factoring technique ! Fault Tree when there is repeated component Failure distribution for the component is not independent Suppose that instead of all three memory modules being shared between the two processors, one of the memory modules (M3) is shared and the other two are private, one for each processor. Failure and or M1 M3 M2 P1 P2 What it reliability block diagram ? We have to use factoring technique !

Factoring Fa(t) F(t) Fb(t) F(t) = FM3(t)*Fa(t) + (1-FM3(t))*Fb(t) M3 has failed Fa(t) Failure 0: Normal, 1: Fails and F(t) Failure or or and P1 M1 P2 M2 or or M3 has not failed Fb(t) Failure and P1 P2 and and P1 M1 M3 P2 M2 M3 F(t) = FM3(t)*Fa(t) + (1-FM3(t))*Fb(t) Multiply the result for each case by the probability that case happens, then add the products.