Part.2.1 In The Name of GOD FAULT TOLERANT SYSTEMS Part 2 – Canonical Structures Chapter 2 – Hardware Fault Tolerance.

Slides:



Advertisements
Similar presentations
+ Failure rates. + outline Failure rates as continuous distributions Weibull distribution – an example of an exponential distribution with the failure.
Advertisements

Reliability Engineering (Rekayasa Keandalan)
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
8. Failure Rate Prediction Reliable System Design 2011 by: Amir M. Rahmani.
1 Non-observable failure progression. 2 Age based maintenance policies We consider a situation where we are not able to observe failure progression, or.
RELIABILITY Dr. Ron Lembke SCM 352. Reliability Ability to perform its intended function under a prescribed set of conditions Probability product will.
5.2 Continuous Random Variable
SMJ 4812 Project Mgmt and Maintenance Eng.
Reliable System Design 2011 by: Amir M. Rahmani
Reliability Engineering and Maintenance The growth in unit sizes of equipment in most industries with the result that the consequence of failure has become.
Time-Dependent Failure Models
Reliability of Systems
Helicopter System Reliability Analysis Statistical Methods for Reliability Engineering Mark Andersen.
Failure Patterns Many failure-causing mechanisms give rise to measured distributions of times-to-failure which approximate quite closely to probability.
Dependability Evaluation. Techniques for Dependability Evaluation The dependability evaluation of a system can be carried out either:  experimentally.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Exponential distribution Reliability and failure rate (Sec )
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.2.1 FAULT TOLERANT SYSTEMS Part 2 – Canonical.
Visual Recognition Tutorial
1 Fundamentals of Reliability Engineering and Applications Dr. E. A. Elsayed Department of Industrial and Systems Engineering Rutgers University
1 Review Definition: Reliability is the probability that a component or system will perform a required function for a given period of time when used under.
Introduction Before… Next…
Copyright © 2014 reliability solutions all rights reserved Reliability Solutions Seminar Managing and Improving Reliability 2014 Agenda Martin Shaw – Reliability.
Reliability Chapter 4S.
8-1 Introduction In the previous chapter we illustrated how a parameter can be estimated from sample data. However, it is important to understand how.
1 2. Reliability measures Objectives: Learn how to quantify reliability of a system Understand and learn how to compute the following measures –Reliability.
PowerPoint presentation to accompany
Reliability Engineering - Part 1
System Reliability. Random State Variables System Reliability/Availability.
Project & Quality Management Quality Management Reliability.
1 Reliability Application Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS.
Chapter 6 Time dependent reliability of components and system.
-Exponential Distribution -Weibull Distribution
Transition of Component States N F Component fails Component is repaired Failed state continues Normal state continues.
Background on Reliability and Availability Slides prepared by Wayne D. Grover and Matthieu Clouqueur TRLabs & University of Alberta © Wayne D. Grover 2002,
Software Reliability SEG3202 N. El Kadri.
1 Basic probability theory Professor Jørn Vatn. 2 Event Probability relates to events Let as an example A be the event that there is an operator error.
Lecture 2: Combinatorial Modeling CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier Adapted from slides by WHS at.
Reliability Models & Applications Leadership in Engineering
Copyright © 2014 reliability solutions all rights reserved Reliability Solutions Seminar Managing and Improving Reliability 2015 Agenda Martin Shaw – Reliability.
Maintenance Policies Corrective maintenance: It is usually referred to as repair. Its purpose is to bring the component back to functioning state as soon.
 How do you know how long your design is going to last?  Is there any way we can predict how long it will work?  Why do Reliability Engineers get paid.
An Application of Probability to
5-1 ANSYS, Inc. Proprietary © 2009 ANSYS, Inc. All rights reserved. May 28, 2009 Inventory # Chapter 5 Six Sigma.
1 Component reliability Jørn Vatn. 2 The state of a component is either “up” or “down” T 1, T 2 and T 3 are ”Uptimes” D 1 and D 2 are “Downtimes”
Fault-Tolerant Computing Systems #4 Reliability and Availability
L Berkley Davis Copyright 2009 MER035: Engineering Reliability Lecture 6 1 MER301: Engineering Reliability LECTURE 6: Chapter 3: 3.9, 3.11 and Reliability.
Reliability Failure rates Reliability
Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.
Quality Improvement PowerPoint presentation to accompany Besterfield, Quality Improvement, 9e PowerPoint presentation to accompany Besterfield, Quality.
CSE SW Metrics and Quality Engineering Copyright © , Dennis J. Frailey, All Rights Reserved CSE8314M12 8/20/2001Slide 1 SMU CSE 8314 /
CS203 – Advanced Computer Architecture Dependability & Reliability.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
More on Exponential Distribution, Hypo exponential distribution
Expectations of Random Variables, Functions of Random Variables
Quantitative evaluation of Dependability
Chapter 4 Continuous Random Variables and Probability Distributions
Fault-Tolerant Computing Systems #5 Reliability and Availability2
TIME TO FAILURE AND ITS PROBABILITY DISTRIBUTIONS
Reliability Failure rates Reliability
Parametric Survival Models (ch. 7)
The Exponential Distribution
Quantitative evaluation of Dependability
T305: Digital Communications
Continuous distributions
FAULT TOLERANT SYSTEMS ecs. umass
Chapter 6 Time dependent reliability of components and system
سیستم های تحمل پذیر خرابی
Simulation Berlin Chen
Definitions Cumulative time to failure (T): Mean life:
Presentation transcript:

Part.2.1 In The Name of GOD FAULT TOLERANT SYSTEMS Part 2 – Canonical Structures Chapter 2 – Hardware Fault Tolerance

Part.2.2 Final Project?  Choose a system that you can model and simulate.  Search its fault models.  Inject the Fault models and measure its reliability,… (Fault Simulation).  Find a Fault-Tolerance method for the injected fault model.  Apply the method and measure its reliability, … again.  Or you can wait to see other titles that you may be more eager to do as a project.

Part.2.3 Failure Rate  Rate at which a component suffers faults  Depends on age, ambient temperature, voltage or physical shocks that it suffers, and technology  Dependence on age is usually captured by the bathtub curve:

Part.2.4 Bathtub Curve  Young component – high failure rate  Good chance that some defective units slipped through manufacturing quality control and were released  Later - bad units weeded out – remaining units have a fairly constant failure rate  As component becomes very old, aging effects cause the failure rate to rise again

Part.2.5 Empirical Formula for - Failure Rate  =  L  Q (C 1  T  V + C 2  E )   L : Learning factor, (how mature the technology is)   Q : Manufacturing process Quality factor (0.25 to 20.00)   T : Temperature factor, (from 0.1 to 1000), proportional to exp(- Ea/kT) where Ea is the activation energy in electron-volts associated with the technology, k is the Boltzmann constant and T is the temperature in Kelvin   V : Voltage stress factor for CMOS devices (from 1 to 10 depending on the supply voltage and the temperature); does not apply to other technologies (set to 1)   E : Environment shock factor: from about 0.4 (air-conditioned environment), to 13.0 (harsh environment - e.g., space, cars)  C 1, C 2 : Complexity factors; functions of number of gates on the chip and number of pins in the package  Further details: MIL-HDBK-217E handbook

Part.2.6 Reliability and MTTF of a Single Component (Module)  Module operational at time t =0  Remains operational until it is hit by a failure  All failures are permanent  T - lifetime of module - time until it fails  T is a random variable  f( t ) - density function of T  F( t ) - cumulative distribution function of T

Part.2.7 Probabilistic Interpretation of f(t) and F(t)  F(t) - probability that the component will fail at or before time t F(t) = Prob {T  t}  f(t) – not a probability, but the momentary rate of probability of failure at time t f(t)dt = Prob {t  T  t+dt}  Like any density function (defined for t  0 ) f(t)  0 (for all t  0) and  The functions F and f are related through and

Part.2.8 Reliability and Failure (Hazard) Rate  The reliability of a single module - R(t)  R(t) = Prob {T>t} = 1- F(t)  The conditional probability that the module will fail at time t, given it has not failed before, is Prob {t  T  t+dt | T  t} = Prob {t  T  t+dt} / Prob{T  t} = f(t)dt / (1-F(t))  The failure rate (or hazard rate) of a component at time t, (t), is defined as  (t) = f(t)/(1- F(t))  Since dR(t)/dt = - f(t), we get (t) = -1/R(t) dR(t)/dt

Part.2.9 Constant Failure Rate  If the module has a failure rate which is constant over time -  (t) =  dR(t) / dt = - R(t) ; R(0)=1  The solution of this differential equation is  A module has a constant failure rate if and only if T, the lifetime of the module, has an exponential distribution

Part.2.10 Mean Time to Failure (MTTF)  MTTF - expected value of the lifetime T  Two ways of calculating MTTF  First way :  Second way:  If the failure rate is a constant

Part.2.11 Weibull Distribution - Introduction  Most calculations of reliability assume that a module has a constant failure rate (or equivalently - an exponential distribution for the module lifetime T)  There are cases in which this simplifying assumption is inappropriate  Example - during the ‘’infant mortality” and ‘’wear-out” phases of the bathtub curve  Weibull distribution for the lifetime T can be used instead

Part.2.12 Weibull distribution - Equation  The Weibull distribution has two parameters, and   The density function of the component lifetime T:  The failure rate for the Weibull distribution is (t) is decreasing with time for  1, constant for  =1, appropriate for infant mortality, wearout and middle phases, respectively

Part.2.13 Reliability and MTTF for Weibull Distribution  Reliability for Weibull distribution is  MTTF for Weibull distribution is (  (x) is the Gamma function )  The special case  = 1 is the exponential distribution with a constant failure rate

Part.2.14 Canonical Structures  A canonical structure is constructed out of N individual modules  The basic canonical structures are  A series system  A parallel system  A mixed system  We will assume statistical independence between failures in the individual modules

Part.2.15 Reliability of a Series System  A series system - set of modules so that the failure of any one module causes the entire system to fail  Reliability of a series system - R s (t) - product of reliabilities of its N modules  R i (t) is the reliability of module i

Part.2.16 Series System – Modules Have Constant Failure Rates  Every module i has a constant failure rate i  s =  i is the constant failure rate of the series system  Mean Time To Failure of a series system -

Part.2.17 Reliability of a Parallel System  A Parallel System - a set of modules connected so that all the modules must fail before the system fails  Reliability of a parallel system -  is the reliability of module i

Part.2.18 Parallel System – Modules have Constant Failure Rates  Module i has a constant failure rate, i  Example - a parallel system with two modules  MTTF of a parallel system with the same

Part.2.19 Non Series/Parallel Systems  Each path represents a configuration allowing the system to operate successfully, e.g., ADF  The reliability can be calculated by expanding about a single module i :  R system =R i Prob{System works | i is fault-free} +(1-R i ) Prob{System works | i is faulty}  Draw two new diagrams: in (a) module i is operational; in (b) module i is faulty  Module i is selected so that the two new diagrams are closer to simple series/parallel structures

Part.2.20 Expanding about C  The process of expanding can be repeated until the resulting diagrams are of the series/parallel type  Figure (a) needs further expansion about E  Figure (a) should not be viewed as a parallel connection of A and B, connected serially to D and E in parallel. Such a diagram will have the path BCDF which is not a valid path (a) (b)

Part.2.21 Expanding about C and E  R system =R C Prob {System works | C is operational} +(1-R C ) R F [1-(1-R A R D )(1-R B R E )]  Expanding about E yields  Prob {System works | C is operational}= R E R F [1-(1-R A )(1-R B )] +(1-R E )R A R D R F  Substituting results in  R system =R C [R E R F (R A +R B -R A R B )+(1-R E ) R A R D R F ] +(1- R C ) [R F (R A R D +R B R E -R A R D R B R E )]  Example: R A =R B =R C =R D =R E =R F =R R system =R (R -3R +R+2) (a) (b)

Part.2.22 Upper Bound on Reliability  If structure is too complicated - derive upper and lower bounds on R system  An upper bound - R system  1 -  (1-R path _ i )  R path_i - reliability of modules in series along path i  Assuming all paths are in parallel  Example - the paths are ADF, BEF and ACEF  R system  1 -(1-R A R D R F )(1-R B R E R F )(1-R A R C R E R F )  If R A =R B =R C =R D =R E =R F =R then  Upper bound can be used to derive the exact expression: perform multiplication and replace every occurrence of by

Part.2.23 Lower Bound on Reliability  A lower bound is calculated based on minimal cut sets of the system diagram  A minimal cut set: a minimal list of modules such that the removal (due to a fault) of all modules will cause a working system to fail  Minimal cut sets: F, AB, AE, DE and BCD  The lower bound is  R system   (1-Q cut _ i )  Q cut_i - probability that the minimal cut i is faulty (i.e., all its modules are faulty)  Example - R A =R B =R C =R D =R E =R F =R

Part.2.24 Example – Comparison of Bounds  Example - R A =R B =R C =R D =R E =R F =R  Lower bound here is a very good estimate for a high-reliability system

Part.2.25 Exercises:  Book : Fault Tolerant Systems:  Pages: 48,49.  Exercises: 1,2,3,4,5.  Good Luck