A. BobbioBertinoro, March 10-14, 20031 Dependability Theory and Methods Part 1: Introduction and definitions Andrea Bobbio Dipartimento di Informatica.

Slides:



Advertisements
Similar presentations
Time to failure Probability, Survival,  the Hazard rate, and the Conditional Failure Probability.
Advertisements

Řešení vybraných modelů s obnovou Radim Briš VŠB - Technical University of Ostrava (TUO), Ostrava, The Czech Republic
Andrea Bobbio Dipartimento di Informatica
Reliability Engineering (Rekayasa Keandalan)
MODULE 2: WARRANTY COST ANALYSIS Professor D.N.P. Murthy The University of Queensland Brisbane, Australia.
1 Non-observable failure progression. 2 Age based maintenance policies We consider a situation where we are not able to observe failure progression, or.
Chapter 8 Continuous Time Markov Chains. Markov Availability Model.
SMJ 4812 Project Mgmt and Maintenance Eng.
Reliable System Design 2011 by: Amir M. Rahmani
Reliability Engineering and Maintenance The growth in unit sizes of equipment in most industries with the result that the consequence of failure has become.
Markov Method of Availability Analysis, APS systems, and Availability Simulation Methods E E 681 Module 21 W. D. Grover TRLabs & University of Alberta.
Continuous Random Variables
Time-Dependent Failure Models
A. BobbioBertinoro, March 10-14, Dependability Theory and Methods 5. Markov Models Andrea Bobbio Dipartimento di Informatica Università del Piemonte.
Failure Patterns Many failure-causing mechanisms give rise to measured distributions of times-to-failure which approximate quite closely to probability.
A. BobbioReggio Emilia, June 17-18, Dependability & Maintainability Theory and Methods Part 2: Repairable systems: Availability Andrea Bobbio Dipartimento.
A. BobbioBertinoro, March 10-14, Dependability Theory and Methods Part 4: Fault-tree analysis Andrea Bobbio Dipartimento di Informatica Università.
Dependability Evaluation. Techniques for Dependability Evaluation The dependability evaluation of a system can be carried out either:  experimentally.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.2.1 FAULT TOLERANT SYSTEMS Part 2 – Canonical.
1 Fundamentals of Reliability Engineering and Applications Dr. E. A. Elsayed Department of Industrial and Systems Engineering Rutgers University
1 Review Definition: Reliability is the probability that a component or system will perform a required function for a given period of time when used under.
Continuous Random Variables and Probability Distributions
Introduction Before… Next…
1 2. Reliability measures Objectives: Learn how to quantify reliability of a system Understand and learn how to compute the following measures –Reliability.
PowerPoint presentation to accompany
Reliability Engineering - Part 1
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Pop Quiz How does fix response time and fix quality impact Customer Satisfaction? What is a Risk Exposure calculation? What’s a Scatter Diagram and why.
1 Reliability Application Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS.
-Exponential Distribution -Weibull Distribution
On Model Validation Techniques Alex Karagrigoriou University of Cyprus "Quality - Theory and Practice”, ORT Braude College of Engineering, Karmiel, May.
Transition of Component States N F Component fails Component is repaired Failed state continues Normal state continues.
1 Logistics Systems Engineering Reliability Fundamentals NTU SY-521-N SMU SYS 7340 Dr. Jerrell T. Stracener, SAE Fellow.
Statistical Decision Theory
Generalized Semi-Markov Processes (GSMP)
Software Reliability SEG3202 N. El Kadri.
Tch-prob1 Chap 3. Random Variables The outcome of a random experiment need not be a number. However, we are usually interested in some measurement or numeric.
A. BobbioBertinoro, March 10-14, Dependability Theory and Methods 2. Reliability Block Diagrams Andrea Bobbio Dipartimento di Informatica Università.
Performance Evaluation of Computer Systems Introduction
1 Basic probability theory Professor Jørn Vatn. 2 Event Probability relates to events Let as an example A be the event that there is an operator error.
Reliability Models & Applications Leadership in Engineering
Stracener_EMIS 7305/5305_Spr08_ System Reliability Analysis - Concepts and Metrics Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering.
1 Lecture 13: Other Distributions: Weibull, Lognormal, Beta; Probability Plots Devore, Ch. 4.5 – 4.6.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Maintenance Policies Corrective maintenance: It is usually referred to as repair. Its purpose is to bring the component back to functioning state as soon.
Reliability & Maintainability Engineering An Introduction Robert Brown Electrical & Computer Engineering Worcester Polytechnic Institute.
Generalized Semi- Markov Processes (GSMP). Summary Some Definitions The Poisson Process Properties of the Poisson Process  Interarrival times  Memoryless.
1 Component reliability Jørn Vatn. 2 The state of a component is either “up” or “down” T 1, T 2 and T 3 are ”Uptimes” D 1 and D 2 are “Downtimes”
Fault-Tolerant Computing Systems #4 Reliability and Availability
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
L Berkley Davis Copyright 2009 MER035: Engineering Reliability Lecture 6 1 MER301: Engineering Reliability LECTURE 6: Chapter 3: 3.9, 3.11 and Reliability.
1 EXAKT SKF Phase 1, Session 2 Principles. 2 The CBM Decision supported by EXAKT Given the condition today, the asset mgr. takes one of three decisions:
Quality Improvement PowerPoint presentation to accompany Besterfield, Quality Improvement, 9e PowerPoint presentation to accompany Besterfield, Quality.
Part.2.1 In The Name of GOD FAULT TOLERANT SYSTEMS Part 2 – Canonical Structures Chapter 2 – Hardware Fault Tolerance.
Chapter 4 Continuous Random Variables and Probability Distributions  Probability Density Functions.2 - Cumulative Distribution Functions and E Expected.
Copyright © Cengage Learning. All rights reserved. 4 Continuous Random Variables and Probability Distributions.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Prof. Enrico Zio Availability of Systems Prof. Enrico Zio Politecnico di Milano Dipartimento di Energia.
A Software Cost Model with Reliability Constraint under Two Operational Scenarios Satoru UKIMOTO and Tadashi DOHI Department of Information Engineering,
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
More on Exponential Distribution, Hypo exponential distribution
Expectations of Random Variables, Functions of Random Variables
Chapter 4 Continuous Random Variables and Probability Distributions
Availability Availability - A(t)
Fault-Tolerant Computing Systems #5 Reliability and Availability2
TIME TO FAILURE AND ITS PROBABILITY DISTRIBUTIONS
Reliability.
T305: Digital Communications
RELIABILITY Reliability is -
Definitions Cumulative time to failure (T): Mean life:
Presentation transcript:

A. BobbioBertinoro, March 10-14, Dependability Theory and Methods Part 1: Introduction and definitions Andrea Bobbio Dipartimento di Informatica Università del Piemonte Orientale, “A. Avogadro” Alessandria (Italy) - Bertinoro, March 10-14, 2003

A. BobbioBertinoro, March 10-14, Dependability: Definition Dependability is the property of a system to be dependable in time, i.e. such that reliance can justifiably be placed on the service it delivers. Dependability extends the interest on the system from the design and construction phase to the operational phase (life cycle).

A. BobbioBertinoro, March 10-14, What dependability theory and practice wants to avoid

A. BobbioBertinoro, March 10-14, dependability measures reliability availability maintainability safety security means fault forecasting fault tolerance fault removal fault prevention threats faults errors failures Dependability: Taxonomy

A. BobbioBertinoro, March 10-14, Quantitative analysis The quantitative analysis aims at numerically evaluating measures to characterize the dependability of an item:  Risk assessment and safety  Design specifications  Technical assistance and maintenance  Life cycle cost  Market competition

A. BobbioBertinoro, March 10-14, Risk assessment and safety The risk associated to an activity is given proportional to the probability of occurrence of the activity and to the magnitute of the consequences. A safety critical system is a system whose incorrect behavior may cause a risk to occur, causing undesirable consequences to the item, to the operators, to the population, to the environment. R = P  M

A. BobbioBertinoro, March 10-14, Design specifications Technological items must be dependable. Some times, dependability requirements (both qualitative and quantitative) are part of the design specifications:  Mean time between failures  Total down time

A. BobbioBertinoro, March 10-14, Technical assistance and maintenance The planning of all the activity related to the technical assistance and maintenance is linked to the system dependability (expected number of failure in time).  planning spare parts and maintenance crews;  cost of the technical assistance (warranty period);  preventive vs reactive maintenance.

A. BobbioBertinoro, March 10-14, Market competition The choice of the consumers is strongly influenced by the perceived dependability.  advertisement messages stress the dependability;  the image of a product or of a brand may depend on the dependability.

A. BobbioBertinoro, March 10-14, Purpose of evaluation Understanding a system – Observation – Operational environment – Reasoning Predicting the behavior of a system –Need a model –A model is a convenient abstraction –Accuracy based on degree of extrapolation

A. BobbioBertinoro, March 10-14, Methods of evaluation Measurement-Based  Most believable, most expensive  Not always possible or cost effective during system design Model-Based Less believable, Less expensive Analytic vs Discrete-Event Simulation Combinatorial vs State-Space Methods

A. BobbioBertinoro, March 10-14, Measurement-Based Most believable, most expensive; Data are obtained observing the behavior of physical objects.  field observations;  measurements on prototypes;  measurements on components (accelerated tests).

A. BobbioBertinoro, March 10-14, Closed-form Answers Numerical Solution Analytic Simulation All models are wrong; some models are useful Models

A. BobbioBertinoro, March 10-14, Methods of evaluation Measurements + Models data bank

A. BobbioBertinoro, March 10-14, The probabilistic approach The mechanisms that lead to failure a technological object are very complex and depend on many physical, chemical, technical, human, environmental … factors. The time to failure cannot be expressed by a determin- istic law. We are forced to assume the time to failure as a random variable. The quantitative dependability analysis is based on a probabilistic approach.

A. BobbioBertinoro, March 10-14, Reliability The reliability is a measurable attribute of the dependability and it is defined as: The reliability R(t) of an item at time t is the probability that the item performs the required function in the interval (0 – t) given the stress and environmental conditions in which it operates.

A. BobbioBertinoro, March 10-14, Basic Definitions: cdf Let X be the random variable representing the time to failure of an item. The cumulative distribution function (cdf) F(t) of the r.v. X is given by: F(t) = Pr { X  t } F(t) represents the probability that the item is already failed at time t (unreliability).

A. BobbioBertinoro, March 10-14, Basic Definitions: cdf Equivalent terminoloy for F(t) :  CDF (cumulative distribution function)  Probability distribution function  Distribution function

A. BobbioBertinoro, March 10-14, Basic Definitions: cdf 1 0 F(t)F(t) t a F(b)F(b) F(a)F(a) b F(0) = 0 lim F(t) = 1 t   F(t) = non-decreasing

A. BobbioBertinoro, March 10-14, Basic Definitions: Reliability Let X be the random variable representing the time to failure of an item. The survivor function (sf) R(t) of the r.v. X is given by: R (t) = Pr { X > t } = 1 - F(t) R(t) represents the probability that the item is correctly working at time t and gives the reliability function.

A. BobbioBertinoro, March 10-14, Basic Definitions Equivalent terminology for R(t) = 1 -F(t) :  Reliability  Complementary distribution function  Survivor function

A. BobbioBertinoro, March 10-14, Basic Definitions: Reliability 1 0 R(t)R(t) t ab R(0) = 1 lim R(t) = 0 t   R(t) = non-increasing R(a)R(a)

A. BobbioBertinoro, March 10-14, Basic Definitions: density Let X be the random variable representing the time to failure of an item and let F(t) be a derivable cdf: The density function f(t) is defined as: d F(t) f (t) = ——— dt f (t) dt = Pr { t  X < t + dt }

A. BobbioBertinoro, March 10-14, Basic Definitions: Density 0 f (t) t a b  f(x) dx = Pr { a < X  b } = F(b) – F(a) a b

A. BobbioBertinoro, March 10-14, Basic Definitions: Density 1 0 f (t) t

A. BobbioBertinoro, March 10-14, Basic Definitions Equivalent terminology: pdf  probability density function  density function  density  f(t) = For a non-negative random variable

A. BobbioBertinoro, March 10-14, Quiz 1: The higher the MTTF is, the higher the item reliability is. 1.Correct 2.Wrong The correct answer is wrong !!!

A. BobbioBertinoro, March 10-14, Hazard (failure) rate h(t)  t = Conditional Prob. system will fail in (t, t +  t) given that it is survived until time t f(t)  t = Unconditional Prob. System will fail in (t, t +  t)

A. BobbioBertinoro, March 10-14, is the conditional probability that the unit will fail in the interval given that it is functioning at time t. is the unconditional probability that the unit will fail in the interval Difference between the two sentences: –probability that someone will die between 90 and 91, given that he lives to 90 –probability that someone will die between 90 and 91 The Failure Rate of a Distribution

30 DFRIFR Decreasing failure rate Increasing fail. rate h(t) t CFR Constant fail. rate (useful life) (infant mortality – burn in)(wear-out-phase) Bathtub curve

A. BobbioBertinoro, March 10-14, Infant mortality (dfr) Also called infant mortality phase or reliability growth phase. The failure rate decreases with time.  Caused by undetected hardware/software defects;  Can cause significant prediction errors if steady- state failure rates are used;  Weibull Model can be used;

A. BobbioBertinoro, March 10-14, Useful life (cfr) The failure rate remains constant in time (age independent).  Failure rate much lower than in early-life period.  Failure caused by random effects (as environmental shocks).

A. BobbioBertinoro, March 10-14, Wear-out phase (ifr) The failure rate increases with age. It is characteristic of irreversible aging phenomena (deterioration, wear-out, fatigue, corrosion etc…) Applicable for mechanical and other systems. (Properly qualified electronic parts do not exhibit wear-out failure during its intended service life) Weibull Failure Model can be used

A. BobbioBertinoro, March 10-14, Cumul. distribution function: Reliability : Density Function : Failure Rate (CFR): Mean Time to Failure: Exponential Distribution Failure rate is age-independent (constant).

A. BobbioBertinoro, March 10-14, The Cumulative Distribution Function of an Exponentially Distributed Random Variable With Parameter = 1 F(t) t F(t) = 1 - e - t

A. BobbioBertinoro, March 10-14, The Reliability Function of an Exponentially Distributed Random Variable With Parameter = 1 R(t) t R(t) = e - t

A. BobbioBertinoro, March 10-14, Exponential Density Function (pdf) f(t) MTTF = 1/

A. BobbioBertinoro, March 10-14, Memoryless Property of the Exponential Distribution Assume X > t. We have observed that the component has not failed until time t Let Y = X - t, the remaining (residual) lifetime

A. BobbioBertinoro, March 10-14, Memoryless Property of the Exponential Distribution (cont.)  Thus G t (y) is independent of t and is identical to the original exponential distribution of X  The distribution of the remaining life does not depend on how long the component has been operating  An observed failure is the result of some suddenly appearing failure, not due to gradual deterioration

A. BobbioBertinoro, March 10-14, Quiz 3: If two components (say, A and B) have independent identical exponentially distributed times to failure, by the “memoryless” property, which of the following is true? 1.They will always fail at the same time 2.They have the same probability of failing at time ‘t’ during operation 3.When these two components are operating simultaneously, the component which has been operational for a shorter duration of time will survive longer

A. BobbioBertinoro, March 10-14, Weibull Distribution Distribution Function: Density Function: Reliability:

A. BobbioBertinoro, March 10-14, Weibull Distribution  : shape parameter; : scale parameter. Failure Rate: Dfr Cfr Ifr

A. BobbioBertinoro, March 10-14, Failure Rate of the Weibull Distribution with Various Values of 

A. BobbioBertinoro, March 10-14, Weibull Distribution for Various Values of  Cdfdensity

A. BobbioBertinoro, March 10-14, We use a truncated Weibull Model Infant mortality phase modeled by DFR Weibull and the steady-state phase by the exponential 02,1904,3806,5708,76010,95013,14015,33017,520 Operating Times (hrs) Failure-Rate Multiplier Figure 2.34 Weibull Failure-Rate Model Failure Rate Models

A. BobbioBertinoro, March 10-14, Failure Rate Models (cont.) This model has the form: where: steady-state failure rate is Weibull shape parameter Failure rate multiplier =

A. BobbioBertinoro, March 10-14, Failure Rate Models (cont.) There are several ways to incorporate time dependent failure rates in availability models The easiest way is to approximate a continuous function by a piecewise constant step function 2,1904,3806,57010,95013,14015,33017,520 Operating Times (hrs) Failure-Rate Multiplier Discrete Failure-Rate Model 8,7600

A. BobbioBertinoro, March 10-14, Failure Rate Models (cont.) Here the discrete failure-rate model is defined by:

A. BobbioBertinoro, March 10-14, A lifetime experiment N i.i.d components are put in a life test experiment N t = 0 X 1 X 2 X 3 X 4 X N

A. BobbioBertinoro, March 10-14, A lifetime experiment N X 1 X 2 X 3 X 4 X N

A. BobbioBertinoro, March 10-14, Repairable systems Availability

A. BobbioBertinoro, March 10-14, Repairable systems X 1, X 2 …. X n Successive UP times Y 1, Y 2 …. Y n Successive DOWN times t UP DOWN X 1 X 2 X 3 Y 1 Y 2

A. BobbioBertinoro, March 10-14, Repairable systems The usual hypothesis in modeling repairable systems is that:  The successive UP times X 1, X 2 …. X n are i.i.d. random variable: i.e. samples from a common cdf F (t)  The successive DOWN times Y 1, Y 2 …. Y n are i.i.d. random variable: i.e. samples from a common cdf G (t)

A. BobbioBertinoro, March 10-14, Repairable systems The dynamic behaviour of a repairable system is characterized by: ” the r.v. X of the successive up times ” the r.v. Y of the successive down times t UP DOWN X 1 X 2 X 3 Y 1 Y 2

A. BobbioBertinoro, March 10-14, Maintainability Let Y be the r.v. of the successive down times: G(t) = Pr { Y  t } (maintainability) d G(t) g (t) = ——— (density) dt g(t) h g (t) = ———— (repair rate) 1 - G(t) MTTR =  t g(t) dt (Mean Time To Repair) 0 

A. BobbioBertinoro, March 10-14, Availability The avaiability A(t) of an item at time t is the probability that the item is correctly working at time t. The measure to characterize a repairable system is the availability (unavailability):

A. BobbioBertinoro, March 10-14, Availability The measure to characterize a repairable system is the availability (unavailability): A(t) = Pr { time t, system = UP } U(t) = Pr { time t, system = DOWN } A(t) + U(t) = 1

A. BobbioBertinoro, March 10-14, Definition of Availability An important difference between reliability and availability is:  reliability refers to failure-free operation during an interval (0 — t) ;  availability refers to failure-free operation at a given instant of time t (the time when a device or system is accessed to provide a required function), independently on the number of cycles failure/repair.

A. BobbioBertinoro, March 10-14, Definition of Availability Operating and providing a required function Failed and being restored 1 Operating and providing a required function System Failure and Restoration Process t I(t) indicator function 0 I(t) 1 working 0 failed

A. BobbioBertinoro, March 10-14, Availability evaluation In the special case when times to failure and times to restoration are both exponentially distributed, the alternating process can be viewed as a two-state homogeneous Continuous Time Markov Chain Time-independent failure rate Time-independent repair rate 

A. BobbioBertinoro, March 10-14, State Markov Availability Model UP 1 DN 0 Transient Availability analysis: for each state, we apply a flow balance equation: – Rate of buildup = rate of flow IN - rate of flow OUT

A. BobbioBertinoro, March 10-14, State Markov Availability Model UP 1 DN 0

A. BobbioBertinoro, March 10-14, State Markov Availability Model 1 A(t) A ss =

A. BobbioBertinoro, March 10-14, State Markov Model 1) Pointwise availability A(t) : 2) Steady state availability: limiting value as 3)If there is no restoration (  =0) the availability becomes the reliability A(t) = R(t) =

A. BobbioBertinoro, March 10-14, Steady-state Availability Steady-state availability: In many system models, the limit: exists and is called the steady-state availability The steady-state availability represents the probability of finding a system operational after many fail-and- restore cycles.

A. BobbioBertinoro, March 10-14, Steady-state Availability 1 t 0 UPDOWN Expected UP time E[U(t)] = MUT = MTTF Expected DOWN time E[D(t)] = MDT = MTTR

A. BobbioBertinoro, March 10-14, Availability: Example (I) Let a system have a steady state availability Ass = 0.95 This means that, given a mission time T, it is expected that the system works correctly for a total time of: 0.95*T. Or, alternatively, it is expected that the system is out of service for a total time: Uss * T = (1- Ass) * T

A. BobbioBertinoro, March 10-14, Availability: Example (II) Let a system have a rated productivity of W $/year. The loss due to system out of service can be estimated as: Uss * W = (1- Ass) * W The availability (unavailability) is an index to estimate the real productivity, given the rated productivity. Alternatively, if the goal is to have a net productivity of W $/year, the plant must be designed such that its rated productivity W’ should satisfy: Uss * W’ = W

A. BobbioBertinoro, March 10-14, Availability We can show that: This result is valid without making any assumptions on the form of the distributions of times to failure & times to repair. Also:

A. BobbioBertinoro, March 10-14, Motivation – High Availability

A. BobbioBertinoro, March 10-14, MDT (Mean Down Time or MTTR - mean time to restoration). The total down time (Y ) consists of: Failure detection time Alarm notification time Dispatch and travel time of the repair person(s) Repair or replacement time Reboot time Maintainability

A. BobbioBertinoro, March 10-14, The total down time (Y ) consists of: Logistic time Administrative times Dispatch and travel time of the repair person(s) Waiting time for spares, tools … Effective restoration time Access and diagnosis time Repair or replacement time Test and reboot time Maintainability

A. BobbioBertinoro, March 10-14, The total cost of a maintenance action consists of: Cost of spares and replaced parts Cost of person/hours for repair Down-time cost (loss of productivity) The down-time cost (due to a loss of productivity) can be the most relevant cost factor. Maintenance Costs

A. BobbioBertinoro, March 10-14, Is the sequence of action that minimizes the total cost related to a down time: Reactive maintenance: maintenance action is triggered by a failure. Proactive maintenance: preventive maintenance policy. Maintenance Policy