1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.

Slides:



Advertisements
Similar presentations
Reliability Engineering (Rekayasa Keandalan)
Advertisements

COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
11. Practical fault-tolerant system design Reliable System Design 2005 by: Amir M. Rahmani.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Software Quality Assurance (SQA). Recap SQA goal, attributes and metrics SQA plan Formal Technical Review (FTR) Statistical SQA – Six Sigma – Identifying.
SMJ 4812 Project Mgmt and Maintenance Eng.
James Ngeru Industrial and System Engineering
Reliable System Design 2011 by: Amir M. Rahmani
1 Chapter Fault Tolerant Design of Digital Systems.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.2.1 FAULT TOLERANT SYSTEMS Part 2 – Canonical.
EEE499 Real Time Systems Software Reliability (Part II)
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 13 Reliability.
Reliability Chapter 4S.
3. Software product quality metrics The quality of a product: -the “totality of characteristics that bear on its ability to satisfy stated or implied needs”.
Software Testing and QA Theory and Practice (Chapter 15: Software Reliability) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
PowerPoint presentation to accompany
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
1 Reliability Application Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND STATISTICS FOR SCIENTISTS.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 15 Software Reliability
ERT 312 SAFETY & LOSS PREVENTION IN BIOPROCESS RISK ASSESSMENT Prepared by: Miss Hairul Nazirah Abdul Halim.
Software Reliability SEG3202 N. El Kadri.
Chapter 2: Non functional Attributes.  It infrastructure provides services to applications  Many of these services can be defined as functions such.
Reliability Management Benbow and Broome (Ch 1, 2, and 3)
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Safety-Critical Systems T Ilkka Herttua. Safety Context Diagram HUMANPROCESS SYSTEM - Hardware - Software - Operating Rules.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CprE 458/558: Real-Time Systems
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Reliability and availability considerations for CLIC modulators Daniel Siemaszko OUTLINE : Give a specification on the availability of the powering.
Failures and Reliability Adam Adgar School of Computing and Technology.
Chapter 1: Fundamental of Testing Systems Testing & Evaluation (MNN1063)
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Unit-3 Reliability concepts Presented by N.Vigneshwari.
Stracener_EMIS 7305/5305_Spr08_ Systems Availability Modeling & Analysis Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7305/5305.
Part.2.1 In The Name of GOD FAULT TOLERANT SYSTEMS Part 2 – Canonical Structures Chapter 2 – Hardware Fault Tolerance.
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.
1 Software Testing and Quality Assurance Lecture 38 – Software Quality Assurance.
CS203 – Advanced Computer Architecture Dependability & Reliability.
1 Introduction to Engineering Fall 2006 Lecture 17: Digital Tools 1.
LOG 211 Supportability Analysis “Reliability 101”
Software Metrics and Reliability
Critical systems design
Hardware & Software Reliability
Software Quality Assurance
Most people will have some concept of what reliability is from everyday life, for example, people may discuss how reliable their washing machine has been.
Software Project Management
Fault Tolerance & Reliability CDA 5140 Spring 2006
Fault-Tolerant Computing Systems #5 Reliability and Availability2
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Fault Tolerance In Operating System
BASICS OF SOFTWARE TESTING Chapter 1. Topics to be covered 1. Humans and errors, 2. Testing and Debugging, 3. Software Quality- Correctness Reliability.
Software Reliability: 2 Alternate Definitions
Software Test Termination
COP 5611 Operating Systems Fall 2011
Reliability.
COP 5611 Operating Systems Spring 2010
Progression of Test Categories
T305: Digital Communications
Baisc Of Software Testing
Reliability.
Overview Dependability: "[..] the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers [..]"
Production and Operations Management
Definitions Cumulative time to failure (T): Mean life:
Seminar on Enterprise Software
Presentation transcript:

1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability

2 Review Probability Part 2 Bayes Rule Using MatLab

3 Review - Probability DEFINITION: The probability of an event is the ratio of the number of cases in which the event occurs to the total number of possible cases P(outcome) = Number of desired outcomes Number of possible outcomes EXAMPLE The probability of drawing a diamond out of a deck of cards is: P(diamond) = 13 52

4 Outline Introduction to Fault Tolerance Reliability Reliability calculations

5 Introduction to Fault Tolerance

6 A Dose of Reality Everything breaks down A switch stuck at open Wrong value in a program Deviation from performance

7 Failure Chain Problems at every stage can result in system failures Specification Mistakes Implementation Mistakes External Disturbances Component Defects Software Faults Hardware Faults Errors System Failures 3 types of control Fault AvoidanceFault MaskingFault Tolerance

8 Primary Design Techniques Fault Avoidance Prevents faults in the first place E.g. Design review Fault Masking Localize fault, prevent error from getting into system informational structure E.g. Error correcting codes Fault Tolerant Allow the system to perform tasks in the presence of faults

9 Ethical & Moral Responsibility Computers are used where system failure would be catastrophic in terms of money, human lives, or ecosystem. As engineers we have a responsibility to ensure that the systems we design provide the highest level of protection required by the application.

10 Failures

11 Downtime Costs

12 Downtime Survey In 1992, Stratus commissioned major research on “Impact of Online Computer Systems Downtime on American Business” Interviewed 450 senior information executives from American corporations in telecommunications, financial services, retail manufacturing, insurance, travel and transportation RESULT: downtime equates to lost revenue and customer dissatisfaction Executives reported $80,000 to $300,000 loss per hour of Downtime Average company reported downtimes 9 times per year, each averaging 4 hours

13 Competing Concerns There is a constant pressure to reduce costs and production time FT adds cost in Hardware, Design, Verification – increase development cycle compressed schedule can result in greater # of errors – errors escape into field

14 Reliability

15 Reliability The reliability of a system, R(t), is a function of time it defines the probability that the system will perform correctly from time 0 to time t When reliability is specified as a design parameter, it is usually a high value a reliability of.9999 is not uncommon it is often noted by the number of 9’s (four 9’s reliability) or as The design parameter may be something other than reliability mean time to failure (MTTF) mean time between failures (MTBF) mean time to repair (MTTR)

16 Failure Rate The failure rate,, is the expected number of failures of a device or system per a given time period if a system fails on average once every 2000 hours then there are 1/2000 failures/hour or = The failure rate for a device will change over time and experience has shown that it follows a “bathtub” curve time Infant Mortality Phase Useful Life Period Wear Out Phase

17 Exponential Failure Law During the useful life phase when the failure rate is a constant the relationship between the reliability and the failure rate is an exponential R(t) = e - t DESIGN ISSUE The design specifications will be in terms of a certain level of reliability over a given time period To determine the reliability, however, we first need to know the failure rate of the components

18 Design Issue Reliability is often expressed as a design parameter PROBLEM: Given an estimate of the failure rate of a design, how do we calculate the reliability of the system? This is a common problem - going from what we know about a design to a measure of a requirement There are several measures of reliability which are related to the failure rate Mean Time to Failure Mean Time between Failures Mean Time to Repair

19 Mean Time to Failure The expected time that a system will operate before the first failure occurs The expected (read - average) value of the reliability of the system is the MTTF

20 Mean Time to Repair The mean time to repair (MTTR) is the average tie required to repair a system Very difficult to estimate and is often determined experimentally by injecting faults into a system and measuring the time required to repair Normally specified in terms of a repair rate,  MTTR =  

21 Mean Time Between Failures The average time between failures in a system includes the mean time to fail and the mean time to repair MTBF = MTTR + MTTR The relationship between MTBF and MTTF: MTTF MTBF MTTR time

22 Other Performance Measures There are several other performance measures related to reliability Maintainability Availability Availability is the probability that the system will be “up” during its scheduled working period

23 Safety S(t) is the probability that the system does not fail in the interval [0,t] in such a manner as to cause unacceptable damage or other catastrophic effects. Safety is a measure of the fail-safe capability of the system system can be unreliable, yet safe bias towards safe failure

24 Reliability Calculations

25 Example If we design a system made up of 4000 components, each with a failure rate of 2 x per hour, what is the MTTF of the whole system? = (2 x )(4000) = 8 x failures/hour MTTF = 1/ = 1250 hours What is the reliability of the system when t = MTTF? R(t) = e - t = exp(-t/MTTF) R(MTTF) = e -1 RESULT: a system with a MTTF of 100 hours has only a 36.8% chance of running 100 hours without a failure

26 Reliable Architectures How do we make trade-offs in the system design to increase reliability? First, produce reliable systems by selecting reliable components and testing, testing, testing Second, trade-off cost vs reliability and speed vs reliability by adding extra components to the system Adding extra components implies that designers need to understand the impact of extra circuits on system reliability Series/Parallel systems Specific fault tolerant architectures

27 Series System Systems in which each subsystem must function if the system as a whole is to function R =  R i i=1 N If subsystem failures are independent and R i is the reliability of subsystem i, then what is the reliability of two systems connected in series?

28 Series Analysis Given a series system, what is its MTTF? From the results on the prior slide: So, the MTTF is: Thus the MTTF of the series system is much smaller than the MTTF of its components

29 Parallel Systems Systems in which the correct operation of just one subsystem is sufficient for the system to function R = 1 -  R i ) i=1 N If the failures are independent and R i is the reliability of subsystem i then

30 Parallel Analysis The MTTF for a parallel system is given by:

31 Specific Architectures You are given a design specification which includes a required level of reliability of.9999, yet the best you can do for a given circuit is reach a documented reliability of.999, what do you do? the trade-off is to increase the cost of the system by imbedding your design in a fault tolerant architecture (and perhaps reduce speed as well) Possible Architectures Triple Module Redundancy Dynamic Redundancy Hybrid Redundancy Sift Out Modular Redundancy Self-Purging Redundancy others...

32 Triple Modular Redundancy An example of static redundancy (masking redundancy) using extra components so that the effect of a faulty component is instantaneously masked TMR uses three identical components and a voting element (majority component) Originally suggested by John von Neumann in 1956 V M M M

33 TMR Reliability ASSUME: The voting circuit does not fail Is this a good assumption? If the reliability of the individual modules is R M, then the reliability of the TMR scheme i, R TMR, is given by: The probability that all three modules are functioning + the probability that any two modules are functioning R TMR = R M + 3R M (1-R M ) = 3R M - 2R M

34 Reliability Improvement A more useful parameter for evaluating reliable systems is the reliability improvement factor, RIF It is the ratio of the probability of failure of the non- redundant system to that of the redundant system for a fixed mission time T Given R N and R R as the reliability's of the non- redundant and the redundant systems at time T: RIF = 1 - R N 1 - R R

35 Simple Calculation Given a system with a reliability of R N = 0.82 at T = 100 hours, what is the RIF of a TMR system? First, find the TMR reliability: (.82) 3 + 3(.82) 2 (1-.82) =.914 Second, find the RIF: (1 -.82)/( ) = 2.1

36 NMR It is possible to use more than three copies of a system in a redundant architecture M-of-N structure: N identical modules where M are required for the system to function properly This system may tolerate N-M failures The reliability of such a system is: R M-of-N = R N-i (1-R) i  i=0 N-M ( ) N! (N-i)!i! a 5MR system requires that 3 of the 5 modules remain fault free: R 3-of-5 = R 5 + 5R 4 (1-R) + 10R 3 (1-R) 2

37 Possible Quiz Remember that even though each quiz is worth only 5 to 10 points, the points do add up to a significant contribution to your overall grade If there is a quiz it might cover these issues: Name one of the three types of fault control. What is MTBF? What is TMR?