1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.

Slides:



Advertisements
Similar presentations
Reliability Engineering (Rekayasa Keandalan)
Advertisements

Pearson Education Ltd. Naki Kouyioumtzis
Relex Reliability Software “the intuitive solution
RELIABILITY Dr. Ron Lembke SCM 352. Reliability Ability to perform its intended function under a prescribed set of conditions Probability product will.
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
Sean Traber CS-147 Fall  7.9 RAID  RAID Level 0  RAID Level 1  RAID Level 2  RAID Level 3  RAID Level 4 
Stats for Engineers Lecture 11. Acceptance Sampling Summary One stage plan: can use table to find number of samples and criterion Two stage plan: more.
Disk Scrubbing in Large Archival Storage Systems Thomas Schwarz, S.J. 1,2 Qin Xin 1,3, Ethan Miller 1, Darrell Long 1, Andy Hospodor 1,2, Spencer Ng 3.
Chapter 10: Hypothesis Testing
SMJ 4812 Project Mgmt and Maintenance Eng.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Chapter 4 Discrete Random Variables and Probability Distributions
Part 9: Normal Distribution 9-1/47 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.
1 BA 275 Quantitative Business Methods Residual Analysis Multiple Linear Regression Adjusted R-squared Prediction Dummy Variables Agenda.
Failure Patterns Many failure-causing mechanisms give rise to measured distributions of times-to-failure which approximate quite closely to probability.
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 13 Reliability.
Reliability Chapter 4S.
Copyright © 2014 by McGraw-Hill Education (Asia). All rights reserved. 4S Reliability.
1 2. Reliability measures Objectives: Learn how to quantify reliability of a system Understand and learn how to compute the following measures –Reliability.
RAID Systems CS Introduction to Operating Systems.
PowerPoint presentation to accompany
Servers Redundant Array of Inexpensive Disks (RAID) –A group of hard disks is called a disk array FIGURE Server with redundant NICs.
BPT2423 – STATISTICAL PROCESS CONTROL.  Fundamental Aspects  Product Life Cycle Curve  Measures of Reliability  Failure Rate, Mean Life and Availability.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Project & Quality Management Quality Management Reliability.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
Mercury Laser Driver Reliability Considerations HAPL Integration Group Earl Ault June 20, 2005 UCRL-POST
Copyright © Cengage Learning. All rights reserved. 10 Inferences Involving Two Populations.
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
Albert Morlan Caitrin Carroll Savannah Andrews Richard Saney.
Software Reliability SEG3202 N. El Kadri.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.
CIT 307 Online Data Communications Error Detection Module 11 Kevin Siminski, Instructor.
Reliability Management Benbow and Broome (Ch 1, 2, and 3)
1 Chapter Six - Errors, Error Detection, and Error Control Chapter Six.
Simulation Using computers to simulate real- world observations.
Reliability Failure rates Reliability
4/25/2017 Reliability Chapter Ten Reliability Reliability.
Failures and Reliability Adam Adgar School of Computing and Technology.
Reliability McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Best in Class 15 % 5 % 80 % 100 % Reactive / Deviation work Too little, too late 60 % 20 % 100 % Industry Average Non-value added Too much, too early Base.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
Reliability Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CSE 451: Operating Systems Spring 2010 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center 534.
CS Introduction to Operating Systems
A Case for Redundant Arrays of Inexpensive Disks (RAID) -1988
Software Metrics and Reliability
Hardware & Software Reliability
CHAPTER 4s Reliability Operations Management, Eighth Edition, by William J. Stevenson Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved.
Most people will have some concept of what reliability is from everyday life, for example, people may discuss how reliable their washing machine has been.
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
CSE 451: Operating Systems Spring 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center.
Reliability Failure rates Reliability
Reliability.
Significance Tests: The Basics
RAID Redundant Array of Inexpensive (Independent) Disks
Fundamentals of Data Representation
T305: Digital Communications
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 13 Reliability.
Production and Operations Management
RELIABILITY Reliability is -
Statistics and Data Analysis
Definitions Cumulative time to failure (T): Mean life:
Presentation transcript:

1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd

2 Reliability  Reliability is the probability that an equipment will operate for some determined period of time under the working conditions for which it was designed

3 The “Bathtub” Curve Failures per hour Time Infant MortalityEnd of Life Operational Life Phase

4 Operational Strategy  1.Run the equipment without traffic until the infant mortality period has passed - (the burn-in period)  2. Use the equipment during the operational life period  3. Retire or replace the equipment before the end of life period

5 Failure Rate  This is a statistical measure, applicable to a large number of samples  The failure rate, is the number of failures per unit time, divided by the number of items in the test  is constant during the operational phase  is often expressed in % / 1000 hours or FITs (failures in ten to the 9 hours)

6 Mean Time Between Failures (MTBF)  This is a statistical measure, applicable to a large number of samples  The MTBF,  is the average time between failures, times the number of items in the test MTBF = 1 / failure rate  = 1 / MTBF = 1 / failure rate  = 1 /

7 Measured MTBF and Failure Rate  A manufacturer tests 3000 light bulbs for 300 hours and observes 5 failures Note: we don’t know the average time between failures from this test, because they have not all failed! But approximately: Note: we don’t know the average time between failures from this test, because they have not all failed! But approximately:  MTBF = 3000 x 300 / 5 = 180,000 hours  Failure rate = % per 1000 hours  This measured MTBF is an under-approximation of the true MTBF

8 MTBF and End of Life  MTBF is a measure of quality and has nothing to do with the expected lifetime  To visualise this, think of a candle. After three hours, the wax will all be used up and it will have reached its end of life. This is its expected lifetime.  However, a quality candle (higher MTBF) will be less likely to fizzle out half way down. If we light a new candle, just as each old one runs out of wax, the mean time between being unexpectedly plunged into darkness is the MTBF

9 Failure Rate (Graphical Representation) Failures per hour Time The failure rate is the size of this gap

10 Example: Typical Hard Disc  Rated or expected life = 5 years  Guaranteed life = 3 years  MTBF = 1,000,000 hours (approx. 114 years)  Modern hard discs are fairly reliable, but being mechanical, they wear out after a few years

11 Disc Replacement Strategy  Observation: The expected life is much less than the MTBF and discs are the “weak link” in the system  Conclusion: Replace the discs just before they wear out under a preventative maintenance program

12 Example MTBF Figures

13 Operational Life Phase  Reliability theory only works in the operational life phase, where the failure rates are constant  With this proviso, the maths is well established and closely related to statistics  There is a large amount of statistical theory concerned with sampling procedures, aimed at estimating the MTBF of components  From now on, we are only concerned with the operational phase

14 Probability of Survival 1 0 Time Probability of survival MTBF,  p= e - t 0.37

15 Non-Redundant System Reliability  For a system S, made up of components A, B, C, etc.  1 / MTBF S = 1 / MTBF A + 1 / MTBF B + 1 / MTBF C + etc.  or S = A + B + C + etc.  These formulae are used to calculate the MTBF or failure rate of equipment, using published tables covering everything from a soldered joint to a disc sub-system

16 Mean Time To Repair (MTTR)  The formulae discussed earlier assume zero maintenance, i.e. if a device breaks down, it is not fixed.  Often, it is important to know the probability of fixing a broken system within a given time T  For this we need to know the MTTR, which is worked out by examining all the steps involved and the failure modes  The probability of fixing the broken system within time T can then be predicted using similar exponentials as seen already

17 Operational Readiness  Operational readiness is the probability that a system will be ready to fulfil its function when called upon  E.g. The probability that an sent at a random time will get through  Operational readiness = MTBF / (MTBF + MTTR)

18 Active Redundancy Device 1 Device 2 Either device can do the job

19 Active Redundancy Calculations  MTBF S = 1 / / / ( )  Probability of survival = e - 1 t + e - 2 t - e - 1 t. e - 2 t  For an active redundancy system, S made from two identical sub-systems, A MTBF S = 1.5 x MTBF A MTBF S = 1.5 x MTBF A  Note: The failure rate is no longer constant with time

20 Passive Redundancy Device 1 Device 2 When device 1 fails, switch over to device 2 (Device 2 not normally powered)

21 Passive Redundancy Calculations  For a passive redundancy system, S made from two identical sub-systems, A and ignoring the reliability of the switch-over system  Probability of survival = e - A t x (1+ A t) MTBF S = 2 x MTBF A MTBF S = 2 x MTBF A  Note: The failure rate is no longer constant with time

22 Error Detection and Correction  There are two trivially simple ways to guard against errors  Send the information twice: Then at the receiver, if they are different, we have detected an error  Send the information three times: Then at the receiver, accept the majority verdict to correct an error  But we can do better than this.....

23 The Hamming (7,4) code  Pink are parity check bits  Green are information bits  codes can be obtained  by adding any rows mod 2  All 16 codes have Hamming distance of 3 or more, so the code can correct a single error  The Golay (23,12) code has 12 information bits and 11 parity check bits and can correct 3 errors

24 Redundant Array of Independent Discs  There are 5 RAID levels:  1Mirrored discs  2Hamming code error correction  3Single check disc per group  4Independent read and write  5Spread data and parity over all discs  In each case, disc errors are corrected; the differences are largely in the system performance

25 Effect of Non-Maintenance  Consider a file server with 6 discs in an array  The probability of getting a disc failure in a 6 disc RAID array in one year is about 5% which is fairly high. Assume that this happens and the problem is left unfixed for a further 3 weeks  The probability of another disc failing in this time is about.25% If this happens too, you loose the server!  The odds for this happening in any year are 5% of 0.25% or 1 in 8000 divided by the number of RAIDs

26 Effect of Improving the MTTR  If the MTTR of the RAID had been 3 hours instead of 3 weeks, the odds are somewhat different:  The probability of the first failure is still 5%. But now the probability of getting a second failure in the ensuing 3 hours is.0003% instead of.25%  So the odds of loosing the RAID in any year improve to 1 in 6,666,675 from the previous 1 in 8000 Moral of the story: Fix the First Fault Fast