COT 5611 Operating Systems Design Principles Spring 2012

Slides:



Advertisements
Similar presentations
Reliability Engineering (Rekayasa Keandalan)
Advertisements

SMJ 4812 Project Mgmt and Maintenance Eng.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 Review Definition: Reliability is the probability that a component or system will perform a required function for a given period of time when used under.
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 13 Reliability.
Reliability Chapter 4S.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
COT 4600 Operating Systems Spring 2011 Dan C. Marinescu Office: HEC 304 Office hours: Tu-Th 5:00 – 6:00 PM.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.
Software Reliability SEG3202 N. El Kadri.
Distributed Systems: Concepts and Design Chapter 1 Pages
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
Discrete Probability Distributions. Random Variable Random variable is a variable whose value is subject to variations due to chance. A random variable.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Fault-Tolerant Computing Systems #4 Reliability and Availability
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
CS203 – Advanced Computer Architecture Dependability & Reliability.
 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Week#3 Software Quality Engineering.
 How do you know how long your design is going to last?  Is there any way we can predict how long it will work?  Why do Reliability Engineers get paid.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
More on Exponential Distribution, Hypo exponential distribution
Hardware & Software Reliability
CHAPTER 4s Reliability Operations Management, Eighth Edition, by William J. Stevenson Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved.
Most people will have some concept of what reliability is from everyday life, for example, people may discuss how reliable their washing machine has been.
Embracing Failure: A Case for Recovery-Oriented Computing
Outline Introduction Background Distributed DBMS Architecture
Statistical Process Control
Chapter 2 Discrete Random Variables
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Fault-Tolerant Computing Systems #5 Reliability and Availability2
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Fault Tolerance In Operating System
Real-time Software Design
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
Software Reliability: 2 Alternate Definitions
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
COP 5611 Operating Systems Fall 2011
Lecture 28: Reliability Today’s topics: GPU wrap-up Disk basics RAID
Fault Tolerance Distributed Web-based Systems
COT 5611 Operating Systems Design Principles Spring 2014
COT 5611 Operating Systems Design Principles Spring 2012
Mattan Erez The University of Texas at Austin July 2015
Introduction to Fault Tolerance
COP 5611 Operating Systems Spring 2010
COP 5611 Operating Systems Spring 2010
COP 5611 Operating Systems Spring 2010
RAID Redundant Array of Inexpensive (Independent) Disks
T305: Digital Communications
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 13 Reliability.
System Testing.
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
Production and Operations Management
Reliability and Error Control 5/17/11
Definitions Cumulative time to failure (T): Mean life:
COT 5611 Operating Systems Design Principles Spring 2012
Tutorial 1.
COT 5611 Operating Systems Design Principles Spring 2014
Seminar on Enterprise Software
Presentation transcript:

COT 5611 Operating Systems Design Principles Spring 2012 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 5:00-6:00 PM

Lecture 14 - Monday February 27 Reading assignment: Chapter 8 from the on-line text Last time: Control mechanisms and decisions in the Internet. The network layer The end-to-end layer 12/31/2018 Lecture 14

Today Reliability and fault tolerance 12/31/2018 Lecture 14

Reliable Systems from Unreliable Components Problem investigated first in mid 1940s by John von Neumann. Steps to build reliable systems Error detection Network protocols (link and end-to-end) Error containment – limit the effect of errors Enforced modularity: client-server architectures, virtual memory, etc. Error masking – ensure correct operation in the presence of errors Network protocols: error correction, repetition, interpolation for data cu real-time constrains 12/31/2018 Lecture 14

Faults and errors Fault a flaw with the potential to cause problems; occurs when an error is not detected and masked Software Hardware Design  e.g., under-provisioning resources Implementation  e.g., setting the wrong device priorities on a bus Operation  setting the wrong date Environment  failure of the cooling system may cause intermittent memory errors Types of faults Latent  not active now Active Error  the consequence of an active fault. Failure  inability to produce the desired result. Distinction between failure and fault related to modularity a fault of a component may lead to the failure of the entire system but it may be detected and masked by other components. 12/31/2018 Lecture 14

Error containment in a layered system Several design strategies are possible. The layer where an error occurs: Masks the error  correct it internally so that the higher layer is not aware of it. Detects the error and report its to the higher layer  fail-fast. Stops  fail-stop. Does nothing. Types of faults Transient (caused by passing external condition)/Persistent Soft /Hard  Can be masked or not by a retry. Intermittent  occurs only occasionally and it is not reproducible Latency of a fault – time until a fault causes an error A long latency may allow errors to accumulate and defeat periodic error correction 12/31/2018 Lecture 14

Fault avoidance and fault tolerance Fault avoidance  build the system using highly reliable components. Does not work for systems with a very large number N of components p  probability of failure of one component If the failures are independent the probability that the system functions correctly is C= (1-p)N  regardless how small p when N is large C is small Fault tolerance  design a reliable system from unreliable components 12/31/2018 Lecture 14

The fault-tolerance design process is iterative Begin the design of a fault-tolerant model Identify potential faults Estimate the risk of each one Design methods to detect the errors for the highest risk faults. Design methods to deal with the errors for the highest risk faults Contain the damage from high risk errors through modularity. Design procedures to contain the errors detected by: Temporal redundancy (retry the operation) Spatial redundancy (deploy multiple components) Update the model to account for the error masking procedures Iterate until the probability of un-tolerated faults is small Observe the system in the real world Study the error logs Identify the cause of each error Use the information collected to improve the model and iterate again 12/31/2018 Lecture 14

Measures of reliability TTF – time to failure MTTF – mean time to failure TTR – time to repair MTTR – mean time to repair MTBF – mean time between failures MTBF =MTTF + MTTR Down time = ( 1- Availability) = MTTR/MTBF Backward looking measures To evaluate how a systems performed in the past To predict how the system will perform in the future Sometimes use proxies to measure MTTF 12/31/2018 Lecture 14

How to measure the averages MTTF, MTTR, MTBF (1). Observe one system through N run-fail-repair cycles and use the TTFi values. (2). Observe N distinct systems and run them until all have failed and use the coresponding TTFi values. It works only if the failure process is ergodic. Stochatic/random process  Instead of dealing with only one possible way the process might develop over time in a stochastic process there is indeterminacy described by probability distributions. Discrete and continuous realizations. Processes modeled as stochastic time series include: the stock market, signals such as speech, audio and video, medical data such as EKG, EEG. Examples of random fields include static images, random terrain (landscapes), or composition variations of a heterogeneous material. A stochastic process has multiple realizations; one can compute A time average of one realization An ensemble average over multiple realization Ergodic processes  time averages over a single realization are equal to ensemble averages (averages over multiple realizations taken at the same time). 12/31/2018 Lecture 14

The conditional failure rate – the bathtub curve Conditional failure  probability of failure conditioned by the length of time the component has been operational infant mortality  many components fail early burn out  components that fail towards the end of their life cycle,. 12/31/2018 Lecture 14

Reliability functions Unconditional failure rate f(t) = Pr(the component fails between t and t = dt) Cumulative probability that the component has failed by time t The mean time between failures: Reliability R(t) = Pr(the component functions at time t given that it was functioning at time 0). R(t) = 1 – F(t) The conditional failure rate h(t) = f(t) /R(t) Some systems experience uniform failure rates, h(t), is independent of the time the system has been operational. h(t) is a straight line (not a bathtub). R(t) is memoryless 12/31/2018 Lecture 14

Memoryless random variables and processes Discrete random variable X  Pr(X > m+n | X≥ m) = Pr(X> n) Example: geometric distributions  the number of independent Bernoulli trials to get one "success", with a fixed probability p of "success" on each trial. Example: Pr(X>50 | X≥ 35) = Pr(X> 15) Note that Pr(X > 50 | X≥ 35) = Pr(X> 50) if and only iff Pr(X > 50) and Pr(X≥ 40)) are independent events, but this is not possible. Continuous random variable X  Pr(X > t+s | X > t) = Pr(X> s) Example: exponential distribution Indeed the conditional distribution: Pr(X > t+s | X > t) = Pr(X> s) Pr(X>t) 12/31/2018 Lecture 14

MTTF, the failure rate, and availability When the failure process is memoryless then the conditional failure rate is h(t) = 1/MTTF. Prove it! Often this condition is ignored! Example: A manufacturer specifies the “MTTF” of a 3.5 inch disk as 300,000 hours (34 years). Runs 1,000 disks for 3,000 hours and 10 disks fail during this time  the failure rate is (3,000 x 1,000 )/10  1 failure for 300,000 hours of operation  h(t) = 1/300,000 But MTTF is not 1/h(t) as the process is not memoryless, the older the disks the more likely is that the mechanical parts will fail! Availability  often expressed by counting the number of 9s 99.9  three nine availability  the system can be down 1.5 minutes/day or 8 hours/year. 99.999 five nine availability  the system can be down 5 minutes/year 99.99999  seven nine availability  the system can be down 3 seconds/year. Note that availability does not give information about MTTF 12/31/2018 Lecture 14

Reliability as the number of σ of the distribution σ standard deviation of a normal distribution Example: production of gates Mean propagation time 10 nsec Maximum propagation time 11.8 nsec. Tolerance 11.8-10.0 = 1.8 nsec 4.5 σ tolerance  σ = 1.8/4.5=0.4 nsec How to measure 4.5 σ tolerance (this applies only to production!!) Samples of the gates would be measured and if the variance in the propagation delay is more than 0.4 nsec then the productions line should be updated. The expected fraction of components that are outside the specified tolerance. That fraction is the integral of one tail of the normal distribution from 4.5 to ∞. No more than 3.4/ one million gates should have delays greater than 11.8 nanoseconds. 12/31/2018 Lecture 14

Active fault handling Do nothing  pass the problem to the larger system which includes this component Fail fast  report that something went wrong Fail-safe  transform incorrect values to acceptable values Fail soft  the system continues to operate correctly with respect to some predictably degraded subset of its specifications, Mask the error  correct the error 12/31/2018 Lecture 14

Types of errors A detectable error  one that can be detected reliably. Maskable error  one for which it is possible to devise a procedure to recover. Tolerated error  one that can be detected and masked. Untolerated error  undetectable, undetected, unmaskable, or unmasked. 12/31/2018 Lecture 14

Fault tolerance model 1. Analyze the system and distinguish: error that can be reliably detected and errors that cannot be reliably detected. 2. For each undetectable error, evaluate the probability of its occurrence. If that probability is not negligible, modify the system design in whatever way necessary to make the error reliably detectable. 3. For each detectable error, implement a detection procedure and reclassify the module in which it is detected as fail-fast. try to devise a way of masking it; if there is a way, reclassify this error as a maskable error. 4. For each maskable error, evaluate its probability of occurrence, the cost of failure, and the cost of the masking method. If the evaluation indicates it is worthwhile, implement the masking method and reclassify this error as a tolerated error. 12/31/2018 Lecture 14

Replication- use multiple copies Update of a sector on disk 2 of a five-disk RAID 4 system. To construct a new parity sector that includes the new data 2, one could read the corresponding sectors of data 1, data 3, and data 4 and perform three more XORs. A faster way is to read just the old parity sector and the old data 2 sector and compute the new parity sector as: new parity ← old parity ⊕ old data 2 ⊕ new data. A quad-component superdiode (Shannon and Moore) The dotted line is a bridging connection, which allows it to tolerate different set of failures: (i) a single short circuit and a single open circuit in any two diodes; (ii) open circuit in both upper diodes plus a short circuit in one of the lower diodes; 12/31/2018 Lecture 14

NMR N-modular redundancy - voting Multiple (N) replicas of the same module. TMR – three-modular redundancy. R  reliability of a single module Modules fail independently Reliability of a super-module with 3 voting modules Rs= R3+3R2(1-R)= 3R2 – 2R3 Example: (a) R=0.8  Rs = 0.896 (b) R=0.999  Rs = 0.999997 If the voter is perfectly reliable the probability that an incorrect result will be accepted by the voter is that it is not more than (1- Rs). The super-module is not always fail-fast. If two replicas fail in exactly the same way, the voter will accept the erroneous result and call for repair of the correctly operating replica. Fully triple replicated super-module 12/31/2018 Lecture 14