COT 5611 Operating Systems Design Principles Spring 2014

COT 5611 Operating Systems Design Principles Spring 2014
Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 3:30 – 5:30 PM

2 Lecture 14 Reading assignment: Last time:
Chapter 8 from the on-line text Last time: Control mechanisms and decisions in the Internet. The network layer The end-to-end layer 11/14/2018 Lecture 14

Today Reliability and fault tolerance

4 Reliable Systems from Unreliable Components
Problem investigated first in mid 1940s by John von Neumann. Steps to build reliable systems Error detection Network protocols (link and end-to-end) Error containment – limit the effect of errors Enforced modularity: client-server architectures, virtual memory, etc. Error masking – ensure correct operation in the presence of errors Network protocols: error correction, repetition, interpolation for data cu real-time constrains 11/14/2018 Lecture 14

5 Faults and errors Fault a flaw with the potential to cause problems; occurs when an error is not detected and masked Software Hardware Design  e.g., under-provisioning resources Implementation  e.g., setting the wrong device priorities on a bus Operation  setting the wrong date Environment  failure of the cooling system may cause intermittent memory errors Types of faults Latent  not active now Active Error  the consequence of an active fault. Failure  inability to produce the desired result. Distinction between failure and fault related to modularity a fault of a component may lead to the failure of the entire system but it may be detected and masked by other components. 11/14/2018 Lecture 14

6 Error containment in a layered system
Several design strategies are possible. The layer where an error occurs: Masks the error  correct it internally so that the higher layer is not aware of it. Detects the error and report its to the higher layer  fail-fast. Stops  fail-stop. Does nothing. Types of faults Transient (caused by passing external condition)/Persistent Soft /Hard  Can be masked or not by a retry. Intermittent  occurs only occasionally and it is not reproducible Latency of a fault – time until a fault causes an error A long latency may allow errors to accumulate and defeat periodic error correction 11/14/2018 Lecture 14

7 Fault avoidance and fault tolerance
Fault avoidance  build the system using highly reliable components. Does not work for systems with a very large number N of components p  probability of failure of one component If the failures are independent the probability that the system functions correctly is C= (1-p)N  regardless how small p when N is large C is small Fault tolerance  design a reliable system from unreliable components 11/14/2018 Lecture 14

8 The fault-tolerance design process is iterative
Begin the design of a fault-tolerant model Identify potential faults Estimate the risk of each one Design methods to detect the errors for the highest risk faults. Design methods to deal with the errors for the highest risk faults Contain the damage from high risk errors through modularity. Design procedures to contain the errors detected by: Temporal redundancy (retry the operation) Spatial redundancy (deploy multiple components) Update the model to account for the error masking procedures Iterate until the probability of un-tolerated faults is small Observe the system in the real world Study the error logs Identify the cause of each error Use the information collected to improve the model and iterate again 11/14/2018 Lecture 14

9 Measures of reliability
TTF – time to failure MTTF – mean time to failure TTR – time to repair MTTR – mean time to repair MTBF – mean time between failures MTBF =MTTF + MTTR Down time = ( 1- Availability) = MTTR/MTBF Backward looking measures To evaluate how a systems performed in the past To predict how the system will perform in the future Sometimes use proxies to measure MTTF 11/14/2018 Lecture 14

10 How to measure the averages MTTF, MTTR, MTBF
(1). Observe one system through N run-fail-repair cycles and use the TTFi values. (2). Observe N distinct systems and run them until all have failed and use the coresponding TTFi values. It works only if the failure process is ergodic. Stochatic/random process  Instead of dealing with only one possible way the process might develop over time in a stochastic process there is indeterminacy described by probability distributions. Discrete and continuous realizations. Processes modeled as stochastic time series include: the stock market, signals such as speech, audio and video, medical data such as EKG, EEG. Examples of random fields include static images, random terrain (landscapes), or composition variations of a heterogeneous material. A stochastic process has multiple realizations; one can compute A time average of one realization An ensemble average over multiple realization Ergodic processes  time averages over a single realization are equal to ensemble averages (averages over multiple realizations taken at the same time). 11/14/2018 Lecture 14

11 The conditional failure rate – the bathtub curve
Conditional failure  probability of failure conditioned by the length of time the component has been operational infant mortality  many components fail early burn out  components that fail towards the end of their life cycle,. 11/14/2018 Lecture 14

12 Reliability functions
Unconditional failure rate f(t) = Pr(the component fails between t and t = dt) Cumulative probability that the component has failed by time t The mean time between failures: Reliability R(t) = Pr(the component functions at time t given that it was functioning at time 0). R(t) = 1 – F(t) The conditional failure rate h(t) = f(t) /R(t) Some systems experience uniform failure rates, h(t), is independent of the time the system has been operational. h(t) is a straight line (not a bathtub). R(t) is memoryless 11/14/2018 Lecture 14

13 Memoryless random variables and processes
Discrete random variable X  Pr(X > m+n | X≥ m) = Pr(X> n) Example: geometric distributions  the number of independent Bernoulli trials to get one "success", with a fixed probability p of "success" on each trial. Example: Pr(X>50 | X≥ 35) = Pr(X> 15) Note that Pr(X > 50 | X≥ 35) = Pr(X> 50) if and only iff Pr(X > 50) and Pr(X≥ 40)) are independent events, but this is not possible. Continuous random variable X  Pr(X > t+s | X > t) = Pr(X> s) Example: exponential distribution Indeed the conditional distribution: Pr(X > t+s | X > t) = Pr(X> s) Pr(X>t) 11/14/2018 Lecture 14

14 MTTF, the failure rate, and availability
When the failure process is memoryless then the conditional failure rate is h(t) = 1/MTTF. Prove it! Often this condition is ignored! Example: A manufacturer specifies the “MTTF” of a 3.5 inch disk as 300,000 hours (34 years). Runs 1,000 disks for 3,000 hours and 10 disks fail during this time  the failure rate is (3,000 x 1,000 )/10  1 failure for 300,000 hours of operation  h(t) = 1/300,000 But MTTF is not 1/h(t) as the process is not memoryless, the older the disks the more likely is that the mechanical parts will fail! Availability  often expressed by counting the number of 9s 99.9  three nine availability  the system can be down 1.5 minutes/day or 8 hours/year. five nine availability  the system can be down 5 minutes/year  seven nine availability  the system can be down 3 seconds/year. Note that availability does not give information about MTTF 11/14/2018 Lecture 14

15 Reliability as the number of σ of the distribution
σ standard deviation of a normal distribution Example: production of gates Mean propagation time 10 nsec Maximum propagation time 11.8 nsec. Tolerance = 1.8 nsec 4.5 σ tolerance  σ = 1.8/4.5=0.4 nsec How to measure 4.5 σ tolerance (this applies only to production!!) Samples of the gates would be measured and if the variance in the propagation delay is more than 0.4 nsec then the productions line should be updated. The expected fraction of components that are outside the specified tolerance. That fraction is the integral of one tail of the normal distribution from 4.5 to ∞. No more than 3.4/ one million gates should have delays greater than 11.8 nanoseconds. 11/14/2018 Lecture 14

16 Active fault handling Do nothing  pass the problem to the larger system which includes this component Fail fast  report that something went wrong Fail-safe  transform incorrect values to acceptable values Fail soft  the system continues to operate correctly with respect to some predictably degraded subset of its specifications, Mask the error  correct the error 11/14/2018 Lecture 14

17 Types of errors A detectable error  one that can be detected reliably. Maskable error  one for which it is possible to devise a procedure to recover. Tolerated error  one that can be detected and masked. Untolerated error  undetectable, undetected, unmaskable, or unmasked. 11/14/2018 Lecture 14

18 Fault tolerance model 1. Analyze the system and distinguish: error that can be reliably detected and errors that cannot be reliably detected. 2. For each undetectable error, evaluate the probability of its occurrence. If that probability is not negligible, modify the system design in whatever way necessary to make the error reliably detectable. 3. For each detectable error, implement a detection procedure and reclassify the module in which it is detected as fail-fast. try to devise a way of masking it; if there is a way, reclassify this error as a maskable error. 4. For each maskable error, evaluate its probability of occurrence, the cost of failure, and the cost of the masking method. If the evaluation indicates it is worthwhile, implement the masking method and reclassify this error as a tolerated error. 11/14/2018 Lecture 14

19 Replication- use multiple copies
Update of a sector on disk 2 of a five-disk RAID 4 system. To construct a new parity sector that includes the new data 2, one could read the corresponding sectors of data 1, data 3, and data 4 and perform three more XORs. A faster way is to read just the old parity sector and the old data 2 sector and compute the new parity sector as: new parity ← old parity ⊕ old data 2 ⊕ new data. A quad-component superdiode (Shannon and Moore) The dotted line is a bridging connection, which allows it to tolerate different set of failures: (i) a single short circuit and a single open circuit in any two diodes; (ii) open circuit in both upper diodes plus a short circuit in one of the lower diodes; 11/14/2018 Lecture 14

20 NMR N-modular redundancy - voting
Multiple (N) replicas of the same module. TMR – three-modular redundancy. R  reliability of a single module Modules fail independently Reliability of a super-module with 3 voting modules Rs= R3+3R2(1-R)= 3R2 – 2R3 Example: (a) R=0.8  Rs = (b) R=0.999  Rs = If the voter is perfectly reliable the probability that an incorrect result will be accepted by the voter is that it is not more than (1- Rs). The super-module is not always fail-fast. If two replicas fail in exactly the same way, the voter will accept the erroneous result and call for repair of the correctly operating replica. Fully triple replicated super-module 11/14/2018 Lecture 14

