COP 5611 Operating Systems Fall 2011

COP 5611 Operating Systems Fall 2011
Dan C. Marinescu Office: HEC 304 Office hours: Tu-Th 5:00-6:00 PM

Lecture 24 Today: Next time Elements of queuing theory.2
Faults, Failures and Fault-Tolerant Design Measures of Reliability and Failure Tolerance Tolerating active Faults Next time Class review 11/20/2018 2 2 2 2 2

Reliable Systems from Unreliable Components
Problem investigated first in mid 1940s by John von Neumann. Steps to build reliable systems Error detection Network protocols (link and end-to-end) Error containment – limit the effect of errors Enforced modularity: client-server architectures, virtual memory, etc. Error masking – ensure correct operation in the presence of errors Network protocols: error correction, repetition, interpolation for data cu real-time constrains 11/20/2018

Faults and errors Fault a flaw with the potential to cause problems
Software Hardware Design Implementation Operation Environment Types of faults Latent Active Error  the consequence of an active fault. 11/20/2018

Error containment in a layered system
Several design strategies are possible. The layer where an error occurs: Masks the error  correct it internally so that the higher layer is not aware of it. Detects the error and report its to the higher layer  fail-fast. Stops  fail-stop. Does nothing. Types of faults Transient (caused by passing external condition)/Persistent Soft /Hard  Can be masked or not by a retry. Intermittent  occurs only occasionally and it is not reproducible Latency of a fault – time until a fault causes an error A long latency may allow errors to accumulate and defeat periodic error correction 11/20/2018

The fault-tolerance design process is iterative
Begin the design of a fault-tolerant model Identify potential faults Estimate the risk of each one Design methods to detect the errors for the highest risk faults. Design methods to deal with the errors for the highest risk faults Contain the damage from high risk errors through modularity. Design procedures to contain the errors detected by: Temporal redundancy (retry the operation) Spatial redundancy (deploy multiple components) Update the model to account for the error masking procedures Iterate until the probability of un-tolerated faults is small Observe the system in the real world Study the error logs Identify the cause of each error Use the information collected to improve the model and iterate again 11/20/2018

Measures of reliability
TTF – time to failure MTTF – mean time to failure MTTF = 1/N ∑ TTFi TTR – time to repair MTTR – mean time to repair MTTR = 1/N ∑ TTRi MTBF – mean time between failures MTBF =MTTF + MTTR Availability =MTTF/MTBF Down time = ( 1- Availability) = MTTR/MTBF 11/20/2018

The conditional failure rate
11/20/2018

Reliability functions
Unconditional failure rate f(t) = Pr(module fails between t and t = dt) Reliability R(t) = Pr(module functions at time t given that it was functioning at time 0). This function is memoryless 11/20/2018

11/20/2018

COP 5611 Operating Systems Fall 2011

Similar presentations

Presentation on theme: "COP 5611 Operating Systems Fall 2011"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COP 5611 Operating Systems Fall 2011

Similar presentations

Presentation on theme: "COP 5611 Operating Systems Fall 2011"— Presentation transcript:

Similar presentations

About project

Feedback