Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.

Similar presentations


Presentation on theme: "CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers."— Presentation transcript:

1 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers University

2 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 2 Fault Tolerance Computing components WILL fail –Hardware, software, and people General field of dependability, fault tolerance, reliability, etc. addresses the issue of how can we keep a computing system running in the presence of component failures Lots of jargon (like all areas of computer science) so need to start with terminology –See short paper I posted on web today

3 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 3 Dependability, Reliability, Availability Dependability: the ability of a computing system to deliver service that can justifiably be trusted –Service delivered by a system is its behavior as perceived by the service’s users –Dependability is a general concept that encapsulate reliability, availability, etc. Availability: readiness for correct service –What percentage of time is the service available Reliability: continuity of correct service –How long until the next service failure Safety: absence of catastrophic consequences on the users and environment, even in presence of faults

4 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 4 Faults, Errors, and Failures Failure: an event that occurs when the delivered service deviates from correct service –By definition, a failure is visible to the user A fault is a failure of a component of a computing system that may lead to service failure –If the system can tolerate this fault, that is, continue to provide correct service despite the fault, then the fault does not lead to service failure An error is the activation of a fault –Faults may be dormant or latent –For example, a disk fault may not ever become an error if the service never uses that disk again

5 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 5 Fault Tolerance How to continue delivering correct service in the presence of errors Error detection: figuring out that an error exists in the service Fault diagnosis: figure out the root cause of the detected error(s) Error handling and recovery: dynamic reconfiguration of the service to continue delivering correct service Fault prediction: predicting when faults are likely to occur Fault prevention: pro-active reconfiguration of the service to tolerate likely future faults

6 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 6 Mathematical Definitions Availability = MTTF / (MTTF + MTTR) Reliability = MTTF

7 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 7 Tandem Case Study Modularity Fail-fast (fail-stop) hardware –Extensive self-monitoring –Fault model enforcement –What happens when the self-monitoring and fault model enforcement hardware fails? Replicate hardware for redundancy –Tolerate single fault Fault-tolerance software On-line maintenance Simplified user interface

8 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 8 Tandem NonStop

9 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 9 Tandem Integrity

10 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 10 Census of Tandem Availability

11 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 11 Census of Tandem Availability

12 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 12 Case Study of 1 Tandem Customer

13 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 13 Sources of Failures (Going Beyond Tandem) Operator mistakes are a major source of service failures Theory: insufficient infrastructural support major reason for operator mistakes –System designers rarely consider human-system interactions Public Switched Telephone Network Average of 3 Internet Sites [Patterson et al. 2002]

14 CS 505: Thu D. Nguyen Rutgers University, Spring 2005 14 Data from Vivo Project Conducting survey to understand database and network administration –~100 respondents –DBAs: all ≥ 2 years experience, 71% ≥ 5 years experience –Networking: 98% ≥ 2 years experience, 81% ≥ 5 years experience Source of failures


Download ppt "CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers."

Similar presentations


Ads by Google