CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers University
CS 505: Thu D. Nguyen Rutgers University, Spring Fault Tolerance Computing components WILL fail –Hardware, software, and people General field of dependability, fault tolerance, reliability, etc. addresses the issue of how can we keep a computing system running in the presence of component failures Lots of jargon (like all areas of computer science) so need to start with terminology –See short paper I posted on web today
CS 505: Thu D. Nguyen Rutgers University, Spring Dependability, Reliability, Availability Dependability: the ability of a computing system to deliver service that can justifiably be trusted –Service delivered by a system is its behavior as perceived by the service’s users –Dependability is a general concept that encapsulate reliability, availability, etc. Availability: readiness for correct service –What percentage of time is the service available Reliability: continuity of correct service –How long until the next service failure Safety: absence of catastrophic consequences on the users and environment, even in presence of faults
CS 505: Thu D. Nguyen Rutgers University, Spring Faults, Errors, and Failures Failure: an event that occurs when the delivered service deviates from correct service –By definition, a failure is visible to the user A fault is a failure of a component of a computing system that may lead to service failure –If the system can tolerate this fault, that is, continue to provide correct service despite the fault, then the fault does not lead to service failure An error is the activation of a fault –Faults may be dormant or latent –For example, a disk fault may not ever become an error if the service never uses that disk again
CS 505: Thu D. Nguyen Rutgers University, Spring Fault Tolerance How to continue delivering correct service in the presence of errors Error detection: figuring out that an error exists in the service Fault diagnosis: figure out the root cause of the detected error(s) Error handling and recovery: dynamic reconfiguration of the service to continue delivering correct service Fault prediction: predicting when faults are likely to occur Fault prevention: pro-active reconfiguration of the service to tolerate likely future faults
CS 505: Thu D. Nguyen Rutgers University, Spring Mathematical Definitions Availability = MTTF / (MTTF + MTTR) Reliability = MTTF
CS 505: Thu D. Nguyen Rutgers University, Spring Tandem Case Study Modularity Fail-fast (fail-stop) hardware –Extensive self-monitoring –Fault model enforcement –What happens when the self-monitoring and fault model enforcement hardware fails? Replicate hardware for redundancy –Tolerate single fault Fault-tolerance software On-line maintenance Simplified user interface
CS 505: Thu D. Nguyen Rutgers University, Spring Tandem NonStop
CS 505: Thu D. Nguyen Rutgers University, Spring Tandem Integrity
CS 505: Thu D. Nguyen Rutgers University, Spring Census of Tandem Availability
CS 505: Thu D. Nguyen Rutgers University, Spring Census of Tandem Availability
CS 505: Thu D. Nguyen Rutgers University, Spring Case Study of 1 Tandem Customer
CS 505: Thu D. Nguyen Rutgers University, Spring Sources of Failures (Going Beyond Tandem) Operator mistakes are a major source of service failures Theory: insufficient infrastructural support major reason for operator mistakes –System designers rarely consider human-system interactions Public Switched Telephone Network Average of 3 Internet Sites [Patterson et al. 2002]
CS 505: Thu D. Nguyen Rutgers University, Spring Data from Vivo Project Conducting survey to understand database and network administration –~100 respondents –DBAs: all ≥ 2 years experience, 71% ≥ 5 years experience –Networking: 98% ≥ 2 years experience, 81% ≥ 5 years experience Source of failures