CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.

Slides:



Advertisements
Similar presentations
Large-Scale Distributed Systems Andrew Whitaker CSE451.
Advertisements

3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Dependability ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg University August.
Term Paper OLOMOLA,Afolabi( ). Dependability Modellling.
Bastien DURAND Karen GODARY-DEJEAN – Lionel LAPIERRE Robin PASSAMA – Didier CRESTANI 27 Janvier 2011 ConecsSdf Architecture de contrôle adaptative : une.
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Making Services Fault Tolerant
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Adaptive Systems – Graceful Degrading System Paul Li
Chapter 9 - Control in Computerized Environment ATG 383 – Spring 2002.
J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)
Last Class: Weak Consistency
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
CS 603 Communication and Distributed Systems April 15, 2002.
Managing Information Systems Information Systems Security and Control Part 2 Dr. Stephania Loizidou Himona ACSC 345.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 2 Wenbing Zhao Department of Electrical and Computer Engineering.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
MAFTIA concepts Yves Deswarte & David Powell LAAS-CNRS, France SRI International.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Reliability Andy Jensen Sandy Cabadas.  Understanding Reliability and its issues can help one solve them in relatable areas of computing Thesis.
Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.
TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Bộ môn Mạng và Truyền Thông Máy Tính.
OHTO -99 SOFTWARE ENGINEERING “SOFTWARE PRODUCT QUALITY” Today: - Software quality - Quality Components - ”Good” software properties.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
1 Software Testing and Quality Assurance Lecture 33 – Software Quality Assurance.
Distributed Systems: Concepts and Design Chapter 1 Pages
Secure Systems Research Group - FAU 1 A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering.
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility –A well.
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
CS551 - Lecture 5 1 CS551 Lecture 5: Quality Attributes Yugi Lee FH #555 (816)
Defect resolution  Defect logging  Defect tracking  Consistent defect interpretation and tracking  Timely defect reporting.
LESSON 3. Properties of Well-Engineered Software The attributes or properties of a software product are characteristics displayed by the product once.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Basic Concepts of Dependability Jean-Claude Laprie DeSIRE and DeFINE Workshop — Pisa, November 2002.
Disaster Tolerant Computing and Communications Systems Mitch Thornton Steve Szygenda.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
1 5/18/2007ã 2007, Spencer Rugaber Architectural Styles and Non- Functional Requirements Jan Bosch. Design and Use of Software Architectures. Addison-Wesley,
1 INTRUSION TOLERANT SYSTEMS WORKSHOP Phoenix, AZ 4 August 1999 Jaynarayan H. Lala ITS Program Manager.
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
Introduction to Fault Tolerance By Sahithi Podila.
Component 8/Unit 9aHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9a Creating Fault Tolerant.
Section 2.1 Distributed System Design Goals Alex De Ruiter
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Mean Time To Repair
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.
Standardized Fault Reporting in Electronic Commerce Software University of St. Thomas MBIF 705 – Foundations of Electronic Commerce Jeff D. Conrad December.
 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.
Software Metrics and Reliability
Hardware & Software Reliability
Embracing Failure: A Case for Recovery-Oriented Computing
Large Distributed Systems
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Fault Tolerance In Operating System
Software Reliability: 2 Alternate Definitions
COP 5611 Operating Systems Fall 2011
Fault Tolerance Distributed Web-based Systems
Introduction to Fault Tolerance
COP 5611 Operating Systems Spring 2010
COP 5611 Operating Systems Spring 2010
Progression of Test Categories
Distributed Systems and Concurrency: Distributed Systems
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility A well.
Presentation transcript:

CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers University

CS 505: Thu D. Nguyen Rutgers University, Spring Fault Tolerance Computing components WILL fail –Hardware, software, and people General field of dependability, fault tolerance, reliability, etc. addresses the issue of how can we keep a computing system running in the presence of component failures Lots of jargon (like all areas of computer science) so need to start with terminology –See short paper I posted on web today

CS 505: Thu D. Nguyen Rutgers University, Spring Dependability, Reliability, Availability Dependability: the ability of a computing system to deliver service that can justifiably be trusted –Service delivered by a system is its behavior as perceived by the service’s users –Dependability is a general concept that encapsulate reliability, availability, etc. Availability: readiness for correct service –What percentage of time is the service available Reliability: continuity of correct service –How long until the next service failure Safety: absence of catastrophic consequences on the users and environment, even in presence of faults

CS 505: Thu D. Nguyen Rutgers University, Spring Faults, Errors, and Failures Failure: an event that occurs when the delivered service deviates from correct service –By definition, a failure is visible to the user A fault is a failure of a component of a computing system that may lead to service failure –If the system can tolerate this fault, that is, continue to provide correct service despite the fault, then the fault does not lead to service failure An error is the activation of a fault –Faults may be dormant or latent –For example, a disk fault may not ever become an error if the service never uses that disk again

CS 505: Thu D. Nguyen Rutgers University, Spring Fault Tolerance How to continue delivering correct service in the presence of errors Error detection: figuring out that an error exists in the service Fault diagnosis: figure out the root cause of the detected error(s) Error handling and recovery: dynamic reconfiguration of the service to continue delivering correct service Fault prediction: predicting when faults are likely to occur Fault prevention: pro-active reconfiguration of the service to tolerate likely future faults

CS 505: Thu D. Nguyen Rutgers University, Spring Mathematical Definitions Availability = MTTF / (MTTF + MTTR) Reliability = MTTF

CS 505: Thu D. Nguyen Rutgers University, Spring Tandem Case Study Modularity Fail-fast (fail-stop) hardware –Extensive self-monitoring –Fault model enforcement –What happens when the self-monitoring and fault model enforcement hardware fails? Replicate hardware for redundancy –Tolerate single fault Fault-tolerance software On-line maintenance Simplified user interface

CS 505: Thu D. Nguyen Rutgers University, Spring Tandem NonStop

CS 505: Thu D. Nguyen Rutgers University, Spring Tandem Integrity

CS 505: Thu D. Nguyen Rutgers University, Spring Census of Tandem Availability

CS 505: Thu D. Nguyen Rutgers University, Spring Census of Tandem Availability

CS 505: Thu D. Nguyen Rutgers University, Spring Case Study of 1 Tandem Customer

CS 505: Thu D. Nguyen Rutgers University, Spring Sources of Failures (Going Beyond Tandem) Operator mistakes are a major source of service failures Theory: insufficient infrastructural support major reason for operator mistakes –System designers rarely consider human-system interactions Public Switched Telephone Network Average of 3 Internet Sites [Patterson et al. 2002]

CS 505: Thu D. Nguyen Rutgers University, Spring Data from Vivo Project Conducting survey to understand database and network administration –~100 respondents –DBAs: all ≥ 2 years experience, 71% ≥ 5 years experience –Networking: 98% ≥ 2 years experience, 81% ≥ 5 years experience Source of failures