COP 5611 Operating Systems Spring 2010

Slides:

Advertisements

Similar presentations

Making Services Fault Tolerant

Advertisements

SWE Introduction to Software Engineering

1 Software Testing and Quality Assurance Lecture 34 – Software Quality Assurance.

Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.

THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 13 Reliability.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.

2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.

March 13, 2001CSci Clark University1 CSci 250 Software Design & Development Lecture #15 Tuesday, March 13, 2001.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.

Copyright © Clifford Neuman and Dongho Kim - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture.

COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 2:00-3:00 PM.

High Performance Embedded Computing © 2007 Elsevier Lecture 5: Embedded Systems Issues Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.

COP 4600 Operating Systems Fall 2010 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:30-4:30 PM.

Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

CprE 458/558: Real-Time Systems

Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.

CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.

Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.

1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University

COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 2:00-3:00 PM.

Operating Systems COT 4600 – Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: W, F 3:00-4:00 PM.

1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.

COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.

COT 4600 Operating Systems Fall 2010 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:30-4:30 PM.

CS203 – Advanced Computer Architecture Dependability & Reliability.

COT 4600 Operating Systems Spring 2011 Dan C. Marinescu Office: HEC 304 Office hours: Tu-Th 5:00 – 6:00 PM.

1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.

Software Metrics and Reliability

Hardware & Software Reliability

Faults and fault-tolerance

Outline Introduction Background Distributed DBMS Architecture

Critical Systems Specification

Fault Tolerance & Reliability CDA 5140 Spring 2006

Software Reliability PPT BY:Dr. R. Mall 7/5/2018.

Fault Tolerance In Operating System

Real-time Software Design

CGS 3763 Operating Systems Concepts Spring 2013

Software Reliability: 2 Alternate Definitions

COT 5611 Operating Systems Design Principles Spring 2014

Software Test Termination

COP 5611 Operating Systems Fall 2011

CGS 3763 Operating Systems Concepts Spring 2013

Outline Announcements Fault Tolerance.

COT 5611 Operating Systems Design Principles Spring 2014

CGS 3763 Operating Systems Concepts Spring 2013

Fault Tolerance Distributed Web-based Systems

CGS 3763 Operating Systems Concepts Spring 2013

Faults and fault-tolerance

COT 5611 Operating Systems Design Principles Spring 2014

COT 5611 Operating Systems Design Principles Spring 2012

CGS 3763 Operating Systems Concepts Spring 2013

Introduction to Fault Tolerance

COP 5611 Operating Systems Spring 2010

COP 5611 Operating Systems Spring 2010

COT 5611 Operating Systems Design Principles Spring 2012

Progression of Test Categories

COP 5611 Operating Systems Spring 2010

CGS 3763 Operating Systems Concepts Spring 2013

COT 4600 Operating Systems Spring 2011

THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 13 Reliability.

Fault Tolerance Distributed

Chapter 2 Operating System Overview

Reliability and Error Control 5/17/11

COT 5611 Operating Systems Design Principles Spring 2012

COT 5611 Operating Systems Design Principles Spring 2012

CGS 3763 Operating Systems Concepts Spring 2013

COT 5611 Operating Systems Design Principles Spring 2014

Presentation transcript:

COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM

Lecture 13 Reading Assignment: Chapter 8 from the online textbook Homework 3 due on March 3 Midterm: Wednesday March 17, the first week after Spring Break Last time: End-to-end-layer Resource Management - Congestion Today: Faults, Failures and Fault-Tolerant Design Measures of Reliability and Failure Tolerance Tolerating active Faults Next time 2 2 2 2 2

Reliable Systems from Unreliable Components Problem investigated first in mid 1940s by John von Neumann. Steps to build reliable systems Error detection Network protocols (link and end-to-end) Error containment – limit the effect of errors Enforced modularity: client-server architectures, virtual memory, etc. Error masking – ensure correct operation in the presence of errors Network protocols: error correction, repetition, interpolation for data cu real-time constrains

Faults and errors Fault a flaw with the potential to cause problems Software Hardware Design Implementation Operation Environment Types of faults Latent Active Error  the consequence of an active fault.

Error containment in a layered system Several design strategies are possible. The layer where an error occurs: Masks the error  correct it internally so that the higher layer is not aware of it. Detects the error and report its to the higher layer  fail-fast. Stops  fail-stop. Does nothing. Types of faults Transient (caused by passing external condition)/Persistent Soft /Hard  Can be masked or not by a retry. Intermittent  occurs only occasionally and it is not reproducible Latency of a fault – time until a fault causes an error A long latency may allow errors to accumulate and defeat periodic error correction

The fault-tolerance design process is iterative Begin the design of a fault-tolerant model Identify potential faults Estimate the risk of each one Design methods to detect the errors for the highest risk faults. Design methods to deal with the errors for the highest risk faults Contain the damage from high risk errors through modularity. Design procedures to contain the errors detected by: Temporal redundancy (retry the operation) Spatial redundancy (deploy multiple components) Update the model to account for the error masking procedures Iterate until the probability of un-tolerated faults is small Observe the system in the real world Study the error logs Identify the cause of each error Use the information collected to improve the model and iterate again

Measures of reliability TTF – time to failure MTTF – mean time to failure MTTF = 1/N ∑ TTFi TTR – time to repair MTTR – mean time to repair MTTR = 1/N ∑ TTRi MTBF – mean time between failures MTBF =MTTF + MTTR Availability =MTTF/MTBF Down time = ( 1- Availability) = MTTR/MTBF

The conditional failure rate

Reliability functions Unconditional failure rate f(t) = Pr(module fails between t and t = dt) Reliability R(t) = Pr(module functions at time t given that it was functioning at time 0). This function is memoryless

Active faults How to respond Errors Do nothing Fail-fast – report the error Fail-safe – operate in an acceptable manner (e.g., stop sign/flashing lights) Fail-soft – operate correctly with reduced performance Mask the error Errors Detectable  the error can be reliably detected. Maskable  Correctable Untolerated error  undetactable, dndetected, unmaskable, unmasked.

Fault tolerance models Categorize all errors Evaluate the probability of occurrence of each error. Modify the design to make the most likely errors detectable. Implement a detection procedure for each detectable error and identify the modules in which it is detected as fail-fast. Try to devise masking procedure for each detectable error. Evaluate the probability of occurrence, the cost of the masking procedure and the cost of failure for each detectable error

Errors in analog and digital systems Analog systems  the designer specifies a tolerance to deal with small errors. Digital system Spatial redundancy Temporal redundancy Coding Forward error-correction.