COP 5611 Operating Systems Fall 2011

Slides:



Advertisements
Similar presentations
Group Research 1: AKHTAR, Kamran SU, Hao SUN, Qiang TANG, Yue
Advertisements

Software Quality Assurance (SQA). Recap SQA goal, attributes and metrics SQA plan Formal Technical Review (FTR) Statistical SQA – Six Sigma – Identifying.
Making Services Fault Tolerant
1 Software Testing and Quality Assurance Lecture 34 – Software Quality Assurance.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 13 Reliability.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.
High Performance Embedded Computing © 2007 Elsevier Lecture 5: Embedded Systems Issues Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM.
High Availability for Information Security Managing The Seven R’s Rich Schiesser Sr. Technical Planner.
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
COP 4600 Operating Systems Fall 2010 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:30-4:30 PM.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
CprE 458/558: Real-Time Systems
Network Management. Network management means monitoring and controlling the network so that it is working properly and providing value to its users. A.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Fault-Tolerant Computing Systems #4 Reliability and Availability
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Introduction to Fault Tolerance By Sahithi Podila.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.
COT 4600 Operating Systems Fall 2010 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:30-4:30 PM.
CS203 – Advanced Computer Architecture Dependability & Reliability.
 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.
COT 4600 Operating Systems Spring 2011 Dan C. Marinescu Office: HEC 304 Office hours: Tu-Th 5:00 – 6:00 PM.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
COP 4600 Operating Systems Fall 2010 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:30-4:30 PM.
Software Defects. What leads to what? ERROR FAULT FAILURE Observed by Introduced by.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
Software Defects Cmpe 550 Fall 2005
Software Metrics and Reliability
COP 4600 Operating Systems Fall 2010
Hardware & Software Reliability
Faults and fault-tolerance
Outline Introduction Background Distributed DBMS Architecture
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Fault Tolerance In Operating System
Real-time Software Design
Software Reliability: 2 Alternate Definitions
COT 4600 Operating Systems Fall 2010
COT 5611 Operating Systems Design Principles Spring 2014
Software Test Termination
Reliability and Fault Tolerance
COT 5611 Operating Systems Design Principles Spring 2014
Fault Tolerance Distributed Web-based Systems
Faults and fault-tolerance
COT 5611 Operating Systems Design Principles Spring 2014
COT 5611 Operating Systems Design Principles Spring 2012
COP 4600 Operating Systems Fall 2010
Introduction to Fault Tolerance
COP 5611 Operating Systems Spring 2010
COP 5611 Operating Systems Spring 2010
COP 5611 Operating Systems Spring 2010
COT 5611 Operating Systems Design Principles Spring 2012
COP 4600 Operating Systems Fall 2010
COT 4600 Operating Systems Fall 2010
COT 4600 Operating Systems Spring 2011
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 13 Reliability.
Fault Tolerance Distributed
Tutorial 1.
Chapter 2 Operating System Overview
COT 4600 Operating Systems Fall 2009
COT 5611 Operating Systems Design Principles Spring 2012
Tutorial 1.
Presentation transcript:

COP 5611 Operating Systems Fall 2011 Dan C. Marinescu Office: HEC 304 Office hours: Tu-Th 5:00-6:00 PM

Lecture 24 Today: Next time Elements of queuing theory.2 Faults, Failures and Fault-Tolerant Design Measures of Reliability and Failure Tolerance Tolerating active Faults Next time Class review 11/20/2018 2 2 2 2 2

Reliable Systems from Unreliable Components Problem investigated first in mid 1940s by John von Neumann. Steps to build reliable systems Error detection Network protocols (link and end-to-end) Error containment – limit the effect of errors Enforced modularity: client-server architectures, virtual memory, etc. Error masking – ensure correct operation in the presence of errors Network protocols: error correction, repetition, interpolation for data cu real-time constrains 11/20/2018

Faults and errors Fault a flaw with the potential to cause problems Software Hardware Design Implementation Operation Environment Types of faults Latent Active Error  the consequence of an active fault. 11/20/2018

Error containment in a layered system Several design strategies are possible. The layer where an error occurs: Masks the error  correct it internally so that the higher layer is not aware of it. Detects the error and report its to the higher layer  fail-fast. Stops  fail-stop. Does nothing. Types of faults Transient (caused by passing external condition)/Persistent Soft /Hard  Can be masked or not by a retry. Intermittent  occurs only occasionally and it is not reproducible Latency of a fault – time until a fault causes an error A long latency may allow errors to accumulate and defeat periodic error correction 11/20/2018

The fault-tolerance design process is iterative Begin the design of a fault-tolerant model Identify potential faults Estimate the risk of each one Design methods to detect the errors for the highest risk faults. Design methods to deal with the errors for the highest risk faults Contain the damage from high risk errors through modularity. Design procedures to contain the errors detected by: Temporal redundancy (retry the operation) Spatial redundancy (deploy multiple components) Update the model to account for the error masking procedures Iterate until the probability of un-tolerated faults is small Observe the system in the real world Study the error logs Identify the cause of each error Use the information collected to improve the model and iterate again 11/20/2018

Measures of reliability TTF – time to failure MTTF – mean time to failure MTTF = 1/N ∑ TTFi TTR – time to repair MTTR – mean time to repair MTTR = 1/N ∑ TTRi MTBF – mean time between failures MTBF =MTTF + MTTR Availability =MTTF/MTBF Down time = ( 1- Availability) = MTTR/MTBF 11/20/2018

The conditional failure rate 11/20/2018

Reliability functions Unconditional failure rate f(t) = Pr(module fails between t and t = dt) Reliability R(t) = Pr(module functions at time t given that it was functioning at time 0). This function is memoryless 11/20/2018

11/20/2018

11/20/2018

11/20/2018

11/20/2018