Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Fault tolerant Measures.

Slides:



Advertisements
Similar presentations
Routing Complexity of Faulty Networks Omer Angel Itai Benjamini Eran Ofek Udi Wieder The Weizmann Institute of Science.
Advertisements

ECE 697: Real-Time Systems
1 Fault-Tolerant Computing Systems #6 Network Reliability Pattara Leelaprute Computer Engineering Department Kasetsart University
Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
1 MM3 - Reliability and Fault tolerance in Networks Service Level Agreements Jens Myrup Pedersen.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
School of Information University of Michigan Network resilience Lecture 20.
By: Swetha Kendyala Software Rejuvenation.
1 SANS Technology Institute - Candidate for Master of Science Degree 1 A Preamble into Aligning Systems Engineering and Information Security Risk Dr. Craig.
5/18/2015CPE 731, 4-Principles 1 Define and quantify dependability (1/3) How decide when a system is operating properly? Infrastructure providers now offer.
Reliable System Design 2011 by: Amir M. Rahmani
Reliability 1. Probability a product will perform as promoted for a given time period under given conditions Functional Failure: does not operate as designed.
On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.
Review for Exam 4 School of Business Eastern Illinois University © Abdou Illia, Fall 2005.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
Network Management 1 School of Business Eastern Illinois University © Abdou Illia, Fall 2006 (Week 16, Tuesday 12/5/2006)
Architecture and Real Time Systems Lab University of Massachusetts, Amherst An Application Driven Reliability Measures and Evaluation Tool for Fault Tolerant.
Dependability Evaluation. Techniques for Dependability Evaluation The dependability evaluation of a system can be carried out either:  experimentally.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.2.1 FAULT TOLERANT SYSTEMS Part 2 – Canonical.
EEE499 Real Time Systems Software Reliability (Part II)
Error and Attack Tolerance of Complex Networks Albert, Jeong, Barabási (presented by Walfredo)
Software Testing and QA Theory and Practice (Chapter 15: Software Reliability) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
NETWORK TOPOLOGY. WHAT IS NETWORK TOPOLOGY?  Network Topology is the shape or physical layout of the network. This is how the computers and other devices.
Lecture Objectives: 1)Draw a picture showing the connection between the processor and memory mapped IO devices. 2)Define the terms reliability, dependability,
System Testing There are several steps in testing the system: –Function testing –Performance testing –Acceptance testing –Installation testing.
Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 15 Software Reliability
9/10/2015 IENG 471 Facilities Planning 1 IENG Lecture Schedule Design: The Sequel.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Software Reliability SEG3202 N. El Kadri.
Performance Evaluation of Computer Systems Introduction
FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/
All the components of network are connected to the central device called “hub” which may be a hub, a router or a switch. There is no direct traffic between.
Lecture 2: Combinatorial Modeling CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier Adapted from slides by WHS at.
Vegard Joa Moseng – Student meeting. A LITTLE BIT ABOUT SYSTEM RELIABILITY:  Reliability: The ability of an item to perform a required function, under.
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
The virtue of dependent failures in multi-site systems Flavio Junqueira and Keith Marzullo University of California, San Diego Workshop on Hot Topics in.
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
Faults and fault-tolerance
Fault-Tolerant Computing Systems #4 Reliability and Availability
EML EML 4550: Engineering Design Methods Probability and Statistics in Engineering Design: Reliability Class Notes Hyman: Chapter 5.
COSC 3330/6308 Solutions to the Third Problem Set Jehan-François Pâris November 2012.
Reliability Failure rates Reliability
Copyright 2004 Koren & Krishna ECE655/Koren Part.8.1 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Mean Time To Repair
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.12.1 FAULT TOLERANT SYSTEMS Part 12 - Networks.
CS203 – Advanced Computer Architecture Dependability & Reliability.
 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
PRODUCT RELIABILITY ASPECT RELIABILITY ENGG COVERS:- RELIABILITY MAINTAINABILITY AVAILABILITY.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.1 FAULT TOLERANT SYSTEMS Part 4 – Analysis Methods Chapter 2 – HW Fault Tolerance.
Software Metrics and Reliability
Computer Network Collection of computers and devices connected by communications channels that facilitates communications among users and allows users.
Large Distributed Systems
Fault-Tolerant Computing Systems #5 Reliability and Availability2
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Software Test Termination
Windows Azure 講師: 李智樺, Ruddy Lee
Reliability and Fault Tolerance
Reliability Failure rates Reliability
RAID Redundant Array of Inexpensive (Independent) Disks
T305: Digital Communications
Reliability.
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
Presentation transcript:

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Fault tolerant Measures

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.2 Fault Tolerance Measures  It is important to have proper yardsticks - measures - by which to measure the effect of fault tolerance  A measure is a mathematical abstraction, which expresses only some subset of the object's nature  Measures? 

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.3 Traditional Measures - Reliability  Assumption: The system can be in one of two states: ‘’up” or ‘’down”  Examples:  Lightbulb - good or burned out  Wire - connected or broken  Reliability, R(t): Probability that the system is up during the whole interval [0,t], given it was up at time 0  Related measure - Mean Time To Failure, MTTF : Average time the system remains up before it goes down and has to be repaired or replaced

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.4 Traditional Measures - Availability  Availability, A(t) : Fraction of time system is up during the interval [0,t]  Point Availability, A p (t) : Probability that the system is up at time t  Long-Term Availability, A:  Availability is used in systems with recovery/repair  Related measures:  Mean Time To Repair, MTTR  Mean Time Between Failures, MTBF = MTTF + MTTR

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.5 Need For More Measures  The assumption of the system being in state ‘’up” or ‘’down” is very limiting  Example: A processor with one of its several hundreds of millions of gates stuck at logic value 0 and the rest is functional - may affect the output of the processor once in every 25,000 hours of use  The processor is not fault-free, but cannot be defined as being ‘’down”  More detailed measures than the general reliability and availability are needed

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.6 Computational Capacity Measures Example: N processors in a gracefully degrading system  System is useful as long as at least one processor remains operational  Let P i = Prob {i processors are operational}  Let c = computational capacity of a processor (e.g., number of fixed-size tasks it can execute)  Computational capacity of i processors: C i = i  c  Average computational capacity of system:

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.7 Another Measure - Performability  Another approach - consider everything from the perspective of the application  Application is used to define ‘’accomplishment levels” L 1, L 2,...,L n  Each represents a level of quality of service delivered by the application  Example: L i indicates i system crashes during the mission time period T  Performability is a vector (P(L 1 ),P(L 2 ),...,P(L n )) where P(L i ) is the probability that the computer functions well enough to permit the application to reach up to accomplishment level L i

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.8 Network Connectivity Measures  Focus on the network that connects the processors  Classical Node and Line Connectivity - the minimum number of nodes and lines, respectively, that have to fail before the network becomes disconnected  Measure indicates how vulnerable the network is to disconnection  A network disconnected by the failure of just one (critically-positioned) node is potentially more vulnerable than another which requires several nodes to fail before it becomes disconnected

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.9 Connectivity - Examples

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.10 Network Resilience Measures  Classical connectivity distinguishes between only two network states: connected and disconnected  It says nothing about how the network degrades as nodes fail before becoming disconnected  Two possible resilience measures:  Average node-pair distance  Network diameter - maximum node-pair distance  Both calculated given probability of node and/or link failure