1 Chapter Fault Tolerant Design of Digital Systems.

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

STATISTIC & INFORMATION THEORY (CSNB134) MODULE 12 ERROR DETECTION & CORRECTION.
Oct. 2007Fault MaskingSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Fault-Tolerant Systems Design Part 1.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Binocular Bilateral Controller: A Hardware Fault Tolerant Implementation Marylène Audet March 2001 VLSI Testing.
Reliable System Design 2011 by: Amir M. Rahmani
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
Oct Combinational Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
ABCSG - Dependable Systems - 01/06/ ABCSG Dependable Systems.
Development of Empirical Models From Process Data
Oct Fault Masking Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
The primary objective in the implementation of a UPS system is to improve power reliability to the limits of technical capability, the ultimate aim being.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Quantum Error Correction Jian-Wei Pan Lecture Note 9.
1 5. Application Examples 5.1. Programmable compensation for analog circuits (Optimal tuning) 5.2. Programmable delays in high-speed digital circuits (Clock.
FAULT TOLERANT POWER SYSTEMS Carsten Nesgaard Advisors: Professor Michael A. E. Andersen Professor Seth R. Sanders Ext. collaborators:
CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Computer Engineering Group Brandenburg University of Technology at Cottbus 1 Ressource Reduced Triple Modular Redundancy for Built-In Self-Repair in VLIW-Processors.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Adaptive control and process systems. Design and methods and control strategies 1.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Redundancy. Definitions Simplex –Single Unit TMR or NMR –Three or n units with a voter TMR/Simplex –After the first failure, a good unit is switched out.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Evaluating Logic Resources Utilization in an FPGA-Based TMR CPU
©Ian Sommerville 2000Dependability Slide 1 Chapter 16 Dependability.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
TOPIC : Fault detection and fault redundancy UNIT 2 : Fault modeling Module 2.3 Fault redundancy and Fault collapsing.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Week#3 Software Quality Engineering.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.1 FAULT TOLERANT SYSTEMS Part 4 – Analysis Methods Chapter 2 – HW Fault Tolerance.
More on Exponential Distribution, Hypo exponential distribution
Fault-Tolerant Design
ECE 313 Probability with Engineering Applications Lecture 7
Critical systems design
Hardware & Software Reliability
Fault Tolerance & Reliability CDA 5140 Spring 2006
Verification and Testing
Fault Tolerance In Operating System
BASICS OF SOFTWARE TESTING Chapter 1. Topics to be covered 1. Humans and errors, 2. Testing and Debugging, 3. Software Quality- Correctness Reliability.
Fault Tolerance Distributed Web-based Systems
Mattan Erez The University of Texas at Austin July 2015
COP 5611 Operating Systems Spring 2010
COP 5611 Operating Systems Spring 2010
Fault Tolerance Distributed
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Seminar on Enterprise Software
Presentation transcript:

1 Chapter Fault Tolerant Design of Digital Systems

2 4.1 The Important of Fault Tolerance Fault Tolerant design can provide dramatic improvements in system availability and lead to a substantial reduction in maintenance costs as a consequence of fewer system failures. Two different approaches to increase the reliability: 1.Fault prevention 2.Fault tolerance

3 4.2 Basic Concepts of Fault Tolerance Fault tolerant system: it is a system which has the built-in capability (without external assistant) to preserve the continued correct execution of its programs and input/output functions in the present of a certain set of operational faults. Types of faults: a)anticipated faults b)unanticipated faults

4 4.3 Static Redundancy Also known as “masking redundancy” Two major techniques employed: 1.Triple modular redundancy 2.Use of error correcting codes Triple Modular Redundancy (TMR) Could be expanded to NMR (N-modular-redundancy) An NMR system can tolerate up to n module failures, where n = (N-1)/2 In general, in an NMR system N is an odd number.

Triple Modular Redundancy (TMR) The Reliability equation of an NMR system is: For the TMR case N=3 and n=1

Triple Modular Redundancy (TMR) Note: Another way to calculate R TMR Exercise : Evaluate R TMR if R M = 0.6 and 0.5 and 0.4

7 Reliability & MTBF & Failure rate For a constant failure rate, Thus, for TMR where

8 We should look for a more useful parameter than MTBF. Other Parameters for evaluating system reliability Reliability Improvement Factor (RIF) = Where, 1-R N : probability of failure of non-redundant system. 1-R R : probability of failure of redundant system. Mission Time Improvement Factor (MTIF) = Where R f is some predetermined reliability (e.g or 0.90), while T R and T N are times at which the system reliability R R (t) and R N (t), respectively, fall to the value R f.

9 The reliability of the voter element If the voter has the reliability, then the reliability of the TMR becomes: If, the reliability of the system is less than that of the original system for all t. Thus, we have to improve the reliability of the voter. where, R v is the reliability of the voter.

10 The major advantages of the TMR scheme Major advantages of the TMR are: 1.The fault-masking action occurs immediately; both temporary and permanent faults are masked. 2.No separate fault detection is necessary before masking. 3.The conversion from a non-redundant system to a TMR system is straightforward.

Dynamic redundancy A system with dynamic redundancy consists of several modules but with only one operating at a time. If a fault is detected in the operating module it is switched out and replaced by a spare. It requires consecutive actions of fault detection and fault recovery. A dynamic redundant system with S spares has a reliability : where R m is the reliability of each module, active or spare in the system. This reliability function is obtained assuming that the fault detection and the switchover mechanism are perfect.

Dynamic redundancy The reliability R is an increasing function of the number of spare modules. However, the use of too many spares may have a detrimental effect on the system reliability. Losq has shown that for every dynamic redundant system there exists a finite best number of spares for a given mission time: When the mission time is extremely short  one spare is best. When the mission time is less than one-tenth of the simplex (i.e. non- redundant) mean-life  five spares or fewer is the best.

Dynamic redundancy The detection of a fault in the individual modules of a dynamic system can be achieved by using one of the following techniques: 1.Periodic tests: Offline. Disadvantage: cannot detect temporary faults unless they occur while the module is tested. 2.Self-checking circuits: provide a very cost effective method of fault detection 3.Watchdog timers: timer, checkpoints Reconfiguration: switching the faulty element and selecting the system output to come from one of the alternative modules. Retry: so that a module will not be removed because of a temporary fault. Self-repair: the replacement is invisible to the user and the system continues its operation uninterrupted. In general dynamic redundant systems can be divided into two categories: a)Cold-standby system. b)Hot-standby system.