FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.

Slides:



Advertisements
Similar presentations
Principles of Engineering System Design Dr T Asokan
Advertisements

Oct. 2007Fault MaskingSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Fault-Tolerant Systems Design Part 1.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
EECE499 Computers and Nuclear Energy Electrical and Computer Eng Howard University Dr. Charles Kim Fall 2013 Webpage:
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
FAULT TOLERANCE IN FPGA BASED SPACE-BORNE COMPUTING SYSTEMS Niharika Chatla Vibhav Kundalia
5th Conference on Intelligent Systems
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Self-Checking Circuits
Binocular Bilateral Controller: A Hardware Fault Tolerant Implementation Marylène Audet March 2001 VLSI Testing.
DS - X - CS - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 10 CASE STUDIES Wintersemester 99/00 Leitung: Prof.
Making Services Fault Tolerant
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
DS - VI - FTM - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Dependable Systems Vorlesung 6 FAULT-TOLERANT AND FAULT-SECURE MEMORIES Wintersemester.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Oct Combinational Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,
DS - VI - FTM - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
1 Chapter Fault Tolerant Design of Digital Systems.
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
DS - IV - TT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 4 Topological Testing Wintersemester 2000/2001 Leitung:
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
8. Fault Tolerance in Software
1 Mm3 Fault-Tolerance related to your projects 2 x 45 min. of Discussions.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Oct Fault Masking Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Instructor: Tasneem Darwish1 University of Palestine Faculty of Applied Engineering and Urban Planning Software Engineering Department Software Systems.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Fault-Tolerant Systems Design Part 1.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
FAULT-TOLERANT COMPUTING Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University Simple Concepts in Fault-Tolerance.
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
ATtiny23131 A SEMINAR ON AVR MICROCONTROLLER ATtiny2313.
Redundancy. Definitions Simplex –Single Unit TMR or NMR –Three or n units with a voter TMR/Simplex –After the first failure, a good unit is switched out.
Final Presentation DigiSat Reliable Computer – Multiprocessor Control System, Part B. Niv Best, Shai Israeli Instructor: Oren Kerem, (Isaschar Walter)
Fault-Tolerant Systems Design Part 1.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Part A Presentation System Design Performed.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
18/05/2006 Fault Tolerant Computing Based on Diversity by Seda Demirağ
Lecture 11. Switch Hardware Nowadays switches are very high performance computers with high hardware specifications Switches usually consist of a chassis.
Week#3 Software Quality Engineering.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.4.1 FAULT TOLERANT SYSTEMS Part 4 – Analysis Methods Chapter 2 – HW Fault Tolerance.
Fault-Tolerant Design
ECE 753: FAULT-TOLERANT COMPUTING
Fault Tolerance In Operating System
MAPLD 2005 BOF-L Mitigation Methods for
Sequential circuits and Digital System Reliability
Design of a ‘Single Event Effect’ Mitigation Technique for Reconfigurable Architectures SAJID BALOCH Prof. Dr. T. Arslan1,2 Dr.Adrian Stoica3.
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Seminar on Enterprise Software
Presentation transcript:

FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM LEVEL) Wintersemester 99/00 Leitung: Prof. Dr. Miroslaw Malek

FTC (DS) - V - TT - 1 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM LEVEL) OBJECTIVE: –TO INTRODUCE MAIN FAULT RECOVERY AND FAULT TOLERANCE TECHNIQUES FOR COMPUTER SYSTEMS CONTENTS: –DYNAMIC TECHNIQUES –STATIC TECHNIQUES –HYBRID TECHNIQUES

FTC (DS) - V - TT - 2 FAULT RECOVERY TECHNIQUES FAULT RECOVERY IS INITIATED BY SUCCESSFUL FAULT DETECTION AND/OR FAULT LOCATION HARDWARE RECOVERY TECHNIQUES INCLUDE REPLACEMENT/REPAIR RECONFIGURATION OR FAULT MASKING SOFTWARE RECOVERY TECHNIQUES INCLUDE EXCEPTION HANDLING RECOVERY BLOCKS MASKING (N-VERSION PROGRAMMING) ROLL-BACKWARD ROLL-FORWARD

FTC (DS) - V - TT - 3 SYSTEM REPLICATION METHODS DYNAMIC –DUPLEX –BACK-UP SPARING –DUPLEX AND SPARE –PAIR AND SPARE –SOFTWARE-IMPLEMENTED FAULT TOLERANCE (SIFT) STATIC –TRIPLE MODULAR REDUNDANCY (TMR) –N MODULAR REDUNDANCY (NMR) –(4-2) CONCEPT –SPECIAL LOGIC –TMR WITH DUPLEX MODULES HYBRID –HYBRID REDUNDANCY (NMR WITH SPARES) –TMR WITH TWO SPARES (SPACE SHUTTLE) –SELF-PURGING REDUNDANCY –SIFT-OUT MODULAR REDUNDANCY

FTC (DS) - V - TT - 4 DUPLEX SYSTEMS (1) OUTPUT SWITCH OUTPUT 1 OUTPUT 2 Test and Reconfigure P1 P2 PRIMARY UNIT SECONDARY UNIT COMPARATOR INPUT [from Siewiorek and Swarz]

FTC (DS) - V - TT - 5 DUPLEX SYSTEMS (2) If a mismatch occurs the following methods can be used to identify a faulty system: –Self-diagnostic program –Self-checking logic (capabilities) –Watchdog timer method (periodically reset timer of another processor) –Outside arbiter (may check signatures or run tests)

FTC (DS) - V - TT - 6 DUPLEX SYSTEMS (3) SYNCHRONIZATION METHODS At the end of each clock period (cycle or microcycle) (e.g., ESS systems, UDET) Update and match unit (UPM) compares every bus cycle (e.g., AXE telephone switching system) At the end of program execution - program or subroutine level comparison (e.g. COMTRAC railway control system) RELIABILITY OF DUPLEX SYSTEMS C - coverage factor (represents the combined probability of successful fault detection and reconfiguration) R k - reliability of the control, switching and matching circuitry

FTC (DS) - V - TT - 7 DUPLEX SYSTEMS (4) Back-up Sparing MODULES 1 2 n SWITCH OUTPUT INPUT HOT, WARM AND COLD SPARES

FTC (DS) - V - TT - 8 DUPLEX AND SPARE MODULES OUTPUT INPUT COMPARATOR SWITCH

FTC (DS) - V - TT - 9 PAIR AND SPARE MODULES OUTPUT INPUT COMPARATOR 4 SWITCH/COMPARATOR

FTC (DS) - V - TT - 10 TRIPLE MODULAR REDUNDANCY (TMR) (1) A method that incorporates static redundancy into system design The voter produces correct output if there are no failures in the voter and if there are no failures in two of the three modules Input Voter output Module A B C Voter Triple Modular Redundancy (TMR) configuration.

FTC (DS) - V - TT - 11 TRIPLE MODULAR REDUNDANCY (TMR) (2) Reliability –R TMR = R V (reliability of 2 out of 3 modules) –R V - Reliability of the voter –R m - Reliability of each module When does a TMR system have a higher reliability than the original single module? –Must have R TMR > R m

FTC (DS) - V - TT - 12 TRIPLE MODULAR REDUNDANCY (TMR) (3) Assuming a perfect voter (R V = 1) TMR is more reliable only if R m > 0.5 Also the voter must be very reliable. Must have R V > 0.9 for R TMR > R m This technique can be generalized to any odd number of modules N R sys R m Single Module TMR

FTC (DS) - V - TT - 13 TMR WITH DUPLEX MODULES (USED IN JAPANESE TRAIN SHINKANSEN) MODULES OUTPUT INPUT COMPARATORS / SWITCHERS VOTER

FTC (DS) - V - TT - 14 HYBRID REDUNDANT SYSTEM (1) One of the drawbacks of N-modular redundancy with voting (NMR) is that fault masking ability deteriorates as more copies fail. Hybrid redundancy combines NMR with backup sparing. M 1 M 2 M 3 M N+S functional units Switch Select N out of (N + S) N + S Voter Voted output Voter-Switch-Detector (VSD) Control lines N 1 Disagree- ment detector (Siewiorek & Swarz) Basic organization of a hybrid-redundant system

FTC (DS) - V - TT - 15 HYBRID REDUNDANT SYSTEM (2) Assuming the same reliability of modules on-line and on standby, the system reliability is: P =  N/2  + S = The maximum number of modules that can fail without crashing the system

FTC (DS) - V - TT - 16 Plots of hybrid TMR system reliability (Rs) & individual module reliability (Rm)S Plots of hybrid TMR system reliability (R s ) vs. individual module reliability (R m ) S is the number of spares. (Siewiorek and Swarz) b. System with standby failure rate 10% of on-line failure rate a. System with standby failure rate equal to on-line failure rate Simplex S = 0 (TMR) RMRM S = RSRS RMRM RSRS S = 6 Simplex S = 0 (TMR) 4 2 1

FTC (DS) - V - TT - 17 SELF-PURGING REDUNDANCY System using self-purging redundancy (Siewiorek and Swarz) Potentially more reliable than hybrid Threshold gates are analog circuit elements

FTC (DS) - V - TT - 18 SIFT-OUT MODULAR REDUNDANCY (N-2) - fault-tolerant Basic configuration for sift-out redundancy BASIC CONCEPT: –COMPARE EACH PAIR AND ELIMINATE FAULTY UNITS ­ M 1 M 2 M N D 1 D 1 D N Clock Collector DetectorComparator Output Nredundant modules, operating synchronously E 12 E 13 F 1 F 2 F N E (N-1)N Nlines, line, signals the failure of module. F i N C 2 lines, each for signaling the disagreement of a pair of modules

FTC (DS) - V - TT - 19 TMR WITH TWO SPARES (USED IN SPACE SHUTTLE) MODULES OUTPUT INPUT 4 5 VOTER / SWITCH PRIMARY MODULES 1, 2 and 3 “ WARM “ SPARE 4 “ COLD “ SPARE 5