7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

Slides:



Advertisements
Similar presentations
Principles of Engineering System Design Dr T Asokan
Advertisements

Fault-Tolerant Systems Design Part 1.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
5th Conference on Intelligent Systems
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault.
© Burns and Welling, 2001 Characteristics of a RTS n Large and complex n Concurrent control of separate system components n Facilities to interact with.
Fault Tolerance: Basic Mechanisms mMIC-SFT September 2003 Anders P. Ravn Aalborg University.
8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,
1 Chapter Fault Tolerant Design of Digital Systems.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
8. Fault Tolerance in Software
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Chapter 2: Reliability and Fault Tolerance
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.
Secure Systems Research Group - FAU 1 A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering.
Computer Engineering Group Brandenburg University of Technology at Cottbus 1 Ressource Reduced Triple Modular Redundancy for Built-In Self-Repair in VLIW-Processors.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.
Fault-Tolerant Systems Design Part 1.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Transparent Fault-Tolerant Java Virtual Machine Roy Friedman & Alon Kama Computer Science — Technion.
Using Software Rules To Enhance FPGA Reliability Chandru Mirchandani Lockheed-Martin September 7-9, 2005 P226-W/MAPLD2005 MIRCHANDANI 1.
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
CprE 458/558: Real-Time Systems
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
The concept of RAID in Databases By Junaid Ali Siddiqui.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
18/05/2006 Fault Tolerant Computing Based on Diversity by Seda Demirağ
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Week#3 Software Quality Engineering.
Software Quality Assurance
8.6. Recovery By Hemanth Kumar Reddy.
Chapter 2: Reliability and Fault Tolerance
EEC 688/788 Secure and Dependable Computing
Fault Tolerance In Operating System
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
Middleware for Fault Tolerant Applications
EEC 688/788 Secure and Dependable Computing
Parallel and Distributed Simulation
EEC 688/788 Secure and Dependable Computing
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Co-designed Virtual Machines for Reliable Computer Systems
ECE 753: FAULT-TOLERANT COMPUTING
Seminar on Enterprise Software
Presentation transcript:

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current erroneous state and determines the correct state without any loss of computation. There are two different approaches: a) Hardware Redundancy –Static Redundancy –Dynamic Redundancy b) Software Redundancy

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.1 Static Redundancy Approaches There are 3 different approaches to mask the failures: Active Masking Redundancy Active Masking Redundancy Active Masking Using Fail-Stop Modules Active Masking Using Fail-Stop Modules Active Redundancy Using Self-Diagnosis Active Redundancy Using Self-Diagnosis

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.1 Static Redundancy Approaches Active Masking Redundancy Active Masking Redundancy: Uses adequate level of replication to tolerate the failures, using voting on the outputs of all the replicas. E.g.: TMR (Triple Modular Redundant) systems mask a single failure without any performance loss.

7.5 Forward Recovery Systems –7.5.1 Static Redundancy Approaches Active Redundancy Using Fail-Stop Modules Active Redundancy Using Fail-Stop Modules: Multiple modules of each processor actively execute each process. Each processor itself is assumed to be fail- stop. Thus, if one of the processors fails, it stops executing and the other processors executing the task continue functioning without any performance penalty, even in the presence of failures. 7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems –7.5.1 Static Redundancy Approaches E.g.  in a given system, each subsystem is duplicated, forming a pair. One of the replicas is identified as the spare. Each subsystem and its spare are, themselves, made self-checking by replication. The HW is thereby replicated 4 times. All 4 copies of the HW are tightly synchronized. When a fault is detected in a subsystem by its self-checking mechanisms, it disconnects itself as well as that the spare starts providing its service without any interruption or rollback. 7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems –7.5.1 Static Redundancy Approaches Active Redundancy Using Self-Diagnosis Active Redundancy Using Self-Diagnosis: Analogous to the one using “fail-stop modules”, however, instead of concurrent self-checking mechanism, self- diagnosis tasks are used to identify the faulty processor. 7. Fault Tolerance Through Dynamic or Standby Redundancy

7.5 Forward Recovery Systems –7.5.1 Static Redundancy Approaches E.g.  the reconfigurable duplication mechanism, where the process is replicated on 2 processors. Their outputs are continuously compared. If any mismatch indicating a failure of at least one of the processors in the pair is detected, each processor runs self-diagnostic tasks to determine if it has failed. Once the faulty processor is identified, the output of the fault-free processor can be accepted as correct. 7. Fault Tolerance Through Dynamic or Standby Redundancy The use of self-diagnostic tasks instead of concurrent self- checking results in a slight computation overhead for determining the faulty processor after a fault is detected.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Forward recovery schemes based on dynamic redundancy and checkpointing try to avoid rollback even in the presence of failures. The fault is thus tolerated without the performance penalty of a rollback. E.g.  Consider a duplex system that detects failures by checkpointing the two modules in the system periodically and then, comparing their states. When a failure is detected, the roll-forward checkpointing scheme tries to determine which of the two processing modules, if any, is fault-free.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Concurrent retry in the Roll Forward Checkpointing Scheme (RFCS) Scheme.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Concurrent retry in the Roll Forward Checkpointing Scheme (RFCS) Scheme.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Recovery Strategy Resources Used With SpareNo Spare Optimistic (only single faults) Roll-forward (I) Rollback (I)* Pessimistic (may occur double faults) Roll-forward (II)Rollback (II) Three Different Recovery Schemes (* no built-in fault detection capability included). Variations of the RFCS may assume that each module has built-in fault detection capability such as parity checks, exception detection. Thus, 4 different scenarios can be conceptualized:

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Optimistic scheme with or without spare. Roll-forward (I) I1I1 I2I2 Module A I1I1 I2I2 B roll-forward In an optimistic recovery strategy, one trusts the built-in detection capability to the fullest extent. This scheme will not require the use of a spare, even though it may be available.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Pessimistic schemes. In the pessimistic recovery strategy, It may be noted that although module B has been already suspect to be faulty, a more conservative action was taken just in case A might have experienced a failure which escaped the built-in detection capability during I 1. Pessimistic Scheme with spare rolling forward with all single faults. Pessimistic Scheme with spare rolling back with double faults.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Three different roll- forward schemes. Performance Reliability The ideal curve 1 is preferred because it allows a small reduction in reliability to be traded off against a large gain in performance. (This is the case of Optimistic Recovery Strategies).

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Generally, the mean completion time given a failure has occurred is lower for the roll-forward scheme for both optimistic and pessimistic strategies. Without any failure, all the schemes perform similarly. When there is no built-in detection capability, the pessimistic and the corresponding optimistic scheme have identical reliabilities. Since there is no built-in detection, there is no way to identify the faulty module without comparison between operating modules and the spare one. When there is 100% fault detection, with or without spare schemes have identical reliabilities.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Note: = failure rate; c = detection coverage (indicates the degree of built- in detection capabilities); n = # of checkpoint intervals.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Performance comparison between optimistic and pessimistic schemes: mean completion time, given a fault. (Optimistic scheme is better) Reliability comparison between optimistic and pessimistic schemes. (Pessimistic scheme is better) Rollback Optimistic Roll-forward Pessimistic

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Permanent delay in rollback scheme outputs in the event of a fault. One of the important advantages of a roll-forward scheme is in the minimal degradation in I/O performance: All outputs after I 1 will experience one checkpoint interval delay.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches The outputs x and y are the only ones delayed and all other outputs are will occur at the regularly scheduled interval. Temporary delay in roll-forward scheme outputs in the event of a fault. I1I1 I2I2 Module A B Spare Release I3I3 I4I4 I5I5 I6I6 I1I1 I2I2 Spare Activated I1I1 I2I2 I3I3 I4I4 I5I5 I6I6 x,y,zwv: System outputs

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.2 Dynamic Redundancy Approaches Forward Recovery Using Checkpointing.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.3 Software Redundancy-Based Approach for Forward Error Recovery HW redundancy+300% The previous approaches primarily require HW redundancy (+300%). SW redundancy HW redundancy This approach requires a certain degree of SW redundancy, as well as HW redundancy: Recovery Blocks program redundancy SW redundancy is implemented by using Recovery Blocks. Recovery blocks are a language construct that supports the incorporation of program redundancy into a fault-tolerant program in a concise and easily readable form.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.3 Software Redundancy-Based Approach for Forward Error Recovery The syntax of the recovery block is: Ensure T by B 1 else by B 2... else by B n else error Where: T is acceptance test; B 1 denotes the primary try block; B 1 denotes the primary try block; B k denotes the (k – 1)th alternate try block. B k denotes the (k – 1)th alternate try block.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems –7.5.3 Software Redundancy-Based Approach for Forward Error Recovery Distributed Recovery Block.