ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery.

Slides:



Advertisements
Similar presentations
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Advertisements

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Fault-Tolerant Systems Design Part 1.
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Making Services Fault Tolerant
7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault.
Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.
1 Chapter Fault Tolerant Design of Digital Systems.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Design of SCS Architecture, Control and Fault Handling.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Distributed Deadlocks and Transaction Recovery.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
Redundant Array of Independent Disks
Chapter 3 Memory Management: Virtual Memory
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
1 CS/COE0447 Computer Organization & Assembly Language Chapter 5 part 4 Exceptions.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. Chapter 12 Understanding database managers on z/OS.
Architectural Optimizations Ed Carlisle. DARA: A LOW-COST RELIABLE ARCHITECTURE BASED ON UNHARDENED DEVICES AND ITS CASE STUDY OF RADIATION STRESS TEST.
Fault-Tolerant Systems Design Part 1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Serverless Network File Systems Overview by Joseph Thompson.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Fault-Tolerant Systems Design Part 1.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Lecture 1: Review of Computer Organization
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Byzantine faults and Agreement Problem (Sensor Networks)
18/05/2006 Fault Tolerant Computing Based on Diversity by Seda Demirağ
Jun-Ki Min. Slide Purpose of Database Recovery ◦ To bring the database into the last consistent stat e, which existed prior to the failure. ◦
© 1997 UW CSE 11/24/97O-1 Recovery Concepts Chapter 18 (lightly)
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Database Recovery Techniques
nZDC: A compiler technique for near-Zero silent Data Corruption
Operating System Reliability
Operating System Reliability
Fault Tolerance In Operating System
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
BIC 10503: COMPUTER ARCHITECTURE
Interrupt handling Explain how interrupts are used to obtain processor time and how processing of interrupted jobs may later be resumed, (typical.
Operating System Reliability
Co-designed Virtual Machines for Reliable Computer Systems
ECE 753: FAULT-TOLERANT COMPUTING
Operating System Reliability
University of Wisconsin-Madison Presented by: Nick Kirchem
Seminar on Enterprise Software
Operating System Reliability
Presentation transcript:

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery Introductory material

ECE 753 Fault Tolerant Computing2 Overview Introduction and basic concept Fault model and fault coverage Checkpointing and backward error recovery (rollback)Checkpointing and backward error recovery (rollback) –General principlesGeneral principles –Uniprocessor systemsUniprocessor systems Summary Cost, Overhead, Latency issues Distributed Systems

ECE 753 Fault Tolerant Computing3 Introduction References –Text Chapter 6Text Chapter 6 –[Prad:96] Chapter 3 – sections on rollback and reconfiguration[Prad:96] Chapter 3 – sections on rollback and reconfiguration

ECE 753 Fault Tolerant Computing4 Introduction (contd.) Some what higher level than ECC and watchdog, uses re-execution as basic recovery strategySome what higher level than ECC and watchdog, uses re-execution as basic recovery strategy It is a hardware assisted software method in practiceIt is a hardware assisted software method in practice Basic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-executeBasic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-execute

ECE 753 Fault Tolerant Computing5 Introduction - Basic Concept (contd.) Three phases of recovery –Error detectionError detection –Damage assessmentDamage assessment –Recovery – error elimination and arrival at the point where error was detectedRecovery – error elimination and arrival at the point where error was detected often entails re-starting fresh on a system presumably fault free often entails re-starting fresh on a system presumably fault free Backward error recovery –Current process is rolled back to some error-free point and re-executesCurrent process is rolled back to some error-free point and re-executes –Trivial solution – start afresh from the beginning of the programTrivial solution – start afresh from the beginning of the program

ECE 753 Fault Tolerant Computing6 Fault model and fault coverage Possible scenarios –Hardware is faulty, software is fault-freeHardware is faulty, software is fault-free –Fault detection mechanism exists – in hardware or in software formFault detection mechanism exists – in hardware or in software form –Hardware fault-free, software is faultyHardware fault-free, software is faulty –Both hardware software faultyBoth hardware software faulty Assumptions for backward error recovery –Reliable error detection mechanism existsReliable error detection mechanism exists –Error can be removed by re-executionError can be removed by re-execution –Process state can be restored to a previous error- free stateProcess state can be restored to a previous error- free state

ECE 753 Fault Tolerant Computing7 Fault model and fault coverage (contd.) Based on the assumptions stated: –The method is normally applicable when: error detection mechanism exists, transient hardware faults, and no-software faultsThe method is normally applicable when: error detection mechanism exists, transient hardware faults, and no-software faults Methods to address other fault scenario areMethods to address other fault scenario are –Re-configurationRe-configuration –Software fault-tolerance: e.g. recovery block and n-version programmingSoftware fault-tolerance: e.g. recovery block and n-version programming

ECE 753 Fault Tolerant Computing8 Checkpointing and Rollback General principles –Time redundancy is permissibleTime redundancy is permissible –Transient hardware errorsTransient hardware errors –If software errors (design or otherwise) alternative modules exist or there are timing errors that may be solved during re-executionIf software errors (design or otherwise) alternative modules exist or there are timing errors that may be solved during re-execution –Reliable error detection mechanismReliable error detection mechanism –It is feasible to determine checkpoints (system states that need to be saved) in an applicationIt is feasible to determine checkpoints (system states that need to be saved) in an application –Method can apply to redundant as well as nonredundant systemsMethod can apply to redundant as well as nonredundant systems

ECE 753 Fault Tolerant Computing9 Checkpointing and Rollback (contd.) General issues: checkpointing & rollback General issues: checkpointing & rollback –Save system state at regular intervalSave system state at regular interval How often to save - checkpoint interval How much to save - can be as little as PC and status flags, just one instruction or as mush as log of all messages, the complete program and associated data values at a given timeHow much to save - can be as little as PC and status flags, just one instruction or as mush as log of all messages, the complete program and associated data values at a given time How long between fault occurrence and its detection (error latency) is tolerable – often large error latency may make this method less than an ideal methodHow long between fault occurrence and its detection (error latency) is tolerable – often large error latency may make this method less than an ideal method

ECE 753 Fault Tolerant Computing10 Checkpointing and Rollback (contd.) General issues: checkpointing & rollbackGeneral issues: checkpointing & rollback –Rollback recoveryRollback recovery Where do we go back to: damage assessment Rollback: load the state vector (state of the processor, the data that may have been altered or corrupted)Rollback: load the state vector (state of the processor, the data that may have been altered or corrupted) Restart the computation

ECE 753 Fault Tolerant Computing11 Checkpointing and Rollback (contd.) What do we need –Error detection mechanismError detection mechanism Various self-checking mechanisms, e.g. error detection, timers, watchdog, acceptance tests.Various self-checking mechanisms, e.g. error detection, timers, watchdog, acceptance tests. –Storage for state/data savingStorage for state/data saving Large enough storage – PC, stack, data segments (static and dynamic), information about user and system files that may be openLarge enough storage – PC, stack, data segments (static and dynamic), information about user and system files that may be open Access time – issue during storing and retrieval Volatility and stability of the storage

ECE 753 Fault Tolerant Computing12 Checkpointing and Rollback (contd.) What do we need (contd.) –EventsEvents Messages and transactions that should be logged and replayedMessages and transactions that should be logged and replayed –Procedures to handle errors and restart computationProcedures to handle errors and restart computation –What if errors continue to exist? – mechanism to handle thisWhat if errors continue to exist? – mechanism to handle this

ECE 753 Fault Tolerant Computing13 Checkpointing: Uniprocessor systems Uniprocess and uniprocessor systems equivalenceUniprocess and uniprocessor systems equivalence Simplest scheme –Instruction re-executionInstruction re-execution Hardware (parity, self-checking, duplication) reports error Instruction is re-executed using previous data and state –IssuesIssues Register file update (commit) Latency, especially in pipeline systems –Key is to determine the state to be savedKey is to determine the state to be saved

ECE 753 Fault Tolerant Computing14 Checkpointing: Uniprocessor systems (contd.) Process control systems –Program that monitors a process behaves in a predetermined manner – known control flow and typically periodicProgram that monitors a process behaves in a predetermined manner – known control flow and typically periodic –Define checkpoints staticallyDefine checkpoints statically

ECE 753 Fault Tolerant Computing15 Checkpointing: Uniprocessor systems (contd.) Process control systems (contd.) –Typical objectivesTypical objectives Recovery possible in a given time Minimize the total number of checkpoints Methods of this nature studied in 60’s

ECE 753 Fault Tolerant Computing16 Checkpointing: Uniprocessor systems (contd.) General purpose systems –How much information to saveHow much information to save System state consisting of register file, PC, stack, etc. Data? –All of it? Can be prohibitive (space and time)All of it? Can be prohibitive (space and time) –So?So? –Only that data which is modified after the last checkpointOnly that data which is modified after the last checkpoint –How do we do this efficiently?How do we do this efficiently? –Caches provide a nice boundary to achieve thisCaches provide a nice boundary to achieve this

ECE 753 Fault Tolerant Computing17 Summary Discussed checkpointing classical studiesDiscussed checkpointing classical studies