Download presentation
Presentation is loading. Please wait.
Published byBarrie Chandler Modified over 9 years ago
1
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery Introductory material
2
ECE 753 Fault Tolerant Computing2 Overview Introduction and basic concept Fault model and fault coverage Checkpointing and backward error recovery (rollback)Checkpointing and backward error recovery (rollback) –General principlesGeneral principles –Uniprocessor systemsUniprocessor systems Summary Cost, Overhead, Latency issues Distributed Systems
3
ECE 753 Fault Tolerant Computing3 Introduction References –Text Chapter 6Text Chapter 6 –[Prad:96] Chapter 3 – sections on rollback and reconfiguration[Prad:96] Chapter 3 – sections on rollback and reconfiguration
4
ECE 753 Fault Tolerant Computing4 Introduction (contd.) Some what higher level than ECC and watchdog, uses re-execution as basic recovery strategySome what higher level than ECC and watchdog, uses re-execution as basic recovery strategy It is a hardware assisted software method in practiceIt is a hardware assisted software method in practice Basic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-executeBasic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-execute
5
ECE 753 Fault Tolerant Computing5 Introduction - Basic Concept (contd.) Three phases of recovery –Error detectionError detection –Damage assessmentDamage assessment –Recovery – error elimination and arrival at the point where error was detectedRecovery – error elimination and arrival at the point where error was detected often entails re-starting fresh on a system presumably fault free often entails re-starting fresh on a system presumably fault free Backward error recovery –Current process is rolled back to some error-free point and re-executesCurrent process is rolled back to some error-free point and re-executes –Trivial solution – start afresh from the beginning of the programTrivial solution – start afresh from the beginning of the program
6
ECE 753 Fault Tolerant Computing6 Fault model and fault coverage Possible scenarios –Hardware is faulty, software is fault-freeHardware is faulty, software is fault-free –Fault detection mechanism exists – in hardware or in software formFault detection mechanism exists – in hardware or in software form –Hardware fault-free, software is faultyHardware fault-free, software is faulty –Both hardware software faultyBoth hardware software faulty Assumptions for backward error recovery –Reliable error detection mechanism existsReliable error detection mechanism exists –Error can be removed by re-executionError can be removed by re-execution –Process state can be restored to a previous error- free stateProcess state can be restored to a previous error- free state
7
ECE 753 Fault Tolerant Computing7 Fault model and fault coverage (contd.) Based on the assumptions stated: –The method is normally applicable when: error detection mechanism exists, transient hardware faults, and no-software faultsThe method is normally applicable when: error detection mechanism exists, transient hardware faults, and no-software faults Methods to address other fault scenario areMethods to address other fault scenario are –Re-configurationRe-configuration –Software fault-tolerance: e.g. recovery block and n-version programmingSoftware fault-tolerance: e.g. recovery block and n-version programming
8
ECE 753 Fault Tolerant Computing8 Checkpointing and Rollback General principles –Time redundancy is permissibleTime redundancy is permissible –Transient hardware errorsTransient hardware errors –If software errors (design or otherwise) alternative modules exist or there are timing errors that may be solved during re-executionIf software errors (design or otherwise) alternative modules exist or there are timing errors that may be solved during re-execution –Reliable error detection mechanismReliable error detection mechanism –It is feasible to determine checkpoints (system states that need to be saved) in an applicationIt is feasible to determine checkpoints (system states that need to be saved) in an application –Method can apply to redundant as well as nonredundant systemsMethod can apply to redundant as well as nonredundant systems
9
ECE 753 Fault Tolerant Computing9 Checkpointing and Rollback (contd.) General issues: checkpointing & rollback General issues: checkpointing & rollback –Save system state at regular intervalSave system state at regular interval How often to save - checkpoint interval How much to save - can be as little as PC and status flags, just one instruction or as mush as log of all messages, the complete program and associated data values at a given timeHow much to save - can be as little as PC and status flags, just one instruction or as mush as log of all messages, the complete program and associated data values at a given time How long between fault occurrence and its detection (error latency) is tolerable – often large error latency may make this method less than an ideal methodHow long between fault occurrence and its detection (error latency) is tolerable – often large error latency may make this method less than an ideal method
10
ECE 753 Fault Tolerant Computing10 Checkpointing and Rollback (contd.) General issues: checkpointing & rollbackGeneral issues: checkpointing & rollback –Rollback recoveryRollback recovery Where do we go back to: damage assessment Rollback: load the state vector (state of the processor, the data that may have been altered or corrupted)Rollback: load the state vector (state of the processor, the data that may have been altered or corrupted) Restart the computation
11
ECE 753 Fault Tolerant Computing11 Checkpointing and Rollback (contd.) What do we need –Error detection mechanismError detection mechanism Various self-checking mechanisms, e.g. error detection, timers, watchdog, acceptance tests.Various self-checking mechanisms, e.g. error detection, timers, watchdog, acceptance tests. –Storage for state/data savingStorage for state/data saving Large enough storage – PC, stack, data segments (static and dynamic), information about user and system files that may be openLarge enough storage – PC, stack, data segments (static and dynamic), information about user and system files that may be open Access time – issue during storing and retrieval Volatility and stability of the storage
12
ECE 753 Fault Tolerant Computing12 Checkpointing and Rollback (contd.) What do we need (contd.) –EventsEvents Messages and transactions that should be logged and replayedMessages and transactions that should be logged and replayed –Procedures to handle errors and restart computationProcedures to handle errors and restart computation –What if errors continue to exist? – mechanism to handle thisWhat if errors continue to exist? – mechanism to handle this
13
ECE 753 Fault Tolerant Computing13 Checkpointing: Uniprocessor systems Uniprocess and uniprocessor systems equivalenceUniprocess and uniprocessor systems equivalence Simplest scheme –Instruction re-executionInstruction re-execution Hardware (parity, self-checking, duplication) reports error Instruction is re-executed using previous data and state –IssuesIssues Register file update (commit) Latency, especially in pipeline systems –Key is to determine the state to be savedKey is to determine the state to be saved
14
ECE 753 Fault Tolerant Computing14 Checkpointing: Uniprocessor systems (contd.) Process control systems –Program that monitors a process behaves in a predetermined manner – known control flow and typically periodicProgram that monitors a process behaves in a predetermined manner – known control flow and typically periodic –Define checkpoints staticallyDefine checkpoints statically
15
ECE 753 Fault Tolerant Computing15 Checkpointing: Uniprocessor systems (contd.) Process control systems (contd.) –Typical objectivesTypical objectives Recovery possible in a given time Minimize the total number of checkpoints Methods of this nature studied in 60’s
16
ECE 753 Fault Tolerant Computing16 Checkpointing: Uniprocessor systems (contd.) General purpose systems –How much information to saveHow much information to save System state consisting of register file, PC, stack, etc. Data? –All of it? Can be prohibitive (space and time)All of it? Can be prohibitive (space and time) –So?So? –Only that data which is modified after the last checkpointOnly that data which is modified after the last checkpoint –How do we do this efficiently?How do we do this efficiently? –Caches provide a nice boundary to achieve thisCaches provide a nice boundary to achieve this
17
ECE 753 Fault Tolerant Computing17 Summary Discussed checkpointing classical studiesDiscussed checkpointing classical studies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.