ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery Lectures 21-22
FAULT-TOLERANT COMPUTING Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University HIGH Level Fault-Tolerance: Checkpointing and recovery Lecture Set 4
ECE 753 Fault Tolerant Computing 3 Overview Introduction and basic concept Fault model and fault coverage Checkpointing and backward error recovery (rollback)Checkpointing and backward error recovery (rollback) –General principlesGeneral principles –Uniprocessor systemsUniprocessor systems –Shared memory multiprocessor systemsShared memory multiprocessor systems –Distributed and network systemsDistributed and network systems Forward error recovery Summary
ECE 753 Fault Tolerant Computing 4 Introduction References –[Prad:96] Chapter 3 – sections on rollback and reconfiguration[Prad:96] Chapter 3 – sections on rollback and reconfiguration –You can find a number of papers on Microsoft web site that deal with checkpointing. The URL is can find a number of papers on Microsoft web site that deal with checkpointing. The URL is –E.N. Elnozahi, L. Alvisi, Y-M Wang, and D.B. Johnson “A Survey of Rollback-Recovery Protocols in Message Passing Systems”, ACM Computing Survey, Vol. 34, No. 3, pp , September 2002.E.N. Elnozahi, L. Alvisi, Y-M Wang, and D.B. Johnson “A Survey of Rollback-Recovery Protocols in Message Passing Systems”, ACM Computing Survey, Vol. 34, No. 3, pp , September 2002.
ECE 753 Fault Tolerant Computing 5 Introduction (contd.) Some what higher level than ECC and watchdog, uses re-execution as basic recovery strategySome what higher level than ECC and watchdog, uses re-execution as basic recovery strategy It is a hardware assisted software method in practiceIt is a hardware assisted software method in practice Basic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-executeBasic concept: save fault-free state of the system and if and when an error is detected, reload the fault-free state and re-execute
ECE 753 Fault Tolerant Computing 6 Introduction - Basic Concept (contd.) Three phases of recovery –Error detectionError detection –Damage assessmentDamage assessment –Recovery – error elimination and arrival at the point where error was detectedRecovery – error elimination and arrival at the point where error was detected often entails re-starting fresh on a system presumably fault free often entails re-starting fresh on a system presumably fault free Backward error recovery –Current process is rolled back to some error-free point and re-executeCurrent process is rolled back to some error-free point and re-execute –Trivial solution – start afresh from the beginning of the programTrivial solution – start afresh from the beginning of the program
ECE 753 Fault Tolerant Computing 7 Fault model and fault coverage Possible scenarios –Hardware is faulty, software is fault-freeHardware is faulty, software is fault-free –Fault detection mechanism exists – in hardware or in software formFault detection mechanism exists – in hardware or in software form –Hardware fault-free, software is faultyHardware fault-free, software is faulty –Both hardware software faultyBoth hardware software faulty Assumptions for backward error recovery –Reliable error detection mechanism existsReliable error detection mechanism exists –Error can be removed by re-executionError can be removed by re-execution –Process state can be restored to a previous error-free stateProcess state can be restored to a previous error-free state
ECE 753 Fault Tolerant Computing 8 Fault model and fault coverage (contd.) Based on the assumptions stated: –The method is normally applicable when: error detection, transient hardware faults, and no- software faultsThe method is normally applicable when: error detection, transient hardware faults, and no- software faults Methods to address other fault scenario are –Re-configurationRe-configuration –Software fault-tolerance: e.g. recovery block and n-version programmingSoftware fault-tolerance: e.g. recovery block and n-version programming
ECE 753 Fault Tolerant Computing 9 Checkpointing and Rollback General principles –Time redundancy is permissibleTime redundancy is permissible –Transient hardware errorsTransient hardware errors –If software errors (design or otherwise) alternative modules exist or there are timing errors that may be solved during re-executionIf software errors (design or otherwise) alternative modules exist or there are timing errors that may be solved during re-execution –Reliable error detection mechanismReliable error detection mechanism –It is feasible to determine checkpoints (system states that need to be saved) in an applicationIt is feasible to determine checkpoints (system states that need to be saved) in an application –Method can apply to redundant as well as nonredundant systemsMethod can apply to redundant as well as nonredundant systems
ECE 753 Fault Tolerant Computing 10 Checkpointing and Rollback (contd.) General issues: checkpointing & rollback General issues: checkpointing & rollback –Save system state at regular intervalSave system state at regular interval How often to save - checkpoint interval How much to save - can be as little as PC and status flags, just one instruction or as mush as log of all messages, the complete program and associated data values at a given timeHow much to save - can be as little as PC and status flags, just one instruction or as mush as log of all messages, the complete program and associated data values at a given time How long between fault occurrence and its detection (error latency) is tolerable – often large error latency may make this method less than an ideal methodHow long between fault occurrence and its detection (error latency) is tolerable – often large error latency may make this method less than an ideal method
ECE 753 Fault Tolerant Computing 11 Checkpointing and Rollback (contd.) General issues: checkpointing & rollback –Rollback recoveryRollback recovery Where do we go back to: damage assessment Rollback: load the state vector (state of the processor, the data that may have been altered or corrupted)Rollback: load the state vector (state of the processor, the data that may have been altered or corrupted) Restart the computation
ECE 753 Fault Tolerant Computing 12 Checkpointing and Rollback (contd.) What do we need –Error detection mechanismError detection mechanism Various self-checking mechanisms, e.g. error detection, timers, watchdog, acceptance tests.Various self-checking mechanisms, e.g. error detection, timers, watchdog, acceptance tests. –Storage for state/data savingStorage for state/data saving Large enough storage – PC, stack, data segments (static and dynamic), information about user and system files that may be openLarge enough storage – PC, stack, data segments (static and dynamic), information about user and system files that may be open Access time – issue during storing and retrieval Volatility and stability of the storage
ECE 753 Fault Tolerant Computing 13 Checkpointing and Rollback (contd.) What do we need (contd.) –EventsEvents Messages and transactions that should be logged and replayedMessages and transactions that should be logged and replayed –Procedures to handle errors and restart computationProcedures to handle errors and restart computation –What if errors continue to exist? – mechanism to handle thisWhat if errors continue to exist? – mechanism to handle this
ECE 753 Fault Tolerant Computing 14 Checkpointing: Uniprocessor systems Uniprocess and uniprocessor systems equivalence Simplest scheme –Instruction re-executionInstruction re-execution Hardware (parity, self-checking, duplication) reports error Instruction is re-executed using previous data and state –IssuesIssues Register file update (commit) Latency, especially in pipeline systems –Key is to determine the state to be savedKey is to determine the state to be saved
ECE 753 Fault Tolerant Computing 15 Checkpointing: Uniprocessor systems (contd.) Process control systems –Program that monitors a process behaves in a predetermined manner – known control flow and typically periodicProgram that monitors a process behaves in a predetermined manner – known control flow and typically periodic –Define checkpoints staticallyDefine checkpoints statically
ECE 753 Fault Tolerant Computing 16 Checkpointing: Uniprocessor systems (contd.) Process control systems (contd.) –Typical objectivesTypical objectives Recovery possible in a given time Minimize the total number of checkpoints Methods of this nature studied in 60’s
ECE 753 Fault Tolerant Computing 17 Checkpointing: Uniprocessor systems (contd.) General purpose systems –How much information to saveHow much information to save System state consisting of register file, PC, stack, etc. Data? –All of it? Can be prohitive (space and time)All of it? Can be prohitive (space and time) –So?So? –Only that data which is modified after the last checkpointOnly that data which is modified after the last checkpoint –How do we do this efficiently?How do we do this efficiently? –Caches provide a nice boundary to achieve thisCaches provide a nice boundary to achieve this
ECE 753 Fault Tolerant Computing 18 Checkpointing: Uniprocessor systems (contd.) General purpose systems (contd.) Write thru cache policy – save system state (all registers etc.) and those items that are sent to main memoryWrite thru cache policy – save system state (all registers etc.) and those items that are sent to main memory CPU cache Main memory
ECE 753 Fault Tolerant Computing 19 Checkpointing: Uniprocessor systems (contd.) General purpose systems (contd.) –Pictorial representation of checkpointsPictorial representation of checkpoints save Regs. and cache t ck1 save all data that changes during write thru save Regs. and cache t ck2
ECE 753 Fault Tolerant Computing 20 Checkpointing: Uniprocessor systems (contd.) General purpose systems (contd.) – Issues –Writethru policy High memory traffic and lot of data to saveWritethru policy High memory traffic and lot of data to save –Copyback policy save all cache missesCopyback policy save all cache misses –Can reduce data to be saved by using the dirty bit (often available in cache implementations)Can reduce data to be saved by using the dirty bit (often available in cache implementations) –Schemes have been studied that reduce the amount of data to be stored between checkpoints (also called active data) as opposed to checkpoint data (all data saved at checkpoint)Schemes have been studied that reduce the amount of data to be stored between checkpoints (also called active data) as opposed to checkpoint data (all data saved at checkpoint) –These concepts can be extended to multilevel memory hierarchy (cache – main mem – disk – archival mem - …)These concepts can be extended to multilevel memory hierarchy (cache – main mem – disk – archival mem - …)
ECE 753 Fault Tolerant Computing 21 Checkpointing: Uniprocessor systems (contd.) General purpose systems (contd.) – Issues –Clearly it can be seenClearly it can be seen Large number of checkpoints cause high overhead (a lot of data to be saved and often) implying loss of performanceLarge number of checkpoints cause high overhead (a lot of data to be saved and often) implying loss of performance Fewer checkpoints implying long recovery time and potentially non-recoverable stateFewer checkpoints implying long recovery time and potentially non-recoverable state –Schemes can be devised which are suited for a given process/memory architectureSchemes can be devised which are suited for a given process/memory architecture Twin paging – virtual memory (see Pradhan)
ECE 753 Fault Tolerant Computing 22 Checkpointing: Uniprocessor systems (contd.) General purpose systems (contd.) – Issues –How to handle failure that occur during checkpointingHow to handle failure that occur during checkpointing A low probability event, none the less it can occur because a lot of data often needs to be copied (takes a long time) to make a checkpointA low probability event, none the less it can occur because a lot of data often needs to be copied (takes a long time) to make a checkpoint (sequioa solution) –Use two saved states at each checkpointUse two saved states at each checkpoint –Time stamp these states on the beginning and end of saves statesTime stamp these states on the beginning and end of saves states –Normally beginning time stamp of the second copy will be same as the end time stamp of the first copyNormally beginning time stamp of the second copy will be same as the end time stamp of the first copy –If failure occurs during any single state save, its second time stamp will be missing …If failure occurs during any single state save, its second time stamp will be missing … –In a cache oriented checkpointing systems checkpoints can be forced by causing cache-full, having unchangeable bits and forcing a change in such bitsIn a cache oriented checkpointing systems checkpoints can be forced by causing cache-full, having unchangeable bits and forcing a change in such bits
ECE 753 Fault Tolerant Computing 23 Checkpointing: Shared memory systems Assumptions –Tightly coupled systemTightly coupled system –Processors with private cachesProcessors with private caches –Bus based architectures for cache coherence (alternative is directory based architecture)Bus based architectures for cache coherence (alternative is directory based architecture) “Consistency problem” arises because “processors” have own private caches. Bus based architectures use the following for coherence“Consistency problem” arises because “processors” have own private caches. Bus based architectures use the following for coherence –Write invalidate – memory writes cause invalidation of copies of the data in local cachesWrite invalidate – memory writes cause invalidation of copies of the data in local caches –Write update – correct data is sent to all cachesWrite update – correct data is sent to all caches
ECE 753 Fault Tolerant Computing 24 Checkpointing: Shared memory systems (contd.) Checkpointing mechanism –Establish global checkpoint – all processes or processors establish a checkpoint they can rollback to in case a fault is detectedEstablish global checkpoint – all processes or processors establish a checkpoint they can rollback to in case a fault is detected Use extra hardware – for example use extra lines to support consistency protocolUse extra hardware – for example use extra lines to support consistency protocol A processor raises a request to establish checkpoint (when updates occur or appropriately defined conditions are met)A processor raises a request to establish checkpoint (when updates occur or appropriately defined conditions are met) All processors observe this signal and establish checkpoint (all actions stop)All processors observe this signal and establish checkpoint (all actions stop) On detection of fault, the processor raises rollback All processors rollback
ECE 753 Fault Tolerant Computing 25 Checkpointing: Shared memory systems (contd.) Limitations –High rate of establishing checkpointsHigh rate of establishing checkpoints –It is possible to reduce the number of checkpoints by tracking the interactions between processors (note all interactions occur on a single bus)It is possible to reduce the number of checkpoints by tracking the interactions between processors (note all interactions occur on a single bus) Non single bus oriented systems and dictionary based architectures have also been studiedNon single bus oriented systems and dictionary based architectures have also been studied
ECE 753 Fault Tolerant Computing 26 Checkpointing: Distributed systems System and failure model –Basics and definitionsBasics and definitions Checkpoint based recovery –Unco-ordinatedUnco-ordinated –Co-ordinatedCo-ordinated –Communication inducedCommunication induced Log based recovery –Pessimistic loggingPessimistic logging –Optimistic loggingOptimistic logging –CasualCasual Implementation issues
ECE 753 Fault Tolerant Computing 27 Checkpointing: Dist. Sys. (contd.) System and failure model – system model –N processors/processesN processors/processes –Interaction between processors/processes and the outside world by sending and/or receiving messagesInteraction between processors/processes and the outside world by sending and/or receiving messages –Messages are non-deterministic eventsMessages are non-deterministic events –Communication system protocolCommunication system protocol Lossy (messages may get lost, duplicated, or re-orderd in the communication system – most commonly used scenario)Lossy (messages may get lost, duplicated, or re-orderd in the communication system – most commonly used scenario) Reliable (no messages are lost and they are always served in order, e.g. FIFO - less frequently used assumption)Reliable (no messages are lost and they are always served in order, e.g. FIFO - less frequently used assumption)
ECE 753 Fault Tolerant Computing 28 Checkpointing: Dist. Sys. (contd.) System and failure model – system model (A message passing system) P1P1 P2P2 P3P3 outside word
ECE 753 Fault Tolerant Computing 29 Checkpointing: Dist. Sys. (contd.) System and failure model – Definitions –We will consider the case of building rollback recovery over lossy communication systemWe will consider the case of building rollback recovery over lossy communication system –Consistent system stateConsistent system state A state that may occur in legal execution of a distributed computingA state that may occur in legal execution of a distributed computing In other words for every message that is received, it is shown to have been sent in the state of a sender It is important to note that reconstructed state is not necessarily the state that occurred before the failure, it is in fact the state that could have occurred before the failure in a legal execution. (this distinction is important and even more so in logging based recovery)It is important to note that reconstructed state is not necessarily the state that occurred before the failure, it is in fact the state that could have occurred before the failure in a legal execution. (this distinction is important and even more so in logging based recovery)
ECE 753 Fault Tolerant Computing 30 Checkpointing: Dist. Sys. (contd.) System and failure model – system model consistent and inconsistent system states P1P1 P2P2 P3P3 consistent state consistent state m1 m2 m3 m4 inconsistent state
ECE 753 Fault Tolerant Computing 31 Checkpointing: Dist. Sys. (contd.) System and failure model – (contd.) –The last state is inconsistent because in the system state (where state line intersects the process lines) m3 is received by process P 2 but it is never sent by process P 1The last state is inconsistent because in the system state (where state line intersects the process lines) m3 is received by process P 2 but it is never sent by process P 1 –Note: message is sent but not received is ok becauseNote: message is sent but not received is ok because It is an in-transit message It may have been lost (lossy communication) If the rollback protocol is built over reliable communication protocol, then in-transit messages should be included in the consistent system state (not discussed here)If the rollback protocol is built over reliable communication protocol, then in-transit messages should be included in the consistent system state (not discussed here)
ECE 753 Fault Tolerant Computing 32 Checkpointing: Dist. Sys. (contd.) System and failure model – (contd.) –Consistent global checkpointConsistent global checkpoint Set of N local checkpoints, one for each process, together these form a consistent system stateSet of N local checkpoints, one for each process, together these form a consistent system state The key idea of this definition is that we can rollback to this state and re-compute from this state to arrive at the present stateThe key idea of this definition is that we can rollback to this state and re-compute from this state to arrive at the present state –Recovery lineRecovery line Most recent consistent global checkpoint
ECE 753 Fault Tolerant Computing 33 Checkpointing: Dist. Sys. (contd.) System and failure model – (contd.) –Goal of checkpointGoal of checkpoint Perform checkpointing in such a way that recovery line is close to failure pointPerform checkpointing in such a way that recovery line is close to failure point –This will reduce the re-computation effortThis will reduce the re-computation effort Have as few checkpoints as possible –This will reduce the cost (performance loss and overhead)This will reduce the cost (performance loss and overhead) Together they should be such that domino effect does not take placeTogether they should be such that domino effect does not take place –What is domino effect?What is domino effect?
ECE 753 Fault Tolerant Computing 34 Checkpointing: Dist. Sys. (contd.) System and failure model – (contd.) –Domino EffectDomino Effect P1P1 P2P2 P3P3 fault
ECE 753 Fault Tolerant Computing 35 Checkpointing: Dist. Sys. (contd.) System and failure model – system model with logging protocolSystem and failure model – system model with logging protocol –Recovery mechanism uses checkpoint information and the log of messages (all messages are kept in a log and replayed during recovery)Recovery mechanism uses checkpoint information and the log of messages (all messages are kept in a log and replayed during recovery) –Recovery starts from a checkpoint, re-executes the program/process, and replays messages from the logRecovery starts from a checkpoint, re-executes the program/process, and replays messages from the log –Note also the messages are non deterministic eventsNote also the messages are non deterministic events –Generally this protocol is less susceptible to domino effectGenerally this protocol is less susceptible to domino effect
ECE 753 Fault Tolerant Computing 36 Checkpointing: Dist. Sys. (contd.) System and failure model – Other concepts –Stable storage: save data (checkpoints, event logs, etc.) must persist for tolerated system failuresStable storage: save data (checkpoints, event logs, etc.) must persist for tolerated system failures For example – dual independent store will tolerate single failureFor example – dual independent store will tolerate single failure –Interaction with external worldInteraction with external world Committing outputs: such as sending output without verifying can speed up the process but if such an action can not be rolled back, verification should be done before committingCommitting outputs: such as sending output without verifying can speed up the process but if such an action can not be rolled back, verification should be done before committing
ECE 753 Fault Tolerant Computing 37 Checkpointing: Dist. Sys. (contd.) System and failure model – Other concepts –Garbage collectionGarbage collection Not all checkpoint are needed to stay in the storage Algorithms exist to determine global recovery line, thus discard older checkpoints and recovery linesAlgorithms exist to determine global recovery line, thus discard older checkpoints and recovery lines Algorithms can also identify useless checkpoints and discard them or create new recovery line from the present checkpointsAlgorithms can also identify useless checkpoints and discard them or create new recovery line from the present checkpoints
ECE 753 Fault Tolerant Computing 38 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery - Unco-ordinatedCheckpoint based recovery - Unco-ordinated –Each process take checkpoints independentlyEach process take checkpoints independently Advantages –InexpensiveInexpensive –Easy to implementEasy to implement –Low overheadLow overhead Disadvantages –Domino effect may take place – potentially no recovery is possibleDomino effect may take place – potentially no recovery is possible –Normally longer recovery timeNormally longer recovery time
ECE 753 Fault Tolerant Computing 39 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery - Unco-ordinatedCheckpoint based recovery - Unco-ordinated –Basic problem formulation – determination of recovery lineBasic problem formulation – determination of recovery line Concept of dependency graph –Define checkpoint intervals for each process for example I 1,x is the xth checkpoint interval of the process P 1Define checkpoint intervals for each process for example I 1,x is the xth checkpoint interval of the process P 1 P1P1 P2P2 x-1 y-1y x I 1,x I 2,y
ECE 753 Fault Tolerant Computing 40 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery - Unco-ordinatedCheckpoint based recovery - Unco-ordinated Concept of dependency graph –Xth checkpoint of P 1 depends on the yth checkpoint of P 2, thus if P 2 rolls back to y-1 then P 1 must rollback to x-1Xth checkpoint of P 1 depends on the yth checkpoint of P 2, thus if P 2 rolls back to y-1 then P 1 must rollback to x-1 I 2,y P1P1 P2P2 x-1 y-1y x I 1,x
ECE 753 Fault Tolerant Computing 41 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery - Unco-ordinatedCheckpoint based recovery - Unco-ordinated Algorithm for recovery line and garbage collection –Construct dependency graphConstruct dependency graph –Determine global recovery line (a recovery line that is as close to the present time as possible is the one we want). The following have their relative advantages and disadvantagesDetermine global recovery line (a recovery line that is as close to the present time as possible is the one we want). The following have their relative advantages and disadvantages »Construct such a line on demand (when fault occurs)Construct such a line on demand (when fault occurs) »Create it as a background job and keep updating itCreate it as a background job and keep updating it –Discard checkpoints that are not part of the recovery line – these need not stay in the stable (persistent) storeDiscard checkpoints that are not part of the recovery line – these need not stay in the stable (persistent) store
ECE 753 Fault Tolerant Computing 42 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery - CoordinatedCheckpoint based recovery - Coordinated –To avoid susceptibility to domino effect, checkpoints can be made in coordinated manner to form a global consistent stateTo avoid susceptibility to domino effect, checkpoints can be made in coordinated manner to form a global consistent state Advantages – –faster recoveryfaster recovery –reduced garbage collection effortreduced garbage collection effort –reduced stable storage overheadreduced stable storage overhead Disadvantages – process/processor autonomy destroyed, implying possible performance loss, and checkpointing overheadDisadvantages – process/processor autonomy destroyed, implying possible performance loss, and checkpointing overhead –Key conceptKey concept Coordination session needs to be initiated before any output commit Block inter-process communication until checkpointing protocol is executedBlock inter-process communication until checkpointing protocol is executed
ECE 753 Fault Tolerant Computing 43 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery – Coordinated (contd.)Checkpoint based recovery – Coordinated (contd.) –Checkpointing protocol (blocking)Checkpointing protocol (blocking) Coordinator broadcasts a checkpointing request Every process that receives this request –Takes a local checkpointTakes a local checkpoint –Stops sending appilation messagesStops sending appilation messages –Replies that local checkpoint doneReplies that local checkpoint done Coordinator sends a global checkpoint done to all processes when it receives a local checkpoint done from every processCoordinator sends a global checkpoint done to all processes when it receives a local checkpoint done from every process Each process commits its new checkpoint and resumes sending application messagesEach process commits its new checkpoint and resumes sending application messages
ECE 753 Fault Tolerant Computing 44 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery – Coordinated (contd.) –Checkpointing protocol (nonblocking)Checkpointing protocol (nonblocking) Responsibility of maintaining consistency is shifted to the receiving processResponsibility of maintaining consistency is shifted to the receiving process The problem we can run into is that post checkpoint message(s) may arrive before the checkpoint request message (see below)The problem we can run into is that post checkpoint message(s) may arrive before the checkpoint request message (see below) m C_req coordinator P1P1 P2P2
ECE 753 Fault Tolerant Computing 45 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery – Coordinated (contd.) –Checkpointing protocol (nonblocking)Checkpointing protocol (nonblocking) Possible solutions –P 1 must send the the checkpoint message P 2 before the message m (alter the queue used be P 1 for sending messages to the other processes)P 1 must send the the checkpoint message P 2 before the message m (alter the queue used be P 1 for sending messages to the other processes) – P 1 must send the the checkpoint message P 2 with the message m (such a piggyback the checkpoint message with m) P 1 must send the the checkpoint message P 2 with the message m (such a piggyback the checkpoint message with m)
ECE 753 Fault Tolerant Computing 46 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery – Coordinated (contd.) –Synchronous CheckpointingSynchronous Checkpointing Clocks of all processes are synchronized (a difficult problem – none the less there are solutions for loose and approximate synchronization of clocks for distributed multiprocessor systems)Clocks of all processes are synchronized (a difficult problem – none the less there are solutions for loose and approximate synchronization of clocks for distributed multiprocessor systems) Checkpoints are triggered and taken by all processes/processors at some predetermined timesCheckpoints are triggered and taken by all processes/processors at some predetermined times To make sure that all checkpoints taken at some time belong to the same time stamps is sufficient and there is no need for a global checkpoint messageTo make sure that all checkpoints taken at some time belong to the same time stamps is sufficient and there is no need for a global checkpoint message
ECE 753 Fault Tolerant Computing 47 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery – Comm. inducedCheckpoint based recovery – Comm. induced –Place system wide constraints on message passing (communication pattern) and checkpointing to guarantee progression of recovery linePlace system wide constraints on message passing (communication pattern) and checkpointing to guarantee progression of recovery line –For example if within every checkpoint interval all messages received precede all messages sent, then the system is domino effect free. Such a message passing system will appear to be as followsFor example if within every checkpoint interval all messages received precede all messages sent, then the system is domino effect free. Such a message passing system will appear to be as follows Global checkpoint state
ECE 753 Fault Tolerant Computing 48 Checkpointing: Dist. Sys. (contd.) Checkpoint based recovery – Comm. inducedCheckpoint based recovery – Comm. induced –Consistency of state can be guaranteed (no domino effect will take place) if we take a checkpoint before every non deterministic event (note this is a special case of what we saw before). The challenge here is to reduce the number of checkpoinsConsistency of state can be guaranteed (no domino effect will take place) if we take a checkpoint before every non deterministic event (note this is a special case of what we saw before). The challenge here is to reduce the number of checkpoins –Generalization of the previous statement (all receives to precede sends) have been studied in literatureGeneralization of the previous statement (all receives to precede sends) have been studied in literature
ECE 753 Fault Tolerant Computing 49 Checkpointing: Dist. Sys. (contd.) Log based recovery –Each checkpoint interval contains non deterministic events – such as receipt of a message or an internal event (note: send message is not considered an event in most log based recovery techniques)Each checkpoint interval contains non deterministic events – such as receipt of a message or an internal event (note: send message is not considered an event in most log based recovery techniques) –Often such systems are built on reliable communication protocol, thus messages are often authenticated before being acted uponOften such systems are built on reliable communication protocol, thus messages are often authenticated before being acted upon Key concept –Save the following on a stable storeSave the following on a stable store checkpoint information about non deterministic events Note: often the system state is fully recoverable to the pre- failure stateNote: often the system state is fully recoverable to the pre- failure state
ECE 753 Fault Tolerant Computing 50 Checkpointing: Dist. Sys. (contd.) Log based recovery - Pessimistic loggingLog based recovery - Pessimistic logging –Key conceptKey concept Log each non deterministic event before it is allowed to affect the computation. For example a message is not delivered to the application program before it is logged. It is also called synchronous logging.Log each non deterministic event before it is allowed to affect the computation. For example a message is not delivered to the application program before it is logged. It is also called synchronous logging. Each process also takes periodic checkpoint to limit the recovery time (needed to re-execute and replay the log)Each process also takes periodic checkpoint to limit the recovery time (needed to re-execute and replay the log) –Fundamental assumption that supports the use of this form of logging is that a fault can occur after every non deterministic event in the computationFundamental assumption that supports the use of this form of logging is that a fault can occur after every non deterministic event in the computation
ECE 753 Fault Tolerant Computing 51 Checkpointing: Dist. Sys. (contd.) Log based recovery - Pessimistic logging (contd.)Log based recovery - Pessimistic logging (contd.) –Example and logsExample and logs –Logs ofLogs of P 0 is {m 0, m 4, m 7 }P 0 is {m 0, m 4, m 7 } P 1 is {m 1, m 3, m 6 }P 1 is {m 1, m 3, m 6 } P 2 is {m 2, m 5 }P 2 is {m 2, m 5 } m2m2 m0m0 m1m1 m3m3 m4m4 m5m5 m6m6 m7m7 P0P0 P1P1 P2P2 A B C
ECE 753 Fault Tolerant Computing 52 Checkpointing: Dist. Sys. (contd.) Log based recovery - Pessimistic logging (contd.)Log based recovery - Pessimistic logging (contd.) –In case of a failure the failed process rolls back to the previous checkpoint and starts re- execution and replays the message from the logIn case of a failure the failed process rolls back to the previous checkpoint and starts re- execution and replays the message from the log –For example, if P 2 fails, it will roll back to the checkpoint C and replay messages m 2 and m 5For example, if P 2 fails, it will roll back to the checkpoint C and replay messages m 2 and m 5 –The failed process will be able to arrive at the state before the failure was detected.The failed process will be able to arrive at the state before the failure was detected.
ECE 753 Fault Tolerant Computing 53 Checkpointing: Dist. Sys. (contd.) Log based recovery - Pessimistic logging (contd.)Log based recovery - Pessimistic logging (contd.) –Advantages of this schemeAdvantages of this scheme Process can commit output to the outside world without running a special protocolProcess can commit output to the outside world without running a special protocol Recovery is simplified as the effect of failure is confined to the failed processRecovery is simplified as the effect of failure is confined to the failed process Process restarts from the most recent checkpoint No need to run complex garbage collection protocolsNo need to run complex garbage collection protocols
ECE 753 Fault Tolerant Computing 54 Checkpointing: Dist. Sys. (contd.) Log based recovery - Pessimistic logging (contd.)Log based recovery - Pessimistic logging (contd.) –What happens if we do not log messages and save them on a stable store?What happens if we do not log messages and save them on a stable store? Consider m 3 depends on m 2Consider m 3 depends on m 2 Failure in P 1 will cause non re-generation of message m 3 (because m 2 was not saved on a stable store and hence it will not be replayed). This will make the state of process P 0 inconsistent (note a message received but never sent) – such a process is called an orphan processFailure in P 1 will cause non re-generation of message m 3 (because m 2 was not saved on a stable store and hence it will not be replayed). This will make the state of process P 0 inconsistent (note a message received but never sent) – such a process is called an orphan process failure P0P0 P1P1 P2P2 m1m1 m2m2 m3m3
ECE 753 Fault Tolerant Computing 55 Checkpointing: Dist. Sys. (contd.) Log based recovery - Pessimistic logging (contd.)Log based recovery - Pessimistic logging (contd.) –Disadvantage(s)Disadvantage(s) High price by way of performance penalty and solutionsHigh price by way of performance penalty and solutions – Solutions Solutions Special hardware assist –Non-volatile semiconductor memory to implement stable storageNon-volatile semiconductor memory to implement stable storage –Special bus to guarantee atomic loggingSpecial bus to guarantee atomic logging –Sender based message logging – log each message at the sender – may improve performance at the expense of longer recovery timeSender based message logging – log each message at the sender – may improve performance at the expense of longer recovery time
ECE 753 Fault Tolerant Computing 56 Checkpointing: Dist. Sys. (contd.) Log based recovery - Optimistic loggingLog based recovery - Optimistic logging –Log messages asynchronouslyLog messages asynchronously –The basic assumption is that logging will complete before a failure occursThe basic assumption is that logging will complete before a failure occurs –Volatile logs are flushed to stable storage periodicallyVolatile logs are flushed to stable storage periodically –The price paid isThe price paid is complete recovery is not always possible as in the case of pessimistic logging complete recovery is not always possible as in the case of pessimistic logging Orphan processes may result, therefore processes may have to rollback to un-receive messagesOrphan processes may result, therefore processes may have to rollback to un-receive messages –Dependency tracking (similar to dependency graph) during failure free execution can help reduce the recovery timeDependency tracking (similar to dependency graph) during failure free execution can help reduce the recovery time
ECE 753 Fault Tolerant Computing 57 Checkpointing: Dist. Sys. (contd.) Log based recovery – CasualLog based recovery – Casual –Key conceptKey concept Information that relates to events is either fully logged or is available locally to the process for recoveryInformation that relates to events is either fully logged or is available locally to the process for recovery
ECE 753 Fault Tolerant Computing 58 Checkpointing: Dist. Sys. (contd.) Implementation issues –Main difficulty is in complexity of handling recoveryMain difficulty is in complexity of handling recovery –Nearly all practical message logging systems use pessimistic logging – it is much simpler for recovery implementationNearly all practical message logging systems use pessimistic logging – it is much simpler for recovery implementation –Major source of overhead – stable storage access (communication overhead is much lower in comparison)Major source of overhead – stable storage access (communication overhead is much lower in comparison) –Interaction with the outside world (constrained) due to checkpointing/logging is an important issueInteraction with the outside world (constrained) due to checkpointing/logging is an important issue
ECE 753 Fault Tolerant Computing 59 Checkpointing: Dist. Sys. (contd.) Implementation issues (contd.) –Performance improvementPerformance improvement Memory protection hardware to protect checkpoint state informationMemory protection hardware to protect checkpoint state information Compiler assistance can reduce overhead Reliable channel protocols – making communication channels more reliable can guarantee that no messages are lostReliable channel protocols – making communication channels more reliable can guarantee that no messages are lost
ECE 753 Fault Tolerant Computing 60 Forward error recovery Consider the following scheme for checkpoint based rollback recoveryConsider the following scheme for checkpoint based rollback recovery –Two processors P 1 and P 2Two processors P 1 and P 2 –At checkpoints P 1 and P 2 compare results and save state (checkpoints)At checkpoints P 1 and P 2 compare results and save state (checkpoints) –Error detected at such comparison, causes the processors P 1 and P 2 to roll backError detected at such comparison, causes the processors P 1 and P 2 to roll back –Note: if there were three processors, they could potentially mask the faulty process/processorNote: if there were three processors, they could potentially mask the faulty process/processor
ECE 753 Fault Tolerant Computing 61 Forward error recovery (contd.) Roll forward – two processors and a spare P1P1 P2P2 chk1 error chk2 spare compare
ECE 753 Fault Tolerant Computing 62 Forward error recovery (contd.) Additional issues –If we had three processors to begin with why not use fault masking?If we had three processors to begin with why not use fault masking? Think of more than one job and the number of processors required – a single spare can be spare for many pairs of jobsThink of more than one job and the number of processors required – a single spare can be spare for many pairs of jobs –What if a second fault occurs while processor P 1 is conitinuing?What if a second fault occurs while processor P 1 is conitinuing? Use spare for one more period
ECE 753 Fault Tolerant Computing 63 Summary Discussed checkpointing and logging issues at lengthDiscussed checkpointing and logging issues at length