Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655 Checkpointing III
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.2 Coordinated Checkpointing Algorithms Uncoordinated checkpointing may lead to domino effect or to livelock Two basic approaches to checkpoint coordination: The Koo-Toueg algorithm, which has a process to initiate the system-wide checkpointing process An algorithm which staggers checkpoints in time; Staggering checkpoints can help avoid near-simultaneous heavy loading of the disk system Communication-induced checkpointing procedures Simultaneously using coordinated and uncoordinated checkpointing algorithms - the latter is sufficient to deal with most isolated failures
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.3 Koo-Toueg Algorithm Suppose P wants to establish a checkpoint at P_3 This will record that q1 was received from Q - to prevent q1 from being orphaned, Q must checkpoint as well Thus, establishing a checkpoint at P_3 by P forces Q to take a checkpoint to record that q1 was sent An algorithm for such coordinated checkpointing has two types of checkpoints - tentative and permanent P first records its current state in a tentative checkpoint, then sends a message to all other processes from whom it has received a message since taking its last checkpoint Call the set of such processes
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.4 Koo-Toueg Algorithm - Cont. The message tells each process in (e.g., Q), the last message, m_qp, that P has received from it before the tentative checkpoint was taken If m_qp was not recorded in a checkpoint by Q: to prevent m_qp from being orphaned, Q is asked to take a tentative checkpoint to record sending m_qp If all processes in , that need to, confirm taking a checkpoint as requested, then all tentative checkpoints can be converted to permanent If some members of , are unable to checkpoint as requested, P and all members of abandon the tentative checkpoints, and none are made permanent This may set off a chain reaction of checkpoints Each member of can potentially spawn a set of checkpoints among processes in its corresponding set
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.5 Staggered Checkpointing The Koo-Toueg algorithm - and others like it - can lead to a large number of processes taking checkpoints at nearly the same time If they are all writing to a shared stable storage, e.g., a set of common disks, this surge can lead to congestion at the disks or network or both Either of two approaches can be used to ensure that, at any time, at most one process is taking its checkpoint (1) Write the checkpoint into a local buffer, then stagger the writes from buffer to stable storage Assuming a buffer of sufficiently large capacity (2) Try staggering the checkpoints in time
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.6 Staggered Checkpointing - Cont. Staggered checkpoints may not be consistent - there may be orphan messages in the system This can be avoided by a coordinating phase in which each process logs in stable storage all messages it sent out since its previous checkpoint The message-logging phase of the processes will overlap in time If the volume of messages is less than the size of the individual checkpoints - the disks and network will see a reduced surge
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.7 Recovery From Failure If a process fails, it can be restarted after rolling it back to its last checkpoint and all the messages stored in log played back This combination of checkpoint and message log is called a logical checkpoint The staggered checkpointing algorithm guarantees that all the logical checkpoints form a consistent recovery line
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.8 Phase One of Staggering Algorithm Phase 1 - the checkpointing phase: for (i=0; i n-1; i++) { P_i takes a checkpoint P_i sends a message to P_{(i+1) mod n}, ordering the latter to take a checkpoint } When P_0 gets a message from P_{n-1} ordering it to checkpoint - this is the cue for P_0 to initiate the second (message-logging) phase It sends out a marker message on each of its outgoing channels. When a process P_i receives a marker message, it goes to phase 2
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.9 Phase Two of Staggering Algorithm Message Logging Phase if (no previous marker message was received in this round by P_i) then { P_i sends a marker message on each of its outgoing channels P_i logs all the messages received by it after the preceding checkpoint } else P_i updates its message log by adding all the messages received by it since the last time the log was updated end if
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Example of Staggering Algorithm - Phase One P0 takes a checkpoint and sends take_checkpoint order to P1 P1 sends such an order to P2 after taking its own checkpoint P2 sends a take_checkpoint order back to P0 At this point, each of the processes has taken a checkpoint and the second phase can begin system
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Example - Phase 2 P0 sends message_log to P1 and P2 - logging messages they received since last checkpoint P1 and P2 send out similar message_log orders Each time such a message is received - the process logs the messages If it is the first time such a message_log order is received by it - the process sends out marker messages on each of its outgoing channels
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Recovery Assumption - given the checkpoint and messages received, a process can be recovered We may have orphan messages with respect to the physical checkpoints taken in the first phase Orphan messages will not exist with respect to the latest (in time) logical checkpoints that are generated using the physical checkpoint and the message log
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Time-Based Synchronization Orphan messages cannot happen if each process checkpoints at exactly the same time Practically impossible - clock skews and message communication times cannot be reduced to zero Time-based synchronization can still be used to facilitate checkpointing - we have to take account of nonzero clock skews Time-based synchronization - processes are checkpointed at previously agreed times Example - ask each process to checkpoint when its local clock reads a multiple of 100 seconds Such a procedure by itself is not enough to avoid orphan messages
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Creation of an Orphan Message - Example Each process is checkpointing at time 1100 (local clock) Skew between the two clocks is such that process P0 checkpoints much earlier (in real time) then process P1 As a result, P0 sends out a message to P1 after its checkpoint, which is received by P1 before its checkpoint This message is a potential orphan
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Preventing Creation of an Orphan Message Suppose the skew between any two clocks in the distributed system is bounded by , and each process is asked to checkpoint when its local clock reads Following its checkpoint, a process Px should not send out messages to any process Py until it is certain that Py's local clock reads more than Px should remain silent over the duration [ , + ] (all times as measured by Px's local clock) If the inter-process message delivery time has a lower bound - to prevent orphan messages Px needs to remain silent during a shorter interval [ , + - ] If > , this interval is of zero length - no need for Px to remain silent
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Different Method of Prevention Suppose message m is received by process Py when its clock reads t m must have been sent (by Px) no later than earlier - before Py's clock read t- Since the clock skew , at this time, Px's clock should have read at most t- + If t- + < , the sending of m would be recorded in Px's checkpoint - m cannot be an orphan A message m received by Py when its clock reads at least - + cannot be an orphan Orphan messages can be avoided by Py not using and not including in its checkpoint at any message received during [ - + , ] (Py's clock) until after taking its checkpoint at
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Diskless Checkpointing Memory is volatile and unsuitable for storing a checkpoint However, with extra processors, we can permit checkpointing in main memory By avoiding disk writes, checkpointing can be faster Best used as one level in a two-level checkpointing Have redundant processors using RAID-like techniques to deal with failure Example: a distributed system with five executing, and one extra, processors Each executing processor stores its checkpoint in its memory; extra processor stores the parity of these checkpoints If an executing processor fails, its checkpoint can be reconstructed from the remaining five plus parity checkpoints
Copyright 2004 Koren & Krishna ECE655/Ckpt Part RAID-like Diskless Checkpointing The inter-processor network must have enough bandwidth for sending checkpoints Example: n executing and one checkpointing processor, if all the executing processors send their checkpoints to the checkpointing processor to calculate parity - a potential hotspot Solution: Distribute the parity computations n=5
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Two-Level Recovery Coordinating checkpoints prevents orphan messages but imposes overhead Will not affect correctness if failures are isolated, i.e., at most one process in a failed/recovering state at any time The vast majority of failures are isolated Make recovery from isolated failures fast Accept longer recovery times for simultaneous failures This suggests a two-level recovery scheme First level: each process takes its own checkpoints without coordination (only useful when recovering from isolated failures) Checkpoint need not be written to disk, can be written into a memory of another processor Second level: occasionally entire system undergoes a coordinated checkpointing (with higher overhead), which guards against non-isolated failures
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Two-Level Recovery - Example P0 fails at t0; system rolls back to latest first-level checkpoint; Recovery successful; P1 fails at t1; rolls back; At point tx (during recovery), P2 also fails Non-isolated failures - the system rolls back both processes to the latest second-level checkpoint In general, the more common the non-isolated failures, the greater must be the frequency at which the second-level checkpoint is taken
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Message Logging To continue computation beyond latest checkpoint, recovering process may require all the messages it received since then, played back in original order For coordinated checkpointing - each process can be rolled back to its latest checkpoint and restarted: those messages will be resent during reexecution To avoid the overhead of coordination and let processes checkpoint independently, logging messages is an option Two approaches to message logging: Pessimistic logging - ensures that rollback will not spread, i.e., if a process fails, no other process will need to be rolled back to ensure consistency Optimistic logging - a process failure may trigger rollback of other processes as well
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Pessimistic Message Logging Simplest approach - the receiver of a message stops whatever it is doing when it receives a message, logs the message onto stable storage, then resumes execution Recovering a process from failure - roll it back to its latest checkpoint and play back to it the messages it received since that checkpoint, in the right order No orphan messages will exist - every message will either have been received before the latest checkpoint or explicitly saved in the message log Rolling back one process will not trigger the rollback of any other process.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Sender-Based Message Logging Logging messages into stable storage can impose a significant overhead Against one isolated failure at a time, sender-based message logging can be used The sender of a message records it in a log - when required, the log is read to replay the message Each process has send- and receive-counters, which increment every time the process sends or receives a message Each message has a Send Sequence Number (SSN) - value of the send-counter when it is transmitted A received message is allocated a Receive Sequence Number (RSN) - value of the receive-counter when it was received The receiver also sends out an ACK to the sender, including the RSN it has allocated to the message Upon receiving this ACK, the sender acknowledges the ACK in a message to the receiver
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Sender-Based Message Logging - Cont’d Between the time that the receiver receives the message and sends its ACK, and when it receives the sender's ACK of its own ACK, the receiver is forbidden to send messages to other processes - essential to maintaining correct functioning upon recovery A message is said to be fully-logged when the sending node knows both its SSN and its RSN; it is partially-logged when the sending node does not yet know its RSN When a process rolls back and restarts computation from the latest checkpoint, it sends out to the other processes a message listing the SSN of their latest message that it recorded in its checkpoint When this message is received by a process, it knows which messages are to be retransmitted, and does so The recovering process now has to use these messages in the same order as they were used before it failed - easy to do for fully-logged messages, since their RSNs are available, and they can be sorted by this number
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Partially-logged Messages Remaining problem - the partially-logged messages, whose RSNs are not available They were sent out, but their ACK was never received by the sender The receiver failed before the message could be delivered to it, or it failed after receiving the message but before it could send out the ACK The receiver is forbidden to send out messages of its own to other processes between receiving the message and sending out its ACK As a result, receiving the partially-logged messages in a different order the second time cannot affect any other process in the system - correctness is preserved Clearly, this approach is only guaranteed to work if there is at most one failed node at any time
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Optimistic Message Logging Optimistic message logging has a lower overhead than pessimistic logging; however, recovery from failure is much more complex Optimistic logging is of theoretical interest When messages are received, they are written into a volatile buffer which, at a suitable time, is copied into stable storage Process execution is not disrupted, and so the logging overhead is very low Upon failure, the contents of the buffer can be lost leading to multiple processes having to be rolled back We need a scheme to handle this situation
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Checkpointing in Shared-Memory Systems A variant of CARER for shared-memory bus-based multiprocessors - each processor has its own cache Change the algorithm to maintain cache coherence among the multiple caches Instead of the single bit marking a line as unchangeable, we have a multi-bit identifier: A checkpoint identifier, C_id with each cache line A (per processor) checkpoint counter, C_count, keeping track of the current checkpoint number
Copyright 2004 Koren & Krishna ECE655/Ckpt Part K-1 Shared Memory - Cont. To take a checkpoint, increment the counter A line modified before will have its C_id less than the counter When a line is updated, set C_id = C_count If a line has been modified since being brought into the cache and C_id < C_count, the line is part of the checkpoint state, and is therefore unwritable. Any writes into such a line must wait until the line is first written into the main memory. If the counter has k bits, it rolls over to 0 after reaching 2
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Bus-Based Coherence Protocol Modify a cache coherence algorithm to take account of checkpointing All traffic between caches and memory must use the bus, i.e., all caches can watch the traffic on bus A cache line can be in one of the following states: invalid, shared unmodified, exclusive modified, and exclusive unmodified Exclusive - this is the only valid copy in any cache; Modified - line has been modified since it was brought into cache from memory
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Bus-Based Coherence Protocol - Cont’d If processor wants to update a line in shared unmodified state, it moves into exclusive modified state Other caches holding the same line must invalidate their copies - no longer current When in the exclusive modified or exclusive unmodified states, another cache puts out a read request on the bus, this cache must service that request (only current copy of that line) Byproduct- memory is also updated if necessary Then, move to shared unmodified Write miss, line into cache - exclusive modified
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Bus-Based Coherence and checkpointing Protocol How can we modify this protocol to account for checkpointing? The original exclusive modified state now splits into two: Exclusive modified Unwritable
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Directory-Based Protocol In this approach a directory is maintained centrally which records the status of each line We can regard this directory as being controlled by some shared-memory controller This controller handles all read and write misses and all other operations which change line state Example: If a line is in the exclusive unmodified state and the cache holding that line wants to modify it, it notifies the controller of its intention The controller can then change the state to exclusive modified It is then a simple matter to implement this checkpointing scheme atop such a protocol
Copyright 2004 Koren & Krishna ECE655/Ckpt Part Other Uses of Checkpointing (1) Process Migration A checkpoint represents process state - migrating a process from one processor to another means moving the checkpoint, and computation can resume on the new processor - can be used to recover from permanent or intermittent faults Nature of checkpoint determines whether the new processor must be of the same model and run the same operating system (2) Load-balancing Better utilization of a distributed system by ensuring that the computational load is appropriately shared among the processors (3) Debugging Core files are dumped when a program exits abnormally - these are essentially checkpoints, containing full state information about the affected process - debuggers can read core files and aid in the debugging process (4) Snapshots Observing the program state at discrete epochs - deeper understanding of program behavior