1 Distributed Systems 2007/08 Rollback-Recovery Alberto Montresor Università di Trento This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
2 Group communication, consensus, etc. Fault tolerance – offers the abstraction of a reliable system where faults are tolerated Recovery techniques Fault recovery – once a fault occurs, we must be able to recover to a previous correct state Examples: Transactions Data-oriented technique Rollback recovery Computation/communication oriented techniques What is a checkpoint? Why do we need them?
3 The idea (1) Stable Storage Distributed System Assumption: The stable storage survives failures Process 1 Process 2 Process 3 Process 4 Process 5 DATA RECOVE RY
4 The idea (2) Processes: have access to a stable storage that survives all tolerated failures disks, RAID arrays Stable storage: is used to save recovery information periodically during failure- free periods. Upon a failure: a failed process uses the saved information to restart the computation from an intermediate state, thereby reducing the amount of lost computation
5 Scientific computations ● Executions running for weeks ● Failures are rare ● Periodically store information about current state of the computation, restart when needed Main application area
6 Minimal information Store the states of the participating processes, called checkpoints Message/event logging ● Log messages exchanged among the processes ● Log interactions with input and output devices ● In general, log events that occur at each process Recovery information
7 Usual approaches: ● Make a periodic “big” checkpoint ● More frequently, make incremental additions When recovering ● Restore latest checkpoint ● Replay messages/events stored in logs Result: Combining checkpoints with message logging makes it possible to restore a state that lies beyond the most recent checkpoint Checkpoint-based vs. Log-based R-R techniques
8 Scientific computing systems might have massive data structures while they run...but maybe not all of them need to be “checkpointed” If the system uses big but unchanging tables, why write them out? ● If data change slightly, we could save only the increments! In general, a checkpoint only needs to include data that can't be “regenerated” in any simple, quick way Which state? (1)
9 GLOBAL STATE = PROCESSES' STATE + COMMUNICATION CHANNELS' STATE A system recovers correctly if its internal state is consistent with the observable behaviour of the system before the failure Which state? (2)
10 Extreme case: A checkpoint could include the entire state of a process ● Write out its memory “layout” ● Contents of all pages ● Contents of registers This way, we can resume the process by simply reloading its entire state. ● Windows XP, GNU/Linux OSs does this for “hibernate” feature ● Potentially, very fast Extreme checkpointing (1)
11 Problem: if a program is “temporarily deterministic” it may 1.Crash due to a corrupt data structure 2.Roll back until the last checkpoint 3.Reload the same (still corrupt) structure goto 1 The advantage of “rebuilding” data structures is that we avoid this risk Extreme checkpointing (2)
12 Rollback/Recovery in practice Rollback-Recovery (RR) Techniques Checkpoint-based Log-based uncoordinated coordinated communication-induced pessimistic optimistic causal
13 Rollback/Recovery in practice Rollback-Recovery (RR) Techniques Checkpoint-based Log-based uncoordinated coordinated communication-induced pessimistic optimistic causal
14 Description: ● All processes periodically take checkpoints in an uncoordinated fashion ● Very easy to implement Problems: ● Determism vs non-determinism When processes rollback, they may end up in a global state that could never happen in a failure-free scenario ● Cascading rollbacks (domino effect) Uncoordinated checkpointing (1)
15 Dealing with non-determinism ● Multi-threading ● Message re-ordering ● Applications that receive user input, timer interrupts, I/O from devices, or messages on multiple connections Basic concern: What if, after having rolled back up to a checkpoint, the application does not repeat the actions occurred “last time” the process was in that same state? Limitations of checkpoint-based RR
16 P and Q are interacting Uncoordinated checkpointing (2) Each makes checkpoints independently from the other P sends the message m(1), then makes a CP; Q receives the message m(1) after its CP. PQPQ m(1) m(2)
17 Uncoordinated checkpointing (3) Q crashes and rolls back up to its latest checkpoint Failure P Q m(1)
18 Q crashes and rolls back to checkpoint It will have “forgotten” message from P! PQPQ m(1) Uncoordinated checkpointing (4)
19 … Yet Q may even have replied. Who would care? Suppose reply was “OK to release the cash. Account has been debited” P Q ??? Uncoordinated checkpointing (5)
20 First, Q needs to see the request again, so that it will re-enter the state in which it sent the reply P must regenerate the input request But if Q is non-deterministic, it might not repeat those actions even with identical input So that might not be “enough” Two related concerns
21 P crashes and rolls back Problems with uncoordinated checkpoints (1)
22 P crashes and rolls back Will P “reissue” the same request? Recall our non- determinism assumption: it might not! Problems with uncoordinated checkpoints (2)
23 Idea: if a process rolls back, make others roll back to a consistent state If a message m was sent after checkpoint → roll receiver back to a state before m was received If a message m was received after checkpoint, → roll the sender back to a state prior to sending m Channels will be “empty” after doing this A solution or a problem? Solution: Rollback propagation
24 Q crashes and rolls back Problems with uncoordinated checkpoints (3)
25 Q crashes and rolls back Problems with uncoordinated checkpoints (4)
26 P must also roll back It won’t be a problem if P happens not to resend the same request Problems with uncoordinated checkpoints (5)
27 But now we can get a cascade effect Problems with uncoordinated checkpoints: The Domino effect (1)
28 Q crashes, restarts from checkpoint… The Domino effect (2)
29 Forcing P to rollback for consistency… The Domino effect (3)
30 New inconsistency forces Q to rollback ever further The Domino effect (4)
31 New inconsistency forces Q to rollback ever further The Domino effect (5)
32 It arises when the creation of checkpoints is uncoordinated ● Can force a system to roll back to initial state ● Clearly undesirable in the extreme case… The Domino effect (6)
33 One solution is to coordinate the creation of checkpoints and logging of messages In effect, find a point at which we can “pause” the system All processes make a checkpoint in a coordinated way Then resume their normal execution Protocols for doing this are well-known and isomorphic to to consistent cuts (“global snapshot”) The Domino effect (7)
34 How many checkpoints should we maintain in the stable storage? Since we cannot know where the domino effect will lead... Possible solutions: Dependency graphs Checkpoint graph moreover...
35 How many checkpoints should we maintain in the stable storage? The garbage collection is the elimination of the entries in the stable storage when they become useless for the computation The garbage collection
36 Often we can’t control processes we didn’t code ourselves ● Most systems have many black-box components ● Can’t expect them to implement the checkpoint/rollback policy Not every process can make a checkpoint “on request” ● Might be in the middle of a costly computation that left big data structures around ● This interferes with coordination protocols Hence it isn’t really practical to do coordinated checkpointing if it includes system components Other problems
37 Rollback/Recovery in practice Rollback-Recovery (RR) Techniques Checkpoint-based Log-based uncoordinated coordinated communication-induced pessimistic optimistic causal
38 The limitations of “pure checkpointing” call for a mechanism based on message logging This solves problems about ● I/O from the outside world ● Re-ordering of messages ● Any kind of non-deterministic event under the control of the application In other words: Non-determism Message logging
39 ● The RECEIPT of a message from another process, or ● An event internal to the process execution These events must be managed carefully to maintain the system in a recoverable state....ehm wait, what's a nondeterministic event?
40 A state in which, if the state of a process reflects a message receipt, then the state of the corresponding sender reflects sending that message [Chandy, Lamport 1985] This way, if the receiver fails, the sender know what message to re- send to it to achieve a consistent recovery, i.e., information detained by each process do not collide with another's. Consistent system state
41 “Piecewise deterministic” (PWD) assumption All nondeterministic events that a process executes can be identified and the information necessary to replay each event during recovery can be logged in the event’s determinant Determinant: 1)Sender, destination, content of a message 2)I/O coming from Outside World The piecewise deterministic (PWD) assumption
42 Outside World ● A message-passing system often interacts with the outside world to receive input data or show the outcome of a computation. ● If a failure occurs, the outside world cannot be relied upon to rollback Examples ● A printer cannot rollback what has printed ● An ATM cannot ask money back The Outside World Process (1)
43 Modeled as a special process (Outside World Process, OWP) (a)Cannot fail (b)Cannot maintain state (c)Cannot participate to recovery procedures (d)CANNOT ROLL BACK OWP must perceive a consistent behavior of the system despite failures. The Outside World Process (2)
44 From system to OWP: Before sending a message (output) to OWP, the system must ensure that the state from which the message is sent will be recovered despite any future failure From OWP to system: Input messages that a system receives from OWP may not be reproducible during recovery So, they must be saved so that they can be retrieved when needed for execution replay after a failure. A common approach is to save each input message on stable storage before allowing the application program to process it The Outside World Process (3)
45 Receiver based logging Log received messages; like an “extension” of the checkpoint Sender based logging Log messages before you send them, ensures you can resend them if needed Mixed mode (Alvisi) Does both, optimizes to log where doing so is most efficient (results in smallest log/overhead) Three options...