Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo Alvisi Yi-Min Wang David B. Johnson
Motivation Large distributed systems have vast computing potential. In these systems a machine can stop participating in execution of a distributed application as a result of: –disconnection from the network –shut down or reboot by the user –power break If any of these events occur we say that the node has failed. The computing potential is hampered by the nodes’ susceptibility to failures. –There is a need to preserve the correctness of a distributed execution despite failures.
Rollback Recovery Periodically use stable storage (e.g. disk) to save the processes’ state and maybe some additional useful data during failure-free execution. –A saved state of a process is called a checkpoint Upon a failure, restart a failed process from one of the saved checkpoints –reduces the amount of lost computation Of course, when recovering, consistency between processes must be maintained.
Flavors of Rollback Recovery There are techniques that –rely on the application to decide when and what to save, or –provide the programmer with linguistic constructs to be added to the application. There are also techniques, called transparent techniques, that do not require any intervention on the part of the application or the programmer. We focus on transparent rollback recovery.
System Model A constant number of processes (N) –Communicate only through messages –Interact with outside world through messages –Cooperate to execute a distributed program
System Model: Communication Most protocols assume that the communication network is immune to partitioning. Some protocols assume reliable FIFO delivery of messages. Other protocols assume unreliable communication, which mean that the messages can be –lost –duplicated –reordered
System Model: Failures A process that fails –loses its volatile state –stops execution –does not send any more messages Such behavior is called fail-stop Processes have a stable storage device that survives failures. Number of tolerated failures in different protocols varies from 1 to N. –Some protocols do not tolerate failures during recovery.
Consistent System States A global state of a message-passing system consists of: –individual states of all processes –the states of communication channels A consistent global state is a global state in which if a process’s state reflects a message receipt, then the state of the corresponding sender reflects sending that message
Consistent System States (2) Intuitively, a consistent global state is one that may occur during a failure-free, correct execution of a distributed computation. The goal of a rollback-recovery protocol is to bring the system into a consistent state
Consistent Global Checkpoints and Recovery Line A consistent global checkpoint is a set of N checkpoints, one from each processes, forming a consistent system state. –Any consistent global checkpoint can be used to restart process execution upon failure It is desirable to minimize the amount of lost work by restoring the system to the most recent consistent global checkpoint, which is called the recovery line.
Orphan Messages and Orphan Processes A message m sent by a process P i that has failed is an orphan message, if the system cannot guarantee regeneration of the same m during the recovery of P i. A process P k whose state depends on a non- deterministic event (e.g. receipt of a message) that cannot be reproduced is called an orphan process. –Existence of orphan processes violates integrity of the execution and therefore must be prevented
In Transit Messages A message that has been sent but not yet received is called an in-transit message. Do rollback recovery protocols have to guarantee the delivery of in-transit messages? –Depends on whether reliable communication is assumed.
In-Transit Messages: Reliable Communication Reliable communication protocols cannot ensure reliability of message delivery if processes fail. For example, if an in-transit message is lost because the intended receiver has failed, then –conventional communication protocols will generate a timeout and inform the sender that the message cannot be delivered. –In a rollback-recovery system, however, the receiver will eventually recover, and therefore the system must: mask the timeout from the application program at the sender process, and make in-transit messages available to the intended receiver process after it recovers.
In Transit Messages: Unreliable Communication If unreliable communication is assumed, then: –In-transit messages lost due to failure of the receiver cannot be distinguished from those lost due to communication failures. –Loss of an in-transit message is a legal event. Therefore, the recovery protocol need not handle in-transit messages in any special way.
Interactions with the Outside World A message-passing system often interacts with the outside world to receive input data or show the outcome of a computation. If a failure occurs, the outside world cannot be relied on to roll back: –a printer cannot roll back the effects of printing a character –an automatic teller machine cannot recover the money that it dispensed to a customer.
Interactions with the Outside World: Output Messages It is therefore necessary that the outside world perceive a consistent behavior of the system despite failures. Before sending output to the outside world, the system must ensure that the state from which the output is sent can be recovered. –This is commonly called the output commit problem
Interactions with the Outside World: Input Messages Input messages that a system receives from the outside world may not be reproducible during recovery –It may not be possible for the outside world to regenerate them. –Recovery protocols must arrange to save these input messages so that they can be retrieved when needed for execution replay after a failure. A common approach is to save each input message on stable storage before allowing the application program to process it.
Stable Storage Rollback recovery uses stable storage to save checkpoints, event logs, and other recovery-related information despite failures. Stable storage in rollback recovery is only an abstraction. –Often confused with the disk storage used to implement it.
Stable Storage (2) There are different implementation styles of stable storage: –In a system that tolerates only a single failure, stable storage may consist of the volatile memory of another process. –In a system that wishes to tolerate an arbitrary number of transient failures, stable storage may consist of a local disk in each host. –In a system that tolerates non-transient failures, stable storage must consist of a persistent medium outside the host on which a process is running. A replicated file system is a possible implementation in such systems.
Garbage Collection As the application progresses and more recovery information is collected, a subset of the stored recovery information may become useless. –Deletion of such useless recovery information is called garbage collection. A common approach to garbage collection is to identify the recovery line and discard all data relating to events that occurred before that line. –For example, processes that coordinate their checkpoints to form consistent states will always restart from the most recent checkpoint of each process, and so all previous checkpoints can be discarded.
Z-Cycles and Z-Paths A Z-path (zigzag path) is a special sequence of messages that connects two checkpoints. Let denote Lamport’s happen-before relation. Let c i,x denote the x th checkpoint of process P i. Define the execution portion between two consecutive checkpoints on the same process to be the checkpoint interval (starting with the earlier checkpoint). Let send i and deliver i be the communication events by process P i.
Definition of Z-Path Given two checkpoints c i,x and c j,y, a Z-path exists between c i,x and c j,y if and only if one of the following two conditions holds: 1.x < y and i = j; or 2. There exists a sequence of messages [m 0, m 1,…, m n ], n 0, such that: c i,x send i (m 0 ); l < n, either deliver k (m l ) and send k (m l+1 ) are in the same checkpoint interval, or deliver k (m l ) send k (m l+1 ); and deliver j (m n ) c j,y
Z-Cycles and Z-Paths (2) Z-cycle is a Z-path that begins and ends with the same checkpoint. –Above, [m 5, m 4, m 3 ] is a Z-cycle that start and ends at checkpoint c 2,2. [m 1, m 2 ] and [m 3, m 4 ] are Z-paths between c 0,1 and c 2,2
The Z-Cycles Theory The Z-cycle theory was first introduced as a framework for reasoning about consistent system states. The theory has proved a powerful tool for reasoning about a class of protocols known as communication-induced checkpointing. –In particular, it has been proven that a checkpoint involved in a Z-cycle cannot become part of a consistent state in a system that uses only checkpoints.
Types of Rollback Recovery Protocols
Checkpoint-based and Log-based Recovery Protocols Checkpoint-based rollback recovery protocols, a.k.a. checkpointing protocols, rely only on checkpointing to achieve fault-tolerance. Log-based rollback recovery protocols, a.k.a. logging protocols, combine checkpointing with logging of non-deterministic events.
Checkpoint-Based Protocols Rely only on checkpointing to achieve fault-tolerance –Upon a failure, strive to restore the system to the most recent consistent set of checkpoints (a.k.a. recovery line) The checkpointing protocols differ in the amount of cooperation between processes.
Classification of Checkpoint- based Protocols
1.Uncoordinated checkpointing – each process takes its checkpoints independently 2.Coordinated checkpointing – processes coordinate their checkpoints in order to save a system-wide consistent state 3.Communication-induced checkpointing – forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes.
Uncoordinated Checkpointing A.k.a. independent checkpointing A process decides when to make a checkpoint independently of other processes –chooses the most convenient time for example, when the amount of state information is small The processes record dependencies among the checkpoints during the failure-free execution, in order to determine a consistent global checkpoint during recovery. Uncoordinated checkpointing protocols inherently suffer from the domino effect
Rollback Propagation and The Domino Effect Upon a failure of one or more processes, the dependencies induced by messages may force some of the processes that did not fail to roll back. –This is commonly called rollback propagation. –If the processes have to roll back to the beginning of the computation, this is called the domino effect. Failure of P 2 causes rollback to the beginning of the computation
Monitoring the Dependencies Let c i,x be the x th checkpoint of process P i. We call x the checkpoint index. Let I i,x denote the interval between checkpoints c i,x-1 and c i,x. We call it the checkpoint interval. If process P i at interval I i,x sends a message m to P j, it piggybacks the pair (i,x) on m. When P j receives m during interval I j,y, it records the dependency from I i,x to I j,y –the dependency is later saved onto stable storage when Pj takes checkpoint c j,y.
Monitoring the Dependencies (2) The recorded dependencies are used at recovery time for calculation of the recovery line. There are two methods to do it: –Rollback-dependency graphs –Checkpoint-graphs
Rollback-Dependency Graphs Consider the system at the time of a failure. Let C be the set of all the checkpoints, F the set of failure points of the failed processes, and L the set of current states of the living processes. Denote the current state of a process P i (failed or living) that follows a checkpoint c i,x by c i,x+1 A rollback-dependency graph is a graph G(V,E) so that: –V = C F L –E contains an edge from c i,x to c j,y only if either (1) i j, and a message m is sent from I i,x and received in I j,y, or (2) i = j and y = x + 1 If there is an edge from c i,x to c j,y and a failure forces I i,x to be rolled back, then I j,y must also be rolled back. –This is why it is called “rollback-dependency graph”.
Rollback-Dependency Graphs (2) –Mark the failure points. –Mark all the nodes reachable from the failure points. –In each process, the latest unmarked checkpoint belongs to the recovery line. Rollback-dependency graph Algorithm to discover the recovery line
Checkpoint Graphs Checkpoint graphs are similar to rollback-dependency graphs, except: –when a message is sent from I i,x and received in I j,y, a directed edge is drawn from c i,x-1 to c j,y (instead of from c i,x to c j,y ). –failure points are not included in V ( = C L ) Rollback-dependency graph: c i,x c j,y c i,x-1 c j,y Checkpoint graph:
Checkpoint Graphs (2) Checkpointing graph Checkpoint graph represents the happened-before relationship between the checkpoints. The recovery line is calculated by the rollback propagation algorithm, which at each step rolls back the processes according to the recorded dependencies.
Rollback Propagation Algorithm include last checkpoint of each failed process as an element in set RootSet ; include current state of each surviving process as an element in RootSet ; mark all checkpoints reachable by following at least one edge from any member of RootSet ; while (at least one member of RootSet is marked) replace each marked element in RootSet by the last unmarked checkpoint of the same process; mark all checkpoints reachable by following at least one edge from any member of RootSet end RootSet is the recovery line.
Rollback-Dependency Graphs vs. Checkpoint Graphs Both the rollback-dependency graph and the checkpoint graph approaches are equivalent. –they always produce the same recovery line (as indeed they do in the example). Checkpointing graphRollback-Dependency graph
Recovery In order to be able to calculate the recovery line some process needs to collect all the dependency data recorded by all the processes. A process recovering from a failure broadcasts a dependency request message Each process that receives a dependency request –stops execution –replies with the local dependency information Then, the initiator –calculates the recovery line based on the received data –broadcasts a rollback request message containing the recovery line
Recovery (2) A process whose current state belongs to the recovery line resumes execution. Otherwise, it rolls back to a checkpoint indicated by the recovery line. P1P1 m1m1 P0P0 P2P2 A B C m0m0 m2m2 m3m3 A B C m3m3 Recovery line
Garbage Collection In order to prevent memory overflow and reduce storage overhead only useful checkpoints should be kept. Any checkpoint that precedes the recovery lines for all possible combinations of process failures can be discarded.
Garbage Collection Algorithm Build a rollback-dependency graph as if all the processes have failed. Run the algorithm for discovery of the recovery line. –The resulting recovery line is called global recovery line. All the checkpoints taken before the recovery line are obsolete.
Garbage Collection Example As can be seen from the example when the global recovery line is unable to advance because of rollback propagation, a large number of non-obsolete checkpoints may need to be retained.
Disadvantages of Uncoordinated Checkpointing Susceptible to the domino effect Checkpoints that will never be part of a global consistent state can be taken –Storage overhead –do not advance the recovery line A process needs to maintain multiple checkpoints and to use garbage collector to reclaim checkpoints that are no longer needed Not suitable for output commit, because output commit requires global coordination to compute the recovery line
Classification of Checkpoint- based Protocols
Coordinated Checkpointing The processes cooperate in order to form a consistent global checkpoint. Only one checkpoint needs to be maintained on the stable storage at all times. –No need for garbage collection –Reduced storage overhead. Recovery is less complicated than in uncoordinated checkpointing. Expensive output commit –A global checkpoint is needed before output can be committed to the outside world.
Preventing Dependencies The main purpose of coordination is to avoid dependencies between the local checkpoints belonging to the same logical global checkpoint. The coordinated checkpointing protocols differ in the way they prevent dependencies.
Classification of Coordinated Checkpointing Protocols
Blocking Checkpoint Coordination Blocking Checkpoint Coordination is the most straightforward approach to implement coordinated checkpointing. A coordinator process orchestrates the checkpointing by sending a request to checkpoint to each process. –not very scalable
Blocking Checkpoint Coordination (2) Two checkpoints become dependent in the following case: B A Blocking Checkpoint Coordination prevents dependencies by blocking communication while checkpointing protocol executes. –Can result in large overhead due to blocking If P 0 rolls back to A, than P 1 has to roll back a checkpoint that precedes B.
Algorithm for Blocking Checkpoint Coordination A coordinator takes a checkpoint and broadcasts a checkpoint request message to all processes. When a process receives this message, it –stops its execution –flushes all the communication channels –takes a tentative checkpoint –sends an acknowledgment message back to the coordinator. After the coordinator receives acknowledgments from all processes, it broadcasts a commit message
Algorithm for Blocking Checkpoint Coordination (2) After receiving the commit message, each process –removes the old permanent checkpoint –atomically makes the tentative checkpoint permanent. –resumes execution P1P1 m1m1 P0P0 P2P2 B C m0m0 m2m2 A m3m3 A BB m4m4 m5m5
Non-Blocking Checkpoint Coordination An initiator sends all the other processes a request to take checkpoint. If the channels are FIFO and reliable the following algorithm can be used. –The initiator takes a checkpoint and sends a checkpoint request to all processes. –Each process takes a checkpoint upon receiving the first checkpoint request and rebroadcasts it to all processes. B A
Non-Blocking Checkpoint Coordination (2) If the channels are not FIFO the checkpointing request can be piggybacked on every post- checkpoint message. Note that in both cases (FIFO & non-FIFO), each checkpoint request contains the identity of its initiator along with the sequence number of the request. B A
Synchronized Checkpoint Clocks Loosely synchronized clocks can be used for triggering the local checkpointing actions at approximately the same time instead of waiting for request from an initiator. All processes take checkpoints at predefined times, according to their local clock. A process takes a checkpoint and blocks for a period that equals the sum of - the maximum deviation between clocks, and - the maximum time to detect a failure in another process in the system.
Synchronized Checkpoint Clocks (2) If during checkpointing or waiting time a failure occurs in another processes, the taken checkpoint is discarded and the protocol is aborted. P1P1 P2P2 P0P0
Synchronized Checkpoint Clocks: Optimization Instead of blocking during the waiting time, a process can continue execution, but include a checkpointing request in the messages it sends A process that receives a checkpointing request with id i starts the i th checkpointing period if it has not started it yet. –The attached message is delivered only after the process has performed the i th checkpoint
Minimal Checkpointing Coordination If all processes participate in every checkpoint, the system becomes not scalable –need to reduce the number of processes involved in the checkpoint Observation: only those processes that have communicated with the initiator of the current checkpoint, either directly or indirectly since the last checkpoint, need to take new checkpoints
Algorithm for Minimal Checkpointing Coordination The checkpoint initiator identifies all processes with which it has communicated since the last checkpoint and sends them a request. Upon receiving the request, each process in turn identifies all processes it has communicated with since the last checkpoints and sends them a request, and so on, until no more processes can be identified. –Hierarchical distribution of a checkpoint request, instead of one initiator. –The rest of the protocol is done according to either blocking or non-blocking approach
Classification of Checkpoint- based Protocols
Communication-induced Checkpointing Balances between uncoordinated and coordinated checkpointing –Allows processes to take some checkpoints independently. These checkpoints are called local checkpoints –Guarantees the eventual progress of the recovery line by forcing processes to take additional checkpoints, called forced checkpoints.
Communication-induced Checkpointing (2) Communication-induced checkpointing piggybacks protocol-related information on each application message. –In contrast with coordinated checkpointing, no special coordination messages are exchanged. The receiver of each application message uses the piggybacked information to determine if it has to take a forced checkpoint to advance the global recovery line. The forced checkpoint must be taken before the application may process the contents of the message. –high latency and overhead –need to reduce the number of forced checkpoints
Classification of Communication- induced Protocols
Model-based Checkpointing Model-based checkpointing relies on preventing patterns of communications and checkpoints that could result in inconsistent states among the existing checkpoints. –A model is set up to detect the possibility that such patterns could be forming within the system. –A checkpoint is usually forced to prevent the undesirable patterns from occurring. –The decision to force a checkpoint is done locally using the available information.
Model-based Checkpointing Algorithms The MSR model: –In every checkpoint interval all message-receiving events precede all message-sending events. –Can be maintained by taking an additional checkpoint before every message-receiving event that is not separated from its previous message-sending event by a checkpoint. Another way to prevent the domino effect is to avoid rollback propagation completely by taking a checkpoint immediately after every message- sending event.
Unnecessary Checkpoints Model-based checkpointing usually takes more forced checkpoints than it is necessary. –The model used to detect possible inconsistencies is not precise and therefore forces local checkpoints to prevent the formation of undesirable patterns that may never actually materialize. –It is possible that two processes detect the potential for inconsistent checkpoints and independently force local checkpoints to prevent the formation of undesirable patterns that could be prevented by a single forced checkpoint.
Index-based checkpointing Index-based communication-induced checkpointing works by assigning monotonically increasing indexes to global checkpoints, such that the checkpoints having the same index at different processes form a consistent state. The indexes are piggybacked on application messages to help receivers decide when they should force a checkpoint. For instance, the protocol by Briatico et al. forces a process to take a checkpoint upon receiving a message with a piggybacked index greater than the local index.
Detailed Classification of Checkpoint-based Protocols
Types of Rollback Recovery Protocols
Log-based Rollback Recovery Log-based recovery views the execution of a process as a sequence of state intervals. –An interval starts with a non-deterministic event, such as: Receipt of a message (from a process or the outside world) Reading the contents of the local clock Interrupt –Execution during an interval is deterministic A process that is started from the same state and is subjected to the same non-deterministic event yields the same output.
Logging Protocols and the PWD assumption Log-based recovery assumes that –all non-deterministic events that a process executes can be identified, and that –the information necessary to replay each event during recovery can be logged Such information is called a determinant of the event Together, these conditions constitute the piecewise deterministic (PWD) assumption.
Logging Protocols and the PWD assumption (2) If the PWD assumption holds, log-based rollback- recovery protocols can recover a failed process and replay its execution exactly as it occurred before the failure. Therefore, they are: –useful when interactions with the outside world are frequent, because it eliminates the need to take expensive checkpoints before sending such output. –generally not susceptible to the domino effect, thereby allowing processes to use uncoordinated checkpointing if desired.
Log-based Rollback Recovery (2) During failure-free operation, each process –logs the determinants of all the non-deterministic events that it observes to the stable storage –periodically takes checkpoints to limit the amount of work during the recovery. The pre-failure execution of a failed process can be reconstructed during recovery up to the first non- deterministic event whose determinant is not logged. The system must guarantee that upon recovery of all failed processes, there is no orphan processes.
Logging Protocols: Recoverable and Stable States A state interval is recoverable if there is sufficient information to replay the execution up to that state interval despite any future failures in the system. Also, a state interval is stable if the determinant of the non-deterministic event that started it is logged on stable storage. A recoverable state interval is always stable, but the opposite is not always true.
Message Logging Example States X, Y and Z form the maximum recoverable state i.e., the most recent recoverable consistent system state. m 7 is an orphan message p 0 is an orphan process Assume that the processes P 1 and P 2 fail before logging the determinants corresponding to the deliveries of m 6 and m 5, respectively,
Classification of Log-based Protocols
There are three main types of logging protocols, depending on when the determinants are logged to stable storage. 1.pessimistic logging – the application blocks waiting for the determinant of each non-deterministic event to be stored on stable storage before the effects become visible. 2.optimistic logging – the application does not block, and determinants are spooled to stable storage asynchronously. 3.causal logging - a balance between optimistic and pessimistic logging.
Types of Logging Protocols and Orphan Processes Pessimistic logging guarantees that orphan processes are never created due to a failure. –Simplify recovery, garbage collection and output commit, at the expense of higher failure-free performance overhead. Optimistic logging reduces the failure-free performance overhead, but allow orphan processes to be created due to failures. –The possibility of having orphans complicates recovery, garbage collection and output commit. Causal logging attempts to combine the advantages of low performance overhead and fast output commit –May require complex recovery and garbage collection.
The No-Orphans Consistency Condition Let e be a non-deterministic event that occurs at process p, we define: Depend(e) – the set of processes that are affected by a non-deterministic event e. This set consists of p, and any process whose state depends on the event e according to the Lamport’s happened before relation. Log(e) – the set of processes that have logged a copy of e’s determinant in their volatile memory. Stable(e) – a predicate that is true if e’s determinant is logged on stable storage.
The No-Orphans Consistency Condition (2) When a subset of processes fail, a surviving process depending on an event e is not an orphan, if: e: Stable(e) Depend(e) Log(e) This property is called the always-no-orphans condition. It stipulates that if any surviving process depends on an event e that either –the event is logged on stable storage, or –some process has a copy of the determinant of event e. If neither condition is true, then the process is an orphan because it depends on an event e that cannot be generated during recovery since its determinant has been lost.
Classification of Log-based Protocols
Pessimistic Logging The determinant of each non-deterministic event is logged before it can affect the computation. In their most straightforward form the pessimistic protocol log to the stable storage. –This approach is called synchronous logging Significant performance overhead during the failure-free execution
Pessimistic Logging Example During failure-free operation the logs of processes P 0, P 1 and P 2 contain the determinants needed to replay messages {m 0, m 4, m 7 }, {m 1, m 3, m 6 } and {m 2, m 5 }, respectively.
Advantages of Pessimistic Logging 1.Processes can commit output to the outside world without running a special protocol. 2.The frequency of checkpoints can be determined by trading off the desired runtime performance with the desired protection of the on-going execution. 3.Functioning processes that are not affected by failures, continue to operate and never become orphans. –This is highly desirable in practical systems. 4.Older checkpoints and determinants of non- deterministic events that occurred before the most recent checkpoint can be discarded
Hardware Techniques for Reducing Performance Overhead The performance overhead of synchronous logging can be lowered by using special hardware. Examples: –a fast non-volatile semiconductor memory to implement stable storage improves performance by orders of magnitude. –a special bus to guarantee atomic logging of all messages exchanged in the system. ensures that the log of one machine is automatically stored on a designated backup without blocking the execution of the application program. requires that all non-deterministic events be converted into external messages.
Sender-Based Message Logging Some pessimistic logging systems reduce the overhead of synchronous logging without relying on hardware. For example, the Sender-Based Message Logging (SBML) –Keeps the determinants corresponding to the delivery of each message m in the volatile memory of its sender. –The determinant of m, which consists of its content and the order in which it was delivered, is logged in several steps: Before sending m, the sender logs its content in volatile memory. The receiver of m responds with an acknowledgment that includes the order in which the message was delivered, the sender adds to the determinant the ordering information. SBML tolerates only one failure and cannot handle non- deterministic events internal to a process.
Relaxing Logging Atomicity m 2 and m 4 are allowed to affect P 2 before logged, but must be logged before m 6 is sent. The performance overhead of pessimistic logging can be reduced by delivering a message or an event while deferring its logging until the host communicates with another host or with the outside world.
Relaxing Logging Atomicity (2) Systems that separate logging of an event from its delivery may lose the last messages delivered before a failure. –This may be a problem for applications that assume that processes communicate through reliable channels. –This problem does not arise in protocols that log messages at the sender or do not assume reliable communication channels
Classification of Log-based Protocols
Optimistic Logging Processes log determinants asynchronously to stable storage. Determinants are kept in a volatile log, which is periodically flushed to stable storage. –No need to block waiting for the determinants to be written to stable storage –Temporary creation of orphan processes is permitted Needs garbage collection – multiple checkpoints may be kept Slower output commit –requires coordination to insure no failure revokes output More complicated recovery –has two flavors: synchronous recovery and asynchronous recovery
Optimistic Logging and Orphan Processes If a process fails, the determinants in its volatile log are lost and some state intervals cannot be recovered. If the failed process sent a message during any of the lost intervals the receiver of the message becomes an orphan process. When recovery is complete there is no orphan processes. –The orphan processes are rolled back until their states do not depend on any message whose determinant has been lost.
Optimistic Logging Example Note that the processes keep multiple checkpoints – non-trivial garbage collection is needed. If P 0 wants to commit output in state X, it must: –log m 4 and m 5 to stable storage –ask P 2 to log m 2 and m 5 to stable storage Suppose P 2 in fails before the determinant for m 5 is logged to stable storage. P 1 becomes orphan and needs to roll back to B, which forces P 0 to rollback to A
Synchronous Recovery and Dependency Tracking During synchronous recovery all processes run a recovery protocol to compute the maximum recoverable state based on: –Logged determinants and checkpoints –Dependency information gathered during the failure-free execution. There are two approaches to dependency tracking: direct and transitive –In both, during failure-free execution, each –process increments a state interval index at the beginning of each state interval.
Direct Dependency Tracking The state interval index is piggybacked on each outgoing message. The receiver records records the dependency directly caused by the message. These direct dependencies are assembled at recovery time to obtain complete dependency information.
Transitive Dependency Tracking Each process P i maintains a size-N vector TD i, where: –TD i [i] is P i ’s current state interval index –TD i [j], j i, is the highest index of any state interval of P j on which P i depends. TD i is sent in each outgoing message and is updated on each receipt of a message Each interval of P i is associated with a vector timestamp. –Two intervals are dependant if their vectors are comparable, i.e. all entries of one vector are not bigger than the corresponding entries of the other Disadvantage: generally incurs a higher failure-free overhead for piggybacking and maintaining the dependency vectors Advantage: allows faster output commit and recovery.
Asynchronous Recovery In asynchronous recovery, a failed process restarts by broadcasting a rollback announcement If upon receiving a rollback announcement a process detects that it has become an orphan with respect to that announcement it –rolls back –broadcasts its own rollback announcement.
Incarnation Tracking When a process restarts execution from a checkpoint, we will say that it starts a new incarnation. Multiple incarnations of a process may coexist in the system with asynchronous recovery –each process needs to track the dependency of its state on every incarnation of all processes to correctly detect orphaned states. Dependency tracking can be limited to a single incarnation of each process by forcing a process P i to delay delivery of messages carrying a dependency on an unknown incarnation of a process P i, until P i receives all the preceding rollback announcements from P i.
Exponential Rollbacks In asynchronous recovery protocols a single failure can cause another process to roll back an exponential number of times. –This is known as the exponential rollbacks phenomenon. P i rolls back 2 i-1 times
Dealing with Exponential Rollbacks Several ways have been proposed: –Distinguish failure announcements from rollback announcements and broadcast only the former –Piggyback the original rollback announcement from the failed process on every subsequent rollback announcement that it triggers. –Piggyback all rollback announcements on every application message
Classification of Log-based Protocols
Causal Logging Ensure the always-no-orphans property by ensuring that the determinant of each non-deterministic event that precedes the state of a process, according to Lamport’s happened-before, is either stable or it is available locally to that process. The determinant of each of these events contains the order in which its original receiver delivered the corresponding message. The message sender, as in sender-based message logging, logs the message content.
Causal Logging Example Process P 0 “guides” the recovery of P 1 and P 2 since it knows the order in which P 1 should replay receipt of m 1 and m 3. The contents of m 1 are obtained from the sender log of P 0. The contents of m 3 are deterministically regenerated during the recovery of P 1 and P 2. In state X the determinants of m 0, m 1, m 2, m 3 and m 4 are either on stable storage or in volatile memory in P 0. Messages m 5 and m 6 may be lost upon the failure,
Advantages of Causal Logging Causal Logging has the failure-free performance advantages of optimistic logging while retaining most of the advantages of pessimistic logging –avoids synchronous access to stable storage except during output commit. –allows each process to commit output independently the sender processor simply needs to save its log to stable storage –never creates orphans –limits the rollback of any failed process to the most recent checkpoint on stable storage. –reduces the storage overhead and the amount of work at risk. The above advantages come at the expense of a more complex recovery protocol.
Tracking Causality Processes piggyback the non-stable determinants in their volatile log on the messages they send to other processes. On receiving a message, a process first adds any piggybacked determinant to its volatile determinant log and then delivers the message to the application. The determinants are stored and sent in the form of antecedence graph.
Antecedence Graph Antecedence graph of a process P is a directed graph G(V,E) so that: –V is a set of non-deterministic events that precede P’s current state (according to happened-before) –E contains an edge v u if and only if v precedes u (according to happened-before)
Antecedence Graph Example
Efficient Transmission of Antecedence Graphs Carrying the entire graph on each application message is unacceptable. Solution: any message between processes p and q carries only the difference between the graphs piggybacked on the previous message exchanged. Furthermore, if p has recently received a message from q, it can exclude the graph portions that have been piggybacked on that message. This technique has low overhead in practice
Family Based Logging Further reduction of the overhead is possible if the system is willing to tolerate a number of failures that is less than N. Family Based Logging protocols (FBL) are parameterized by the number of tolerated failures. –Log each non-deterministic event in the volatile store of f + 1 different hosts. propagation of information about an event stops when it has been recorded in f + 1 processes. For f < N, –Sender-based logging is used to support message replay during recovery and determinants are piggybacked on application messages.
Family Based Logging (2) FBL protocols do not access stable storage except for checkpointing. –Reducing access to stable storage in turn reduces performance overhead and implementation complexity. An implementation for the protocol with f = 1 confirms that the performance overhead is very small. The described causal logging protocol is an FBL protocol corresponding to the case of f = N.
Detailed Classification of Rollback Recovery Protocols
Comparison