Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
Lock-Based Concurrency Control
Database Systems, 8 th Edition Concurrency Control with Time Stamping Methods Assigns global unique time stamp to each transaction Produces explicit.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
What we will cover…  Distributed Coordination 1-1.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Chapter 8 : Transaction Management. u Function and importance of transactions. u Properties of transactions. u Concurrency Control – Meaning of serializability.
Chapter 18: Distributed Coordination (Chapter 18.1 – 18.5)
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Distributed Deadlocks and Transaction Recovery.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
Fault Tolerance (2). Topics r Reliable Group Communication.
1 Fault Tolerance and Recovery Mostly taken from
Database Recovery Techniques
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
Commit Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Chapter 10 Transaction Management and Concurrency Control
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Database Recovery 1 Purpose of Database Recovery
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 18: Coherence and Synchronization
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655 Checkpointing III

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.2 Coordinated Checkpointing Algorithms  Uncoordinated checkpointing may lead to domino effect or to livelock  Two basic approaches to checkpoint coordination:  The Koo-Toueg algorithm, which has a process to initiate the system-wide checkpointing process  An algorithm which staggers checkpoints in time; Staggering checkpoints can help avoid near-simultaneous heavy loading of the disk system  Communication-induced checkpointing procedures  Simultaneously using coordinated and uncoordinated checkpointing algorithms - the latter is sufficient to deal with most isolated failures

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.3 Koo-Toueg Algorithm  Suppose P wants to establish a checkpoint at P_3 This will record that q1 was received from Q - to prevent q1 from being orphaned, Q must checkpoint as well  Thus, establishing a checkpoint at P_3 by P forces Q to take a checkpoint to record that q1 was sent  An algorithm for such coordinated checkpointing has two types of checkpoints - tentative and permanent  P first records its current state in a tentative checkpoint, then sends a message to all other processes from whom it has received a message since taking its last checkpoint  Call the set of such processes 

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.4 Koo-Toueg Algorithm - Cont.  The message tells each process in  (e.g., Q), the last message, m_qp, that P has received from it before the tentative checkpoint was taken  If m_qp was not recorded in a checkpoint by Q: to prevent m_qp from being orphaned, Q is asked to take a tentative checkpoint to record sending m_qp  If all processes in , that need to, confirm taking a checkpoint as requested, then all tentative checkpoints can be converted to permanent  If some members of , are unable to checkpoint as requested, P and all members of  abandon the tentative checkpoints, and none are made permanent  This may set off a chain reaction of checkpoints  Each member of  can potentially spawn a set of checkpoints among processes in its corresponding set

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.5 Staggered Checkpointing  The Koo-Toueg algorithm - and others like it - can lead to a large number of processes taking checkpoints at nearly the same time  If they are all writing to a shared stable storage, e.g., a set of common disks, this surge can lead to congestion at the disks or network or both  Either of two approaches can be used to ensure that, at any time, at most one process is taking its checkpoint  (1) Write the checkpoint into a local buffer, then stagger the writes from buffer to stable storage  Assuming a buffer of sufficiently large capacity  (2) Try staggering the checkpoints in time

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.6 Staggered Checkpointing - Cont.  Staggered checkpoints may not be consistent - there may be orphan messages in the system  This can be avoided by a coordinating phase in which each process logs in stable storage all messages it sent out since its previous checkpoint  The message-logging phase of the processes will overlap in time  If the volume of messages is less than the size of the individual checkpoints - the disks and network will see a reduced surge

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.7 Recovery From Failure  If a process fails, it can be restarted after rolling it back to its last checkpoint and all the messages stored in log played back  This combination of checkpoint and message log is called a logical checkpoint  The staggered checkpointing algorithm guarantees that all the logical checkpoints form a consistent recovery line

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.8 Phase One of Staggering Algorithm  Phase 1 - the checkpointing phase:  for (i=0; i  n-1; i++) {  P_i takes a checkpoint  P_i sends a message to P_{(i+1) mod n}, ordering the latter to take a checkpoint  }  When P_0 gets a message from P_{n-1} ordering it to checkpoint - this is the cue for P_0 to initiate the second (message-logging) phase  It sends out a marker message on each of its outgoing channels. When a process P_i receives a marker message, it goes to phase 2

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.9 Phase Two of Staggering Algorithm  Message Logging Phase  if (no previous marker message was received in this round by P_i) then {  P_i sends a marker message on each of its outgoing channels  P_i logs all the messages received by it after the preceding checkpoint  }  else  P_i updates its message log by adding all the messages received by it since the last time the log was updated  end if

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Example of Staggering Algorithm - Phase One  P0 takes a checkpoint and sends take_checkpoint order to P1  P1 sends such an order to P2 after taking its own checkpoint  P2 sends a take_checkpoint order back to P0  At this point, each of the processes has taken a checkpoint and the second phase can begin system

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Example - Phase 2  P0 sends message_log to P1 and P2 - logging messages they received since last checkpoint  P1 and P2 send out similar message_log orders  Each time such a message is received - the process logs the messages  If it is the first time such a message_log order is received by it - the process sends out marker messages on each of its outgoing channels

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Recovery  Assumption - given the checkpoint and messages received, a process can be recovered  We may have orphan messages with respect to the physical checkpoints taken in the first phase  Orphan messages will not exist with respect to the latest (in time) logical checkpoints that are generated using the physical checkpoint and the message log

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Time-Based Synchronization  Orphan messages cannot happen if each process checkpoints at exactly the same time  Practically impossible - clock skews and message communication times cannot be reduced to zero  Time-based synchronization can still be used to facilitate checkpointing - we have to take account of nonzero clock skews  Time-based synchronization - processes are checkpointed at previously agreed times  Example - ask each process to checkpoint when its local clock reads a multiple of 100 seconds  Such a procedure by itself is not enough to avoid orphan messages

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Creation of an Orphan Message - Example  Each process is checkpointing at time 1100 (local clock)  Skew between the two clocks is such that process P0 checkpoints much earlier (in real time) then process P1  As a result, P0 sends out a message to P1 after its checkpoint, which is received by P1 before its checkpoint  This message is a potential orphan

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Preventing Creation of an Orphan Message  Suppose the skew between any two clocks in the distributed system is bounded by , and each process is asked to checkpoint when its local clock reads   Following its checkpoint, a process Px should not send out messages to any process Py until it is certain that Py's local clock reads more than   Px should remain silent over the duration [ ,  +  ] (all times as measured by Px's local clock)  If the inter-process message delivery time has a lower bound  - to prevent orphan messages Px needs to remain silent during a shorter interval [ ,  +  -  ]  If  > , this interval is of zero length - no need for Px to remain silent

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Different Method of Prevention  Suppose message m is received by process Py when its clock reads t  m must have been sent (by Px) no later than  earlier - before Py's clock read t-   Since the clock skew  , at this time, Px's clock should have read at most t-  +   If t-  +  < , the sending of m would be recorded in Px's checkpoint - m cannot be an orphan  A message m received by Py when its clock reads at least  -  +  cannot be an orphan  Orphan messages can be avoided by Py not using and not including in its checkpoint at  any message received during [  -  + ,  ] (Py's clock) until after taking its checkpoint at 

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Diskless Checkpointing  Memory is volatile and unsuitable for storing a checkpoint  However, with extra processors, we can permit checkpointing in main memory  By avoiding disk writes, checkpointing can be faster  Best used as one level in a two-level checkpointing  Have redundant processors using RAID-like techniques to deal with failure  Example: a distributed system with five executing, and one extra, processors  Each executing processor stores its checkpoint in its memory; extra processor stores the parity of these checkpoints  If an executing processor fails, its checkpoint can be reconstructed from the remaining five plus parity checkpoints

Copyright 2004 Koren & Krishna ECE655/Ckpt Part RAID-like Diskless Checkpointing  The inter-processor network must have enough bandwidth for sending checkpoints  Example: n executing and one checkpointing processor, if all the executing processors send their checkpoints to the checkpointing processor to calculate parity - a potential hotspot  Solution: Distribute the parity computations n=5

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Two-Level Recovery  Coordinating checkpoints prevents orphan messages but imposes overhead  Will not affect correctness if failures are isolated, i.e., at most one process in a failed/recovering state at any time  The vast majority of failures are isolated  Make recovery from isolated failures fast  Accept longer recovery times for simultaneous failures  This suggests a two-level recovery scheme  First level: each process takes its own checkpoints without coordination (only useful when recovering from isolated failures)  Checkpoint need not be written to disk, can be written into a memory of another processor  Second level: occasionally entire system undergoes a coordinated checkpointing (with higher overhead), which guards against non-isolated failures

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Two-Level Recovery - Example  P0 fails at t0; system rolls back to latest first-level checkpoint; Recovery successful;  P1 fails at t1; rolls back; At point tx (during recovery), P2 also fails  Non-isolated failures - the system rolls back both processes to the latest second-level checkpoint  In general, the more common the non-isolated failures, the greater must be the frequency at which the second-level checkpoint is taken

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Message Logging  To continue computation beyond latest checkpoint, recovering process may require all the messages it received since then, played back in original order  For coordinated checkpointing - each process can be rolled back to its latest checkpoint and restarted: those messages will be resent during reexecution  To avoid the overhead of coordination and let processes checkpoint independently, logging messages is an option  Two approaches to message logging:  Pessimistic logging - ensures that rollback will not spread, i.e., if a process fails, no other process will need to be rolled back to ensure consistency  Optimistic logging - a process failure may trigger rollback of other processes as well

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Pessimistic Message Logging  Simplest approach - the receiver of a message stops whatever it is doing when it receives a message, logs the message onto stable storage, then resumes execution  Recovering a process from failure - roll it back to its latest checkpoint and play back to it the messages it received since that checkpoint, in the right order  No orphan messages will exist - every message will either have been received before the latest checkpoint or explicitly saved in the message log  Rolling back one process will not trigger the rollback of any other process.

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Sender-Based Message Logging  Logging messages into stable storage can impose a significant overhead  Against one isolated failure at a time, sender-based message logging can be used  The sender of a message records it in a log - when required, the log is read to replay the message  Each process has send- and receive-counters, which increment every time the process sends or receives a message  Each message has a Send Sequence Number (SSN) - value of the send-counter when it is transmitted  A received message is allocated a Receive Sequence Number (RSN) - value of the receive-counter when it was received  The receiver also sends out an ACK to the sender, including the RSN it has allocated to the message  Upon receiving this ACK, the sender acknowledges the ACK in a message to the receiver

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Sender-Based Message Logging - Cont’d  Between the time that the receiver receives the message and sends its ACK, and when it receives the sender's ACK of its own ACK, the receiver is forbidden to send messages to other processes - essential to maintaining correct functioning upon recovery  A message is said to be fully-logged when the sending node knows both its SSN and its RSN; it is partially-logged when the sending node does not yet know its RSN  When a process rolls back and restarts computation from the latest checkpoint, it sends out to the other processes a message listing the SSN of their latest message that it recorded in its checkpoint  When this message is received by a process, it knows which messages are to be retransmitted, and does so  The recovering process now has to use these messages in the same order as they were used before it failed - easy to do for fully-logged messages, since their RSNs are available, and they can be sorted by this number

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Partially-logged Messages  Remaining problem - the partially-logged messages, whose RSNs are not available  They were sent out, but their ACK was never received by the sender  The receiver failed before the message could be delivered to it, or it failed after receiving the message but before it could send out the ACK  The receiver is forbidden to send out messages of its own to other processes between receiving the message and sending out its ACK  As a result, receiving the partially-logged messages in a different order the second time cannot affect any other process in the system - correctness is preserved  Clearly, this approach is only guaranteed to work if there is at most one failed node at any time

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Optimistic Message Logging  Optimistic message logging has a lower overhead than pessimistic logging; however, recovery from failure is much more complex  Optimistic logging is of theoretical interest  When messages are received, they are written into a volatile buffer which, at a suitable time, is copied into stable storage  Process execution is not disrupted, and so the logging overhead is very low  Upon failure, the contents of the buffer can be lost leading to multiple processes having to be rolled back  We need a scheme to handle this situation

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Checkpointing in Shared-Memory Systems  A variant of CARER for shared-memory bus-based multiprocessors - each processor has its own cache  Change the algorithm to maintain cache coherence among the multiple caches  Instead of the single bit marking a line as unchangeable, we have a multi-bit identifier:  A checkpoint identifier, C_id with each cache line  A (per processor) checkpoint counter, C_count, keeping track of the current checkpoint number

Copyright 2004 Koren & Krishna ECE655/Ckpt Part K-1 Shared Memory - Cont.  To take a checkpoint, increment the counter  A line modified before will have its C_id less than the counter  When a line is updated, set C_id = C_count  If a line has been modified since being brought into the cache and C_id < C_count, the line is part of the checkpoint state, and is therefore unwritable. Any writes into such a line must wait until the line is first written into the main memory.  If the counter has k bits, it rolls over to 0 after reaching 2

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Bus-Based Coherence Protocol  Modify a cache coherence algorithm to take account of checkpointing  All traffic between caches and memory must use the bus, i.e., all caches can watch the traffic on bus  A cache line can be in one of the following states: invalid, shared unmodified, exclusive modified, and exclusive unmodified  Exclusive - this is the only valid copy in any cache;  Modified - line has been modified since it was brought into cache from memory

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Bus-Based Coherence Protocol - Cont’d  If processor wants to update a line in shared unmodified state, it moves into exclusive modified state  Other caches holding the same line must invalidate their copies - no longer current  When in the exclusive modified or exclusive unmodified states, another cache puts out a read request on the bus, this cache must service that request (only current copy of that line)  Byproduct- memory is also updated if necessary  Then, move to shared unmodified  Write miss, line into cache - exclusive modified

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Bus-Based Coherence and checkpointing Protocol  How can we modify this protocol to account for checkpointing?  The original exclusive modified state now splits into two:  Exclusive modified  Unwritable

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Directory-Based Protocol  In this approach a directory is maintained centrally which records the status of each line  We can regard this directory as being controlled by some shared-memory controller  This controller handles all read and write misses and all other operations which change line state  Example: If a line is in the exclusive unmodified state and the cache holding that line wants to modify it, it notifies the controller of its intention  The controller can then change the state to exclusive modified  It is then a simple matter to implement this checkpointing scheme atop such a protocol

Copyright 2004 Koren & Krishna ECE655/Ckpt Part Other Uses of Checkpointing  (1) Process Migration  A checkpoint represents process state - migrating a process from one processor to another means moving the checkpoint, and computation can resume on the new processor - can be used to recover from permanent or intermittent faults  Nature of checkpoint determines whether the new processor must be of the same model and run the same operating system  (2) Load-balancing  Better utilization of a distributed system by ensuring that the computational load is appropriately shared among the processors  (3) Debugging  Core files are dumped when a program exits abnormally - these are essentially checkpoints, containing full state information about the affected process - debuggers can read core files and aid in the debugging process  (4) Snapshots  Observing the program state at discrete epochs - deeper understanding of program behavior