Coordinated Checkpointing Presented by Sarah Arnold 1.

Slides:



Advertisements
Similar presentations
Chapter 16: Recovery System
Advertisements

Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the failed node in step.
Faults and Recovery Ludovic Henrio CNRS - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
Faults and Recovery Ludovic Henrio INRIA - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
TRANSACTIONS A sequence of SQL statements to be executed "together“ as a unit: A money transfer transaction: Reasons for Transactions : Concurrency control.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Distributed Transactions Chapter 13
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 294 Database Systems II Coping With System Failures.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
Chapter 16 Recovery Yonsei University 1 st Semester, 2015 Sanghyun Park.
Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.
Ludovic Henrio INRIA - projet OASIS
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
1 Fault Tolerance and Recovery Mostly taken from
SYSTEMS IMPLEMENTATION TECHNIQUES TRANSACTION PROCESSING DATABASE RECOVERY DATABASE SECURITY CONCURRENCY CONTROL.
Jun-Ki Min. Slide Purpose of Database Recovery ◦ To bring the database into the last consistent stat e, which existed prior to the failure. ◦
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
1 Distributed Systems 2007/08 Rollback-Recovery Alberto Montresor Università di Trento This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
8.6. Recovery By Hemanth Kumar Reddy.
Ludovic Henrio CNRS - projet SCALE
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Chapter 10 Transaction Management and Concurrency Control
EEC 688/788 Secure and Dependable Computing
Parallel and Distributed Simulation
Recovery System.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
Presentation transcript:

Coordinated Checkpointing Presented by Sarah Arnold 1

Coordinated Checkpointing Agenda Goals Fault Tolerance Failure Recovery System Overview Coordinated Checkpointing Communication-Induced Checkpointing Logging Conclusions 2

Coordinated Checkpointing Goals To recover the system after any type of fault has been introduced to the system and to minimize the amount of computation lost  Hardware  Software  Processors  Network  Memory  Disk 3 erroneous state error valid state failure causes fault leads to recovery

Coordinated Checkpointing Fault Tolerance Fault Tolerance – a design that enables a system to continue operation, rather than failing completely, when some part of the system fails  Looking at problem from system perspective in terms of the state of the system being its “memory state”  We know nothing of the application or outside world processes that may have introduced the error, but must still get the system back to a valid state 4

Coordinated Checkpointing Failure Recovery Failure Recovery – an attempt to put the system back into a valid state  Backward Recovery: Retreating back to an earlier state of the system Operation-based: Logs of operations are maintained and replayed State-based: Check-pointing particular states of the system as it evolves  Forward Recovery: Usually no previous state to retreat to; instead must fail into some forward condition Messages sent to outside world are sent and cannot be retrieved: Imagine trying to recover Space Shuttle after liftoff! 5

Coordinated Checkpointing System Model System interacts with outside world as well as sends messages internally  System must be kept in a coherent state with the outside world process 6 Processes Messages

Coordinated Checkpointing Orphan Messages Orphan Message: A message that is received but never sent (i.e. message m below); no sender can be identified  Due to the fact that, when restored back to their checkpoints, one part of the system is incoherent with another part of the system Checkpoint: Complete recorded state of the application or Failure: 7 X1X1 X m x1x1 Y y1y1

Coordinated Checkpointing Lost Messages If a process fails and has to recover to a previous state before it received a message, the message is lost  Sender might try and send again, but potential receiver doesn’t even know it had been sent already 8 Y X m y1y1 x1x1

Coordinated Checkpointing In-/Consistent States When rolling back to a checkpoint, the system is in a consistent state if there are no orphan messages (see a below) and is in an inconsistent state if there are orphan messages (see b below) 9

Coordinated Checkpointing Domino Effect In order to avoid orphan messages and rolling back to an inconsistent state, a failed process may trigger other processes to rollback as well – this is Domino Effect.  Goal is to checkpoint at most useful time/state Consider the effect if Z failed after sending message n 10 X n m y2y2 x1x1 z2z2 z1z1 Z Y y1y1 x2x2 x3x3

Coordinated Checkpointing Algorithm Considerations Output commit: when a message is sent to the outside world, there is no way to pull that message back; similarly, there may not be a way to reproduce a message from the outside world.  Therefore, the state of the system must be solid to ensure no failure past that point  Expense: Affects latency of message and additional checkpointing Garbage Collection: when can I get rid of older checkpoints? Stable Storage  All algorithms assume that the location of checkpointing data is on stable storage 11

Coordinated Checkpointing Logging Elements Determinant: The information that must be logged that is needed to recover a message  How to record this depends on type of algorithm Piecewise-Deterministic  Postulates that all nondeterministic events that a process executes can be identified and the information needed to replay the events can be logged in its determinant  By logging and replaying the nondeterministic events in their exact order, a process can deterministically recreate its pre-failure state, even without a checkpoint 12 L

Coordinated Checkpointing Recovery Algorithms 13 Rollback-Recovery checkpointinglogging uncoordinatedcoordinatedcommunication- induced pessimisticoptimisticcausal blockingnon-blockingindex-basedmodel-based

Coordinated Checkpointing Coordinated Checkpointing Protocol (Blocking) When a process takes a checkpoint, it engages a protocol to coordinate with other processes to also checkpoint  Coordinator takes a checkpoint; broadcasts a message to all processes  Process receives this message and halts execution; takes tentative checkpoint  Coordinator receives acknowledgement from all processes; broadcasts commit message to end protocol  Process receives commit message, removes old permanent checkpoint and makes tentative checkpoint permanent  Processes resume execution 14

Coordinated Checkpointing Recovery line: guarantee that system will never have to go back to a state earlier than this line  {x1, y1, z1} forms “recovery line”  Good for garbage collection Blocking: Application is paused and no messages can be in transit during checkpointing 15 Coordinated Checkpointing Protocol (Blocking)

Coordinated Checkpointing Coordinated Checkpointing Notation Each message has a sequence number (an increasing counter) affixed to it by the system When we checkpoint, we keep these vectors along with it 16 X Y last_label_rcvd X [Y] last_label_sent X [Y] first_label_sent Y [X] m.l (a message m and its label l) Last label X received before checkpoint was from Y Last label X sent before checkpoint was to Y First label Y sent after checkpoint was to X

Coordinated Checkpointing Coordinated Checkpointing Questions When to take a checkpoint?  Application specific  Balance the cost of taking the checkpoint against the amount of computation that you’re going to lose by not taking one and having to use an earlier one Checkpoint protocol  When should I do a checkpoint?  If I take a checkpoint, who else do I have to ensure also takes a checkpoint? and…  When must I rollback?  If I rollback, who else must rollback? Answers are based on label vectors! 17 ?

Coordinated Checkpointing 18 Coordinated Checkpointing Algorithm (1) When must I take a checkpoint? (2) Who else has to take a checkpoint when I do? (1) When I (Y) have sent a message to the checkpointing process, X, since my last checkpoint: last_label_rcvd X [Y] >= first_label_sent Y [X] >sl (2) Any other process from whom I have received messages since my last checkpoint. ckpt_cohort X = {Y | last_label_rcvd X [Y] >sl} tentative checkpoint X Z Y m y1y1 y2y2 x1x1 x2x2 z1z1 z2z2

Coordinated Checkpointing 19 Coordinated Checkpointing Algorithm (1) When must I rollback? (2) Who else might have to rollback when I do? (1) When I,Y, have received a message from the restarting process,X, since X's last checkpoint. last_label_rcvd Y (X) >last_label_sent X (Y) (2) Any other process to whom I can send messages. roll_cohort Y = {Z | Y can send message to Z} X Z Y y1y1 y2y2 x1x1 x2x2 z1z1 z2z2

Coordinated Checkpointing Key issue with coordinated checkpointing:  Being able to prevent a process from receiving application messages that could make the checkpoint inconsistent Problem can be avoided by preceding the first post-checkpoint message on each channel by a checkpoint request, forcing each process to take a checkpoint upon receiving the first checkpoint-request message 20

Coordinated Checkpointing Communication-Induced Checkpointing Avoids domino effect without coordinated checkpoints Processes take two kinds of checkpoints  Local: can be taken independently  Forced: must be taken to guarantee progress of recovery line Piggyback protocol-specific information on each application message Follow application trends to make sure checkpoint is necessary  Z-paths and Z-cycles form patterns 21

Coordinated Checkpointing Communication-Induced Checkpointing Z-path: sequence of messages in the interval between two checkpoints  [m 1, m 2 ], [m 1, m 4 ], [m 3, m 2 ] and [m 3, m 4 ] Z-cycle: Z-path that begins and ends within the same interval  [m 5, m 3, m 4 ]  Makes checkpoint c 2, 2 useless 22

Coordinated Checkpointing Logging Goal: Capture messages that are received and avoid orphan processes  Always-no-orphans condition: If any surviving processes depends on an event e, either the event is logged on stable storage or the process has a copy of e’s determinant. Uses checkpointing and logs Useful with applications that interact frequently with the outside world  Enables process to repeat its execution without having to take expensive checkpoints before sending messages Not susceptible to domino effect Piecewise determinism  Rollback recovery protocol can identify all nondeterministic events (messages received, input from outside world, etc.) executed and logs the determinant; can recover a failed process and replay its execution as it occurred before the failure 23

Coordinated Checkpointing Logging Recoverable: a state interval is recoverable if there is sufficient information to replay the execution up to that point despite any future failures Stable: a state interval is stable if the determinant of the nondeterministic event that started it is logged on stable storage  Recoverable is always stable, but opposite is not always true P 1 and P 2 fail before logging m 5 and m 6? M 7 becomes an orphan message  Maximum Recoverable State: X, Y, Z 24

Coordinated Checkpointing Pessimistic Logging Designed under assumption that a failure can occur after any nondeterministic event  Protocol logs determinant to stable storage before event is allowed to affect computation Periodic checkpoints are taken to aid in repeating execution  Application is restarted from most recent checkpoint and the logged determinants are used to recreate execution Pros:  Immediate output commit  Restart from most recent checkpoint  Recovery limited to failed processes Always-no-orphans: if a surviving process depends on an event, either the event is logged or that process has a copy of the event’s determinant  Simple garbage collection Con:  Performance Penalty due to synchronous logging 25 

Coordinated Checkpointing Optimistic Logging Log determinants asynchronously  Optimistic assumption that logging will complete before a failure occurs Determinants are kept in a volatile log that is periodically flushed to stable storage  No blocking necessary (less overhead)  More complicated recovery, garbage collection, and slower output commit Does not implement always-no-orphans  Permits temporary creation of orphan processes  Upon a failure, dependency information is used to recover latest global state of pre-failure execution in which no process is an orphan Great for failure free executions 26

Coordinated Checkpointing Causal Logging Failure-free performance from optimistic + allowing processes to commit output independently and always-no-orphans from pessimistic Determinants of all causally preceding events are logged to stable storage or are available locally Limits rollback to most recent checkpoint  Reduces overhead of storage and work at risk Piggybacks on each message information about preceding messages 27 X  Y

Coordinated Checkpointing Rollback-Recovery Protocols 28

Coordinated Checkpointing Conclusions Issues at hand: Piecewise determinism, performance overhead, storage overhead, ease of output commits, ease of garbage collection, ease of recovery, avoiding domino effect and orphan processes Checkpointing:  Coordinated: simplifies recovery and garbage collection, overall good performance  Uncoordinated: suffers from potential domino effects and complicates recovery  Communication-Induced: no domino effect or coordination, but nondeterministic nature complicates garbage collection and degrades performance Logging: Natural choice for applications that often interact with outside world  Pessimistic: simplifies recovery and output commit; simple and robust  Causal: reduces overhead, fast output commit and orphan-free recovery  Optimistic: reduces overhead more than Causal, but complicates recovery by increasing extent of future rollbacks 29

Coordinated Checkpointing Questions? Thank you! 30

Coordinated Checkpointing References “A Survey of Rollback-Recovery Protocols in Message-Passing Systems” by E.N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson Fault Tolerance: en.wikipedia.org/wiki/Fault_tolerance en.wikipedia.org/wiki/Fault_tolerance Checkpointing-Recovery, Dr. Dennis Kafura 31