CS5204 – Operating Systems 1 Checkpointing-Recovery.

Slides:



Advertisements
Similar presentations
Replicated Dictionary and Log
Advertisements

Recovery Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the failed node in step.
Faults and Recovery Ludovic Henrio CNRS - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Uncoordinated Checkpointing The Global State Recording Algorithm.
Faults and Recovery Ludovic Henrio INRIA - projet OASIS Sources: - A survey of rollback-recovery protocols in message-passing systems.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Causal Logging : Manetho Rohit C Fernandes 10/25/01.
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
Coordinated Checkpointing Presented by Sarah Arnold 1.
Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.
Ludovic Henrio INRIA - projet OASIS
EEC 688/788 Secure and Dependable Computing Lecture 5 Wenbing Zhao Cleveland State University
1 Fault Tolerance and Recovery Mostly taken from
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
1 Distributed Systems 2007/08 Rollback-Recovery Alberto Montresor Università di Trento This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
8.6. Recovery By Hemanth Kumar Reddy.
Ludovic Henrio CNRS - projet SCALE
Prepared by Ertuğrul Kuzan
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
Chapter 8 Fault Tolerance Part I Introduction.
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Middleware for Fault Tolerant Applications
EEC 688/788 Secure and Dependable Computing
Fault Tolerant Distributed Computing system.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

CS5204 – Operating Systems 1 Checkpointing-Recovery

Checkpointing CS 5204 – Operating Systems2 Fault Tolerance erroneous state error valid state failure causes fault leads to recovery An error is a manifestation of a fault that can lead to a failure. Failure Recovery: backward recovery operation­based (do­undo­redo logs) state­based (checkpointing/logging) forward recovery

Checkpointing CS 5204 – Operating Systems3 System Model Basic approaches checkpointing : copying/restoring the state of a process logging : recording/replaying messages

Checkpointing CS 5204 – Operating Systems4 Orphan Message X m x1x1 Y y1y1

Checkpointing CS 5204 – Operating Systems5 Lost Messages Y X m y1y1 x1x1 Regenerating lost messages on recovery: if implemented on unreliable communication channels, the application is responsible if impelmented on reliable communication channels, the recovery algorithm is responsible

Checkpointing CS 5204 – Operating Systems6 Domino Effect Cases: X fails after x 3 Y fails after sending message m Z fails after sending message n x2x2 x3x3 X n m y2y2 x1x1 z2z2 z1z1 Z Y y1y1

Checkpointing CS 5204 – Operating Systems7 Other Issues Output commit  the state from which messages are sent to the “outside world” can be recovered  affects latency of message delivery to “outside world” and overhead of checkpoint/logging Stable storage  survives process failures  contains checkpoint/logging information Garbage collection  removal of checkpoints/logs no longer needed

Checkpointing CS 5204 – Operating Systems8 Logging Protocols Elements Piecewise deterministic (PWD) assumption – the system state can be recovered by replaying message receptions Determinant – record of information needed to recover receipt of message Determinants for m 5 and m 6 not logged

Checkpointing CS 5204 – Operating Systems9 Taxonomy Rollback-Recovery checkpointinglogging uncoordinatedcoordinatedcommunication -induced pessimisticoptimisticcausal blockingnon-blockingindex-basedmodel-based

Checkpointing CS 5204 – Operating Systems10 Uncoordinated Checkpointing Rollback-Recovery checkpointing uncoordinated susceptible to domino effect can generate useless checkpoints complicates storage/GC not suitable for frequent output commits

Checkpointing CS 5204 – Operating Systems11 Cordinated/Blocking Protocols Rollback-Recovery checkpointing coordinated blocking X Z Y m y1y1 y2y2 x1x1 x2x2 z1z1 z2z2 no messages can be in transit during checkpointing {x 1, y 1, z 1 } forms “recovery line”

Checkpointing CS 5204 – Operating Systems12 Coordinated/Blocking Notation Each node maintains: a monotonically increasing counter with which each message from that node is labeled. records of the last message from/to and the first message to all other nodes. X Y last_label_rcvd X [Y] last_label_sent X [Y] first_label_sent Y [X] m.l (a message m and its label l) Note: “sl” denotes a “smallest label” that is < any other label and “ll” denotes a “largest label” that is > any other label

Checkpointing CS 5204 – Operating Systems13 Coordinated/Blocking Algorithm (1) When must I take a checkpoint? (2) Who else has to take a checkpoint when I do? (1) When I (Y) have sent a message to the checkpointing process, X, since my last checkpoint: last_label_rcvd X [Y] >= first_label_sent Y [X] > sl (2) Any other process from whom I have received messages since my last checkpoint. ckpt_cohort X = {Y | last_label_rcvd X [Y] > sl} tentative checkpoint X Z Y m y1y1 y2y2 x1x1 x2x2 z1z1 z2z2

Checkpointing CS 5204 – Operating Systems14 Coordinated/Blocking Algorithm (1) When must I rollback? (2) Who else might have to rollback when I do? (1) When I,Y, have received a message from the restarting process,X, since X's last checkpoint. last_label_rcvd Y (X) > last_label_sent X (Y) (2) Any other process to whom I can send messages. roll_cohort Y = {Z | Y can send message to Z} X Z Y y1y1 y2y2 x1x1 x2x2 z1z1 z2z2

Checkpointing CS 5204 – Operating Systems15 Taxonomy Rollback-Recovery checkpointing coordinated non-blocking Approach: “tag” message to trigger checkpointing Example: global-state recording algorithm

Checkpointing CS 5204 – Operating Systems16 Communication-Induced Checkpointing checkpointing Z-path:[m 1,m 2 ] and [m 3,m 4 ] Z-cycle: [m 3,m 4,m 5 ] Checkpoints (like c 2,2 ) in a z-cycle are useless Cause checkpoints to be taken to avoid z-cycles Rollback-Recovery communication -induced

Checkpointing CS 5204 – Operating Systems17 Logging Rollback-Recovery logging pessimisticoptimisticcausal Orphan process: a non-failed process whose state depends on a non-deterministic event that cannot be reproduced during recovery. Determinant: the information need to “replay” the occurrence of a non-deterministic event (e.g., message reception). Avoid orphan processes by guaranteeing: For all e : not Stable(e) => Depend(e) < Log(e) where: Depend(e) – set of processes affected by event e Log(e) – set of processes with e logged on volatile memory Stable(e) – set of processes with e logged on stable storage

Checkpointing CS 5204 – Operating Systems18 Pessimistic Logging Determinant is logged to stable storage before message is delivered Disadvantage: performance penalty for synchronous logging Advantages: immediate output commit restart from most recent checkpoint recovery limited to failed process(es) simple garbage collection

Checkpointing CS 5204 – Operating Systems19 Optimistic Logging determinants are logged asynchronously to stable storage consider: P 2 fails before m 5 is logged advantage: better performance in failure-free execution disadvantages: coordination required on output commit more complex garbage collection

Checkpointing CS 5204 – Operating Systems20 Causal logging combines advantages of optimistic and pessimistic logging based on the set of events that causally precede the state of a process guarantees determinants of all causally preceding events are logged to stable storage or are available locally at non-failed process non-failed process “guides” recovery of failed processes piggybacks on each message information about causally preceding messages reduce cost of piggybacked information by send only difference between current information and information on last message