Application-Level Checkpoint-restart (CPR) for MPI Programs

Slides:



Advertisements
Similar presentations
Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Advertisements

Global States and Checkpoints
Scalable Algorithms for Global Snapshots in Distributed Systems
Lecture 8: Asynchronous Network Algorithms
Consistent Cuts and Un-coordinated Check-pointing.
Uncoordinated Checkpointing The Global State Recording Algorithm Cristian Solano.
Parallel and Distributed Simulation Time Warp: Basic Algorithm.
CS 582 / CMPE 481 Distributed Systems
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
ISS C 3 : A System for Automating Application-level Checkpointing of MPI Programs Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill Department.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Ordering and Consistent Cuts Presented by Chi H. Ho.
EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
CS 717: Programming for Fault-tolerance Keshav Pingali Cornell University.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Distributed Snapshot. Think about these -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes?
CS5204 – Operating Systems 1 Checkpointing-Recovery.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Two Lectures on Grid Computing Keshav Pingali Cornell University.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Snapshot. One-dollar bank Let a $1 coin circulate in a network of a million banks. How can someone count the total $ in circulation? If not.
Hwajung Lee. Well, you need to capture the notions of atomicity, non-determinism, fairness etc. These concepts are not built into languages like JAVA,
Hwajung Lee. Why do we need these? Don’t we already know a lot about programming? Well, you need to capture the notions of atomicity, non-determinism,
Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1 Chapter 11 Global Properties (Distributed Termination)
Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
1 Fault Tolerance and Recovery Mostly taken from
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Consistent cut A cut is a set of events.
Prepared by Ertuğrul Kuzan
Global State Recording
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Global States
Operating System Reliability
Operating System Reliability
Lecture 9: Asynchronous Network Algorithms
Distributed Snapshot.
Global State Recording
EECS 498 Introduction to Distributed Systems Fall 2017
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Distributed Snapshot.
湖南大学-信息科学与工程学院-计算机与科学系
EEC 688/788 Secure and Dependable Computing
Parallel and Distributed Simulation
Distributed Snapshot Distributed Systems.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
ITEC452 Distributed Computing Lecture 8 Distributed Snapshot
EEC 688/788 Secure and Dependable Computing
Distributed Snapshot.
EEC 688/788 Secure and Dependable Computing
Consistent cut If this is not true, then the cut is inconsistent
Operating System Reliability
Operating System Reliability
Distributed Snapshot.
Presentation transcript:

Application-Level Checkpoint-restart (CPR) for MPI Programs Keshav Pingali Joint work with Dan Marques, Greg Bronevetsky, Paul Stodghill, Rohit Fernandes

The Problem Old picture of high-performance computing: Turn-key big-iron platforms Short-running codes Modern high-performance computing: Roll-your-own platforms Large clusters from commodity parts Grid Computing Long-running codes Protein-folding on BG may take 1 year Program runtimes are exceeding MTBF ASCI, Blue Gene, Illinois Rocket Center

Software view of hardware failures Two classes of faults Fail-stop: a failed processor ceases all operation and does not further corrupt system state Byzantine: arbitrary failures Nothing to do with adversaries Our focus: Fail-Stop Faults

Solution Space for Fail-stop Faults Checkpoint-restart (CPR) [Our Choice] Save application state periodically When a process fails, all processes go back to last consistent saved state. Message Logging Processes save outgoing messages If a process goes down it restarts and neighbors resend it old messages Checkpointing used to trim message log In principle, only failed processes need to be restarted Popular in the distributed system community Our experience: not practical for scientific programs because of communication volume

Solution Space for CPR Application level Saving Process System state Checkpointing Uncoordinated Coordinated Blocking Non-Blocking Quasi-Synchronous Coordination

Saving process state System-level (SLC) save all bits of machine program must be restarted on same platform Application-level (ALC) [Our Choice] programmer chooses certain points in program to save minimal state programmer or compiler generate save/restore code amount of saved data can be much less than in system-level CPR (e.g., n-body codes) in principle, program can be restarted on a totally different platform Practice at National Labs demand vendor provide SLC but use hand-rolled ALC in practice!

Coordinating checkpoints Uncoordinated Dependency-tracking, time-coordinated, … Suffer from exponential rollback Coordinated [Our Choice] Blocking Global snapshot at a Barrier Used in current ALC implementations Non-blocking Chandy-Lamport

Blocking Co-ordinated Checkpointing Q R Barrier Barrier Barrier Many programs are bulk-synchronous (BSP model of Valiant) At barrier, all processes can take checkpoints. assumption: no messages are in-flight across the barrier Parallel program reduces to sequential state saving problem But many new parallel programs do not have global barriers..

Non-blocking coordinated checkpointing Processes must be coordinated, but … Do we really need to block all processes before taking a global checkpoint? K. Mani Chandy ? Leslie Lamport So, the question is “Do we really need to block processes and to flush the channels?” Chandy and Lamport had the same question, and came up with a new idea, which is called coordinated non-blocking checkpoint. !

Global View Initiator Recovery line Epoch …… Epoch n Process P Process Q Initiator root process that decided to take a global checkpoint once in a while Recovery line saved state of each process (+ some additional information) recovery lines do not cross Epoch interval between successive recovery lines Program execution is divided into a series of disjoint epochs A failure in epoch n requires that all processes roll back to the recovery line that began epoch n

Possible Types of Messages P’s Checkpoint Early Message Process P Past Message Future Message Process Q Late Message Q’s Checkpoint On Recovery: Past message will be left alone. Future message will be reexecuted. Late message will be re-received but not resent. Early message will be resent but not re-received.  Non-blocking protocols must deal with late and early messages.

Difficulties in recovery: (I) x P m1 x Q Late message: m1 Q sent it before taking checkpoint P receives it after taking checkpoint Called in-flight message in literature On recovery, how does P re-obtain message?

Difficulties in recovery: (II) x P m2 x Q Early message: m2 P sent it after taking checkpoint Q receives it before taking checkpoint Called inconsistent message in literature Two problems: How do we prevent m2 from being re-sent? How do we ensure non-deterministic events in P relevant to m2 are re-played identically on recovery?

Approach in systems community x x x P x x x Q Ensure we never have to worry about inconsistent messages during recovery Consistent cut: Set of saved states, one per process No inconsistent message  saved states must form a consistent cut Ensuring this: Chandy-Lamport protocol

Chandy-Lamport protocol Processes one process initiates taking of global snapshot Channels: directed FIFO reliable Process graph: Fixed topology Strongly connected component p q r c1 c2 c3 c4 Before we talk about the algorithm, let me briefly describe the model of distributed system that we are going to assume. In today’s talk, a distributed system consists of processes and channels. And channels are assumed to be directed, first-in-first-out, and error-free. And the system can be described by a labeled, directed graph in which the vertices represent processes and the edges represent channels. For example, this system has three processes, p, q, and r. and p has…

Algorithm explanation Coordinating process state-saving How do we avoid inconsistent messages? Saving in-flight messages Termination Okay, so this is their paper, and today we are going to state this algorithm(or protocol) in three steps. In the first step we are going to talk about when each process has to save its local state to construct a consistent cut! And if we do not prevent each process from receiving messages during the coordination phase, to construct a consistent cut, We need to save some messages. And this will be addressed in step 2. And step 3 is about the termination condition of the algorithm, actually I don’t have much thing to talk about in this step, but… anyway Next: Model of Distributed System

Step 1: co-ordinating process state-saving Initiator: Save its local state Send a marker token on each outgoing edge Out-of-band (non-application) message All other processes: On receiving a marker on an incoming edge for the first time save state immediately propagate markers on all outgoing edges resume execution. Further markers will be eaten up. Each system has a special process, which can initiate the chkpnt of the system. It does not have to be only one process, and one system can have many processes as initiator, But let me assume that each system has only one initiator for today. To start the checkpointing of the system, the initiator saves its local state, and send a special message, “marker” on all outgoing edges Send a special message on all outgoing edges We are going to use these further markers in step 2, but for now let us say that further markers can be simply eaten up. Next: Example

Example p q r c1 c2 c3 c4 initiator p x x q x marker checkpoint x x r Draw a cut! So, does this algorithm return a consistent cut? Let us prove this. q x marker checkpoint x x r Next: Proof

Theorem: Saved states form consistent cut p q x Let us assume that a message m exists, and it makes our cut inconsistent. Part of the system P may not be an initiator, or P may be an initiator. What makes the cut inconsistent… as you know, that is the case From the perspective of this cut, message m has not been sent but received… So, these incoming messages make a cut inconsistent. Let us assume that this kind of message exists, and by contradicting the existence of this message, We are going to show this algorithm’s correctness. p m q Next: Proof (cont’)

Proof(cont’) p q x x1 x2 p m x1 q x2 p m x1 q x2 x1 is the 1st marker for process q q x2 p If there exists a message sent by p after p’s checkpoint, we know that there should be a marker X1 sent before message m and after p’s checkpoint. (by the algorithm) The algorithm stated in step 1, does not include any incoming messages into the cut Okay? m (2) x1 is not the 1st marker for process q x1 q x2

Step 2:recording in-flight messages q Okay, from the step 1, we know that the cut returned by the algorithm does not include any message that makes the cut inconsistent.(incoming messages into the cut,) But how about the outgoing messages? These messages are sent before the sender’s chkpnt, and received after the receiver’s chkpnt Therefore this should be part of the global state with these local states. How can we detect these messages? Let us look at an example. Process p saves all messages on channel c that are received after p takes its own checkpoint but before p receives marker token on channel c

Example r r s s q q x x x x x p p x x u u t t (1) p is receiving messages (2) p has just saved its state r r s s q q x 7 x 7 x x 5 8 5 8 x 3 6 6 2 1 4 4 p p X indicates that the message is a “marker” and small yellow box indicates the normal message. These numbers indicate the order that each message arrives on process p. Okay, now we are in the middle of checkpointing phase. Initiator already initiated the checkpointing, so saved its state and sent markers on all outgoing edges, And process p is receiving some normal messages as well as markers. On receiving the first marker from q, in this case, the status of each channels are like this. This figure says that process r sent message 5 before it saved its state, and process s, message 4 and 6. Now you may notice which messages to save. x x u u t t

Example(cont’) r s x q x p x x x q x x x p r x u s t x p’s chkpnt triggered by a marker from q r s x q x 7 1 2 3 4 5 6 7 8 p x 5 8 x x 3 6 2 q 1 4 x x x p Diagram name??? Process p receives messages in the order of one, two, three, … and so on. On the arrival of third message, which is the first marker to process p, process p saves its state, and propagate markers along all outgoing edges (t and u) And message 4, 5, and 6 arrives, and message 7 and 8 were sent just after process r and s saved their states. And the cut is this… and the outgoing edges are 4, 5, and 6. r x u s t x Next: Algorithm (revised)

Algorithm (revised) Initiator: when it is time to checkpoint Save its local state Send marker tokens on all outgoing edges Resume execution, but also record incoming messages on each in-channel c until marker arrives on channel c Once markers are received on all in-channels, save in-flight messages on disk Every other process: when it sees first marker on any in-channel Save state Do not just eat up the further markers. But we have still step 3 left.

Step 3: Termination of algorithm Did every process save its state and its in-flight messages? outside scope of C-L paper p q r initiator direct channel to the initiator? spanning tree? Next: References

Comments on C-L protocol Relied critically on some assumptions: Process can take checkpoint at any time during execution get first marker  save state FIFO communication Fixed communication topology Point-to-point communication: no group communication primitives like bcast None of these assumptions are valid for application-level checkpointing of MPI programs

Application-Level Checkpointing (ALC) At special points in application the programmer (or automated tool) places calls to a take_checkpoint() function. Checkpoints may be taken at such spots. State-saving: Programmer writes code Preprocessor transforms program into a version that saves its own state during calls to take_checkpoint().

Application-level checkpointing difficulties System-level checkpoints can be taken anywhere Application-level checkpoints can only be taken at certain places in program This may lead to inconsistent messages  Recovery lines in ALC may form inconsistent cuts Process P P’s Checkpoint Process P Process Q Process Q Possible Checkpoint Locations

Our protocol (I) Initiator Recovery Line pleaseCheckpoint Process P Process Q Initiator checkpoints, sends pleaseCheckpoint message to all others After receiving this message, process checkpoints at the next available spot Sends every other process Q the number of messages sent to Q in the last epoch

Protocol Outline (II) Initiator pleaseCheckpoint Process P Recording… Process Q After checkpointing, each process keeps a record, containing: data of messages from last epoch (Late messages) non-deterministic events: In our applications, non-determinism arises from wild-card MPI receives

Protocol Outline (IIIa) Initiator Process P Process Q Globally, ready to stop recording when all processes have received their late messages no process can send early message safe approximation: all processes have taken their checkpoints

Protocol Outline (IIIb) Initiator readyToStopRecording Process P Process Q Locally, when a process has received all its late messages  sends a readyToStopRecording message to Initiator.

Protocol Outline (IV) Initiator stopRecording stopRecording Process P Application Message Process Q When initiator receives readyToStopRecording from everyone, it sends stopRecording to everyone Process stops recording when it receives stopRecording message from initiator OR message from a process that has itself stopped recording

Protocol Discussion Initiator stopRecording Process P ? Application Message Process Q Why can’t we just wait to receive stopRecording message? Our record would depend on a non-deterministic event, invalidating it. The application message may be different or may not be resent on recovery.

Non-FIFO channels Recovery Line Process P Epoch n Epoch n+1 Process Q In principle, we can piggyback epoch number of sender on each message Receiver classifies message as follows: Piggybacked epoch < receiver epoch: late Piggybacked epoch = receiver epoch: intra-epoch Piggybacked epoch > receiver epoch: early

Non-FIFO channels We can reduce this to one bit: Recovery Line Message #51 Process P Epoch n Epoch n+1 Process Q We can reduce this to one bit: Epoch color alternates between red and green Piggyback sender epoch color on message If piggybacked color is not equal to receiver epoch color: Receiver is logging: late message Receiver is not logging: early message

Implementation details Out-of-band messages Whenever application program does a send or receive, our thin layer also looks to see if any out-of-band messages have arrived May cause a problem if a process does not exchange messages for a long time but this is not a serious concern in practice MPI features non-blocking communication Collective communication Save internal state of MPI library Write global checkpoint out to stable storage

Research issue Protocol is sufficiently complex that it is easy to make errors Shared-memory protocol even more subtle because shared-memory programs have race conditions Is there a framework for proving these kinds of protocols correct?