Download presentation
Published byEleanor Cross Modified over 9 years ago
1
Application-Level Checkpoint-restart (CPR) for MPI Programs
Keshav Pingali Joint work with Dan Marques, Greg Bronevetsky, Paul Stodghill, Rohit Fernandes
2
The Problem Old picture of high-performance computing:
Turn-key big-iron platforms Short-running codes Modern high-performance computing: Roll-your-own platforms Large clusters from commodity parts Grid Computing Long-running codes Protein-folding on BG may take 1 year Program runtimes are exceeding MTBF ASCI, Blue Gene, Illinois Rocket Center
3
Software view of hardware failures
Two classes of faults Fail-stop: a failed processor ceases all operation and does not further corrupt system state Byzantine: arbitrary failures Nothing to do with adversaries Our focus: Fail-Stop Faults
4
Solution Space for Fail-stop Faults
Checkpoint-restart (CPR) [Our Choice] Save application state periodically When a process fails, all processes go back to last consistent saved state. Message Logging Processes save outgoing messages If a process goes down it restarts and neighbors resend it old messages Checkpointing used to trim message log In principle, only failed processes need to be restarted Popular in the distributed system community Our experience: not practical for scientific programs because of communication volume
5
Solution Space for CPR Application level Saving Process System state
Checkpointing Uncoordinated Coordinated Blocking Non-Blocking Quasi-Synchronous Coordination
6
Saving process state System-level (SLC)
save all bits of machine program must be restarted on same platform Application-level (ALC) [Our Choice] programmer chooses certain points in program to save minimal state programmer or compiler generate save/restore code amount of saved data can be much less than in system-level CPR (e.g., n-body codes) in principle, program can be restarted on a totally different platform Practice at National Labs demand vendor provide SLC but use hand-rolled ALC in practice!
7
Coordinating checkpoints
Uncoordinated Dependency-tracking, time-coordinated, … Suffer from exponential rollback Coordinated [Our Choice] Blocking Global snapshot at a Barrier Used in current ALC implementations Non-blocking Chandy-Lamport
8
Blocking Co-ordinated Checkpointing
Q R Barrier Barrier Barrier Many programs are bulk-synchronous (BSP model of Valiant) At barrier, all processes can take checkpoints. assumption: no messages are in-flight across the barrier Parallel program reduces to sequential state saving problem But many new parallel programs do not have global barriers..
9
Non-blocking coordinated checkpointing
Processes must be coordinated, but … Do we really need to block all processes before taking a global checkpoint? K. Mani Chandy ? Leslie Lamport So, the question is “Do we really need to block processes and to flush the channels?” Chandy and Lamport had the same question, and came up with a new idea, which is called coordinated non-blocking checkpoint. !
10
Global View Initiator Recovery line Epoch
…… Epoch n Process P Process Q Initiator root process that decided to take a global checkpoint once in a while Recovery line saved state of each process (+ some additional information) recovery lines do not cross Epoch interval between successive recovery lines Program execution is divided into a series of disjoint epochs A failure in epoch n requires that all processes roll back to the recovery line that began epoch n
11
Possible Types of Messages
P’s Checkpoint Early Message Process P Past Message Future Message Process Q Late Message Q’s Checkpoint On Recovery: Past message will be left alone. Future message will be reexecuted. Late message will be re-received but not resent. Early message will be resent but not re-received. Non-blocking protocols must deal with late and early messages.
12
Difficulties in recovery: (I)
x P m1 x Q Late message: m1 Q sent it before taking checkpoint P receives it after taking checkpoint Called in-flight message in literature On recovery, how does P re-obtain message?
13
Difficulties in recovery: (II)
x P m2 x Q Early message: m2 P sent it after taking checkpoint Q receives it before taking checkpoint Called inconsistent message in literature Two problems: How do we prevent m2 from being re-sent? How do we ensure non-deterministic events in P relevant to m2 are re-played identically on recovery?
14
Approach in systems community
x x x P x x x Q Ensure we never have to worry about inconsistent messages during recovery Consistent cut: Set of saved states, one per process No inconsistent message saved states must form a consistent cut Ensuring this: Chandy-Lamport protocol
15
Chandy-Lamport protocol
Processes one process initiates taking of global snapshot Channels: directed FIFO reliable Process graph: Fixed topology Strongly connected component p q r c1 c2 c3 c4 Before we talk about the algorithm, let me briefly describe the model of distributed system that we are going to assume. In today’s talk, a distributed system consists of processes and channels. And channels are assumed to be directed, first-in-first-out, and error-free. And the system can be described by a labeled, directed graph in which the vertices represent processes and the edges represent channels. For example, this system has three processes, p, q, and r. and p has…
16
Algorithm explanation
Coordinating process state-saving How do we avoid inconsistent messages? Saving in-flight messages Termination Okay, so this is their paper, and today we are going to state this algorithm(or protocol) in three steps. In the first step we are going to talk about when each process has to save its local state to construct a consistent cut! And if we do not prevent each process from receiving messages during the coordination phase, to construct a consistent cut, We need to save some messages. And this will be addressed in step 2. And step 3 is about the termination condition of the algorithm, actually I don’t have much thing to talk about in this step, but… anyway Next: Model of Distributed System
17
Step 1: co-ordinating process state-saving
Initiator: Save its local state Send a marker token on each outgoing edge Out-of-band (non-application) message All other processes: On receiving a marker on an incoming edge for the first time save state immediately propagate markers on all outgoing edges resume execution. Further markers will be eaten up. Each system has a special process, which can initiate the chkpnt of the system. It does not have to be only one process, and one system can have many processes as initiator, But let me assume that each system has only one initiator for today. To start the checkpointing of the system, the initiator saves its local state, and send a special message, “marker” on all outgoing edges Send a special message on all outgoing edges We are going to use these further markers in step 2, but for now let us say that further markers can be simply eaten up. Next: Example
18
Example p q r c1 c2 c3 c4 initiator p x x q x marker checkpoint x x r
Draw a cut! So, does this algorithm return a consistent cut? Let us prove this. q x marker checkpoint x x r Next: Proof
19
Theorem: Saved states form consistent cut
p q x Let us assume that a message m exists, and it makes our cut inconsistent. Part of the system P may not be an initiator, or P may be an initiator. What makes the cut inconsistent… as you know, that is the case From the perspective of this cut, message m has not been sent but received… So, these incoming messages make a cut inconsistent. Let us assume that this kind of message exists, and by contradicting the existence of this message, We are going to show this algorithm’s correctness. p m q Next: Proof (cont’)
20
Proof(cont’) p q x x1 x2 p m x1 q x2 p m x1 q x2 x1 is the 1st marker
for process q q x2 p If there exists a message sent by p after p’s checkpoint, we know that there should be a marker X1 sent before message m and after p’s checkpoint. (by the algorithm) The algorithm stated in step 1, does not include any incoming messages into the cut Okay? m (2) x1 is not the 1st marker for process q x1 q x2
21
Step 2:recording in-flight messages
q Okay, from the step 1, we know that the cut returned by the algorithm does not include any message that makes the cut inconsistent.(incoming messages into the cut,) But how about the outgoing messages? These messages are sent before the sender’s chkpnt, and received after the receiver’s chkpnt Therefore this should be part of the global state with these local states. How can we detect these messages? Let us look at an example. Process p saves all messages on channel c that are received after p takes its own checkpoint but before p receives marker token on channel c
22
Example r r s s q q x x x x x p p x x u u t t
(1) p is receiving messages (2) p has just saved its state r r s s q q x 7 x 7 x x 5 8 5 8 x 3 6 6 2 1 4 4 p p X indicates that the message is a “marker” and small yellow box indicates the normal message. These numbers indicate the order that each message arrives on process p. Okay, now we are in the middle of checkpointing phase. Initiator already initiated the checkpointing, so saved its state and sent markers on all outgoing edges, And process p is receiving some normal messages as well as markers. On receiving the first marker from q, in this case, the status of each channels are like this. This figure says that process r sent message 5 before it saved its state, and process s, message 4 and 6. Now you may notice which messages to save. x x u u t t
23
Example(cont’) r s x q x p x x x q x x x p r x u s t x
p’s chkpnt triggered by a marker from q r s x q x 7 1 2 3 4 5 6 7 8 p x 5 8 x x 3 6 2 q 1 4 x x x p Diagram name??? Process p receives messages in the order of one, two, three, … and so on. On the arrival of third message, which is the first marker to process p, process p saves its state, and propagate markers along all outgoing edges (t and u) And message 4, 5, and 6 arrives, and message 7 and 8 were sent just after process r and s saved their states. And the cut is this… and the outgoing edges are 4, 5, and 6. r x u s t x Next: Algorithm (revised)
24
Algorithm (revised) Initiator: when it is time to checkpoint
Save its local state Send marker tokens on all outgoing edges Resume execution, but also record incoming messages on each in-channel c until marker arrives on channel c Once markers are received on all in-channels, save in-flight messages on disk Every other process: when it sees first marker on any in-channel Save state Do not just eat up the further markers. But we have still step 3 left.
25
Step 3: Termination of algorithm
Did every process save its state and its in-flight messages? outside scope of C-L paper p q r initiator direct channel to the initiator? spanning tree? Next: References
26
Comments on C-L protocol
Relied critically on some assumptions: Process can take checkpoint at any time during execution get first marker save state FIFO communication Fixed communication topology Point-to-point communication: no group communication primitives like bcast None of these assumptions are valid for application-level checkpointing of MPI programs
27
Application-Level Checkpointing (ALC)
At special points in application the programmer (or automated tool) places calls to a take_checkpoint() function. Checkpoints may be taken at such spots. State-saving: Programmer writes code Preprocessor transforms program into a version that saves its own state during calls to take_checkpoint().
28
Application-level checkpointing difficulties
System-level checkpoints can be taken anywhere Application-level checkpoints can only be taken at certain places in program This may lead to inconsistent messages Recovery lines in ALC may form inconsistent cuts Process P P’s Checkpoint Process P Process Q Process Q Possible Checkpoint Locations
29
Our protocol (I) Initiator Recovery Line pleaseCheckpoint Process P Process Q Initiator checkpoints, sends pleaseCheckpoint message to all others After receiving this message, process checkpoints at the next available spot Sends every other process Q the number of messages sent to Q in the last epoch
30
Protocol Outline (II) Initiator pleaseCheckpoint Process P Recording… Process Q After checkpointing, each process keeps a record, containing: data of messages from last epoch (Late messages) non-deterministic events: In our applications, non-determinism arises from wild-card MPI receives
31
Protocol Outline (IIIa)
Initiator Process P Process Q Globally, ready to stop recording when all processes have received their late messages no process can send early message safe approximation: all processes have taken their checkpoints
32
Protocol Outline (IIIb)
Initiator readyToStopRecording Process P Process Q Locally, when a process has received all its late messages sends a readyToStopRecording message to Initiator.
33
Protocol Outline (IV) Initiator stopRecording stopRecording Process P Application Message Process Q When initiator receives readyToStopRecording from everyone, it sends stopRecording to everyone Process stops recording when it receives stopRecording message from initiator OR message from a process that has itself stopped recording
34
Protocol Discussion Initiator stopRecording Process P ? Application Message Process Q Why can’t we just wait to receive stopRecording message? Our record would depend on a non-deterministic event, invalidating it. The application message may be different or may not be resent on recovery.
35
Non-FIFO channels Recovery Line Process P Epoch n Epoch n+1 Process Q In principle, we can piggyback epoch number of sender on each message Receiver classifies message as follows: Piggybacked epoch < receiver epoch: late Piggybacked epoch = receiver epoch: intra-epoch Piggybacked epoch > receiver epoch: early
36
Non-FIFO channels We can reduce this to one bit:
Recovery Line Message #51 Process P Epoch n Epoch n+1 Process Q We can reduce this to one bit: Epoch color alternates between red and green Piggyback sender epoch color on message If piggybacked color is not equal to receiver epoch color: Receiver is logging: late message Receiver is not logging: early message
37
Implementation details
Out-of-band messages Whenever application program does a send or receive, our thin layer also looks to see if any out-of-band messages have arrived May cause a problem if a process does not exchange messages for a long time but this is not a serious concern in practice MPI features non-blocking communication Collective communication Save internal state of MPI library Write global checkpoint out to stable storage
38
Research issue Protocol is sufficiently complex that it is easy to make errors Shared-memory protocol even more subtle because shared-memory programs have race conditions Is there a framework for proving these kinds of protocols correct?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.