Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Parallel Checkpointing - Sathish Vadhiyar

Introduction  Checkpointing? storing application’s state in order to resume later

Motivation  The largest parallel systems in the world have from 130,000 to 213,000 parallel processing elements. (http://www.top500.org)http://www.top500.org  Large-scale applications, that can use these large number of processes, are continuously being built.  These applications are also long-running  As the number of processing elements, increase the Mean Time Between Failure (MTBF) decreases  So, how can long-running applications execute in this highly failure-prone environment? – checkpointing and fault-tolerance

Uses of Checkpointing  Fault tolerance & rollback recovery  Process migration & job swapping  Debugging

Types of Checkpointing  OS-level Process preemption E.g. Berkeley’s checkpointing Linux  User-level Transparent Checkpointing performed by the program itself Transparency achieved by linking application with a special library e.g. Plank’s libckpt  User-level non-transparent Users insert checkpointing calls in their programs e.g. SRS (Vadhiyar)

Rules for Consistent Checkpointing  In a parallel program, each process has events and local state An event changes the local state of a process  Global state – an external view of the parallel application (e.g. lines S, S’, S’’) – used for checkpointing and restarting Consists of local states and messages in transit

Rules for Consistent Checkpointing  Types of global states Consistent global state – from where program can be restarted correctly Inconsistent - Otherwise

Rules for Consistent Checkpointing  Chandy & Lamport – 2 rules for consistent global states 1. if a receive event is part of local state of a process, the corresponding send event must be part of the local state of the sender. 2. if a send event is part of the local state of a process and the matching receive is not part of the local state of the receiver, then the message must be part of the state of the network.  S’’ violates rule 1. Hence cannot lead to consistent global state

Independent checkpointing  Processors checkpoint periodically without coordination  Can lead to domino effect – each rollback of a processor to a previous checkpoint forces another processor to rollback even further

Checkpointing Methods 1.Coordinated checkpointing 1.All processes coordinate to take a consistent checkpoint (e.g. using a barrier) 2.Will always lead to consistent checkpoints 2.Checkpointing with message logging 1.Independent checkpoints are taken by processes and a process logs the messages it receives after the last checkpoint 2.Thus recovery is by previous checkpoint and the logged messages.

Message Logging

 In message-logging protocols, each process stores contents and receive sequence number  of all messages it has sent or received into a message log  To trim message logs, a process can also periodically checkpoint  Once a process checkpoints, all messages sent/received before this checkpoint can be removed from log

Message Logging for MPI (e.g. MPICH-V, MPICH-V2)  MPICH-V – uncoordinated, pessimistic receiver-based message logging  Architecture consists of computational nodes, dispatcher, channel memory, checkpoint servers  MPICH-V2 – uncoordinated, pessimistic sender-based message logging  Usage of message ids, logical clocks  Senders log message payload and resends messages

MPICH – V – receiver-based  Each process has an associated channel memory (CM) called home CM  When P1 wants to send messages to P2, it contacts P2’s home CM and sends message to it.  P2 retrieves message from its home CM.  During checkpoint, the process state is stored to CS (checkpoint server)  During restart, checkpoint retrieved from CS and messages from CM.  Dispatcher manages all services

MPICH-V

MPICH-V2- sender-based  Sender based logging to avoid channel memories  When q sends m to p, q logs m.  m is associated with an id  When p receives m, it stores (id, l) where l is the logical clock.  When p crashes and restarts from a checkpoint state, retrieves (id, l) sets from storage and asks other processes to resend from the sets.

Coordinated Checkpointing

Coordinated Checkpointing: SRS  User inserts checkpointing calls specifying data to be checkpointed  Advantage: Small amount of checkpointed data  Disadvantage: User burden  SRS (Stop-ReStart) – A user-level checkpointing library that allows reconfiguration of applications.  Reconfiguration of number of processors and / or data distribution.

INTERNALS MPI Application SRS Runtime Support System (RSS) Start Poll STOP ReStart Read with possible redistribution

SRS Example – Original Code MPI_Init(&argc, &argv); local_size = global_size/size; if(rank == 0){ for(i=0; i<global_size; i++){ global_A[i] = i; } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; for(i=iter_start; i<global_size; i++){ proc_number = i/local_size; local_index = i%local_size; if(rank == proc_number){ local_A[local_index] += 10; } MPI_Finalize();

SRS Example – Modified Code MPI_Init(&argc, &argv); SRS_Init(); local_size = global_size/size; restart_value = SRS_Restart_Value(); if(restart_value == 0){ if(rank == 0){ for(i=0; i<global_size; i++){ global_A[i] = i; } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; } else{ SRS_Read(“A”, local_A, BLOCK, NULL); SRS_Read(“iterator”, &iter_start, SAME, NULL); } SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL); SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);

SRS Example – Modified Code (Contd..) for(i=iter_start; i<global_size; i++){ stop_value = SRS_Check_Stop(); if(stop_value == 1){ MPI_Finalize(); exit(0); } proc_number = i/local_size; local_index = i%local_size; if(rank == proc_number){ local_A[local_index] += 10; } SRS_Finish(); MPI_Finalize();

Challenges – a precompiler for SRS  A precompiler that reads a plain source code and converts it into SRS- instrumented source code  Need to automatically analyze variables that need to be checkpointed  Need to automatically determine data distributions  Should locate phases between which checkpoints should be taken – automatically determining phases of an application From Schulz et al., SC 2004

Challenges – determining checkpointing intervals  How frequently should checkpoints be taken?  Frequent – leads to checkpointing overhead  Sporadic – leads to high recovery cost  Should model applications using a Markov model consisting of running, recovery and failure states  Should use failure distributions for determining the weights of transitions between states

Challenges – determining checkpointing intervals

Checkpointing Performance  Reducing times for checkpointing, recovery etc.  Checkpoint latency hiding Checkpoint buffering – during checkpointing, copy data to local buffer, store buffer to disk in parallel with application progress Copy-on-write buffering – only modified pages are copied to a buffer and stored. Can be implemented using fork() – forked checkpointing

Checkpointing Performance  Reducing checkpoint size Memory exclusion – no need to store dead and read-only variables Incremental checkpointing – store only that part of data that have been modified from the previous checkpoint Checkpoint compression

References  James S. Plank, ``An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance'', University of Tennessee Technical Report CS- 97-372, July, 1997``An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance''  James Plank and Thomason. Processor Allocation and Checkpointing Interval Selection in Cluster Computing Systems. JPDC 2001.

References  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes -- George Bosilca, Aurélien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fédak, Cécile Germain, Thomas Hérault, Pierre Lemarinier, Oleg Lodygensky, Frédéric Magniette, Vincent Néri, Anton Selikhov -- SuperComputing 2002, Baltimore USA, November 2002  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on the Pessimistic Sender Based Message Logging -- Aurélien Bouteiller, Franck Cappello, Thomas Hérault, Géraud Krawezik, Pierre Lemarinier, Frédéric Magniette -- To appear in SuperComputing 2003, Phoenix USA, November 2003

References  Vadhiyar, S. and Dongarra, J. “SRS - A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems”. Parallel Processing Letters, Vol. 13, number 2, pp. 291-312, June 2003.

References  Schulz et al. Implementation and Evaluation of a Scalable Application- level Checkpoint-Recovery Scheme for MPI Programs. SC 2004.

Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Similar presentations

Presentation on theme: "Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Similar presentations

Presentation on theme: "Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later."— Presentation transcript:

Similar presentations

About project

Feedback