Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

Slides:

Advertisements

Similar presentations

Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.

Advertisements

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,

Uncoordinated Checkpointing The Global State Recording Algorithm.

Copyright 2004 Koren & Krishna ECE655/Ckpt Part.12.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.

Improved Message Logging versus Improved Coordinated Checkpointing For Fault Tolerant MPI Pierre Lemarinier joint work with.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.

1 Distributed Systems: Distributed Process Management – Process Migration.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:

1 Rollback-Recovery Protocols II Mahmoud ElGammal.

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.

Rescheduling Sathish Vadhiyar. Rescheduling Motivation Heterogeneity and contention can cause application’s performance vary over time Rescheduling decisions.

Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Fault Tolerant Parallel Data-Intensive Algorithms Mucahid KutluGagan AgrawalOguz Kurt Department of Computer Science and Engineering The Ohio State University.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems.

Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Fault Tolerant Systems

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.

CS5204 – Operating Systems 1 Checkpointing-Recovery.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

11/18 SC 2003 MPICH-V2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging joint work with A.Bouteiller, F.Cappello,

CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Rollback-Recovery Protocols I Message Passing Systems Nabil S. Al Ramli.

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Presented by Rukmini and Diksha Chauhan Virginia Tech 2 nd May, 2007 Movement-Based Checkpointing and Logging for Recovery in Mobile Computing Systems.

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

1 Fault Tolerance and Recovery Mostly taken from

Snapshots, checkpoints, rollback, and restart

Prepared by Ertuğrul Kuzan

Jack Dongarra University of Tennessee

FT-MPI Survey Alan & Nathan.

EEC 688/788 Secure and Dependable Computing

MPI Message Passing Interface

Fault Tolerance in MPI Programs

Operating System Reliability

Operating System Reliability

Introduction to Operating Systems

EECS 498 Introduction to Distributed Systems Fall 2017

Operating System Reliability

Operating System Reliability

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Operating System Reliability

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

ECE 753: FAULT-TOLERANT COMPUTING

Operating System Reliability

Operating System Reliability

Presentation transcript:

Parallel Checkpointing - Sathish Vadhiyar

Introduction  Checkpointing? storing application’s state in order to resume later

Motivation  The largest parallel systems in the world have from 130,000 to 213,000 parallel processing elements. (  Large-scale applications, that can use these large number of processes, are continuously being built.  These applications are also long-running  As the number of processing elements, increase the Mean Time Between Failure (MTBF) decreases  So, how can long-running applications execute in this highly failure-prone environment? – checkpointing and fault-tolerance

Uses of Checkpointing  Fault tolerance & rollback recovery  Process migration & job swapping  Debugging

Types of Checkpointing  OS-level Process preemption E.g. Berkeley’s checkpointing Linux  User-level Transparent Checkpointing performed by the program itself Transparency achieved by linking application with a special library e.g. Plank’s libckpt  User-level non-transparent Users insert checkpointing calls in their programs e.g. SRS (Vadhiyar)

Rules for Consistent Checkpointing  In a parallel program, each process has events and local state An event changes the local state of a process  Global state – an external view of the parallel application (e.g. lines S, S’, S’’) – used for checkpointing and restarting Consists of local states and messages in transit

Rules for Consistent Checkpointing  Types of global states Consistent global state – from where program can be restarted correctly Inconsistent - Otherwise

Rules for Consistent Checkpointing  Chandy & Lamport – 2 rules for consistent global states 1. if a receive event is part of local state of a process, the corresponding send event must be part of the local state of the sender. 2. if a send event is part of the local state of a process and the matching receive is not part of the local state of the receiver, then the message must be part of the state of the network.  S’’ violates rule 1. Hence cannot lead to consistent global state

Independent checkpointing  Processors checkpoint periodically without coordination  Can lead to domino effect – each rollback of a processor to a previous checkpoint forces another processor to rollback even further

Checkpointing Methods 1.Coordinated checkpointing 1.All processes coordinate to take a consistent checkpoint (e.g. using a barrier) 2.Will always lead to consistent checkpoints 2.Checkpointing with message logging 1.Independent checkpoints are taken by processes and a process logs the messages it receives after the last checkpoint 2.Thus recovery is by previous checkpoint and the logged messages.

Message Logging

 In message-logging protocols, each process stores contents and receive sequence number  of all messages it has sent or received into a message log  To trim message logs, a process can also periodically checkpoint  Once a process checkpoints, all messages sent/received before this checkpoint can be removed from log

Message Logging for MPI (e.g. MPICH-V, MPICH-V2)  MPICH-V – uncoordinated, pessimistic receiver-based message logging  Architecture consists of computational nodes, dispatcher, channel memory, checkpoint servers  MPICH-V2 – uncoordinated, pessimistic sender-based message logging  Usage of message ids, logical clocks  Senders log message payload and resends messages

MPICH – V – receiver-based  Each process has an associated channel memory (CM) called home CM  When P1 wants to send messages to P2, it contacts P2’s home CM and sends message to it.  P2 retrieves message from its home CM.  During checkpoint, the process state is stored to CS (checkpoint server)  During restart, checkpoint retrieved from CS and messages from CM.  Dispatcher manages all services

MPICH-V

MPICH-V2- sender-based  Sender based logging to avoid channel memories  When q sends m to p, q logs m.  m is associated with an id  When p receives m, it stores (id, l) where l is the logical clock.  When p crashes and restarts from a checkpoint state, retrieves (id, l) sets from storage and asks other processes to resend from the sets.

Coordinated Checkpointing

Coordinated Checkpointing: SRS  User inserts checkpointing calls specifying data to be checkpointed  Advantage: Small amount of checkpointed data  Disadvantage: User burden  SRS (Stop-ReStart) – A user-level checkpointing library that allows reconfiguration of applications.  Reconfiguration of number of processors and / or data distribution.

INTERNALS MPI Application SRS Runtime Support System (RSS) Start Poll STOP ReStart Read with possible redistribution

SRS Example – Original Code MPI_Init(&argc, &argv); local_size = global_size/size; if(rank == 0){ for(i=0; i<global_size; i++){ global_A[i] = i; } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; for(i=iter_start; i<global_size; i++){ proc_number = i/local_size; local_index = i%local_size; if(rank == proc_number){ local_A[local_index] += 10; } MPI_Finalize();

SRS Example – Modified Code MPI_Init(&argc, &argv); SRS_Init(); local_size = global_size/size; restart_value = SRS_Restart_Value(); if(restart_value == 0){ if(rank == 0){ for(i=0; i<global_size; i++){ global_A[i] = i; } MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm); iter_start = 0; } else{ SRS_Read(“A”, local_A, BLOCK, NULL); SRS_Read(“iterator”, &iter_start, SAME, NULL); } SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL); SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);

SRS Example – Modified Code (Contd..) for(i=iter_start; i<global_size; i++){ stop_value = SRS_Check_Stop(); if(stop_value == 1){ MPI_Finalize(); exit(0); } proc_number = i/local_size; local_index = i%local_size; if(rank == proc_number){ local_A[local_index] += 10; } SRS_Finish(); MPI_Finalize();

Challenges – a precompiler for SRS  A precompiler that reads a plain source code and converts it into SRS- instrumented source code  Need to automatically analyze variables that need to be checkpointed  Need to automatically determine data distributions  Should locate phases between which checkpoints should be taken – automatically determining phases of an application From Schulz et al., SC 2004

Challenges – determining checkpointing intervals  How frequently should checkpoints be taken?  Frequent – leads to checkpointing overhead  Sporadic – leads to high recovery cost  Should model applications using a Markov model consisting of running, recovery and failure states  Should use failure distributions for determining the weights of transitions between states

Challenges – determining checkpointing intervals

Checkpointing Performance  Reducing times for checkpointing, recovery etc.  Checkpoint latency hiding Checkpoint buffering – during checkpointing, copy data to local buffer, store buffer to disk in parallel with application progress Copy-on-write buffering – only modified pages are copied to a buffer and stored. Can be implemented using fork() – forked checkpointing

Checkpointing Performance  Reducing checkpoint size Memory exclusion – no need to store dead and read-only variables Incremental checkpointing – store only that part of data that have been modified from the previous checkpoint Checkpoint compression

References  James S. Plank, ``An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance'', University of Tennessee Technical Report CS , July, 1997``An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance''  James Plank and Thomason. Processor Allocation and Checkpointing Interval Selection in Cluster Computing Systems. JPDC 2001.

References  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes -- George Bosilca, Aurélien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fédak, Cécile Germain, Thomas Hérault, Pierre Lemarinier, Oleg Lodygensky, Frédéric Magniette, Vincent Néri, Anton Selikhov -- SuperComputing 2002, Baltimore USA, November 2002  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on the Pessimistic Sender Based Message Logging -- Aurélien Bouteiller, Franck Cappello, Thomas Hérault, Géraud Krawezik, Pierre Lemarinier, Frédéric Magniette -- To appear in SuperComputing 2003, Phoenix USA, November 2003

References  Vadhiyar, S. and Dongarra, J. “SRS - A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems”. Parallel Processing Letters, Vol. 13, number 2, pp , June 2003.

References  Schulz et al. Implementation and Evaluation of a Scalable Application- level Checkpoint-Recovery Scheme for MPI Programs. SC 2004.