Checkpoint/restart in Slurm: current status and new developments

Slides:



Advertisements
Similar presentations
Proposal (More) Flexible RMA Synchronization for MPI-3 Hubert Ritzdorf NEC–IT Research Division
Advertisements

Templates in C++. Generic Programming Programming/developing algorithms with the abstraction of types The uses of the abstract type define the necessary.
Kit Chan ATS Lua Plugin Kit Chan Hi, My name is kit.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Supporting MPI Applications on EGEE Grids Zoltán Farkas MTA SZTAKI.
Operating System Support Focus on Architecture
1. 2 FUNCTION INLINE FUNCTION DIFFERENCE BETWEEN FUNCTION AND INLINE FUNCTION CONCLUSION 3.
Cross Cluster Migration Remote access support Adianto Wibisono supervised by : Dr. Dick van Albada Kamil Iskra, M. Sc.
FLANN Fast Library for Approximate Nearest Neighbors
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Task Farming on HPCx David Henty HPCx Applications Support
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
CS1550 Assignment 5 Multiprogramming Implementation notes Matt Craven.
(1) A Beginner’s Quick Start to SIMICS. (2) Disclaimer This is a quick start document to help users get set up quickly Does not replace the user guide.
An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
E-science grid facility for Europe and Latin America E2GRIS1 Gustavo Miranda Teixeira Ricardo Silva Campos Laboratório de Fisiologia Computacional.
Beowulf Software. Monitoring and Administration Beowulf Watch 
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
+ AliEn status report Miguel Martinez Pedreira. + APIs dcache issue having lfn-like pfns root://srm.ndgf.org:1094//alice/cern.ch/user/a/alitrain/PWGJE/Jets_PbPb_2011/104_
Interactive Workflows Branislav Šimo, Ondrej Habala, Ladislav Hluchý Institute of Informatics, Slovak Academy of Sciences.
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
Programming with Java © 2002 The McGraw-Hill Companies, Inc. All rights reserved. 1 McGraw-Hill/Irwin Chapter 5 Creating Classes.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
C Wrapper to LAPACK by Rémi Delmas supervised by Julien Langou.
The JANA Reconstruction Framework David Lawrence - JLab May 25, /25/101JANA - Lawrence - CLAS12 Software Workshop.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
VM vs Container Xen, KVM, VMware, etc. Hardware emulation / paravirtualization Can run different OSs on the same box Dozens of instances OS sprawl problem.
October Test Beam DAQ. Framework sketch Only DAQs subprograms works during spills Each subprogram produces an output each spill Each dependant subprogram.
Getting ready. Why C? Design Features – Efficiency (C programs tend to be compact and to run quickly.) – Portability (C programs written on one system.
Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.
Today… Modularity, or Writing Functions. Winter 2016CISC101 - Prof. McLeod1.
Five todos when moving an application to distributed HTC.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
Virtualisation: status and plans Dag Toppe Larsen
Process Management & Monitoring WG Quarterly Report August 26, 2004.
Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.
Process Management Process Concept Why only the global variables?
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Slurm User Group Meeting’16 - Athens, Greece – Sept , 2016
Open Source distributed document DB for an enterprise
Sujata Ray Dey Maheshtala College Computer Science Department
Router Startup and Setup
A task-based implementation for GeantV
William Stallings Computer Organization and Architecture
Chapter 3: Processes.
Deploying and Configuring SSIS Packages
Senior Solutions Architect, MongoDB Inc.
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Integration of Singularity With Makeflow
CISC101 Reminders Assn 3 due Friday, this week. Quiz 3 next week.
CSE 451: Operating Systems Winter 2011 Processes
CARLA Buenos Aires, Argentina - Sept , 2017
Sujata Ray Dey Maheshtala College Computer Science Department
Router Startup and Setup
Operating System Introduction.
Sai Krishna Deepak Maram, CS 6410
New cluster capabilities through Checkpoint/ Restart
Overview of Workflows: Why Use Them?
Developer: Thadpong Pongthawornkamol
CSE 542: Operating Systems
Makefiles, GDB, Valgrind
Presentation transcript:

Checkpoint/restart in Slurm: current status and new developments M. Rodríguez-Pascual, J.A. Moríñigo, R. Mayo-García

Outline Checkpoint/restart Integration in Slurm Our roadmap

Checkpoint/Restart System vs. Application level Characteristics System Level Application level Triggered by: User/system Application Basic idea Full memory dump Save relevant information When to checkpoint? Any time Pre-fixed places Requires modification of application No (some technologies require re-compilation) Yes Resulting file size Big Small Overhead in exec. time ~1-2% negligible This work is about system-level checkpoint: application level one is transparent to Slurm

BLCR Vs. CRIU Vs. DMTCP Disclaimer: this is our personal experience Integration with Slurm NO YES Requires application modification Recompile app MPI applications Can checkpoint running application without preloading YES* library must be loaded Overhead besides checkpoint NONE Init: sec. CPU: 1-2% CPU:1-2% Can checkpoint containers (Docker & LXD)? YES*we have only tested Docker, not LXD Infiniband support N/A NO* we haven’t tried, comes from doc.

Random thoughts DMTCP & CRIU do not oblige to recompile: useful for legacy & proprietary applications CRIU & BLCR do not oblige to start anything before the application DMTCP & BLCR can checkpoint MPI CRIU can checkpoint Docker & LXD containers (Small) overhead with CRIU & BLCR … different solutions depending on your needs

What have we done? Implement Slurm drivers for DMTCP & CRIU Add “--with-criu=<path>” and “--with-dmtcp=<path>” flags on compilation Add “--with-criu” and “--with-dmtcp” flags to sbatch

CRIU Plugin Direct adaptation of existing BLCR one Idea: Do nothing when starting the application On checkpoint, execute a script Organize data: input, folder structure and so Get process PID Call CRIU checkpoint op. On restart, execute another script Call CRIU restart op.

DMTCP Plugin How does DMTCP work? DMTCP employes a coordinator All jobs register there It coordinates checkpoint on parallel applications Can be more than one. We are employing one per node (max.) Coordinator must start before application Application is started trough a wrapper

DMTCP Plugin Checkpoint process The whole script submitted to sbatch is checkpointed. When running a job, see if there is a coordinator in that node. If so, connect to it. If not, start one When running a multi-task job, only one task performs this. The rest connect to the coordinator

DMTCP Plugin implementation (1): Checkpoint/restart Performed with a script, just like BLCR one Only difficulty is to know where is the coordinator running. All together, quite straighforward.

DMTCP Plugin implementation: Implementation (2): job start Performed with a Spank plugin. We capture the command to be executed on the remote node (user script). Then we replace it by a call to the DMTCP wrapper, with the user script as input argument On start, this wrapper Includes a mutex, so only a task (in a parallel job) executes it Organizes some files, folders and so If there is a coordinator running, connects to it. If not, start sone Starts the real job to be executed (which is the user script)

DMTCP & CRIU Plugin implementation: Implementation (3): Modifications to Slurm Usual stuff related to a new plugin: Makefiles… Checkpoint.h: include new libraries Spank: New parameter S_JOB_CHECKPOINT_DIR so spank_get_item returns checkpoint folder Creation of spank_set_item to modify parameters of the job. At this moment, only “S_JOB_ARGV”

Current status CRIU Plugin DMTCP Plugin Finished & tested Main implementation already finished Some issues being debugged in DMTCP related to Infiniband driver When everything is working OK, it will be put on Slurm mailing list

BLCR Vs. CRIU Vs. DMTCP Disclaimer: this is our personal experience Integration with Slurm YES Requires application modification NO Recompile app MPI applications Can checkpoint running application without preloading YES* library must be loaded Overhead besides checkpoint NONE Init: sec. CPU: 1-2% CPU:1-2% Can checkpoint containers (Docker & LXD)? YES*we have only tested Docker, not LXD Infiniband support N/A NO* we haven’t tried, comes from doc.

https://github.com/supermanue/slurm Take a look! https://github.com/supermanue/slurm

Future: short term Full comparison on checkpoint mechanisms (characteristics, performance, scalability…) Finishing implementation of a multi-checkpoint plugin that automatically determines which mechanism to employ on each job Both will be ready by the end of the year

Future: short term multi-checkpoint plugin algorithm Is the code integrated with BLCR? Yes-> use BLCR No Is it parallel? Yes -> Use DMTCP No -> Use CRIU This is our current naive approach, will be modified after performance & scalability analysis

Future: middle and long term After Docker itself is well integrated in Slurm (security and so), integrate Docker checkpoint in Slurm. Employ checkpoint/restart for live job migration inside clusters

Thanks for your attention manuel.rodriguez@ciemat.es