An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric Roman January 13 th, 2004 (Based on slides by Jason Duell)
Linux checkpoint/restart Outline Project goals System design Entension interface Current status Future work
Uses of Checkpoint/Restart Gang scheduling ● No queue drain for maintenance, policy change ● Higher utilization and/or more flexible scheduling Process migration ● Save job if node failure imminent ● Pack jobs for optimal network performance Periodic backup ● Not our main focus ● Application can always do more efficiently ● But may be useful for systems with long jobs, fast I/O, and/or high node failure rates
Implementation Strategies Application-based checkpointing ● Efficient: save only needed data as step completes ● Good for fault tolerance: bad for preemption ● Requires per-application effort by programmer Library-based checkpointing ● Portable across operating systems ● Transparent to application (but may require relink, etc.) ● Can't (generally) restore all resources (ex: process IDs) ● Can’t checkpoint shell scripts Kernel-based checkpointing ● Not portable, and harder to implement ● Can save/restore all resources
Design Goals Target: parallel scientific applications ● MPI is a must ● But allow support for other programs/models, too ● Esoteric features (ptrace, Unix domain sockets) have lower implementation priority Implemention: Linux kernel module ● lower barrier to adoption than kernel patch ● Allows upgrades, bug fixes, without reboot
Design Goals II Provide ‘toolkit’ for distributed C/R ● We provide single node checkpoint/restart ● We don’t support distributed operating system features No built-in support for TCP sockets, bproc namespaces, etc. ● We provide hooks to allow parallel runtimes/libraries to implement distributed checkpoint/restart So the MPI library needs to know about checkpointing, but user applications don’t
Extension Interface Callback functions ● Registered at startup (or as needed) ● Run at checkpoint time, then resume at restart/continue ● Handle parallel coordination and/or unsupported objects Two types of callbacks ● Signal handler context ● Run with same PID (LinuxThreads); no thread-safety needed ● But callback limited to calling signal-safe functions (small subset of POSIX) ● Separate thread context ● Can call any function ● But code needs to be thread-safe, and separate PID (LinuxThreads) Critical sections ● Use to protect uncheckpointable sections of code
Current Status Support LAM-MPI jobs ● Both TCP and Myrinet supported ● Infrastructure in place for Infiniband, Quadrics ● Process migration: currently must restart whole job Simple semantics for open files ● Reopen and seek to original position ● Must be regular files (pipe support coming soon) ● Files must exist in same location on filesystem Single- and multi-threaded processes ● checkpoint of ‘mpirun’ checkpoints whole MPI job ● Will support process groups, sessions in future ● Restore original PID
Current status II Work with wide variety of 2.4 kernels ● kernel.org versions onwards ● RedHat: 7.2 through 9 ● SuSE: 7.2 through 9.0 ● autoconf feature probing, so support of custom patched kernels likely to be automatic ● we’ll maintain 2.4 support once 2.6 comes out Support both new and old pthreads ● I.e., old “LinuxThreads”, plus new 2.6 pthreads (backported to 2.4 by Red Hat)
Future Work Support for sessions & process groups ● Including pipes, mmaps, etc., shared within group ● Full restoration of parent/child tree, with original PIDs More semantics for files ● Allow checksum of file, with restart error if it has changed ● Allow saving contents of file (restore either clobbers, or opens anonymously) ● Support files that are not open at checkpoint time, but are specified as being part of the checkpoint Laundry list of other resources to support ● Page 4 of “Design and Implementation” paper
Future Work II Integration with parallel job systems ● Funded to work within suite from DOE Scalable systems software SciDAC. Work is in progress. ● Possibility of OpenPBS, PBSPro support ● Interested in others (LSF, SGE, SLURM, etc.) More MPI implementations ● MPICH 2 support anticipated ● Vendor support (Quadrics)? ● LAM/MPI support for partial/live migration
Conclusion Papers (available from website): ● “Design and Implementation of BLCR”: high-level system design, including description of user API ● “Requirements for Linux Checkpoint/Restart”: exhaustive list of Unix features we will support (or not). ● “A Survey of Checkpoint/Restart Implementations”: focusing on open source versions that run on Linux ● “The LAM/MPI Checkpoint/Restart Framework: System- Initiated Checkpointing”: implementation with LAM/MPI