Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications
Heon Y. Yeom, Seoul National University Condor Week 2006 Motivation Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. Single process jobs C/R for parallel jobs is not provided in any of current Condor universes. We would like to make C/R available for MPI programs.
Heon Y. Yeom, Seoul National University Condor Week 2006 Introduction Why Message Passing Interface (MPI)? Designing a generic FT framework is extremely hard due to the diversity of hardware and software systems. We have chosen MPICH series.... MPI is the most popular programming model in cluster computing. Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware …
Heon Y. Yeom, Seoul National University Condor Week 2006 Architecture -Concept- Monitoring Failure Detection C/R Protocol FT-MPICH
Heon Y. Yeom, Seoul National University Condor Week 2006 Architecture -Overall System- Ethernet IPC Management System Communication MPI Process Communication IPC Ethernet MPI Process Communication IPC Ethernet MPI Process Communication IPC Ethernet Message Queue
Heon Y. Yeom, Seoul National University Condor Week 2006 Management System Management System Makes MPI more reliable Failure Detection Checkpoint Coordination Recovery Initialization Coordination Output Management Checkpoint Transfer
Heon Y. Yeom, Seoul National University Condor Week 2006 Manager System MPI process Local Manager MPI process Local Manager MPI process Local Manager Stable Storage Leader Manager Initialization, CKPT cmd, CKPT transfer, Failure notification & recovery Communication between MPI process to exchange data
Heon Y. Yeom, Seoul National University Condor Week 2006 Fault-tolerant MPICH_P4 FT Module Recovery Module Connection Re-establishment Checkpoint Toolkit Atomic Message Transfer ADI(Abstract Device Interface) Ch_p4 (Ethernet) FT-MPICH Ethernet Collective Operations P2P Operations
Heon Y. Yeom, Seoul National University Condor Week 2006 Startup in Condor Precondition Leader Manager already knows the machines where MPI process is executed and the number of MPI process by user input Binary of Local Manager and MPI process is located at the same location of each machine
Heon Y. Yeom, Seoul National University Condor Week 2006 Startup in Condor Job submission description file Vanilla Universe Shell script file is used in submission description file executable points a shell script The shell file only executes Leader Manager Ex) Example.cmd #!/bin/sh Leader_manager … exe.sh(shell script) universe = Vanilla executable = exe.sh output = exe.out error = exe.err log = exe.log queue Example.cmd
Heon Y. Yeom, Seoul National University Condor Week 2006 Startup in Condor User submits a job using condor_submit Normal Job Startup Condor Pool Central Manager Submit Machine SubmitShadow Schedd NegotiatorCollector Execute Machine Job (Leader Manager) Starter Startd
Heon Y. Yeom, Seoul National University Condor Week 2006 Startup in Condor Leader Manager executes Local Manager Local Manager executes MPI process Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process Fork() & Exec()
Heon Y. Yeom, Seoul National University Condor Week 2006 Startup in Condor MPI processes send Communication Info and Leader Manager aggregates this info Leader Manager broadcasts aggregated info Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process
Heon Y. Yeom, Seoul National University Condor Week 2006 Fault Tolerant MPI To provide MPI fault-tolerance, we have adopted Coordinated checkpointing scheme (vs. Independent scheme) The Leader Manager is the Coordinator!! Application-level checkpointing (vs. kernel-level CKPT.) This method does not require any efforts on the part of cluster administrators User-transparent checkpointing scheme (vs. User-aware) This method requires no modification of MPI source codes
Heon Y. Yeom, Seoul National University Condor Week 2006 Atomic Message Passing Coordination between MPI process Assumption Communication Channel is FIFO Lock(), Unlock() To create atomic operation Proc 1 Lock() Unlock() Atomic Region CKPT SIG Checkpoint is performed!! Checkpoint is delayed!! Proc 0
Heon Y. Yeom, Seoul National University Condor Week 2006 Atomic Message Passing (Case 1) When MPI process receive CKPT SIG, MPI process send & receive barrier message Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data CKPT CKPT SIG CKPT
Heon Y. Yeom, Seoul National University Condor Week 2006 Atomic Message Passing (Case 2) Through sending and receiving barrier message, In-transit message is pushed to the destination Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data Delayed CKPT
Heon Y. Yeom, Seoul National University Condor Week 2006 Atomic Message Passing (Case 3) The communication channel between MPI process is flushed Dependency between MPI process is removed Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data Delayed CKPT
Heon Y. Yeom, Seoul National University Condor Week 2006 Checkpointing Coordinated Checkpointing ver 2 ver 1 Leader Manager checkpoint command rank0 rank1rank2rank3 storage Stack Data Text Heap
Heon Y. Yeom, Seoul National University Condor Week 2006 Failure Recovery MPI process recovery Stack Data Text Heap Stack Data Text Heap CKPT ImageNew processRestarted Process
Heon Y. Yeom, Seoul National University Condor Week 2006 Failure Recovery Connection Re-establishment Each MPI process re-opens socket and sends IP, Port info to Local Manager This is the same with the one we did before at the initialization time.
Heon Y. Yeom, Seoul National University Condor Week 2006 Fault Tolerant MPI Recovery from failure failure detection ver 1 Leader Manager checkpoint command rank0 rank1rank2rank3 storage
Heon Y. Yeom, Seoul National University Condor Week 2006 Fault Tolerant MPI in Condor Leader Manager controls MPI processes by issuing checkpoint command, monitoring Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process Condor is not aware of the failure incident
Heon Y. Yeom, Seoul National University Condor Week 2006 Fault-tolerant MPICH-variants (Seoul National University) FT Module Recovery Module Connection Re-establishment Ethernet Checkpoint Toolkit Atomic Message Transfer ADI(Abstract Device Interface) Globus2 (Ethernet) GM (Myrinet)MVAPICH (InfiniBand) Collective Operations MPICH-GF P2P Operations M3M3 SHIELD MyrinetInfiniBand
Heon Y. Yeom, Seoul National University Condor Week 2006 Summary We can provide fault-tolerance for parallel applications using MPICH on Ethernet, Myrinet, and Infiniband. Currently, only the P4(ethernet) version works with Condor. We look forward to working with Condor team.