Download presentation
Presentation is loading. Please wait.
Published byNorma O’Neal’ Modified over 9 years ago
1
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications
2
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Motivation Condor supports Checkpoint/Restart(C/R) mechanism only in Standard Universe. Single process jobs C/R for parallel jobs is not provided in any of current Condor universes. We would like to make C/R available for MPI programs.
3
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Introduction Why Message Passing Interface (MPI)? Designing a generic FT framework is extremely hard due to the diversity of hardware and software systems. We have chosen MPICH series.... MPI is the most popular programming model in cluster computing. Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware …
4
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Architecture -Concept- Monitoring Failure Detection C/R Protocol FT-MPICH
5
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Architecture -Overall System- Ethernet IPC Management System Communication MPI Process Communication IPC Ethernet MPI Process Communication IPC Ethernet MPI Process Communication IPC Ethernet Message Queue
6
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Management System Management System Makes MPI more reliable Failure Detection Checkpoint Coordination Recovery Initialization Coordination Output Management Checkpoint Transfer
7
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Manager System MPI process Local Manager MPI process Local Manager MPI process Local Manager Stable Storage Leader Manager Initialization, CKPT cmd, CKPT transfer, Failure notification & recovery Communication between MPI process to exchange data
8
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Fault-tolerant MPICH_P4 FT Module Recovery Module Connection Re-establishment Checkpoint Toolkit Atomic Message Transfer ADI(Abstract Device Interface) Ch_p4 (Ethernet) FT-MPICH Ethernet Collective Operations P2P Operations
9
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Startup in Condor Precondition Leader Manager already knows the machines where MPI process is executed and the number of MPI process by user input Binary of Local Manager and MPI process is located at the same location of each machine
10
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Startup in Condor Job submission description file Vanilla Universe Shell script file is used in submission description file executable points a shell script The shell file only executes Leader Manager Ex) Example.cmd #!/bin/sh Leader_manager … exe.sh(shell script) universe = Vanilla executable = exe.sh output = exe.out error = exe.err log = exe.log queue Example.cmd
11
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Startup in Condor User submits a job using condor_submit Normal Job Startup Condor Pool Central Manager Submit Machine SubmitShadow Schedd NegotiatorCollector Execute Machine Job (Leader Manager) Starter Startd
12
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Startup in Condor Leader Manager executes Local Manager Local Manager executes MPI process Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process Fork() & Exec()
13
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Startup in Condor MPI processes send Communication Info and Leader Manager aggregates this info Leader Manager broadcasts aggregated info Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process
14
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Fault Tolerant MPI To provide MPI fault-tolerance, we have adopted Coordinated checkpointing scheme (vs. Independent scheme) The Leader Manager is the Coordinator!! Application-level checkpointing (vs. kernel-level CKPT.) This method does not require any efforts on the part of cluster administrators User-transparent checkpointing scheme (vs. User-aware) This method requires no modification of MPI source codes
15
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Atomic Message Passing Coordination between MPI process Assumption Communication Channel is FIFO Lock(), Unlock() To create atomic operation Proc 1 Lock() Unlock() Atomic Region CKPT SIG Checkpoint is performed!! Checkpoint is delayed!! Proc 0
16
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Atomic Message Passing (Case 1) When MPI process receive CKPT SIG, MPI process send & receive barrier message Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data CKPT CKPT SIG CKPT
17
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Atomic Message Passing (Case 2) Through sending and receiving barrier message, In-transit message is pushed to the destination Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data Delayed CKPT
18
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Atomic Message Passing (Case 3) The communication channel between MPI process is flushed Dependency between MPI process is removed Proc 1 Lock() Unlock() Atomic Region Proc 0 CKPT SIG Barrier Data Delayed CKPT
19
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Checkpointing Coordinated Checkpointing ver 2 ver 1 Leader Manager checkpoint command rank0 rank1rank2rank3 storage Stack Data Text Heap
20
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Failure Recovery MPI process recovery Stack Data Text Heap Stack Data Text Heap CKPT ImageNew processRestarted Process
21
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Failure Recovery Connection Re-establishment Each MPI process re-opens socket and sends IP, Port info to Local Manager This is the same with the one we did before at the initialization time.
22
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Fault Tolerant MPI Recovery from failure failure detection ver 1 Leader Manager checkpoint command rank0 rank1rank2rank3 storage
23
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Fault Tolerant MPI in Condor Leader Manager controls MPI processes by issuing checkpoint command, monitoring Condor Pool Central Manager Submit Machine Execute Machine Job (Leader Manager) Execute Machine 1Execute Machine 2Execute Machine 3 Local Manager MPI Process Condor is not aware of the failure incident
24
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Fault-tolerant MPICH-variants (Seoul National University) FT Module Recovery Module Connection Re-establishment Ethernet Checkpoint Toolkit Atomic Message Transfer ADI(Abstract Device Interface) Globus2 (Ethernet) GM (Myrinet)MVAPICH (InfiniBand) Collective Operations MPICH-GF P2P Operations M3M3 SHIELD MyrinetInfiniBand
25
Heon Y. Yeom, Seoul National University Condor Week 2006 yeom@snu.ac.kr Summary We can provide fault-tolerance for parallel applications using MPICH on Ethernet, Myrinet, and Infiniband. Currently, only the P4(ethernet) version works with Condor. We look forward to working with Condor team.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.