Download presentation
Presentation is loading. Please wait.
Published byJaliyah Leadingham Modified over 10 years ago
1
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University
2
Contents Motivation 1 Introduction 2 Architecture 3 Conclusion 4
3
Motivation Hardware performance limitations are overcome by Moore's Law These cutting-edge technologies make “Tera-scale” clusters feasible !!! However.. What about “ THE ” system reliability ??? Distributed systems are still fragile due to unexpected failures …
4
Motivation Multiple Fault-tolerant Framework MVAPICH (InfiniBand) High-speed (Up to 30Gbps) Will be Popular MPICH Compatible Demand Fault- resilience !!! MPICH-GM (Myrinet) High-speed (10Gbps) Popular MPICH Compatible Demand Fault- resilience !!! MPICH-G2 (Ethernet) Good speed (1Gbps) Common MPICH Standard Demand Fault- resilience !!! High-performance Network Trend
5
Introduction Unreliability of distributed systems Even a single local failure can be fatal to parallel processes since it could render useless all computations executed to the point of failure. Our goal is To construct a practical multiple fault- tolerant framework for various types of MPICH variants working on high- performance clusters/Grids.
6
Introduction Why Message Passing Interface (MPI)? Designing a generic FT framework is extremely hard due to the diversity of hardware and software systems. We chosen MPICH series.... MPI is the most popular programming model in cluster computing. Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware …
7
Architecture -Concept- Monitoring Failure Detection C/R Protocol Consensus & Election Protocol Multiple Fault-tolerant Framework
8
Architecture -Overall System- Others Ethernet Management System Communication MPI Process Communication Ethernet Others MPI Process Communication Ethernet Others MPI Process Communication Ethernet High-speed Network (Myrinet, InfiniBand) Gigabit Ethernet
9
Architecture -Development History- Fault-tolerantMPICH-G2-Ethernet-Fault-tolerantMPICH-GM-Myrinet-Fault-tolerantMVAPICH-InfiniBand- MPICH-GF FT- MPICH-GM FT- MVAPICH 20042005 Current 2003
10
Management System Management System Makes MPI more reliable Failure Detection Checkpoint Coordination Recovery Initialization Coordination Output Management Checkpoint Transfer
11
Management System
12
Job Management System 1/2 Job Management System Manages and monitors multiple MPI processes and their execution environments Should be lightweight Helps the system take consistent checkpoints and recover from failures Has a fault-detection mechanism Two main components Central Manager & Local Job Manager
13
Job Management System 2/2 Central Manager Manages all system functions and states Detects node failures by periodic heartbeats and Job Manager ’ s failures Job Manager Relays messages between Central Manager & MPI Processes Detects unexpected MPI process failures
14
Fault-Tolerant MPI 1/3 To provide MPI fault-tolerance, we adopt Coordinated checkpointing scheme (vs. Independent scheme) The Central Manager is the Coordinator!! Application-level checkpointing (vs. kernel- level CKPT.) This method does not require any efforts on the part of cluster administrators User-transparent checkpointing scheme (vs. User-aware) This method requires no modification of MPI source codes
15
ver 2 ver 1 Fault-Tolerant MPI 2/3 Central Manager checkpoint command rank0 rank1rank2rank3 Coordinated Checkpointing storage
16
failure detection ver 1 Fault-Tolerant MPI 3/3 Central Manager checkpoint command rank0 rank1rank2rank3 Recovery from failures storage
17
Management System MPICH-GF Based on Globus Toolkit2 Hierarchical Management System Suitable for multiple clusters Supports recovery from process/manager/node failure Limitation Does not support recovery from multiple failures Has single point of failure (Central Manager)
18
Management System FT-MPICH-GM New version It does not rely on the Globus Toolkit. Removes of hierarchical structure Myrinet/Infiniband clusters no longer require hierarchical structure. Supports recovery from multiple failures FT-MVAPICH More robust Removes the single point of failure Leader election for the job manager
19
Fault-tolerant MPICH-variants FT Module Recovery Module Connection Re-establishment Ethernet Checkpoint Toolkit Atomic Message Transfer ADI(Abstract Device Interface) Globus2 (Ethernet) GM (Myrinet)MVAPICH (InfiniBand) Collective Operations MPICH-GF P2P Operations FT-MPICH-GMFT-MVAPICH MyrinetInfiniBand
20
Future Works We’re working to incorporate our FT protocol into the GT-4 framework. MPICH-GF is GT-2 compliant Incorporating fault-tolerant management protocol into GT-4 Make MPICH work with different clusters Gig-E Myrinet Open-MPI, VMI, etc. Infiniband Supporting non-Intel CPUs AMD(Opteron)
21
GRID Issues Who should be responsible for ? Monitoring the up/down of nodes. Resubmitting the failed process. Allocating new nodes. GRID Job Management Resource management Scheduler Health Monitoring
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.