Presentation is loading. Please wait.

Presentation is loading. Please wait.

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

Similar presentations


Presentation on theme: "Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University."— Presentation transcript:

1 Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University

2 Contents Motivation 1 Introduction 2 Architecture 3 Conclusion 4

3 Motivation  Hardware performance limitations are overcome by Moore's Law  These cutting-edge technologies make “Tera-scale” clusters feasible !!!  However.. What about “ THE ” system reliability ??? Distributed systems are still fragile due to unexpected failures …

4 Motivation Multiple Fault-tolerant Framework MVAPICH (InfiniBand) High-speed (Up to 30Gbps) Will be Popular MPICH Compatible Demand Fault- resilience !!! MPICH-GM (Myrinet) High-speed (10Gbps) Popular MPICH Compatible Demand Fault- resilience !!! MPICH-G2 (Ethernet) Good speed (1Gbps) Common MPICH Standard Demand Fault- resilience !!! High-performance Network Trend

5 Introduction  Unreliability of distributed systems Even a single local failure can be fatal to parallel processes since it could render useless all computations executed to the point of failure.  Our goal is To construct a practical multiple fault- tolerant framework for various types of MPICH variants working on high- performance clusters/Grids.

6 Introduction  Why Message Passing Interface (MPI)? Designing a generic FT framework is extremely hard due to the diversity of hardware and software systems. We chosen MPICH series.... MPI is the most popular programming model in cluster computing. Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware …

7 Architecture -Concept- Monitoring Failure Detection C/R Protocol Consensus & Election Protocol Multiple Fault-tolerant Framework

8 Architecture -Overall System- Others Ethernet Management System Communication MPI Process Communication Ethernet Others MPI Process Communication Ethernet Others MPI Process Communication Ethernet High-speed Network (Myrinet, InfiniBand) Gigabit Ethernet

9 Architecture -Development History- Fault-tolerantMPICH-G2-Ethernet-Fault-tolerantMPICH-GM-Myrinet-Fault-tolerantMVAPICH-InfiniBand- MPICH-GF FT- MPICH-GM FT- MVAPICH 20042005 Current 2003

10 Management System Management System Makes MPI more reliable Failure Detection Checkpoint Coordination Recovery Initialization Coordination Output Management Checkpoint Transfer

11 Management System

12 Job Management System 1/2  Job Management System Manages and monitors multiple MPI processes and their execution environments Should be lightweight Helps the system take consistent checkpoints and recover from failures Has a fault-detection mechanism  Two main components Central Manager & Local Job Manager

13 Job Management System 2/2  Central Manager Manages all system functions and states Detects node failures by periodic heartbeats and Job Manager ’ s failures  Job Manager Relays messages between Central Manager & MPI Processes Detects unexpected MPI process failures

14 Fault-Tolerant MPI 1/3  To provide MPI fault-tolerance, we adopt Coordinated checkpointing scheme (vs. Independent scheme) The Central Manager is the Coordinator!! Application-level checkpointing (vs. kernel- level CKPT.) This method does not require any efforts on the part of cluster administrators User-transparent checkpointing scheme (vs. User-aware) This method requires no modification of MPI source codes

15 ver 2 ver 1 Fault-Tolerant MPI 2/3 Central Manager checkpoint command rank0 rank1rank2rank3  Coordinated Checkpointing storage

16 failure detection ver 1 Fault-Tolerant MPI 3/3 Central Manager checkpoint command rank0 rank1rank2rank3  Recovery from failures storage

17 Management System  MPICH-GF Based on Globus Toolkit2 Hierarchical Management System Suitable for multiple clusters Supports recovery from process/manager/node failure Limitation Does not support recovery from multiple failures Has single point of failure (Central Manager)

18 Management System  FT-MPICH-GM New version It does not rely on the Globus Toolkit. Removes of hierarchical structure Myrinet/Infiniband clusters no longer require hierarchical structure. Supports recovery from multiple failures  FT-MVAPICH More robust Removes the single point of failure Leader election for the job manager

19 Fault-tolerant MPICH-variants FT Module Recovery Module Connection Re-establishment Ethernet Checkpoint Toolkit Atomic Message Transfer ADI(Abstract Device Interface) Globus2 (Ethernet) GM (Myrinet)MVAPICH (InfiniBand) Collective Operations MPICH-GF P2P Operations FT-MPICH-GMFT-MVAPICH MyrinetInfiniBand

20 Future Works  We’re working to incorporate our FT protocol into the GT-4 framework. MPICH-GF is GT-2 compliant Incorporating fault-tolerant management protocol into GT-4 Make MPICH work with different clusters Gig-E Myrinet Open-MPI, VMI, etc. Infiniband  Supporting non-Intel CPUs AMD(Opteron)

21 GRID Issues  Who should be responsible for ? Monitoring the up/down of nodes. Resubmitting the failed process. Allocating new nodes.  GRID Job Management Resource management Scheduler Health Monitoring

22


Download ppt "Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University."

Similar presentations


Ads by Google