Fault Tolerance in MPI Programs

Slides:



Advertisements
Similar presentations
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Advertisements

MPI Message Passing Interface
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Distributed components
1 Course Information Parallel Computing Fall 2008.
1 Course Information Parallel Computing Spring 2010.
Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
Minaashi Kalyanaraman Pragya Upreti CSS 534 Parallel Programming
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
SIMULATING ERRORS IN WEB SERVICES International Journal of Simulation: Systems, Sciences and Technology 2004 Nik Looker, Malcolm Munro and Jie Xu.
Fine Grain MPI Earl J. Dodd Humaira Kamal, Alan University of British Columbia 1.
1 Parallel of Hyderabad CS-726 Parallel Computing By Rajeev Wankar
Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.
Fault Tolerant Parallel Data-Intensive Algorithms Mucahid KutluGagan AgrawalOguz Kurt Department of Computer Science and Engineering The Ohio State University.
Distributed Systems: Concepts and Design Chapter 1 Pages
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
PVM and MPI.
1 Distributed Systems Architectures Distributed object architectures Reference: ©Ian Sommerville 2000 Software Engineering, 6th edition.
Apache Ignite Compute Grid Research Corey Pentasuglia.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Chapter 4: Threads.
Data Management on Opportunistic Grids
Netscape Application Server
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Outline Introduction Background Distributed DBMS Architecture
Exception and Event Handling
Distributed Shared Memory
Session 3 Memory Management
RAID, Programmed I/O, Interrupt Driven I/O, DMA, Operating System
Jack Dongarra University of Tennessee
FT-MPI Survey Alan & Nathan.
Definition of Distributed System
System Design and Modeling
Design of a Multi-Agent System for Distributed Voltage Regulation
Threads CSSE 332 Operating Systems Rose-Hulman Institute of Technology
Steven Whitham Jeremy Woods
University of Technology
Chapter 4: Threads.
Chapter 4: Threads.
MPI-Message Passing Interface
Fault Tolerance Distributed Web-based Systems
Middleware for Fault Tolerant Applications
CSE8380 Parallel and Distributed Processing Presentation
Analysis models and design models
Threads Chapter 4.
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Multithreaded Programming
Fault Tolerance with FT-MPI for Linear Algebra Algorithms
Chapter 4: Threads.
Introduction To Distributed Systems
Exception and Event Handling
Database System Architectures
New Tools In Education Minjun Wang
Abstractions for Fault Tolerance
Exception Handling.
Presentation transcript:

Fault Tolerance in MPI Programs By: Venkatesh Kumar Auzumeedi. CS 523 Parallel Computing.

Introduction Fault tolerance in general and what the MPI Standard has to say about it. Survey several approaches to this problem, namely check – pointing, FT-MPI Tool support. MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.

The MPI Standard and Fault Tolerance Reliable Communication: The MPI implementation is responsible for detecting and handling network faults. “Handling” means either recovering from the error through retransmission of the message or else informing the application that an error has occurred, allowing the application to take its own corrective action. Under no circumstances should an MPI application or library need to verify integrity of data received.

Error Handlers MPI ERRORS ARE FATAL :- Specifies that if an MPI function returns unsuccessfully then all the processes in the communicator will abort. MPI ERRORS RETURN :- Specifies that MPI functions will attempt (whether this attempt is successful may be implementation dependent) to return an error code. Users can define their own error handlers and attach them. This ability to define one’s own application-specific error handler is important for the approach to fault tolerance. While the user application program still uses MPI ERRORS ARE FATAL.

Errors MPI ERRORS RETURN is the active error handler, a set of errors is predefined, and implementations are allowed to extend this set. If an error is returned, the standard does not require that subsequent operations succeed, or that they fail. Thus the standard allows implementations to take various approaches to the fault tolerance issue and to trade performance against fault tolerance to meet user needs.

Failure Modes On detecting a failure within a communicator, that communicator is marked as having an (possible) error. Immediately as this occurs the underlying system sends a state update to all other processes involved in that communicator. If the error was a communication error, not all communicators are forced to be updated, if it was a process exit then all communicators that include this process are changed.

Communicator and communication handling Once a communicator has an error state it can only recover by rebuilding it, using a modified version of one of the MPI communicator build functions such as MPI_Comm_{create, split or dup}. New communicator will follow certain semantics depending on its failure mode : REBUILD: It forces the creation of new processes to fill any gaps. The new processes can either be places in to the empty ranks, or the communicator can be shrank and the processes added the end ABORT: Is a mode which effects the application immediately an error is detected and forces a graceful abort. The user can not trap this, and only option is to change the communicator mode other modes. SHRINK: The communicator is shrank so that there are no holes in its data structures

FT-MPI usage example Typical usage of FT-MPI would be in the form of an error check and then some corrective action such as a communicator rebuild etc. A typical code fragment would be like: rc= MPI_Send (----, com); If (rc==MPI_ERR_OTHER) MPI_Comm_dup (com, newcom); com = newcom; /* continue.. */

Approaches to Fault Tolerance in MPI Programs Checkpointing :- Checkpointing is a common technique that periodically saves the state of a computation, allowing the computation to be restarted from that point in the event of a failure. For small probability of failure and relatively small costs of creating and restoring a checkpoint, the added cost of using checkpoints is quite modest.

FT-MPI Tool support Hostinfo which displays the state of the Virtual Machine. Cominfo which displays processes and communicators in color coded fashion so that users know the state of an applications processes and communicators. Both tools are currently built using the X11 libraries but will be rebuilt using the Java SWING system to aid portability. Example displays during a SHRINK communicator rebuild operation. X11 is designed to be a platform-independent, networked graphics framework.

Cominfo use: Cominfo display for a healthy three process MPI application. The colors of the inner boxes indicate the state of the processes and the outer box indicates the communicator state. Cominfo display for an application with an exited process. In this case the rank 1 process has exited. Note the communicator is marked as having an error and that the number of processes and size of the communicator are different. Cominfo display for the application after a communicator rebuild using the SHRINK option. Note the communicator status box has changed back to a blue (dark) color

Conclusion FT-MPI is an attempt to provide application programmers with different methods of dealing with failure within MPI application What the MPI Standard provides in the way of support for writing fault-tolerant programs. We have considered several approaches to doing so.

References R. Batchu, J. Neelamegam, Z. Dui, M. Beddhua, A. Skjellum, Y. Dandass, and M. Apte. MPI/FT (TM): Architecture and taxonomies for fault-tolerant, messagepassing middleware for performance-portable parallel computing. In Proceedings of the 1st IEEE International Symposium of Cluster Computing and the Grid, Melbourne, 2001. George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fedak, Cedile Germain, Thomas Herault, Pierre Lemarinier, Oleg Lodygensky, Frederic Magniette, Vencent Neri, and Anton Selikhov. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In Proceedings of SC 2002. IEEE, 2002. R. Butler, W. Gropp, and E. Lusk. Components and interfaces of a process management system for parallel programs. Parallel Computing, 27:1417–1429, 2001. G. E. Fagg, A. Bukovsky, and J. J. Dongarra. HARNESS and fault tolerant MPI. Parallel Computing, 27(11):1479–1495, October 2001 Graham Fagg and Jack Dongarra. Fault-tolerant MPI: Supporting dynamic applications in a dynamic world. In Jack Dongarra, Peter Kacsuk, and Norbert Podhorszki, editors, Recent Advances in Parallel Virutal Machine and Message Passing Interface, number 1908 in Springer Lecture Notes in Computer Science, pages 346–353, 2000. 7th European PVM/MPI Users’ Group Meeting. Graham E. Fagg and Jack J Dongarra. Building and using a fault tolerant MPI implementation. (to appear in the International Journal of High Performance Computer Applications and Supercomputing). Al Geist and Christian Engelmann. Development of naturally fault tolerant algorithms for computing on 100,000 processors, 2002. J. Gray and A. Reuter. Transaction Processing. Morgan Kaufmann Publishers, San Mateo (CA), USA, 1993. W. Gropp and E. Lusk. Dynamic process management in an MPI setting. In Proceedings / Seventh IEEE Symposium on Parallel and Distributed Processing, October 25–28, 1995, San Antonio, Texas, pages 530–534, 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 1995. IEEE Computer Society Press. IEEE catalog number 95TB8131. William Gropp and Ewing Lusk. Dynamic process management in an MPI setting. http://www.mcs.anl.gov/ gropp/bib/papers/. http://www.shodor.org/media/content//petascale/materials/distributedMemory/presentations/MPI_Error_Example.pdf