Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.

Slides:

Advertisements

Similar presentations

Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.

Advertisements

Presented by Fault Tolerance Working Group Update Rich Graham.

Building Algorithmically Nonstop Fault Tolerant MPI Programs Rui Wang, Erlin Yao, Pavan Balaji, Darius Buntinas, Mingyu Chen, and Guangming Tan Argonne.

Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.

User Level Failure Mitigation Fault Tolerance Plenary December 2013, MPI Forum Meeting Chicago, IL USA.

User Level Failure Mitigation Fault Tolerance Working Group September 2013, MPI Forum Meeting Madrid, Spain.

Pontus Boström and Marina Waldén Åbo Akademi University/ TUCS Development of Fault Tolerant Grid Applications Using Distributed B.

Update on ULFM Fault Tolerance Working Group MPI Forum, San Jose CA December, 2014.

Abstract HyFS: A Highly Available Distributed File System Jianqiang Luo, Mochan Shrestha, Lihao Xu Department of Computer Science, Wayne State University.

1Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 1Managed by UT-Battelle for the Department of Energy Richard L. Graham Computer.

Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Ashish Gupta Under Guidance of Prof. B.N. Jain Department of Computer Science and Engineering Advanced Networking Laboratory.

Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. The experimental resources for this research were.

CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.

Inter-cluster Job Deployment by AgentTeamwork Sentinel Agents Emory Horvath CSS497 Spring 2006 Advisor: Dr. Munehiro Fukuda.

Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.

User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

Fault Tolerant Runtime ANL Wesley Bland LBL Visit 3/4/14.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.

Xiao Liu CS3 -- Centre for Complex Software Systems and Services Swinburne University of Technology, Australia Key Research Issues in.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.

Diffusing Computation. Using Spanning Tree Construction for Solving Leader Election Root is the leader In the presence of faults, –There may be multiple.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Cloud Age Time to change the programming paradigm?

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

A Collaborative Framework for Scientific Data Analysis and Visualization Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox Department of Computer.

Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,

Author(s) Politehnica University of Bucharest Automatic Control and Computers Faculty Computer Science Department Robocheck – Integrated Code Validation.

Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.

Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji.

Remote Procedure Call Andy Wang Operating Systems COP 4610 / CGS 5765.

ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

MPI: Portable Parallel Programming for Scientific Computing William Gropp Rusty Lusk Debbie Swider Rajeev Thakur.

Code Motion for MPI Performance Optimization The most common optimization in MPI applications is to post MPI communication earlier so that the communication.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

MPI Communicator Assertions Jim Dinan Point-to-Point WG March 2015 MPI Forum Meeting.

Issues #1 and #3 Fault Tolerance Working Group December 2015 MPI Forum.

Section 2.1 Distributed System Design Goals Alex De Ruiter

Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

Fault Tolerant MapReduce-MPI for HPC Clusters Yanfei Guo *, Wesley Bland +, Pavan Balaji +, Xiaobo Zhou * * Dept. of Computer Science, University of Colorado,

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.

1 Fault Tolerance and Recovery Mostly taken from

TensorFlow– A system for large-scale machine learning

Jack Dongarra University of Tennessee

MPI: Portable Parallel Programming for Scientific Computing

New cluster capabilities through Checkpoint/ Restart

Mark McKelvin EE249 Embedded System Design December 03, 2002

Transactions in Distributed Systems

Abstractions for Fault Tolerance

Presentation transcript:

Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level failure mitigation (ULFM) is becoming the front- running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high- performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low MPI_Recv MPI_ERR_PROC_FAILE D Failure Notification Failure Propagation ULFM introduces semantics to define failure notification, propagation, and recovery within MPI. For failure notification, ULFM uses traditional error codes and returns a new error, MPI_ERR_PROC_FAILED. For failure propagation, MPI_COMM_REVOKE will notify all processes of a failure. For failure recovery, MPI_COMM_SHRINK can be used to create a new communicator. To assist with the primary functionality, there are also functions to detect failure ( MPI_COMM_FAILURE_ACK, MPI_COMM_FAILURE_GET_ACKED ) and for distributed, fault- tolerant agreement ( MPI_COMM_AGREE ). Failure Detection Local failures detected by Hydra and netmods Error codes are returned back to the user from the API calls Agreement Uses two group-based allreduce operations If either fails, an error is returned to the user Shrinking All processes construct consistent group of failed procs via allreduce Uses MPI_COMM_CREATE_GROUP internally to create communicator If failure is detected, the shrink is started over Revocation Non-optimized implementation done with message flood Internally marked as done when a messages is received from all other processes Internal refcount kept until done ULFM Evaluation ULFM Implementation Agreement Shrinking Performance with any failures is similar due to the lack of leave-early semantics. Failure-free performance is much better. Acknowledging failures also does not significantly improve the runtime because all failure detection code must still be executed. Slower than MPI_COMM_DUP because of failure detection. As expected, introducing failures in the middle of the algorithm causes large runtime increase as the algorithm must restart. Failure Recovery MPI_COMM_SHRINK() What is ULFM? Traditional fault tolerance solutions require the entire application to synchronize and save its state to disc. In the event of a failure, all processes have to abort and roll back to the previously saved state. ULFM allows the application to either replace to failed process or continue with fewer processes. Why is this better than Checkpoint/Restart? Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, under contract number DE-AC02-06CH We gratefully acknowledge the computing resources provided on Fusion, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory.