Download presentation
Presentation is loading. Please wait.
Published byRandall Miller Modified over 9 years ago
1
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level failure mitigation (ULFM) is becoming the front- running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high- performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low. 0 0 1 MPI_Recv MPI_ERR_PROC_FAILE D Failure Notification Failure Propagation ULFM introduces semantics to define failure notification, propagation, and recovery within MPI. For failure notification, ULFM uses traditional error codes and returns a new error, MPI_ERR_PROC_FAILED. For failure propagation, MPI_COMM_REVOKE will notify all processes of a failure. For failure recovery, MPI_COMM_SHRINK can be used to create a new communicator. To assist with the primary functionality, there are also functions to detect failure ( MPI_COMM_FAILURE_ACK, MPI_COMM_FAILURE_GET_ACKED ) and for distributed, fault- tolerant agreement ( MPI_COMM_AGREE ). Failure Detection Local failures detected by Hydra and netmods Error codes are returned back to the user from the API calls Agreement Uses two group-based allreduce operations If either fails, an error is returned to the user Shrinking All processes construct consistent group of failed procs via allreduce Uses MPI_COMM_CREATE_GROUP internally to create communicator If failure is detected, the shrink is started over Revocation Non-optimized implementation done with message flood Internally marked as done when a messages is received from all other processes Internal refcount kept until done ULFM Evaluation ULFM Implementation Agreement Shrinking Performance with any failures is similar due to the lack of leave-early semantics. Failure-free performance is much better. Acknowledging failures also does not significantly improve the runtime because all failure detection code must still be executed. Slower than MPI_COMM_DUP because of failure detection. As expected, introducing failures in the middle of the algorithm causes large runtime increase as the algorithm must restart. Failure Recovery MPI_COMM_SHRINK() What is ULFM? Traditional fault tolerance solutions require the entire application to synchronize and save its state to disc. In the event of a failure, all processes have to abort and roll back to the previously saved state. ULFM allows the application to either replace to failed process or continue with fewer processes. Why is this better than Checkpoint/Restart? Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, under contract number DE-AC02-06CH11357. We gratefully acknowledge the computing resources provided on Fusion, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.