Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
User Level Failure Mitigation Fault Tolerance Plenary December 2013, MPI Forum Meeting Chicago, IL USA.
User Level Failure Mitigation Fault Tolerance Working Group September 2013, MPI Forum Meeting Madrid, Spain.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 3 Memory Management Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Update on ULFM Fault Tolerance Working Group MPI Forum, San Jose CA December, 2014.
1Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 1Managed by UT-Battelle for the Department of Energy Richard L. Graham Computer.
Chapter 19: Network Management Business Data Communications, 4e.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Operating System Support Focus on Architecture
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
CS CS 5150 Software Engineering Lecture 21 Reliability 3.
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.
Implementing Disaster Protection
A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.
Backup and Recovery Part 1.
Fault Tolerant Runtime ANL Wesley Bland LBL Visit 3/4/14.
Lecture 18 Page 1 CS 111 Online Design Principles for Secure Systems Economy Complete mediation Open design Separation of privileges Least privilege Least.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
1 I-Logix Professional Services Specialist Rhapsody IDF (Interrupt Driven Framework) CPU External Code RTOS OXF Framework Rhapsody Generated.
Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.
University of Maryland Bug Driven Bug Finding Chadd Williams.
Window NT File System JianJing Cao (#98284).
Distributed File Systems
The Global View Resilience Model Approach GVR (Global View for Resilience) Exploits a global-view data model, which enables irregular, adaptive algorithms.
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Use of Coverity & Valgrind in Geant4 Gabriele Cosmo.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
©2008 TTW Where “Lean” principles are considered common sense and are implemented with a passion! Product Training RMA.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.
Analysis trains – Status & experience from operation Mihaela Gheata.
Welcome to the PVFS BOF! Rob Ross, Rob Latham, Neill Miller Argonne National Laboratory Walt Ligon, Phil Carns Clemson University.
ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Issues #1 and #3 Fault Tolerance Working Group December 2015 MPI Forum.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Fault Tolerant MapReduce-MPI for HPC Clusters Yanfei Guo *, Wesley Bland +, Pavan Balaji +, Xiaobo Zhou * * Dept. of Computer Science, University of Colorado,
Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
Modularization of Geant4 Dynamic loading of modules Configurable build using CMake Pere Mato Witek Pokorski
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
The Structuring of Systems Using Upcalls By David D. Clark Presented by Samuel Moffatt.
Benefits of a Virtual SIL
Checkpoint/restart in Slurm: current status and new developments
Integration Testing.
Jack Dongarra University of Tennessee
FT-MPI Survey Alan & Nathan.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Chapter 2: System Structures
Fault Tolerance Distributed Web-based Systems
Operating Systems.
Product Training RMA Where “Lean” principles are considered common sense and are implemented with a passion! ©2008 TTW.
New cluster capabilities through Checkpoint/ Restart
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014

Why Process Fault Tolerance?  Two (main) kinds of resilience –Data resilience (full backups and silent errors) –Process resilience  Data resilience is handled outside of MPI –Checkpoint/Restart, FTI, SCR, GVR, Containment Domains, etc.  Process fault tolerance must be at least partially handled within MPI –Must recover communication channels ExaMPI '14, Wesley Bland

What has been done before?  Checkpoint/Restart –Requires restarting the entire job, waiting in job queue  FT-MPI (University of Tennessee) –Automatic communicator repair, no customization –Support dropped and never presented to the MPI Forum  FA-MPI (University of Alabama, Birmingham) –Try/catch semantics to simulate transactions –Detects errors via timeouts  Run-Through Stabilization (MPI Forum) –Similar to ULFM, but too big to not fail ExaMPI '14, Wesley Bland

User-Level Failure Mitigation (ULFM)  Handles (fail-stop) process failures  General resilience framework designed to support a wide range of recovery models  Encourages libraries to build more user-friendly resilience on top of ULFM  Failure notification via return codes and/or MPI_Errhandlers –Local failure notification to maintain performance ExaMPI '14, Wesley Bland

ULFM Continued  Failure propagation happens manually when necessary –MPI_COMM_REVOKE –All communication operations on a communicator are interrupted  New function call creates a replacement communicator without failed procs –MPI_COMM_SHRINK  Group of failed procs available via API calls ExaMPI '14, Wesley Bland 0 1 MPI_Recv MPI_ERR_PROC_FAILED Failure Notification Failure Propagation Failure Recovery MPI_COMM_SHRINK()

FT Agreement  Collective over communicator  Allows procs to determine the status of an entire communicator in one call –Bitwise AND over an integer value  Ignores (acknowledged) failed processes –Provides notification for unacknowledged failures  Works on a revoked communicator  Useful for “validating” previous work –Communicator creation –Application iteration(s) ExaMPI '14, Wesley Bland

ULFM in MPICH  Experimentally available in 3.2a2, planned for full MPICH 3.2 release –Still some known bugs –  Un-optimized implementation  Need to enable some setting for complete usage –Configure flag: --enable-error-checking –Environment variable: MPIR_CVAR_ENABLE_FT  Also available in Open MPI branch from UTK – ExaMPI '14, Wesley Bland

Why is ULFM challenging?  Feedback from users and developers: –Challenging to track the state of an entire legacy application –Checking all return codes / using error handler requires lots of code changes  Much of this is not wrong –ULFM has been designed to be flexible, not simple –Eventually, applications should use libraries for FT, not ULFM itself Similar to how MPI was originally envisioned  However…all is not lost ExaMPI '14, Wesley Bland

BSP Applications  Very common types of applications  Typically involve some sort of iterative method –Ends iterations with a collective operation to decide what to do next  Commonly already have checkpoint/restart built-in for current gen systems –Already protecting data and have functions to regenerate communicators ExaMPI '14, Wesley Bland

while ( ) { } while ( ) { } ULFM for BSP  Take advantage of existing synchrony  Perform agreement on the result of the ending collective  If there’s a problem, repair the communicator and data and re- execute  Avoid restarting the job –Sitting in the queue, restarting the job, re-reading from disk, etc. ExaMPI '14, Wesley Bland … Application Code … rc = MPI_Allreduce(value) … Application Code … rc = MPI_Allreduce(value) if (rc == MPI_SUCCESS) success = 1 rc = MPI_Comm_agree(success) if (!success || !rc) repair() if (rc == MPI_SUCCESS) success = 1 rc = MPI_Comm_agree(success) if (!success || !rc) repair()

Recovery Model  Communication Recovery –Dynamic processes (+ batch support) Shrink + Spawn –No-dynamic processes Start the job with extra processes Rearrange after a failure  Data Recovery –Full checkpoint restart Already built into many apps High recovery and checkpointing overhead –Application level checkpointing Low overhead Possible code modifications required ExaMPI '14, Wesley Bland Failure Shrink Spawn

Code Intrusiveness  Add 2-3 lines to check for failures –Perform agreement to determine status  Add function to repair application –If the agreement says there’s a failure, call the repair function –Repair data and communicator(s) ExaMPI '14, Wesley Bland

Monte Carlo Communication Kernel (MCCK)  Mini-app from the Center for Exascale Simulation of Advanced Reactors (CESAR)  Monte-Carlo, domain decomposition stencil code  Investigates communication costs of particle tracking ExaMPI '14, Wesley Bland

ULFM Modifications for MCCK Application Level Checkpointing  Checkpoint particle position to disk  On failure, restart job –No job queue to introduce delay  All processes restore data from disk together User Level Failure Mitigation  Particle position exchanged with neighbors  On failure, call repair function –Shrink communicator –Spawn replacement process –Restore missing data from neighbor  No disk contention ExaMPI '14, Wesley Bland

Performance  Used Fusion Argonne –2 Xeons per node (8 cores) –36 GB memory –QDR InfiniBand  Experiments up to 1024 processes  100,000 particles per process  Checkpoints taken every 1, 3, and 5 iterations ExaMPI '14, Wesley Bland

Overall Runtime ExaMPI '14, Wesley Bland

Overhead ExaMPI '14, Wesley Bland

What’s next for ULFM?  Fault tolerance libraries –Incorporate data & communication recovery –Abstract details from user applications  Working on standardization in MPI Forum ExaMPI '14, Wesley Bland

Questions? MPICH BoF – Tuesday 5:30 PM – Room MPI Forum BoF – Wednesday 5:30 PM – Room 293 ExaMPI '14, Wesley Bland