Fault Tolerant Runtime ANL Wesley Bland LBL Visit 3/4/14.

Slides:

Advertisements

Similar presentations

Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.

Advertisements

Remus: High Availability via Asynchronous Virtual Machine Replication

User Level Failure Mitigation Fault Tolerance Plenary December 2013, MPI Forum Meeting Chicago, IL USA.

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

User Level Failure Mitigation Fault Tolerance Working Group September 2013, MPI Forum Meeting Madrid, Spain.

Recovery Planning A Holistic View Adam Backman, President White Star Software

Persistent Linda 3.0 Peter Wyckoff New York University.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Remus: High Availability via Asynchronous Virtual Machine Replication.

Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.

Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

Accelerating Mobile Applications through Flip-Flop Replication

CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.

Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.

CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Practical Byzantine Fault Tolerance

CE Operating Systems Lecture 3 Overview of OS functions and structure.

SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM E. B. Nightingale P. M. Chen J. Flint University of Michigan.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

Databases Illuminated

RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.

Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.

Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.

ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

Recovery Management in QuickSilver Roger Haskin, Yoni Malachi, Wayne Sawdon, and Gregory Chan IBM Almaden Research Center.

This project has received funding from the European Union's Seventh Framework Programme for research, technological development.

Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.

Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.

Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.

Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

بسم الله الرحمن الرحيم MEMORY AND I/O.

Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.

Next Generation of Apache Hadoop MapReduce Owen

VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji.

Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Presented by: Daniel Taylor

Transactions and Reliability

Operating System Reliability

Fault Tolerance In Operating System

Chapter 9: Virtual-Memory Management

Page Replacement.

Operating System Reliability

Fault Tolerance Distributed Web-based Systems

Lecture 6: Reliability, PCM

Transactions in Distributed Systems

Abstractions for Fault Tolerance

Presentation transcript:

Fault Tolerant Runtime ANL Wesley Bland LBL Visit 3/4/14

Exascale Trends  Power/Energy is a known issue for Exascale –1Exaflop performance in a 20MW power envelope –We are currently at 34 Petaflops using 17MW (Tianhe-2; not including cooling) –Need a 25X increase in power efficiency to get to Exascale –Unfortunately, power/energy and Faults are strong duals – it’s almost impossible to improve one without effecting the other  Exascale Driven Trends in Technology –Packaging Density Data movement is a problem for performance/energy, so we can expect processing units, memory, network interfaces, etc., to be packaged as close to each other as vendors can get away with High packaging density also means more heat, more leakage, more faults –Near threshold voltage operation Circuitry will be operated with as low power as possible, which means bit flips and errors are going to become more common –IC verification Growing gap in silicon capability and verification ability with large amounts of dark silicon being added to chips that cannot be effectively tested in all configurations Wesley Bland, 2

Hardware Failures  We expect future hardware to fail  We don’t expect full node failures to be as common as partial node failures –The only thing that really causes full node failure is power failure –Specific hardware will become unavailable –How do we react to failure of a particular portion of the machine? –Is our reaction the same regardless of the failure?  Low power thresholds will make this problem worse Wesley Bland, 3 Cray XK7

Current Failure Rates  Memory errors –Jaguar: Uncorrectable ECC error every 2 weeks  GPU errors –LANL (Fermi): 4% of 40K runs have bad residuals  Unrecoverable hardware error –Schroeder & Gibson (CMU): Up to two failures per day on LANL machines Wesley Bland, 4

What protections do we have?  There are existing hardware pieces that protect us from some failures  Memory –ECC –2D error coding (too expensive to be implemented most of the time)  CPU –Instruction caches are more resilient than data caches –Exponents are more reliable than mantissas  Disks –RAID Wesley Bland, 5

What can we do in software?  Hardware won’t fix everything for us. –Some parts will be either too hard, too expensive, or too power hungry to fix  For these problems we have to move to software  4 different ways of handling failures in software 1.Processes detect failures and all processes abort 2.Processes detect failures and only abort where errors occur 3.Other mechanisms detect failure and recover within existing processes 4.Failures are not explicitly detected, but handled automatically Wesley Bland, 6

Processes detect failures and all processes abort  Well studied already –Checkpoint/restart  People are studying improvements now –Scalable Checkpoint Restart (SCR) –Fault Tolerance Interface (FTI)  Default model of MPI up to now (version 3.0) Wesley Bland, 7

Processes detect failures and only abort where errors occur  Some processes will remain around while others won’t  Need to consider how to repair lots of parts of the application –Communication library (MPI) –Computational capacity (if necessary) –Data C/R, ABFT, Natural Fault Tolerance, etc. Wesley Bland, 8

MPIXFT  MPI-3 Compatibly Recovery Library  Automatically repairs MPI communicators as failures occur –Handles running in n-1 model  Virtualizes MPI Communicator –User gets an MPIXFT wrapper communicator –On failure, the underlying MPI communicator is replaced with a new, working communicator MPI_COMM MPIXFT_COMM Wesley Bland, 9

MPIXFT Design  Possible because of new MPI-3 capabilities –Non-blocking equivalents for (almost) everything –MPI_COMM_CREATE_GROUP MPI_BARRIER MPI_WAITANY MPI_IBARRIER Failure notification request MPI_ISEND() COMM_CREATE_GROUP() Wesley Bland, 10

MIPXFT Results  MCCK Mini-app –Domain decomposition communication kernel –Overhead within standard deviation Wesley Bland, 11  Halo Exchange (1D, 2D, 3D) –Up to 6 outstanding requests at a time –Very low overhead

User Level Failure Mitigation  Proposed change to MPI Standard for MPI-4  Repair MPI after process failure –Enable more custom recovery than MPIXFT  Don’t pick a particular recovery technique as better or worse than others  Introduce minimal changes to MPI  Treat process failure as fail-stop failures –Transient failures are masked as fail-stop Wesley Bland, 12

TL;DR  5(ish) new functions (some non-blocking, RMA, and I/O equivalents) –MPI_COMM_FAILURE_ACK / MPI_COMM_FAILURE_GET_ACKED Provide information about who has failed –MPI_COMM_REVOKE Provides a way to propagate failure knowledge to all processes in a communicator –MPI_COMM_SHRINK Creates a new communicator without failures from a communicator with failures –MPI_COMM_AGREE Agreement algorithm to determine application completion, collective success, etc.  3 new error classes –MPI_ERR_PROC_FAILED A process has failed somewhere in the communicator –MPI_ERR_REVOKED The communicator has been revoked –MPI_ERR_PROC_FAILED_PENDING A failure somewhere prevents the request from completing, but it is still valid Wesley Bland, 13

Failure Notification  Failure notification is local. –Notification of a failure for one process does not mean that all other processes in a communicator have also been notified.  If a process failure prevents an MPI function from returning correctly, it must return MPI_ERR_PROC_FAILED. –If the operation can return without an error, it should (i.e. point-to-point with non-failed processes. –Collectives might have inconsistent return codes across the ranks (i.e. MPI_REDUCE)  Some operations will always have to return an error: –MPI_ANY_SOURCE –MPI_ALLREDUCE / MPI_ALLGATHER / etc.  Special return code for MPI_ANY_SOURCE –MPI_ERR_PROC_FAILED_PENDING –Request is still valid and can be completed later (after acknowledgement on next slide) Wesley Bland, 14

Failure Notification  To find out which processes have failed, use the two-phase functions: –MPI_Comm_failure_ack(MPI_Comm comm) Internally “marks” the group of processes which are currently locally know to have failed –Useful for MPI_COMM_AGREE later Re-enables MPI_ANY_SOURCE operations on a communicator now that the user knows about the failures –Could be continuing old MPI_ANY_SOURCE requests or starting new ones –MPI_Comm_failure_get_acked(MPI_Comm comm, MPI_Group *failed_grp) Returns an MPI_GROUP with the processes which were marked by the previous call to MPI_COMM_FAILURE_ACK Will always return the same set of processes until FAILURE_ACK is called again  Must be careful to check that wildcards should continue before starting/restarting an operation –Don’t enter a deadlock because the failed process was supposed to send a message  Future MPI_ANY_SOURCE operations will not return errors unless a new failure occurs. Wesley Bland, 15

Recovery with only notification Master/Worker Example  Post work to multiple processes  MPI_Recv returns error due to failure –MPI_ERR_PROC_FAILED if named –MPI_ERR_PROC_FAILED_PENDING if wildcard  Master discovers which process has failed with ACK/GET_ACKED  Master reassigns work to worker 2 Wesley Bland, 16 Master Worker 1 Worker 2 Worker 3 Send Recv Discovery Send

Failure Propagation  When necessary, manual propagation is available. –MPI_Comm_revoke(MPI_Comm comm) Interrupts all non-local MPI calls on all processes in comm. Once revoked, all non-local MPI calls on all processes in comm will return MPI_ERR_REVOKED. –Exceptions are MPI_COMM_SHRINK and MPI_COMM_AGREE (later) –Necessary for deadlock prevention  Often unnecessary –Let the application discover the error as it impacts correct completion of an operation. Wesley Bland, 17

Failure Recovery  Some applications will not need recovery. –Point-to-point applications can keep working and ignore the failed processes.  If collective communications are required, a new communicator must be created. –MPI_Comm_shrink(MPI_Comm *comm, MPI_Comm *newcomm) Creates a new communicator from the old communicator excluding failed processes If a failure occurs during the shrink, it is also excluded. No requirement that comm has a failure. In this case, it will act identically to MPI_Comm_dup.  Can also be used to validate knowledge of all failures in a communicator. –Shrink the communicator, compare the new group to the old one, free the new communicator (if not needed). –Same cost as querying all processes to learn about all failures Wesley Bland, 18

Recovery with Revoke/Shrink ABFT Example  ABFT Style application  Iterations with reductions  After failure, revoke communicator –Remaining processes shrink to form new communicator  Continue with fewer processes after repairing data Wesley Bland, MPI_ALLREDUCE MPI_COMM_REVOKE MPI_COMM_SHRINK

Fault Tolerant Consensus  Sometimes it is necessary to decide if an algorithm is done. –MPI_Comm_agree(MPI_comm comm, int *flag); Performs fault tolerant agreement over boolean flag Non-acknowledged, failed processes cause MPI_ERR_PROC_FAILED. Will work correctly over a revoked communicator. –Expensive operation. Should be used sparingly.  Can also pair with collectives to provide global return codes if necessary.  Can also be used as a global failure detector –Very expensive way of doing this, but possible.  Also includes a non-blocking version Wesley Bland, 20

One-sided  MPI_WIN_REVOKE –Provides same functionality as MPI_COMM_REVOKE  The state of memory targeted by any process in an epoch in which operations raised an error related to process failure is undefined. –Local memory targeted by remote read operations is still valid. –It’s possible that an implementation can provide stronger semantics. If so, it should do so and provide a description. –We may revisit this in the future if a portable solution emerges.  MPI_WIN_FREE has the same semantics as MPI_COMM_FREE Wesley Bland, 21

File I/O  When an error is returned, the file pointer associated with the call is undefined. –Local file pointers can be set manually Application can use MPI_COMM_AGREE to determine the position of the pointer –Shared file pointers are broken  MPI_FILE_REVOKE –Provides same functionality as MPI_COMM_REVOKE  MPI_FILE_CLOSE has similar to semantics to MPI_COMM_FREE Wesley Bland, 22

Other mechanisms detect failure and recover within existing processes  Data corruption  Network failure  Accelerator failure Wesley Bland, 23

GVR (Global View Resilience)  Multi-versioned, distributed memory –Application commits “versions” which are stored by a backend –Versions are coordinated across entire system  Different from C/R –Don’t roll back full application stack, just the specific data. Wesley Bland, 24 Rollback & re-compute if uncorrected error Parallel Computation proceeds from phase to phase Phases create new logical versions App-semantics based recovery

MPI Memory Resilience (Planned Work)  Integrate memory stashing within the MPI stack –Data can be replicated across different kinds of memories (or storage) –On error, repair data from backup memory (or disk) Wesley Bland, 25 MPI_PUT Traditional Replicated

Network Failures  Topology disruptions –What is the tradeoff of completing our application with a broken topology vs. restarting and recreating the correct topology  NIC Failures –Fall back to other NICs –Treat as a process failure (from the perspective of other off-node processes)  Dropped packets –Handled by lower levels of network stack Wesley Bland, 26

Traditional Model Compute Node Physical GPU VOCL Proxy OpenCL API VOCL Model Native OpenCL Library Compute Node Virtual GPU Application VOCL Library OpenCL API MPI Compute Node Physical GPU VOCL Proxy OpenCL API Native OpenCL Library Virtual GPU Compute Node Physical GPU Application Native OpenCL Library OpenCL API 27 Wesley Bland Argonne National Laboratory VOCL: Transparent Remote GPU Computing  Transparent utilization of remote GPUs  Efficient GPU resource management: –Migration (GPU / server) –Power Management: pVOCL

bufferWrite lanchKernel bufferRead sync lanchKernel bufferRead sync bufferWrite lanchKernel bufferRead sync bufferWrite detecting thread lanchKernel bufferRead sync lanchKernel bufferRead sync bufferWrite lanchKernel bufferRead sync Double- (uncorrected) and single-bit (corrected) error counters may be queried in both models ECC Query Checkpointing Synchronous Detection Model VOCL FT Functionality ECC Query Checkpointing User App. Asynchronous Detection Model User App.VOCL FT Thread ECC Query Minimum overhead, but double-bit errors will trash whole executions 28 Wesley Bland Argonne National Laboratory VOCL-FT (Fault Tolerant Virtual OpenCL)

Failures are not explicitly detected, but handled automatically  Some algorithms take care of this automatically –Iterative methods –Naturally fault tolerant algorithms  Can other things be done to support this kind of failure model? Wesley Bland, 29

Conclusion  Lots of ways to provide fault tolerance 1.Abort everyone 2.Abort some processes 3.Handle failure of portion of system without process failure 4.Handle failure recovery automatically via algorithms, etc.  We are already looking at #2 & #3 –Some of this work will go / is going back into MPI ULFM –Some will be made available externally GVR  We want to make all parts of the system more reliable, not just the full system view Wesley Bland, 30