Jack Dongarra University of Tennessee

Slides:

Advertisements

Similar presentations

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

Advertisements

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

CSCE430/830 Computer Architecture

 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.

Diskless Checkpointing 15 Nov Motivation  Checkpointing on Stable Storage Disk access is a major bottleneck! Incremental Checkpointing Copy-on-write.

RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.

Distributed components

Improved Message Logging versus Improved Coordinated Checkpointing For Fault Tolerant MPI Pierre Lemarinier joint work with.

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.

Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,

Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.

Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.

Fault Tolerant Parallel Data-Intensive Algorithms Mucahid KutluGagan AgrawalOguz Kurt Department of Computer Science and Engineering The Ohio State University.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.

ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

Apache Ignite Compute Grid Research Corey Pentasuglia.

rain technology (redundant array of independent nodes)

TensorFlow– A system for large-scale machine learning

Chapter 1: Introduction

Data Management on Opportunistic Grids

8.6. Recovery By Hemanth Kumar Reddy.

FT-MPI Survey Alan & Nathan.

Definition of Distributed System

Experiments with Fault Tolerant Linear Algebra Algorithms

Fault Tolerance in MPI Programs

Operating System Reliability

Operating System Reliability

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

RAID RAID Mukesh N Tekwani

Operating System Reliability

Fault Tolerance Distributed Web-based Systems

EEC 688/788 Secure and Dependable Computing

Middleware for Fault Tolerant Applications

Fault Tolerance with FT-MPI for Linear Algebra Algorithms

Operating System Reliability

Erasure Correcting Codes for Highly Available Storage

EEC 688/788 Secure and Dependable Computing

RAID RAID Mukesh N Tekwani April 23, 2019

Database System Architectures

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Abstractions for Fault Tolerance

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Jack Dongarra University of Tennessee Fault Tolerant Ideas Jack Dongarra University of Tennessee

Super-Scale Architectures for Clusters and Grids Failures for such a system is likely to be just a few hours, minutes, seconds away. Application checkpoint / restart is today’s typical fault tolerance method. A problem with MPI, no recovery from faults in the standard Widely deployed systems have 1,000 processors Current tera-scale supercomputers have up to 10,000 processors. Next generation peta-scale systems will have 100,000 processors and more. Such machines may scale up beyond 100K processors in the next decade.

MPI Implementations with Fault Tolerance Automatic Semi-automatic Checkpoint based Log based Other Optimistic Casual Pessimistic Framework CoCheck Manetho Starfish Egida API Clip MPI/FT FT-MPI Pruitt98 LAM/MPI MPI-FT Comms layer Send based Mesg. logging MPICH-V/CL LA-MPI MPICH-V2

FT-MPI http://icl.cs.utk.edu/ft-mpi/ Define the behavior of MPI in case an error occurs FT-MPI based on MPI 1.3 with a fault tolerant model similar to what was done in PVM. Give the application the possibility to recover from a node-failure A regular, non fault-tolerant MPI program will run using FT-MPI Stick to the MPI-1 and MPI-2 specification as closely as possible (e.g. no additional function calls) What FT-MPI does not do: Recover user data (e.g. automatic check-pointing) Provide transparent fault-tolerance

Algorithm Based Fault Tolerance Using Diskless Check Pointing Not transparent, has to be built into the algorithm N processors will be executing the computation. Each processor maintains their own checkpoint locally M (M << N) extra processors maintain coding information so that if 1 or more processors die, they can be replaced Today looking at M = 1 (parity processor), can do more with Reed-Solomon coding

How Diskless Check Pointing Works Similar to RAID for disks. If X = A XOR B then this is true: X XOR B = A A XOR X = B

Diskless Checkpointing The N application processors (4 in this case) each maintain their own checkpoints locally. M extra processors maintain coding information so that if 1 or more processors die, they can be replaced. Will describe for m=1 (parity) If a single processor fails, then its state may be restored from the remaining live processors Application processors Parity processor P0 P1 P3 P2 P4 P4 = P0  P1  P2  P3

Diskless Checkpointing P1 = P0  P2  P3  P4 P0 P4 P2 P3

Diskless Checkpointing P4 takes on the identity of P1 and the computation continues P0 P0 P4 P4 P1 P2 P3 P2 P3

Algorithm Based Built into the algorithm Not transparent Allows for heterogeneity Developing prototype examples for ScaLAPACK and iterative methods for Ax=b

A Fault-Tolerant Parallel CG Solver Tightly coupled computation Do a “backup” (checkpoint) every k iterations Can survive the failure of a single process Dedicate an additional process for holding data, which can be used during the recovery operation Work-communicator excludes the backup process For surviving m process failures (m < np) you need m additional processes

The Checkpoint Procedure 4 processes participating in the computation, one for checkpointing and recovery If your application can survive one process failure at a time or Implementation: a single reduce operation for a vector Keep a copy of the vector v which you used for the backup 1 2 3 4 5 6 7 8 Rank 0 Rank 1 Rank 2 Rank 4 10 14 18 22 26 Rank 3 + =

The Recovery Procedure Rebuild work-communicator and Recover data Say lose process w/rank 1, checkpoint in process 4, then use remain processes 0, 2, and 3 along with checkpoint in 4 to recover data from process 1. Reset iteration counter On each process: copy backup of vector v into the current version 1 2 3 4 5 6 7 8 Rank 0 Rank 1 Rank 2 Rank 4 10 14 18 22 26 Rank 3 + - =

CG Data Storage Think of the data like this A b 5 vectors

Parallel version No need to checkpoint each iteration, say every k Think of the data like this Think of the data like this on each processor A b 5 vectors A b 5 vectors . . No need to checkpoint each iteration, say every k iterations.

Diskless version P0 P1 P2 P4 P3 P0 P1 P3 P2 P4

Preconditioned Conjugate Grad Performance Table 1: PCG performance on 25 nodes of a dual Pentium 4 (2.4 GHz). 24 nodes are used for computation. 1 node is used for checkpoint Checkpoint every 100 iterations (diagonal preconditioning) Matrix ( Size ) Mpich1.2.5 (sec) FT-MPI FT-MPI w/ ckpoint (sec) FT-MPI w/ recovery (sec) Recovery Ckpoint Ohead (%) Recovery Overhead (%) bcsstk18.rsa (11948) 9.81 9.78 10.0 12.9 2.31 2.4 23.7 bcsstk17.rsa (10974) 27.5 27.2 30.5 2.48 1.1 9.1 nasasrb.rsa (54870) 577. 569. 570. 4.09 0.23 0.72 bcsstk35.rsa (30237) 860. 858. 859. 872. 3.17 0.12 0.37

Futures Investigate ideas for 10K to 100K processors in a Grid context: Processors hold backups of neighbors. Unwind the computation to get back to the checkpoint Local checkpoint and restart algorithm. Coordination of local checkpoints. Middleware supported super-scale diskless checkpointing. Development of super-scalable fault-tolerant MPI implementation with localized recovery.