Experiments with Fault Tolerant Linear Algebra Algorithms

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
High Performance Computing Course Notes High Performance Storage.
NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)
Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CH2 System models.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
MediaGrid Processing Framework 2009 February 19 Jason Danielson.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
MapReduce How to painlessly process terabytes of data.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Lecture 26 Virtual Machine Monitors. Virtual Machines Goal: run an guest OS over an host OS Who has done this? Why might it be useful? Examples: Vmware,
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
Robust Task Scheduling in Non-deterministic Heterogeneous Computing Systems Zhiao Shi Asim YarKhan, Jack Dongarra Followed by GridSolve, FT-MPI, Open MPI.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Apache Ignite Compute Grid Research Corey Pentasuglia.
rain technology (redundant array of independent nodes)
SC’07 Demo Draft VGrADS Team June 2007.
OpenMosix, Open SSI, and LinuxPMI
CS 6560: Operating Systems Design
Performance and Fault Tolerance
Jack Dongarra University of Tennessee
Self Healing and Dynamic Construction Framework:
GWE Core Grid Wizard Enterprise (
Chapter 1: Introduction
Distribution and components
Steven Whitham Jeremy Woods
CMSC 611: Advanced Computer Architecture
RAID RAID Mukesh N Tekwani
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Processor Fundamentals
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Fault Tolerance with FT-MPI for Linear Algebra Algorithms
Subject Name: Operating System Concepts Subject Number:
Interrupt handling Explain how interrupts are used to obtain processor time and how processing of interrupted jobs may later be resumed, (typical.
Erasure Correcting Codes for Highly Available Storage
Architectural Support for OS
RAID RAID Mukesh N Tekwani April 23, 2019
Memory System Performance Chapter 3
Database System Architectures
MapReduce: Simplified Data Processing on Large Clusters
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Experiments with Fault Tolerant Linear Algebra Algorithms Fault Tolerance For Sparse Linear Algebra Computations Implemented In A Grid Environment Experiments with Fault Tolerant Linear Algebra Algorithms Jack Dongarra Jeffery Chen Zhiao Shi Asim YarKhan University of Tennessee

Fault Tolerance: Motivation Interested in using the VGrADS framework to find resources to solve problems and increase the ease of use in a fault prone system With Grids and some parallel systems there’s an increased probability of a system or network failure Mean Time to Failure is growing shorter as system’s size increase. MPI widely accepted in scientific computing. Process faults not tolerated in MPI model. Mismatch between hardware and (non fault-tolerant) programming paradigm of MPI.

Fault Avoidance and Tolerance in the Grid and Clusters Problem with the MPI standard, no recovery from faults. We have a version/extension of MPI (FT-MPI) which can trap and accommodate faults. Compatible with MPI 1.2 Application checkpoint / restart is today’s typical fault tolerance method. By monitoring, one can identify Performance problems Failure probability fault prediction migration opportunities. Prepare for fault recovery Large-scale fault tolerance adaptation: resilience and recovery predictive techniques for probability of failure resource classes and capabilities coupled to application usage modes resilience implementation mechanisms adaptive checkpoint frequency in memory checkpoints BG/L is 1 GB now

Fault Tolerance - Diskless Checkpointing Built into Software Checkpointing to disk is slow. May not have any disks on the system. Have extra checkpointing processors. Use “RAID like” checkpointing to processor. Maintain a system checkpoint in memory. All processors may be rolled back if necessary. Use k extra processors to encode checkpoints so that if up to k processors fail, their checkpoints may be restored (Reed-Solomon encoding). Idea to build into library routines. We are looking at iterative solvers. Not transparent, has to be built into the algorithm. Use VGrADS Framework to simplify task

How Raid for a Disk System Works Similar to RAID for disks. If X = A XOR B then this is true: X XOR B = A A XOR X = B

How Diskless Checkpointing Works Comp Proc 1 Comp Proc p Check pt Proc Data Data Local Checkpoint Local Checkpoint Checkpoint Encoding Memory Memory Memory + The encoding establishes an equality: C1 + C2 + … Cp = Cp+1. If one of the processor failed, the above equality becomes a linear equation with only one unknown, therefore, lost data can be solved from the equation.

Diskless Checkpointing The N application processors (4 in this case) each maintain their own checkpoints locally. K extra processors maintain coding information so that if 1 or more processors fail, they can be replaced. Will describe for k=1 (parity). If a single processor fails, then its state may be restored from the remaining live processors. Application processors Parity processor P0 P1 P3 P2 P4 P4 = P0  P1  P2  P3

Diskless Checkpointing When failure occurs: control passes to user supplied handler “XOR” performed to recover missing data P4 takes on role of P1 Execution continue P0 P1 P3 P2 P4 P4 takes on the identity of P1 and the computation continues. P0 P0 P4 P4 P1 P2 P3 P2 P3

A Fault-Tolerant Parallel CG Solver Tightly coupled computation. Not expecting to do wide area distributed computing. Cluster based is ideal. Issues on how many “optimal” for problem, including failure scenario. Do a “backup” (checkpoint) every j iterations for changing data. Requires each process to keep copy of iteration changing data from checkpoint. First example can survive the failure of a single process. Dedicate an additional process for holding data, which can be used during the recovery operation. For surviving k process failures (k << p) you need k additional processes (second example).

CG Data Storage 5 vectors change every iteration Checkpoint A and b Think of the data like this A b 5 vectors 5 vectors change every iteration Checkpoint A and b initially

Parallel Version No need to checkpoint each iteration, say every j Think of the data like this Think of the data like this on each processor A b 5 vectors A b 5 vectors . . No need to checkpoint each iteration, say every j iterations. Need a copy of the 5 vectors from checkpt in each processor.

Diskless Version P0 P1 P2 P4 P3 Extra storage needed on each process from the data that is changing. Actually don’t do XOR, add the information. P0 P1 P3 P2 P4

FT PCG Algorithm Analysis Global Operations Global operation in PCG: three dot product, one preconditioning, and one matrix vector multiplication. Global operation in Checkpoint: encoding the local checkpoint.

FT PCG Algorithm Analysis Checkpoint x, r, and p every k iterations Global Operations Global operation in PCG: three dot product, one preconditioning, and one matrix vector multiplication. Global operation in Checkpoint: encoding the local checkpoint. Global operation in checkpoint can be localized by sub-group.

PCG: Experiment Configurations Problem Scaling bcsstk17 All experiment are performed on: 64 dual-processor 2.4 GHz AMD Opteron nodes Each node of the cluster has 2 GB of memory Each node runs the Linux operation system Nodes are connected with a Gigabit Ethernet. bcsstk17: The size is: 10974 x 10974 Non-zeros: 428650 Sparsity: 39 non-zeros per row on average Source: Linear equation from elevated pressure vessel Size of the Problem No. of Comp. Procs Problem #1 164,610 15 Problem #2 329,220 30 Problem #3 658,440 60 Problem #4 1,316,880 120

PCG: Performance with Different MPI Implementations T for 20000 iters LAM-7.0.4 MPICH2-1.0 FT-MPI FT-MPI ckpt /2000 iters FT-MPI exit 1 proc @10000 iters Problem #1 522.5 536.3 517.8 518.9 521.7 Problem #2 532.9 542.9 532.2 533.3 537.5 Problem #3 545.5 553.0 546.5 547.8 554.2 Problem #4 674.3 624.4 622.9 637.1 http://icl.cs.utk.edu/ft-mpi/

Protecting for More Than One Failure: Reed-Solomon (Checkpoint Encoding Matrices) In order to be able to recover from any k ( ≤ number of checkpoint processes ) failures, need a checkpoint encoding. With one checkpoint process we had: P sets of data and a function A such that C=A*P where P=(P1,P2,…Pp)T; C: Checkpoint data (C and Pi same size) With A = (1, 1, …, 1) C = a1P1 + a2P2 + …+ ap Pp; C = A*P To recover Pk; solve Pk = (C-a1P1-ak-1Pk-1–ak+1Pk+1–apPp)/ak With k checkpoints we need a function A such that C: Checkpoint data C = (C1,C2,…Ck)T (Ci and Pi same size). A: Checkpoint-Encoding matrix A is k x p (k << p). When h failures occur, recover the data by taking the h x h submatrix of A, call it A’, corresponding to the failed processes and solving A’P’ = C’; to recover the h “lost” P’s. A’ is the h x h submatrix. C’ is made up of the surviving h checkpoints.

PCG: Performance Overhead of Performing Recovery Run PCG for 20000 iterations and take checkpoint every 2000 iterations (about 1 minute) Simulate a failure by exiting some processes at the 10000-th iteration T (ckpt T) 0 proc 1 proc 2 proc 3 proc 4 proc 5 proc 15 comp 517.8 521.7 (2.8) 522.1 (3.2) 522.8 (3.3) 522.9 (3.7) 523.1 (3.9) 30 comp 532.2 537.5 (4.5) 537.7 (4.9) 538.1 (5.3) 538.5 (5.7) 538.6 (6.1) 60 comp 546.5 554.2 (6.9) 554.8 (7.4) 555.2 (7.6) 555.7 (8.2) 556.1 (8.7) 120 comp 622.9 637.1 (10.5) 637.2 (11.1) 637.7 (11.5) 638.0 (12.0) 638.5 (12.5)

GridSolve Architecture Resource discovery Scheduling Load balancing Fault tolerance Agent request server list server data server result server Client [x,y,z,info] = gridsolve(‘solver’, A, B) server

GridSolve Usage with VGrADS Simple-to-use access to complicated software libraries Selection of best machine in your grid to service user request Portability non-portable code can be run from a client on an architecture as long as there is a server provisioned with the code Legacy codes easily wrapped into services Plug into VGrADS Framework Using the vgES for resource selection and launching of application: integrated performance information integrated monitoring fault prediction integrating the software and resource information repositories

VGrADS/GridSolve Architecture Service Catalog query Agent software location request [x,y,z,info] = gridsolve(‘dgesv’, A, B) vgDL Server info register Virtual Grid Start server data result Transfer Client Software Repository

Agent Agent is tightly bound to client Initially agent contains no resource information Agent requests information about the possible services and their complexity in order to estimate the resources required (vgDL) Agent may decide to run the service locally on the client For each service request estimates resources required vgDL spec: vgdl = Clusterof<node>[N]; node = {node.memory > 500MB, node.speed > 2000}; vgid = vgCreateVG(vgserver, vgdl, 1000, ns-server-script) Return the set of resources to the client The ns-server-script fetches and deploys needed services on selected vgrads resources

Next Steps Software to determine the checkpointing interval and number of checkpoint processors from the machine characteristics. Perhaps use historical information. Monitoring Migration of task if potential problem Local checkpoint and restart algorithm. Coordination of local checkpoints. Processors hold backups of neighbors. Have the checkpoint processes participate in the computation and do data rearrangement when a failure occurs. Use p processors for the computation and have k of them hold checkpoint. Generalize the ideas to provide a library of routines to do the diskless check pointing. Look at “real applications” and investigate “Lossy” algorithms.

Additional Details Can Be Found in Posters Poster on VGrADS and GridSolve; Zhiao Shi Richard Huang's and Dan Nurmi's posters

* Somewhere between slides 26 and 27, we need to address why vgES helps (after all, you're already doing NetSolve without it). (Future) Capabilities that spring to mind are integrated performance information for the vgrid, integrated monitoring for the vgrid, Rich's fault prediction work (to the extent that it becomes part of vgES, which I admit I don't quite grok), integrating the software and resource information repositories. Whatever you think makes sense. * In a similar vein, forward pointers to Shi's poster and (maybe) Richard Huang's and Dan Nurmi's posters would show VGrADS as a solidly unified project. (You may have been planning to insert these verbally...)

Reed-Solomon Approach A*P = C, where A is k x p made up of random numbers, P is p x n, C is k x n Here using 4 processors and 3 Ckpt processors:

Reed-Solomon Approach A*P = C, where A is k x p made up of random numbers, P is p x n, C is k x n Here using 4 processors and 3 Ckpt processors: X Say 2 processors fail, P2 and P3.

Reed-Solomon Approach A*P = C, where A is k x p made up of random numbers, P is p x n, C is k x n Here using 4 processors and 3 Ckpt processors: X Say 2 processors fail, P2 and P3. Take a subset of A’s (columns 2 and 3) and solve for P2 and P3. Say 2 processors fail, P2 and P3. Could use GF(2). Signal processing aps do this. In that case, A is Vandermonde or Cauchy matrix. (Need to have any subset of A be non singular.) We use A as a random matrix.