Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fault Tolerance with FT-MPI for Linear Algebra Algorithms

Similar presentations


Presentation on theme: "Fault Tolerance with FT-MPI for Linear Algebra Algorithms"— Presentation transcript:

1 Fault Tolerance with FT-MPI for Linear Algebra Algorithms
Jack Dongarra, Innovative Computing Laboratory University of Tennessee and Computer Science and Mathematics Division Oak Ridge National Laboratory 00:32

2 Fault Tolerance: Motivation
Trends in HPC: High end systems with thousand of processors Increased probability of a node failure Most systems nowadays are robust MPI widely accepted in scientific computing Process faults not tolerated in MPI model Mismatch between hardware and (non fault- tolerant) programming paradigm of MPI. 36

3 FT-MPI http://icl.cs.utk.edu/ft-mpi/
Define the behavior of MPI in case an error occurs FT-MPI based on MPI 1.3 (plus some MPI 2 features) with a fault tolerant model similar to what was done in PVM. Complete reimplementation, not based on other implementations Gives the application the possibility to recover from a node-failure A regular, non fault-tolerant MPI program will run using FT-MPI Runs over a Grid What FT-MPI does not do: Recover user data (e.g. automatic check-pointing) Provide transparent fault-tolerance 36

4 In distributed systems
Related work A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques. Automatic Non Automatic Checkpoint based Log based Optimistic log (sender based) Causal log Pessimistic log Optimistic recovery In distributed systems n faults with coherent checkpoint [SY85] Cocheck Independent of MPI [Ste96] Manetho n faults [EZ92] Causal logging + Coordinated checkpoint Framework Starfish Enrichment of MPI [AF99] This is another view of the existing fault tolerant message passing systems. This graph present a classification considering 2 criteria: -The level in the software stack where the fault tolerance is managed AND -The fault tolerant techniques There is also the mention of how many simultaneous faults can be supported by the system. Volatility tolerance in large scale systems implies that: 1) the system survive to an arbitrary number of faults. 2) a fault tolerance system transparent for the user and the programmer (in the other case one should be able to manage any combination of faults 3) The fault tolerance technique should be scalable (not relies on coordinated checkpoint). There is no system providing all these features simultaneously. That’s why we have designed MPICH-V which stands at the lowest level of the software stack which is the communication library level. Clip Semi-transparent checkpoint [CLP97] Egida [RAV99] MPI/FT Redundance of tasks [BNC01] FT-MPI Modification of MPI routines User Fault Treatment [FD00] API New community MPI effort OPEN-MPI Pruitt 98 2 faults sender based [PRU98] LAM/MPI MPI-FT N fault Centralized server [LNLE00] Communication Layer MPICH-V N faults Distributed logging MPICH-V/CL Sender based Mess. Log. 1 fault sender based [JZ87] LA-MPI Level 36

5 FT-MPI Failure Recovery Modes
ABORT: just do as other MPI implementations BLANK: leave hole SHRINK: re-order processes to make a contiguous communicator Some ranks change REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD 36

6 Application scenario rc=MPI_Init (…) If normal startup Install Error
Handler & Set LongJMP Call app (…) MPI_Finalize(…) 36

7 Application scenario rc=MPI_Init (…) Set LongJMP ErrorHandler
Do recover ( ) Do JMP Call app (…) On error (automatic via the MPI runtime library, could be done because of a “contract violation”) MPI_Finalize(…) 36

8 Fault Tolerance - Diskless Checkpointing - Built into Software
Checkpointing to disk is slow May not be any disks on the system Have extra checkpointing processors Use RAID like checkpointing to processor Maintain a system checkpoint in memory All processors may be rolled back if necessary Use k extra processors to encode checkpoints so that if up to k processors fail, their checkpoints may be restored (Reed-Solomon encoding) Idea to build into library routines We are looking at iterative solvers Not transparent, has to be built into the algorithm 36

9 How Diskless Checkpointing Works
Comp Proc Comp Proc k Ckpt Proc Data Data Local Checkpoint Local Checkpoint Checkpoint Encoding Memory Memory Memory + The encoding establishes an equality: C1 + C2 + … CK = CK+1 If one of the processor failed, the above equality becomes a linear equation with only one unknown, therefore, lost data can be solved from the equation 36

10 Diskless Checkpointing
The N application processors (4 in this case) each maintain their own checkpoints locally. K extra processors maintain coding information so that if 1 or more processors fail, they can be replaced. Will describe for k=1 (parity) If a single processor fails, then its state may be restored from the remaining live processors Application processors Parity processor P0 P1 P3 P2 P4 P4 = P0  P1  P2  P3 36

11 Diskless Checkpointing
When failure occurs: control passes to user supplied handler “XOR” performed to recover missing data P4 takes on role of P1 Execution continue P0 P1 P3 P2 P4 P4 takes on the identity of P1 and the computation continues P0 P0 P4 P4 P1 P2 P3 P2 P3 36

12 A Fault-Tolerant Parallel CG Solver
Tightly coupled computation Do a “backup” (checkpoint) every j iterations for changing data Requires each process to keep copy of iteration changing data from checkpoint First example can survive the failure of a single process Dedicate an additional process for holding data, which can be used during the recovery operation For surviving k process failures (k < np) you need k additional processes (second example) 36

13 CG Data Storage Checkpoint A and b initially
Think of the data like this A b 5 vectors Checkpoint A and b initially 36

14 Parallel version No need to checkpoint each iteration, say every k
Think of the data like this Think of the data like this on each processor A b 5 vectors A b 5 vectors . . No need to checkpoint each iteration, say every k iterations. Need a copy of the 5 vectors in each processor. 36

15 Diskless version P0 P1 P2 P4 P3 P0 P1 P3 P2 P4
Extra storage needed on each process from the data that is changing. Actually don’t do XOR, add the information 36

16 Preconditioned Conjugate Grad Performance
Table 1: PCG performance on 25 nodes of a dual Pentium 4 (2.4 GHz). 24 nodes are used for computation. 1 node is used for checkpoint Checkpoint every 100 iterations (diagonal preconditioning) Matrix ( Size ) Mpich1.2.5 (sec) FT-MPI FT-MPI w/ ckpoint (sec) FT-MPI w/ recovery (sec) Recovery Ckpoint Ohead (%) Recovery Overhead (%) bcsstk18.rsa (11948) 9.81 9.78 10.0 12.9 2.31 2.4 23.7 bcsstk17.rsa (10974) 27.5 27.2 30.5 2.48 1.1 9.1 nasasrb.rsa (54870) 577. 569. 570. 4.09 0.23 0.72 bcsstk35.rsa (30237) 860. 858. 859. 872. 3.17 0.12 0.37 36

17 Preconditioned Conjugate Grad Performance
Table 1: PCG performance on 25 nodes of a dual Pentium 4 (2.4 GHz). 24 nodes are used for computation. 1 node is used for checkpoint Checkpoint every 100 iterations (diagonal preconditioning) Matrix ( Size ) Mpich1.2.5 (sec) FT-MPI FT-MPI w/ ckpoint (sec) FT-MPI w/ recovery (sec) Recovery Ckpoint Ohead (%) Recovery Overhead (%) bcsstk18.rsa (11948) 9.81 9.78 10.0 12.9 2.31 2.4 23.7 bcsstk17.rsa (10974) 27.5 27.2 30.5 2.48 1.1 9.1 nasasrb.rsa (54870) 577. 569. 570. 4.09 0.23 0.72 bcsstk35.rsa (30237) 860. 858. 859. 872. 3.17 0.12 0.37 36

18 Preconditioned Conjugate Grad Performance
Table 1: PCG performance on 25 nodes of a dual Pentium 4 (2.4 GHz). 24 nodes are used for computation. 1 node is used for checkpoint Checkpoint every 100 iterations (diagonal preconditioning) Matrix ( Size ) Mpich1.2.5 (sec) FT-MPI FT-MPI w/ ckpoint (sec) FT-MPI w/ recovery (sec) Recovery Ckpoint Ohead (%) Recovery Overhead (%) bcsstk18.rsa (11948) 9.81 9.78 10.0 12.9 2.31 2.4 23.7 bcsstk17.rsa (10974) 27.5 27.2 30.5 2.48 1.1 9.1 nasasrb.rsa (54870) 577. 569. 570. 4.09 0.23 0.72 bcsstk35.rsa (30237) 860. 858. 859. 872. 3.17 0.12 0.37 36

19 Preconditioned Conjugate Grad Performance
Table 1: PCG performance on 25 nodes of a dual Pentium 4 (2.4 GHz). 24 nodes are used for computation. 1 node is used for checkpoint Checkpoint every 100 iterations (diagonal preconditioning) Matrix ( Size ) Mpich1.2.5 (sec) FT-MPI FT-MPI w/ ckpoint (sec) FT-MPI w/ recovery (sec) Recovery Ckpoint Ohead (%) Recovery Overhead (%) bcsstk18.rsa (11948) 9.81 9.78 10.0 12.9 2.31 2.4 23.7 bcsstk17.rsa (10974) 27.5 27.2 30.5 2.48 1.1 9.1 nasasrb.rsa (54870) 577. 569. 570. 4.09 0.23 0.72 bcsstk35.rsa (30237) 860. 858. 859. 872. 3.17 0.12 0.37 36

20 Protecting for More Than One Failure: Reed-Solomon (Checkpoint Encoding Matrices)
In order to be able to recover from any k ( <= # of checkpoint processes ) failures, need a checkpoint encoding With one checkpoint process we had: P sets of data and a function A such that C=A*P where P=(P1,P2,…Pp)T; C: Checkpoint data (C and Pi same size) With A = (1, 1, …, 1) C = a1P1 + a2P2 + …+ ap Pp With k checkpoints we need a function A such that C: Checkpoint data C = (C1,C2,…Ck)T (Ci and Pi same size) A: Checkpoint-Encoding matrix A is k x p (k << p) Ci = ai1P1 + ai2P2 + …+ aip Pp Each checkpoint process get one of these 36

21 Protecting for More Than One Failure: Reed-Solomon (Checkpoint Encoding Matrices)
With k checkpoints we need a function A such that C=A*P where P=(P1,P2,…Pp)T; C: Checkpoint data C = (C1,C2,…Ck)T (Ci and Pi same size) A: Checkpoint-Encoding matrix A is k x p (k << p) Ci = ai1P1 + ai2P2 + …+ aip Pp Each checkpoint process get one of these The checkpoint matrix A (k x p) has to satisfy the condition: Any square sub-matrix of A is non-singular How to find such an A? Vandermonde matrix, Cauchy matrix, . . ., random element matrix We are using a random matrix to in-code the checkpoints. When h failures occur, recover the data by taking the h x h submatrix of A corresponding to the failed processes, call it A’, and solving A’P’ = C’ A’ is the h x h submatrix C’ is made up of the surviving h checkpoints 36

22 FT PCG Performance on AMD Opteron Cluster w/Myrinet
Num of Procs 54 comp + 2 ckpt 54 comp + 4 ckpt 54 comp + 6 ckpt 54 comp + 8 ckpt 54 comp + 10 ckpt T(54 comp procs) 2570.4 T_with_ckpt 2570.9 2571.8 2573.2 2573.1 2574.3 T_with_recovery 2658.0 2662.7 2663.1 2665.9 2668.3 The test matrix is audikw_1.rsa size is: 943,695. The number of non-zeros is: 39,297,771. The pcg algorithm is from Templates for the Solution of Linear Systems with diagonal as preconditioner. The fault tolerance scheme is based on diskless checkpoint with floating point Reed-Solomon encoding. The parity check matrix used is a pseudo random (uniformly distributed) matrix generated from MT19937. Checkpoint every 100 iterations (about 250 seconds). We run the pcg for 1000 iterations and force some processes to fail at the 500th iteration. 36

23 Futures Put this in the Grid / VGrADS context
Launch and determine the number of processors for this particular problem from the VGrADS framework. Provide the fault handler the ability to detect slow processors or networks and cause the fault to occur. Use the checkpoint processors to participate with the work. Requires reallocation or layout of data Run with a larger numbers of processors 36

24 Collaborators / Support
FT-MPI Graham Fagg, UTK Thara Angskun, UTK George Bosilca, UTK Jelena Pjesivac-Grbovic, UTK FT Algorithms Jeffery Chen, UTK Julien Langou, UTK VGrADS Asim YarKhan, UTK Zhiao Shi, UTK For more information: 36


Download ppt "Fault Tolerance with FT-MPI for Linear Algebra Algorithms"

Similar presentations


Ads by Google