June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan
June 13-15, 2010SPAA Motivation Solving Linear Systems Solve for x A x = b Factorize AO(n 3 ) P A = L U Forward and Backward substitutionO(n 2 ) L y = P b U x = y
June 13-15, 2010SPAA Goals Programmability Use tools provided by FLAME Parallelism Directed acyclic graph (DAG) scheduling
June 13-15, 2010SPAA Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Formal Linear Algebra Method Environment (FLAME) High-level abstractions for expressing linear algebra algorithms Application programming interfaces (APIs) for seamlessly implementing algorithms in code Library of commonly used linear algebra operations in libflame
June 13-15, 2010SPAA 20106
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 1 A 21 A 22 A 12 A 11
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 1 LUpiv A 22 A 12
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 1 A 21 PIV A 11
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 1 A 21 A 22 TRSMA 11
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 1 A 21 GEMM A 12 A 11
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 2 A 00 A 10 A 20 A 11 A 21 A 12 A 22 A 01 A 02
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 2 A 00 A 10 A 20 LUpiv A 12 A 22 A 01 A 02
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 2 A 00 PIV A 11 A 21 PIV A 01 A 02
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 2 A 00 A 10 A 20 A 11 A 21 TRSM A 22 A 01 A 02
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 2 A 00 A 10 A 20 A 11 A 21 A 12 GEMM A 01 A 02
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 3 A 00 A 10 A 01 A 11
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 3 A 00 A 10 A 01 LUpiv
June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm Iteration 3 A 00 PIV A 01 A 11
June 13-15, 2010SPAA Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U
June 13-15, 2010SPAA Algorithm-by-Blocks FLASH Storage-by-blocks
June 13-15, 2010SPAA FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); FLA_Part_2x1( p, &pT, &pB, 0, FLA_TOP ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); FLA_Repart_2x1_to_3x1( pT, &p0, /* ** */ /* ** */ &p1, pB, &p2, 1, FLA_BOTTOM ); /* */ FLA_Merge_2x1( A11, A21, &AB1 ); FLASH_LU_piv( AB1, p1 ); FLA_Merge_2x1( A10, A20, &AB0 ); FLASH_Apply_pivots( FLA_LEFT, FLA_NO_TRANSPOSE, p1, AB0 ); FLA_Merge_2x1( A12, A22, &AB2 ); FLASH_Apply_pivots( FLA_LEFT, FLA_NO_TRANSPOSE, p1, AB2 ); FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_UNIT_DIAG, FLA_ONE, A11, A12 ); FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); FLA_Cont_with_3x1_to_2x1( &pT, p0, p1, /* ** */ /* ** */ &pB, p2, FLA_TOP ); }
June 13-15, 2010SPAA Algorithm-by-Blocks LU Factorization with Partial Pivoting Iteration 1 PIV 1 TRSM 3 LUpiv 0 PIV 2 TRSM 4 PIV 1 GEMM 5 PIV 2 GEMM 7 PIV 1 GEMM 6 LUpiv 0 PIV 2 GEMM 8
June 13-15, 2010SPAA Algorithm-by-Blocks LU Factorization with Partial Pivoting Iteration 2 LUpiv 9 PIV 11 TRSM 12 LUpiv 9 PIV 10 PIV 11 GEMM 13
June 13-15, 2010SPAA Algorithm-by-Blocks LU Factorization with Partial Pivoting Iteration 3 PIV 16 PIV 15 LUpiv 14
June 13-15, 2010SPAA PIV 1 TRSM 4 GEMM 5 LUpiv 9 GEMM 13 LUpiv 0 TRSM 12 GEMM 8 GEMM 6 PIV 11 PIV 2 TRSM 3 LUpiv 14 GEMM 7 PIV 10 PIV 16 PIV 15
June 13-15, 2010SPAA Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U
June 13-15, 2010SPAA SuperMatrix Runtime System Separation of Concerns Analyzer Decomposes subproblems into component tasks Store tasks in global task queue sequentially Internally calculates all dependencies between tasks, which form a directed acyclic graph (DAG), only using input and output parameters for each task Dispatcher Spawn threads Schedule and dispatch tasks to threads in parallel
June 13-15, 2010SPAA … SuperMatrix Runtime System Dispatcher – Single Queue Set of all ready and available tasks FIFO, priority PE 1 PE 0 PE p-1
June 13-15, 2010SPAA PIV 1 TRSM 4 GEMM 5 LUpiv 9 GEMM 13 LUpiv 0 TRSM 12 GEMM 8 GEMM 6 PIV 11 PIV 2 TRSM 3 LUpiv 14 GEMM 7 PIV 10 PIV 16 PIV 15
June 13-15, 2010SPAA SuperMatrix Runtime System Lookahead Schedule GEMM 5 and GEMM 6 tasks first so LUpiv 9 can be “computed ahead” in parallel with GEMM 7 and GEMM 8 Implemented directly within the code which increases the complexity and detracts from programmability High-Performance LINPACK
June 13-15, 2010SPAA SuperMatrix Runtime System Scheduling Sorting tasks by height of each task in DAG mimics lookahead Multiple queues Data affinity Work stealing Macroblocks Tasks overwriting more than one block at a time
June 13-15, 2010SPAA Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U
June 13-15, 2010SPAA Performance Implementations SuperMatrix + serial BLAS Partial and incremental pivoting LAPACK dgetrf + multithreaded BLAS Multithreaded dgetrf Double precision real floating point arithmetic Tuned block size per problem size
June 13-15, 2010SPAA Performance Target Architecture – Linux 4 socket 2.3 GHz AMD Opteron Quad-Core ranger.tacc.utexas.edu 3936 SMP nodes 16 cores per node 2 MB shared L3 cache per socket OpenMP Intel compiler 10.1 BLAS GotoBLAS2 1.00, MKL 10.0
June 13-15, 2010SPAA Performance
June 13-15, 2010SPAA Performance
June 13-15, 2010SPAA Performance Target Architecture – Windows 4 socket 2.4 GHz Intel Xeon E7330 Quad-Core Windows Server 2008 R2 Enterprise 16 core UMA machine Two 3 MB shared L2 cache per socket OpenMP Microsoft Visual C BLAS GotoBLAS2 1.00, Intel MKL 10.2
June 13-15, 2010SPAA Performance
June 13-15, 2010SPAA Performance
June 13-15, 2010SPAA Performance Results SuperMatrix is competitive with GotoBLAS and MKL Incremental pivoting ramps up in performance faster but partial pivoting provides better asymptotic performance Linux and Windows platforms attain similar performance curves
June 13-15, 2010SPAA Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U
June 13-15, 2010SPAA Conclusion Separation of Concerns Programmability Allows us to experiment with different scheduling algorithms
June 13-15, 2010SPAA Acknowledgements Andrew Chapman, Robert van de Geijn I thank the other members of the FLAME team for their support Funding Microsoft NSF grants CCF– CCF–
June 13-15, 2010SPAA Conclusion More Information Questions?