Download presentation
Presentation is loading. Please wait.
Published byCharla Lane Modified over 8 years ago
1
June 13-15, 2010SPAA 20101 Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan
2
June 13-15, 2010SPAA 20102 Motivation Solving Linear Systems Solve for x A x = b Factorize AO(n 3 ) P A = L U Forward and Backward substitutionO(n 2 ) L y = P b U x = y
3
June 13-15, 2010SPAA 20103 Goals Programmability Use tools provided by FLAME Parallelism Directed acyclic graph (DAG) scheduling
4
June 13-15, 2010SPAA 20104 Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U
5
June 13-15, 2010SPAA 20105 LU Factorization with Partial Pivoting Formal Linear Algebra Method Environment (FLAME) High-level abstractions for expressing linear algebra algorithms Application programming interfaces (APIs) for seamlessly implementing algorithms in code Library of commonly used linear algebra operations in libflame
6
June 13-15, 2010SPAA 20106
7
June 13-15, 2010SPAA 20107 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 1 A 21 A 22 A 12 A 11
8
June 13-15, 2010SPAA 20108 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 1 LUpiv A 22 A 12
9
June 13-15, 2010SPAA 20109 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 1 A 21 PIV A 11
10
June 13-15, 2010SPAA 201010 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 1 A 21 A 22 TRSMA 11
11
June 13-15, 2010SPAA 201011 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 1 A 21 GEMM A 12 A 11
12
June 13-15, 2010SPAA 201012 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 2 A 00 A 10 A 20 A 11 A 21 A 12 A 22 A 01 A 02
13
June 13-15, 2010SPAA 201013 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 2 A 00 A 10 A 20 LUpiv A 12 A 22 A 01 A 02
14
June 13-15, 2010SPAA 201014 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 2 A 00 PIV A 11 A 21 PIV A 01 A 02
15
June 13-15, 2010SPAA 201015 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 2 A 00 A 10 A 20 A 11 A 21 TRSM A 22 A 01 A 02
16
June 13-15, 2010SPAA 201016 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 2 A 00 A 10 A 20 A 11 A 21 A 12 GEMM A 01 A 02
17
June 13-15, 2010SPAA 201017 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 3 A 00 A 10 A 01 A 11
18
June 13-15, 2010SPAA 201018 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 3 A 00 A 10 A 01 LUpiv
19
June 13-15, 2010SPAA 201019 LU Factorization with Partial Pivoting Blocked Algorithm Iteration 3 A 00 PIV A 01 A 11
20
June 13-15, 2010SPAA 201020 Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U
21
June 13-15, 2010SPAA 201021 Algorithm-by-Blocks FLASH Storage-by-blocks
22
June 13-15, 2010SPAA 201022 FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); FLA_Part_2x1( p, &pT, &pB, 0, FLA_TOP ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); FLA_Repart_2x1_to_3x1( pT, &p0, /* ** */ /* ** */ &p1, pB, &p2, 1, FLA_BOTTOM ); /*------------------------------------------------------*/ FLA_Merge_2x1( A11, A21, &AB1 ); FLASH_LU_piv( AB1, p1 ); FLA_Merge_2x1( A10, A20, &AB0 ); FLASH_Apply_pivots( FLA_LEFT, FLA_NO_TRANSPOSE, p1, AB0 ); FLA_Merge_2x1( A12, A22, &AB2 ); FLASH_Apply_pivots( FLA_LEFT, FLA_NO_TRANSPOSE, p1, AB2 ); FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_UNIT_DIAG, FLA_ONE, A11, A12 ); FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 ); /*------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); FLA_Cont_with_3x1_to_2x1( &pT, p0, p1, /* ** */ /* ** */ &pB, p2, FLA_TOP ); }
23
June 13-15, 2010SPAA 201023 Algorithm-by-Blocks LU Factorization with Partial Pivoting Iteration 1 PIV 1 TRSM 3 LUpiv 0 PIV 2 TRSM 4 PIV 1 GEMM 5 PIV 2 GEMM 7 PIV 1 GEMM 6 LUpiv 0 PIV 2 GEMM 8
24
June 13-15, 2010SPAA 201024 Algorithm-by-Blocks LU Factorization with Partial Pivoting Iteration 2 LUpiv 9 PIV 11 TRSM 12 LUpiv 9 PIV 10 PIV 11 GEMM 13
25
June 13-15, 2010SPAA 201025 Algorithm-by-Blocks LU Factorization with Partial Pivoting Iteration 3 PIV 16 PIV 15 LUpiv 14
26
June 13-15, 2010SPAA 201026 PIV 1 TRSM 4 GEMM 5 LUpiv 9 GEMM 13 LUpiv 0 TRSM 12 GEMM 8 GEMM 6 PIV 11 PIV 2 TRSM 3 LUpiv 14 GEMM 7 PIV 10 PIV 16 PIV 15
27
June 13-15, 2010SPAA 201027 Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U
28
June 13-15, 2010SPAA 201028 SuperMatrix Runtime System Separation of Concerns Analyzer Decomposes subproblems into component tasks Store tasks in global task queue sequentially Internally calculates all dependencies between tasks, which form a directed acyclic graph (DAG), only using input and output parameters for each task Dispatcher Spawn threads Schedule and dispatch tasks to threads in parallel
29
June 13-15, 2010SPAA 201029 … SuperMatrix Runtime System Dispatcher – Single Queue Set of all ready and available tasks FIFO, priority PE 1 PE 0 PE p-1
30
June 13-15, 2010SPAA 201030 PIV 1 TRSM 4 GEMM 5 LUpiv 9 GEMM 13 LUpiv 0 TRSM 12 GEMM 8 GEMM 6 PIV 11 PIV 2 TRSM 3 LUpiv 14 GEMM 7 PIV 10 PIV 16 PIV 15
31
June 13-15, 2010SPAA 201031 SuperMatrix Runtime System Lookahead Schedule GEMM 5 and GEMM 6 tasks first so LUpiv 9 can be “computed ahead” in parallel with GEMM 7 and GEMM 8 Implemented directly within the code which increases the complexity and detracts from programmability High-Performance LINPACK
32
June 13-15, 2010SPAA 201032 SuperMatrix Runtime System Scheduling Sorting tasks by height of each task in DAG mimics lookahead Multiple queues Data affinity Work stealing Macroblocks Tasks overwriting more than one block at a time
33
June 13-15, 2010SPAA 201033 Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U
34
June 13-15, 2010SPAA 201034 Performance Implementations SuperMatrix + serial BLAS Partial and incremental pivoting LAPACK dgetrf + multithreaded BLAS Multithreaded dgetrf Double precision real floating point arithmetic Tuned block size per problem size
35
June 13-15, 2010SPAA 201035 Performance Target Architecture – Linux 4 socket 2.3 GHz AMD Opteron Quad-Core ranger.tacc.utexas.edu 3936 SMP nodes 16 cores per node 2 MB shared L3 cache per socket OpenMP Intel compiler 10.1 BLAS GotoBLAS2 1.00, MKL 10.0
36
June 13-15, 2010SPAA 201036 Performance
37
June 13-15, 2010SPAA 201037 Performance
38
June 13-15, 2010SPAA 201038 Performance Target Architecture – Windows 4 socket 2.4 GHz Intel Xeon E7330 Quad-Core Windows Server 2008 R2 Enterprise 16 core UMA machine Two 3 MB shared L2 cache per socket OpenMP Microsoft Visual C++ 2010 BLAS GotoBLAS2 1.00, Intel MKL 10.2
39
June 13-15, 2010SPAA 201039 Performance
40
June 13-15, 2010SPAA 201040 Performance
41
June 13-15, 2010SPAA 201041 Performance Results SuperMatrix is competitive with GotoBLAS and MKL Incremental pivoting ramps up in performance faster but partial pivoting provides better asymptotic performance Linux and Windows platforms attain similar performance curves
42
June 13-15, 2010SPAA 201042 Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U
43
June 13-15, 2010SPAA 201043 Conclusion Separation of Concerns Programmability Allows us to experiment with different scheduling algorithms
44
June 13-15, 2010SPAA 201044 Acknowledgements Andrew Chapman, Robert van de Geijn I thank the other members of the FLAME team for their support Funding Microsoft NSF grants CCF–0540926 CCF–0702714
45
June 13-15, 2010SPAA 201045 Conclusion More Information http://www.cs.utexas.edu/~flame Questions? echan@cs.utexas.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.