Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 13-15, 2010SPAA 20101 Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.

Similar presentations

Presentation on theme: "June 13-15, 2010SPAA 20101 Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan."— Presentation transcript:

1 June 13-15, 2010SPAA 20101 Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan

2 June 13-15, 2010SPAA 20102 Motivation Solving Linear Systems  Solve for x A x = b  Factorize AO(n 3 ) P A = L U  Forward and Backward substitutionO(n 2 ) L y = P b U x = y

3 June 13-15, 2010SPAA 20103 Goals Programmability  Use tools provided by FLAME Parallelism  Directed acyclic graph (DAG) scheduling

4 June 13-15, 2010SPAA 20104 Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U

5 June 13-15, 2010SPAA 20105 LU Factorization with Partial Pivoting Formal Linear Algebra Method Environment (FLAME)  High-level abstractions for expressing linear algebra algorithms  Application programming interfaces (APIs) for seamlessly implementing algorithms in code  Library of commonly used linear algebra operations in libflame

6 June 13-15, 2010SPAA 20106

7 June 13-15, 2010SPAA 20107 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 1 A 21 A 22 A 12 A 11

8 June 13-15, 2010SPAA 20108 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 1 LUpiv A 22 A 12

9 June 13-15, 2010SPAA 20109 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 1 A 21 PIV A 11

10 June 13-15, 2010SPAA 201010 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 1 A 21 A 22 TRSMA 11

11 June 13-15, 2010SPAA 201011 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 1 A 21 GEMM A 12 A 11

12 June 13-15, 2010SPAA 201012 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 2 A 00 A 10 A 20 A 11 A 21 A 12 A 22 A 01 A 02

13 June 13-15, 2010SPAA 201013 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 2 A 00 A 10 A 20 LUpiv A 12 A 22 A 01 A 02

14 June 13-15, 2010SPAA 201014 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 2 A 00 PIV A 11 A 21 PIV A 01 A 02

15 June 13-15, 2010SPAA 201015 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 2 A 00 A 10 A 20 A 11 A 21 TRSM A 22 A 01 A 02

16 June 13-15, 2010SPAA 201016 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 2 A 00 A 10 A 20 A 11 A 21 A 12 GEMM A 01 A 02

17 June 13-15, 2010SPAA 201017 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 3 A 00 A 10 A 01 A 11

18 June 13-15, 2010SPAA 201018 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 3 A 00 A 10 A 01 LUpiv

19 June 13-15, 2010SPAA 201019 LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 3 A 00 PIV A 01 A 11

20 June 13-15, 2010SPAA 201020 Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U

21 June 13-15, 2010SPAA 201021 Algorithm-by-Blocks FLASH  Storage-by-blocks

22 June 13-15, 2010SPAA 201022 FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); FLA_Part_2x1( p, &pT, &pB, 0, FLA_TOP ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); FLA_Repart_2x1_to_3x1( pT, &p0, /* ** */ /* ** */ &p1, pB, &p2, 1, FLA_BOTTOM ); /*------------------------------------------------------*/ FLA_Merge_2x1( A11, A21, &AB1 ); FLASH_LU_piv( AB1, p1 ); FLA_Merge_2x1( A10, A20, &AB0 ); FLASH_Apply_pivots( FLA_LEFT, FLA_NO_TRANSPOSE, p1, AB0 ); FLA_Merge_2x1( A12, A22, &AB2 ); FLASH_Apply_pivots( FLA_LEFT, FLA_NO_TRANSPOSE, p1, AB2 ); FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_UNIT_DIAG, FLA_ONE, A11, A12 ); FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 ); /*------------------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); FLA_Cont_with_3x1_to_2x1( &pT, p0, p1, /* ** */ /* ** */ &pB, p2, FLA_TOP ); }

23 June 13-15, 2010SPAA 201023 Algorithm-by-Blocks LU Factorization with Partial Pivoting  Iteration 1 PIV 1 TRSM 3 LUpiv 0 PIV 2 TRSM 4 PIV 1 GEMM 5 PIV 2 GEMM 7 PIV 1 GEMM 6 LUpiv 0 PIV 2 GEMM 8

24 June 13-15, 2010SPAA 201024 Algorithm-by-Blocks LU Factorization with Partial Pivoting  Iteration 2 LUpiv 9 PIV 11 TRSM 12 LUpiv 9 PIV 10 PIV 11 GEMM 13

25 June 13-15, 2010SPAA 201025 Algorithm-by-Blocks LU Factorization with Partial Pivoting  Iteration 3 PIV 16 PIV 15 LUpiv 14

26 June 13-15, 2010SPAA 201026 PIV 1 TRSM 4 GEMM 5 LUpiv 9 GEMM 13 LUpiv 0 TRSM 12 GEMM 8 GEMM 6 PIV 11 PIV 2 TRSM 3 LUpiv 14 GEMM 7 PIV 10 PIV 16 PIV 15

27 June 13-15, 2010SPAA 201027 Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U

28 June 13-15, 2010SPAA 201028 SuperMatrix Runtime System Separation of Concerns  Analyzer Decomposes subproblems into component tasks Store tasks in global task queue sequentially Internally calculates all dependencies between tasks, which form a directed acyclic graph (DAG), only using input and output parameters for each task  Dispatcher Spawn threads Schedule and dispatch tasks to threads in parallel

29 June 13-15, 2010SPAA 201029 … SuperMatrix Runtime System Dispatcher – Single Queue  Set of all ready and available tasks  FIFO, priority PE 1 PE 0 PE p-1

30 June 13-15, 2010SPAA 201030 PIV 1 TRSM 4 GEMM 5 LUpiv 9 GEMM 13 LUpiv 0 TRSM 12 GEMM 8 GEMM 6 PIV 11 PIV 2 TRSM 3 LUpiv 14 GEMM 7 PIV 10 PIV 16 PIV 15

31 June 13-15, 2010SPAA 201031 SuperMatrix Runtime System Lookahead  Schedule GEMM 5 and GEMM 6 tasks first so LUpiv 9 can be “computed ahead” in parallel with GEMM 7 and GEMM 8  Implemented directly within the code which increases the complexity and detracts from programmability High-Performance LINPACK

32 June 13-15, 2010SPAA 201032 SuperMatrix Runtime System Scheduling  Sorting tasks by height of each task in DAG mimics lookahead  Multiple queues Data affinity Work stealing Macroblocks  Tasks overwriting more than one block at a time

33 June 13-15, 2010SPAA 201033 Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U

34 June 13-15, 2010SPAA 201034 Performance Implementations  SuperMatrix + serial BLAS Partial and incremental pivoting  LAPACK dgetrf + multithreaded BLAS  Multithreaded dgetrf  Double precision real floating point arithmetic  Tuned block size per problem size

35 June 13-15, 2010SPAA 201035 Performance Target Architecture – Linux  4 socket 2.3 GHz AMD Opteron Quad-Core 3936 SMP nodes 16 cores per node 2 MB shared L3 cache per socket  OpenMP Intel compiler 10.1  BLAS GotoBLAS2 1.00, MKL 10.0

36 June 13-15, 2010SPAA 201036 Performance

37 June 13-15, 2010SPAA 201037 Performance

38 June 13-15, 2010SPAA 201038 Performance Target Architecture – Windows  4 socket 2.4 GHz Intel Xeon E7330 Quad-Core Windows Server 2008 R2 Enterprise 16 core UMA machine Two 3 MB shared L2 cache per socket  OpenMP Microsoft Visual C++ 2010  BLAS GotoBLAS2 1.00, Intel MKL 10.2

39 June 13-15, 2010SPAA 201039 Performance

40 June 13-15, 2010SPAA 201040 Performance

41 June 13-15, 2010SPAA 201041 Performance Results  SuperMatrix is competitive with GotoBLAS and MKL  Incremental pivoting ramps up in performance faster but partial pivoting provides better asymptotic performance  Linux and Windows platforms attain similar performance curves

42 June 13-15, 2010SPAA 201042 Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U

43 June 13-15, 2010SPAA 201043 Conclusion Separation of Concerns  Programmability  Allows us to experiment with different scheduling algorithms

44 June 13-15, 2010SPAA 201044 Acknowledgements Andrew Chapman, Robert van de Geijn  I thank the other members of the FLAME team for their support Funding  Microsoft  NSF grants CCF–0540926 CCF–0702714

45 June 13-15, 2010SPAA 201045 Conclusion More Information Questions?

Download ppt "June 13-15, 2010SPAA 20101 Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan."

Similar presentations

Ads by Google