Presentation is loading. Please wait.

Presentation is loading. Please wait.

PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG.

Similar presentations


Presentation on theme: "PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG."— Presentation transcript:

1 PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG

2 PACC2011, Sept. 2011 2 Robert van de Geijn Designing a Library to be Multi-Accelerator Ready: A Case Study Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin

3 PACC2011, Sept. 2011 3 Outline  Motivation  What is FLAME?  Deriving algorithms to be correct  Representing algorithms in code  Of blocked algorithms and algorithms-by-blocks  Runtime support for multicore, GPU, and multiGPU  Extensions to distributed memory platforms  Related work  Conclusion

4 PACC2011, Sept. 2011 4 Moments of Inspiration Birth of multi-threaded libflame  Aug. 2006 - an insight: libflame + algorithm-by-blocks + out-of-order scheduling (runtime) = multithreaded library  Sept. 2006 - working prototype (by G. Quintana)  Oct. 2006 - grant proposal (to NSF, later funded)  Jan. 2007 - paper submitted (to SPAA07, accepted)  April 2007 - released with libflame R1.0 Birth of multi-GPU libflame  Fall 2007 - runtime used to manage data and tasks on a single GPU. (UJI-Spain)  March 2008 - NVIDIA donates 4GPU Tesla S870 system  Two hours after unboxing the boards, multiple heuristics for multiGPU runtime implemented  Then the power cord fried…

5 PACC2011, Sept. 2011 5 After two hoursShortly after Birth of MultiGPU libflame G. Quintana, Igual, E. Quintana, van de Geijn. "Solving Dense Linear Algebra Problems on Platforms with Multiple Hardware Accelerators." PPoPP’09.

6 PACC2011, Sept. 2011 6 What Supports our Productivity/Performance?  Deep understanding of the domain  Foundational computer science  Derivation of algorithms  Software implementation of hardware techniques  Blocking for performance  Abstraction  Separation of concern

7 PACC2011, Sept. 2011 7 Outline  Motivation  What is FLAME?  Deriving algorithms to be correct  Representing algorithms in code  Of blocked algorithms and algorithms-by-blocks  Runtime support for multicore, GPU, and multiGPU  Extensions to distributed memory platforms  Related work  Conclusion

8 PACC2011, Sept. 2011 8 What is FLAME?  A notation for expressing linear algebra algorithms  A methodology for deriving such algorithm  A set of abstractions for representing such algorithms  In LaTeX, M-script, C, etc.  A modern library (libflame)  Alternative to BLAS, LAPACK, ScaLAPACK, and related efforts  Many new contributions to theory and practice of dense linear algebra  Also banded and Krylov subspace methods  A set of tools supporting the above  Mechanical derivation  Automatic generation of code  Design-by-Transformation (DxT)

9 PACC2011, Sept. 2011 9 Who is FLAME?

10 PACC2011, Sept. 2011 10 Outline  Motivation  What is FLAME?  Deriving algorithms to be correct  Representing algorithms in code  Of blocked algorithms and algorithms-by-blocks  Runtime support for multicore, GPU, and multiGPU  Extensions to distributed memory platforms  Related work  Conclusion

11 PACC2011, Sept. 2011 11 Deriving Algorithms to be Correct  Include all algorithms for a given operation:  Pick the right algorithm for the given architecture  Problem:  How to find the right algorithm  Solution:  Formal derivation (Hoare, Dijkstra, …): Given operation, systematically derive family of algorithms for computing it.

12 PACC2011, Sept. 2011 12 Notation  The notation used to express an algorithm should reflect how the algorithm is naturally explained.

13 PACC2011, Sept. 2011 13 Example: The Cholesky Factorization  Lower triangular case: Key in the solution of s.p.d. linear systems A x = b  (LL T )x = b Ly = b  y L T x = y  x A = * LLTLT

14 PACC2011, Sept. 2011 14 Algorithm Loop: Repartition A TL A BL A BR  11 a 21 A 22 A 00 A 20 a 10 T Indexing operations 

15 PACC2011, Sept. 2011 15 Algorithm Loop: Update   11 a 21 A 00 A 20 a 10 T  11 a 21 /  11 A 22 – a 21 a 21 T A 00 A 20 a 10 T Real computation

16 PACC2011, Sept. 2011 16 Algorithm Loop: Merging  A TL A BL A BR  11 a 21 /  11 A 22 – a 21 a 21 T A 00 A 20 a 10 T Indexing operation

17 PACC2011, Sept. 2011 17 Worksheet for Cholesky Factorization

18 PACC2011, Sept. 2011 18 Mechanical Derivation of Algorithms  Mechanical development from math. specification 18 Mechanical procedure A = L * L T Paolo Bientinesi. "Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms." Ph.D. Dissertation. UT-Austin. August 2006.

19 PACC2011, Sept. 2011 19 Is Formal Derivation Practical?  libflame :  128+ matrix operations  1389+ implementations of algorithms  Test suite created in 2011  126,756 tests executed  Only 3 minor bugs in library… (now fixed)

20 PACC2011, Sept. 2011 20 Impact on (Single) GPU Computing CUBLAS 2009 (have been optimized since)

21 PACC2011, Sept. 2011 21 Impact on (Single) GPU Computing Fran Igual. “Matrix Computations on Graphics Processors and Clusters of GPUs." Ph.D. Dissertation. Univ. Jaume I. May 2011. Igual, G. Quintana, and van de Geijn. "Level-3 BLAS on a GPU: Picking the Low Hanging Fruit " FLAME Working Note #37. Universidad Jaume I, updated May 21, 2009. CUBLAS 2009

22 PACC2011, Sept. 2011 22 A Sampling of Functionality operationClassic FLAMESuperMatrix MultiThreaded/ MultiGPU lapack2flame Level-3 BLASyyN.A. Choleskyyyy LU with partial pivotingyyy LU with incremental pivotingyyN.A. QR (UT)yyy LQ (UT)yyy SPD/HPD inversionyyy Triangular inversionyyy Triangular Sylvesteryyy Triangular Lyapunovyyy Up-and-downdate (UT)yN.A. SVDnext weeksoon EVDnext weeksoon

23 PACC2011, Sept. 2011 23 Outline  Motivation  What is FLAME?  Deriving algorithms to be correct  Representing algorithms in code  Of blocked algorithms and algorithms-by-blocks  Runtime support for multicore, GPU, and multiGPU  Extensions to distributed memory platforms  Related work  Conclusion

24 PACC2011, Sept. 2011 24 Representing Algorithms in Code  Code should closely resemble how an algorithm is presented so that no bugs can be introduced when translating an algorithm to code.

25 PACC2011, Sept. 2011 25 Representing algorithms in code Spark+APIs C, F77, Matlab, LabView, LaTeX http://www.cs.utexas.edu/users/flame/Spark/

26 PACC2011, Sept. 2011 26 FLAME/C API FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02, /* ************* */ /* ************************** */ &a10t,/**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); /*--------------------------------------*/ FLA_Sqrt( alpha11 ); FLA_Inv_scal( alpha11, a21 ); FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, a21, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t, /* ************** */ /* ************************/ &ABL, /**/ &ABR, A20, a21, /**/ A22, FLA_TL ); } For now, libflame employs external BLAS: GotoBLAS, MKL, ACML, CUBLAS

27 PACC2011, Sept. 2011 27 Outline  Motivation  What is FLAME?  Deriving algorithms to be correct  Representing algorithms in code  Of blocked algorithms and algorithms-by-blocks  Runtime support for multicore, GPU, and multiGPU  Extensions to distributed memory platforms  Related work  Conclusion

28 PACC2011, Sept. 2011 28 Multicore/MultiGPU - Issues  Manage computation  Assignment of tasks to cores and/or GPUs  Granularity is important  Manage memory  Manage data transfer between “host” and caches of cores or host and GPU local memories  Granularity is important  Keep the data in the local memory as long as possible

29 PACC2011, Sept. 2011 29 Where have we seen this before?  Computer architecture late 1960s:  Super scalar units proposed  Unit of data: floating point number  Unit of computation: floating point operation  Examine dependencies  Execute out-of-order, prefetch, cache data, etc., to keep computational units busy  Extract parallelism from sequential instruction stream R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units.” IBM J. of R&D, (1967)  Basis for exploitation of ILP on current superscalar processors!

30 PACC2011, Sept. 2011 30 Of Blocks and Tasks  Dense matrix computation  Unit of data: block in matrix  Unit of computation (task): operation with blocks  Dependency: input/output of operation with blocks  Instruction stream: sequential libflame code  Generates DAG  Runtime system schedules tasks  Goal: minimize data transfer and maximize utilization

31 PACC2011, Sept. 2011 31 Review: Blocked Algorithms  Cholesky factorization … 1st iteration 2nd iteration 3rd iteration A 11 = L 11 L 11 T A 21 := L 21 = A 21 L 11 -T A 22 := A 22 – L 21 L 21 T

32 PACC2011, Sept. 2011 32 Blocked Algorithms  Cholesky factorization A = L * L T APIs + Tools FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

33 PACC2011, Sept. 2011 33 Simple Parallelization: Blocked Algorithms  Link with multi-threaded BLAS FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } A 11 = L 11 L 11 T A 21 := L 21 = A 21 L 11 -T A 22 := A 22 – L 21 L 21 T

34 PACC2011, Sept. 2011 34 Blocked Algorithms  There is more parallelism! 1st iteration Inside the same iteration 2nd iteration In different iterations

35 PACC2011, Sept. 2011 35 Coding Algorithm-by-Blocks  Algorithm-by-blocks : FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } A 11 = L 11 L 11 T A 21 := L 21 = A 21 L 11 -T A 22 := A 22 – L 21 L 21 T

36 PACC2011, Sept. 2011 36 Outline  Motivation  What is FLAME?  Deriving algorithms to be correct  Representing algorithms in code  Of blocked algorithms and algorithms-by-blocks  Runtime support for multicore, GPU, and multiGPU  Extensions to distributed memory platforms  Related work  Conclusion

37 PACC2011, Sept. 2011 37 Generating a DAG 1 2 3 4 5 6 7 8910 FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }

38 PACC2011, Sept. 2011 38 Managing Tasks and Blocks  Separation of concerns:  Sequential libflame routine generates the DAG  Runtime system (SuperMatrix) manages and schedules the DAG  As one moves from one architecture to another, only the runtime system needs to be updated  Multicore  Out-of-core  Single GPU  MultiGPU  Distributed Runtime  …

39 PACC2011, Sept. 2011 39 Runtime system - SuperMatrix 1 2 3 4 5 6 7 8910 Super Matrix Super Matrix DAG Multicore + heuristic

40 PACC2011, Sept. 2011 40 Runtime system for GPU - SuperMatrix 1 2 3 4 5 6 7 8910 Super Matrix Super Matrix DAG CPU + GPU Manage data transfer accelerator

41 PACC2011, Sept. 2011 41 Runtime system for MultiGPU - SuperMatrix 1 2 3 4 5 6 7 8910 Super Matrix Super Matrix DAG CPU + MultiGPU Manage data transfer Multi-accelerator

42 PACC2011, Sept. 2011 42 MultiGPU 42  How do we program these? CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect

43 PACC2011, Sept. 2011 43 MultiGPU: a User’s View FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init(); FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Queue_enable_gpu(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize();

44 PACC2011, Sept. 2011 44 MultiGPU: Under the Cover 44  Naïve approach:  Before execution, transfer data to device  Call CUBLAS operations (implementation “someone else’s problem”)  Upon completion, retrieve results back to host  poor data locality CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect

45 PACC2011, Sept. 2011 45 MultiGPU: Under the Cover 45  How do we program these? View as a…  Shared-memory multiprocessor + Distributed Shared Memory (DSM) architecture CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect

46 PACC2011, Sept. 2011 46 MultiGPU: Under the Cover 46  View system as a shared- memory multiprocessors (multi-core processor with hw. coherence) MPP 0 +Cache 0 P 1 +Cache 1 P 2 +Cache 2 P 3 +Cache 3 CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect

47 PACC2011, Sept. 2011 47 MultiGPU: Under the Cover 47  Software Distributed-Shared Memory (DSM)  Software: flexibility vs. efficiency  Underlying distributed memory hidden  Reduce memory transfers using write-back, write- invalidate,…  Well-known approach, not too efficient as a middleware for general apps. Regularity of dense linear algebra makes a difference!

48 PACC2011, Sept. 2011 48 MultiGPU: Under the Cover 48  Reduce #data transfers:  Run-time handles device memory as a software cache:  Operate at block level  Software  flexibility  Write-back  Write-invalidate Super Matrix Super Matrix CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect 1 2 3 4 5 6 7 8910

49 PACC2011, Sept. 2011 49 MultiGPU: Under the Cover Super Matrix Super Matrix FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } Factor A11 on host

50 PACC2011, Sept. 2011 50 Multi-GPU: Under the Cover Super Matrix Super Matrix FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } Transfer A11 from host to appropriate devices before using it in subsequent computations (write-update)

51 PACC2011, Sept. 2011 51 Multi-GPU: Under the Cover Super Matrix Super Matrix FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } Cache A11 in receiving device(s) in case needed in subsequent computations

52 PACC2011, Sept. 2011 52 Multi-GPU: Under the Cover Super Matrix Super Matrix FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } Send blocks to devices Perform Trsm on blocks of A21 (hopefully using cached A11) Keep updated A21 in device till needed by other GPU(s) (write-back)

53 PACC2011, Sept. 2011 53 Multi-GPU: Under the Cover Super Matrix Super Matrix FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } Send blocks to devices Perform Syrk/Gemm on blocks of A22 (hopefully using cached blocks of A21) Keep updated A22 in device till needed by other GPU(s) (write-back)

54 PACC2011, Sept. 2011 54 C = C + AB T on S1070 (Tesla x 4)

55 PACC2011, Sept. 2011 55 Cholesky on S1070 (Tesla x 4)

56 PACC2011, Sept. 2011 56 Cholesky on S1070 (Tesla x 4)

57 PACC2011, Sept. 2011 57 Sampling of LAPACK functionality on S2050 (Fermi x 4)

58 PACC2011, Sept. 2011 58 Sampling of LAPACK functionality on S2050 (Fermi x 4)

59 PACC2011, Sept. 2011 59 Outline  Motivation  What is FLAME?  Deriving algorithms to be correct  Representing algorithms in code  Of blocked algorithms and algorithms-by-blocks  Runtime support for multicore, GPU, and multiGPU  Extensions to distributed memory platforms  Conclusion

60 PACC2011, Sept. 2011 60 libflame for Cluster (+ Accelerators)  PLAPACK  Distributed memory (MPI)  Inspired FLAME  Recently modified so that each node can have GPU  Keep data in GPU memory as much as possible  Elemental (Jack Poulson)  Distributed memory (MPI)  Inspired by FLAME/PLAPACK  Can use GPU at each node/core  libflame + SuperMatrix  Runtime schedules tasks and data transfer  Appropriate for small clusters

61 PACC2011, Sept. 2011 61 PLAPACK + GPU accelerators Fogue, Igual, E. Quintana, van de Geijn. "Retargeting PLAPACK to Clusters with Hardware Accelerators." (WEHA 2010). Each node: Xeon Nehalem (8 cores) + 2 NVIDIA C1060 (Tesla)

62 PACC2011, Sept. 2011 62 Targeting Clusters with GPUs SuperMatrix Distributed Runtime Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.” Concurrency and Computation: Practice and Experience. In review. Each node: Xeon Nehalem (8 cores) + 1 NVIDIA C2050 (Fermi)

63 PACC2011, Sept. 2011 63 Elemental Cholesky Factorization

64 PACC2011, Sept. 2011 64 Elemental vs. ScaLAPACK Cholesky on 8192 cores - BlueGene/P Elemental has full ScaLAPACK functionality (except nonsymmetric Eigenvalue Problem). Poulson, Marker, Hammond, Romero, van de Geijn. "Elemental: A New Framework for Distributed Memory Dense Matrix Computations." ACM TOMS. Submitted.

65 PACC2011, Sept. 2011 65 Single-Chip Cloud Computer  Intel SCC research processor  48-core concept vehicle  Created for many-core software research  Custom communication library (RCCE)

66 PACC2011, Sept. 2011 66 SCC Results - 48 Pentium cores - MPI replaced by RCCE Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.” Concurrency and Computation: Practice and Experience. In review. Marker, Chan, Poulson, van de Geijn, Van der Wijngaart, Mattson, Kubaska. "Programming Many-Core Architectures - A Case Study: Dense Matrix Computations on the Intel SCC Processor." Concurrency and Computation: Practice and Experience. To Appear.

67 PACC2011, Sept. 2011 67 Outline  Motivation  What is FLAME?  Deriving algorithms to be correct  Representing algorithms in code  Of blocked algorithms and algorithms-by-blocks  Runtime support for multicore, GPU, and multiGPU  Extensions to distributed memory platforms  Related work  Conclusion

68 PACC2011, Sept. 2011 68 Related work  Data-flow parallelism, dynamic scheduling, runtime  Cilk  OpenMP (task queues)  StarSs (SMPSs)  StarPU  Threading Building Blocks (TBB)  …  What we have is very dense linear algebra specific

69 PACC2011, Sept. 2011 69 Dense Linear Algebra Libraries Target PlatformLAPACK ProjectFLAME Project SequentialLAPACKlibflame Sequential + multithreaded BLASLAPACKlibflame Multicore/multithreadedPLASMAlibflame+SuperMatrix Multicore + out-of-order schedulingPLASMA+Quarklibflame+SuperMatrix CPU + single GPUMAGMAlibflame+SuperMatrix Multicore + multiGPUDAGuE?libflame+SuperMatrix Distributed memoryScaLAPACKlibflame+SuperMatrix PLAPACK Elemental Distributed memory + GPUDAGuE? ScaLAPACK? libflame+SuperMatrix PLAPACK Elemental Out-of-Core?libflame+SuperMatrix

70 PACC2011, Sept. 2011 70 Comparison with Quark Agullo, Bouwmeester, Dongarra, Kurzak, Langou, Rosenbert. “Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures.” VecPar, 2010.

71 PACC2011, Sept. 2011 71 Outline  Motivation  What is FLAME?  Deriving algorithms to be correct  Representing algorithms in code  Of blocked algorithms and algorithms-by-blocks  Runtime support for multicore, GPU, and multiGPU  Extensions to distributed memory platforms  Related work  Conclusion

72 PACC2011, Sept. 2011 72 Conclusions  Programmability is the key to harnessing parallel computation  One code, many target platforms  Formal derivation provides confidence in code  If there is a problem, it is not in the library!  Separation of concern  Library developer derives algorithms and codes them  Execution of routines generates DAG  Parallelism, temporal locality, and spatial locality are captured in DAG  Runtime system uses appropriate heuristics to schedule

73 PACC2011, Sept. 2011 73 What does this mean for you?  One successful approach:  Identify units of data and units of computation  Write a sequential program that generates a DAG  Hand DAG to runtime for scheduling

74 PACC2011, Sept. 2011 74 The Future  Currently: Library is an instantiation in code  Future  Create repository of algorithms, expert knowledge about algorithms, and knowledge about a target architecture  Mechanically generate a library for a target architecture, exactly as an expert would  Design-by-Transformation (DxT) Bryan Marker, Andy Terrel, Jack Poulson, Don Batory, and Robert van de Geijn. "Mechanizing the Expert Dense Linear Algebra Developer." FLAME Working Note #58. 2011

75 PACC2011, Sept. 2011 75 Availability  Everything that has been discussed is available under LGPL license or BSD license  libflame + SuperMatrix  http://www.cs.utexas.edu/users/flame/ http://www.cs.utexas.edu/users/flame/  Elemental  http://code.google.com/p/elemental/ http://code.google.com/p/elemental/ [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG


Download ppt "PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG."

Similar presentations


Ads by Google