Download presentation
Presentation is loading. Please wait.
Published byWesley Miller Modified over 9 years ago
1
PACC2011, Sept. 2011 1 [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG
2
PACC2011, Sept. 2011 2 Robert van de Geijn Designing a Library to be Multi-Accelerator Ready: A Case Study Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin
3
PACC2011, Sept. 2011 3 Outline Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
4
PACC2011, Sept. 2011 4 Moments of Inspiration Birth of multi-threaded libflame Aug. 2006 - an insight: libflame + algorithm-by-blocks + out-of-order scheduling (runtime) = multithreaded library Sept. 2006 - working prototype (by G. Quintana) Oct. 2006 - grant proposal (to NSF, later funded) Jan. 2007 - paper submitted (to SPAA07, accepted) April 2007 - released with libflame R1.0 Birth of multi-GPU libflame Fall 2007 - runtime used to manage data and tasks on a single GPU. (UJI-Spain) March 2008 - NVIDIA donates 4GPU Tesla S870 system Two hours after unboxing the boards, multiple heuristics for multiGPU runtime implemented Then the power cord fried…
5
PACC2011, Sept. 2011 5 After two hoursShortly after Birth of MultiGPU libflame G. Quintana, Igual, E. Quintana, van de Geijn. "Solving Dense Linear Algebra Problems on Platforms with Multiple Hardware Accelerators." PPoPP’09.
6
PACC2011, Sept. 2011 6 What Supports our Productivity/Performance? Deep understanding of the domain Foundational computer science Derivation of algorithms Software implementation of hardware techniques Blocking for performance Abstraction Separation of concern
7
PACC2011, Sept. 2011 7 Outline Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
8
PACC2011, Sept. 2011 8 What is FLAME? A notation for expressing linear algebra algorithms A methodology for deriving such algorithm A set of abstractions for representing such algorithms In LaTeX, M-script, C, etc. A modern library (libflame) Alternative to BLAS, LAPACK, ScaLAPACK, and related efforts Many new contributions to theory and practice of dense linear algebra Also banded and Krylov subspace methods A set of tools supporting the above Mechanical derivation Automatic generation of code Design-by-Transformation (DxT)
9
PACC2011, Sept. 2011 9 Who is FLAME?
10
PACC2011, Sept. 2011 10 Outline Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
11
PACC2011, Sept. 2011 11 Deriving Algorithms to be Correct Include all algorithms for a given operation: Pick the right algorithm for the given architecture Problem: How to find the right algorithm Solution: Formal derivation (Hoare, Dijkstra, …): Given operation, systematically derive family of algorithms for computing it.
12
PACC2011, Sept. 2011 12 Notation The notation used to express an algorithm should reflect how the algorithm is naturally explained.
13
PACC2011, Sept. 2011 13 Example: The Cholesky Factorization Lower triangular case: Key in the solution of s.p.d. linear systems A x = b (LL T )x = b Ly = b y L T x = y x A = * LLTLT
14
PACC2011, Sept. 2011 14 Algorithm Loop: Repartition A TL A BL A BR 11 a 21 A 22 A 00 A 20 a 10 T Indexing operations
15
PACC2011, Sept. 2011 15 Algorithm Loop: Update 11 a 21 A 00 A 20 a 10 T 11 a 21 / 11 A 22 – a 21 a 21 T A 00 A 20 a 10 T Real computation
16
PACC2011, Sept. 2011 16 Algorithm Loop: Merging A TL A BL A BR 11 a 21 / 11 A 22 – a 21 a 21 T A 00 A 20 a 10 T Indexing operation
17
PACC2011, Sept. 2011 17 Worksheet for Cholesky Factorization
18
PACC2011, Sept. 2011 18 Mechanical Derivation of Algorithms Mechanical development from math. specification 18 Mechanical procedure A = L * L T Paolo Bientinesi. "Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms." Ph.D. Dissertation. UT-Austin. August 2006.
19
PACC2011, Sept. 2011 19 Is Formal Derivation Practical? libflame : 128+ matrix operations 1389+ implementations of algorithms Test suite created in 2011 126,756 tests executed Only 3 minor bugs in library… (now fixed)
20
PACC2011, Sept. 2011 20 Impact on (Single) GPU Computing CUBLAS 2009 (have been optimized since)
21
PACC2011, Sept. 2011 21 Impact on (Single) GPU Computing Fran Igual. “Matrix Computations on Graphics Processors and Clusters of GPUs." Ph.D. Dissertation. Univ. Jaume I. May 2011. Igual, G. Quintana, and van de Geijn. "Level-3 BLAS on a GPU: Picking the Low Hanging Fruit " FLAME Working Note #37. Universidad Jaume I, updated May 21, 2009. CUBLAS 2009
22
PACC2011, Sept. 2011 22 A Sampling of Functionality operationClassic FLAMESuperMatrix MultiThreaded/ MultiGPU lapack2flame Level-3 BLASyyN.A. Choleskyyyy LU with partial pivotingyyy LU with incremental pivotingyyN.A. QR (UT)yyy LQ (UT)yyy SPD/HPD inversionyyy Triangular inversionyyy Triangular Sylvesteryyy Triangular Lyapunovyyy Up-and-downdate (UT)yN.A. SVDnext weeksoon EVDnext weeksoon
23
PACC2011, Sept. 2011 23 Outline Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
24
PACC2011, Sept. 2011 24 Representing Algorithms in Code Code should closely resemble how an algorithm is presented so that no bugs can be introduced when translating an algorithm to code.
25
PACC2011, Sept. 2011 25 Representing algorithms in code Spark+APIs C, F77, Matlab, LabView, LaTeX http://www.cs.utexas.edu/users/flame/Spark/
26
PACC2011, Sept. 2011 26 FLAME/C API FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02, /* ************* */ /* ************************** */ &a10t,/**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); /*--------------------------------------*/ FLA_Sqrt( alpha11 ); FLA_Inv_scal( alpha11, a21 ); FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, a21, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t, /* ************** */ /* ************************/ &ABL, /**/ &ABR, A20, a21, /**/ A22, FLA_TL ); } For now, libflame employs external BLAS: GotoBLAS, MKL, ACML, CUBLAS
27
PACC2011, Sept. 2011 27 Outline Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
28
PACC2011, Sept. 2011 28 Multicore/MultiGPU - Issues Manage computation Assignment of tasks to cores and/or GPUs Granularity is important Manage memory Manage data transfer between “host” and caches of cores or host and GPU local memories Granularity is important Keep the data in the local memory as long as possible
29
PACC2011, Sept. 2011 29 Where have we seen this before? Computer architecture late 1960s: Super scalar units proposed Unit of data: floating point number Unit of computation: floating point operation Examine dependencies Execute out-of-order, prefetch, cache data, etc., to keep computational units busy Extract parallelism from sequential instruction stream R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units.” IBM J. of R&D, (1967) Basis for exploitation of ILP on current superscalar processors!
30
PACC2011, Sept. 2011 30 Of Blocks and Tasks Dense matrix computation Unit of data: block in matrix Unit of computation (task): operation with blocks Dependency: input/output of operation with blocks Instruction stream: sequential libflame code Generates DAG Runtime system schedules tasks Goal: minimize data transfer and maximize utilization
31
PACC2011, Sept. 2011 31 Review: Blocked Algorithms Cholesky factorization … 1st iteration 2nd iteration 3rd iteration A 11 = L 11 L 11 T A 21 := L 21 = A 21 L 11 -T A 22 := A 22 – L 21 L 21 T
32
PACC2011, Sept. 2011 32 Blocked Algorithms Cholesky factorization A = L * L T APIs + Tools FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }
33
PACC2011, Sept. 2011 33 Simple Parallelization: Blocked Algorithms Link with multi-threaded BLAS FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } A 11 = L 11 L 11 T A 21 := L 21 = A 21 L 11 -T A 22 := A 22 – L 21 L 21 T
34
PACC2011, Sept. 2011 34 Blocked Algorithms There is more parallelism! 1st iteration Inside the same iteration 2nd iteration In different iterations
35
PACC2011, Sept. 2011 35 Coding Algorithm-by-Blocks Algorithm-by-blocks : FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } A 11 = L 11 L 11 T A 21 := L 21 = A 21 L 11 -T A 22 := A 22 – L 21 L 21 T
36
PACC2011, Sept. 2011 36 Outline Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
37
PACC2011, Sept. 2011 37 Generating a DAG 1 2 3 4 5 6 7 8910 FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); }
38
PACC2011, Sept. 2011 38 Managing Tasks and Blocks Separation of concerns: Sequential libflame routine generates the DAG Runtime system (SuperMatrix) manages and schedules the DAG As one moves from one architecture to another, only the runtime system needs to be updated Multicore Out-of-core Single GPU MultiGPU Distributed Runtime …
39
PACC2011, Sept. 2011 39 Runtime system - SuperMatrix 1 2 3 4 5 6 7 8910 Super Matrix Super Matrix DAG Multicore + heuristic
40
PACC2011, Sept. 2011 40 Runtime system for GPU - SuperMatrix 1 2 3 4 5 6 7 8910 Super Matrix Super Matrix DAG CPU + GPU Manage data transfer accelerator
41
PACC2011, Sept. 2011 41 Runtime system for MultiGPU - SuperMatrix 1 2 3 4 5 6 7 8910 Super Matrix Super Matrix DAG CPU + MultiGPU Manage data transfer Multi-accelerator
42
PACC2011, Sept. 2011 42 MultiGPU 42 How do we program these? CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect
43
PACC2011, Sept. 2011 43 MultiGPU: a User’s View FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init(); FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Queue_enable_gpu(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize();
44
PACC2011, Sept. 2011 44 MultiGPU: Under the Cover 44 Naïve approach: Before execution, transfer data to device Call CUBLAS operations (implementation “someone else’s problem”) Upon completion, retrieve results back to host poor data locality CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect
45
PACC2011, Sept. 2011 45 MultiGPU: Under the Cover 45 How do we program these? View as a… Shared-memory multiprocessor + Distributed Shared Memory (DSM) architecture CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect
46
PACC2011, Sept. 2011 46 MultiGPU: Under the Cover 46 View system as a shared- memory multiprocessors (multi-core processor with hw. coherence) MPP 0 +Cache 0 P 1 +Cache 1 P 2 +Cache 2 P 3 +Cache 3 CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect
47
PACC2011, Sept. 2011 47 MultiGPU: Under the Cover 47 Software Distributed-Shared Memory (DSM) Software: flexibility vs. efficiency Underlying distributed memory hidden Reduce memory transfers using write-back, write- invalidate,… Well-known approach, not too efficient as a middleware for general apps. Regularity of dense linear algebra makes a difference!
48
PACC2011, Sept. 2011 48 MultiGPU: Under the Cover 48 Reduce #data transfers: Run-time handles device memory as a software cache: Operate at block level Software flexibility Write-back Write-invalidate Super Matrix Super Matrix CPU(s) PCI-e bus GPU #1 GPU #3 GPU #0 GPU #2 Inter- connect 1 2 3 4 5 6 7 8910
49
PACC2011, Sept. 2011 49 MultiGPU: Under the Cover Super Matrix Super Matrix FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } Factor A11 on host
50
PACC2011, Sept. 2011 50 Multi-GPU: Under the Cover Super Matrix Super Matrix FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } Transfer A11 from host to appropriate devices before using it in subsequent computations (write-update)
51
PACC2011, Sept. 2011 51 Multi-GPU: Under the Cover Super Matrix Super Matrix FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } Cache A11 in receiving device(s) in case needed in subsequent computations
52
PACC2011, Sept. 2011 52 Multi-GPU: Under the Cover Super Matrix Super Matrix FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } Send blocks to devices Perform Trsm on blocks of A21 (hopefully using cached A11) Keep updated A21 in device till needed by other GPU(s) (write-back)
53
PACC2011, Sept. 2011 53 Multi-GPU: Under the Cover Super Matrix Super Matrix FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } Send blocks to devices Perform Syrk/Gemm on blocks of A22 (hopefully using cached blocks of A21) Keep updated A22 in device till needed by other GPU(s) (write-back)
54
PACC2011, Sept. 2011 54 C = C + AB T on S1070 (Tesla x 4)
55
PACC2011, Sept. 2011 55 Cholesky on S1070 (Tesla x 4)
56
PACC2011, Sept. 2011 56 Cholesky on S1070 (Tesla x 4)
57
PACC2011, Sept. 2011 57 Sampling of LAPACK functionality on S2050 (Fermi x 4)
58
PACC2011, Sept. 2011 58 Sampling of LAPACK functionality on S2050 (Fermi x 4)
59
PACC2011, Sept. 2011 59 Outline Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Conclusion
60
PACC2011, Sept. 2011 60 libflame for Cluster (+ Accelerators) PLAPACK Distributed memory (MPI) Inspired FLAME Recently modified so that each node can have GPU Keep data in GPU memory as much as possible Elemental (Jack Poulson) Distributed memory (MPI) Inspired by FLAME/PLAPACK Can use GPU at each node/core libflame + SuperMatrix Runtime schedules tasks and data transfer Appropriate for small clusters
61
PACC2011, Sept. 2011 61 PLAPACK + GPU accelerators Fogue, Igual, E. Quintana, van de Geijn. "Retargeting PLAPACK to Clusters with Hardware Accelerators." (WEHA 2010). Each node: Xeon Nehalem (8 cores) + 2 NVIDIA C1060 (Tesla)
62
PACC2011, Sept. 2011 62 Targeting Clusters with GPUs SuperMatrix Distributed Runtime Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.” Concurrency and Computation: Practice and Experience. In review. Each node: Xeon Nehalem (8 cores) + 1 NVIDIA C2050 (Fermi)
63
PACC2011, Sept. 2011 63 Elemental Cholesky Factorization
64
PACC2011, Sept. 2011 64 Elemental vs. ScaLAPACK Cholesky on 8192 cores - BlueGene/P Elemental has full ScaLAPACK functionality (except nonsymmetric Eigenvalue Problem). Poulson, Marker, Hammond, Romero, van de Geijn. "Elemental: A New Framework for Distributed Memory Dense Matrix Computations." ACM TOMS. Submitted.
65
PACC2011, Sept. 2011 65 Single-Chip Cloud Computer Intel SCC research processor 48-core concept vehicle Created for many-core software research Custom communication library (RCCE)
66
PACC2011, Sept. 2011 66 SCC Results - 48 Pentium cores - MPI replaced by RCCE Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.” Concurrency and Computation: Practice and Experience. In review. Marker, Chan, Poulson, van de Geijn, Van der Wijngaart, Mattson, Kubaska. "Programming Many-Core Architectures - A Case Study: Dense Matrix Computations on the Intel SCC Processor." Concurrency and Computation: Practice and Experience. To Appear.
67
PACC2011, Sept. 2011 67 Outline Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
68
PACC2011, Sept. 2011 68 Related work Data-flow parallelism, dynamic scheduling, runtime Cilk OpenMP (task queues) StarSs (SMPSs) StarPU Threading Building Blocks (TBB) … What we have is very dense linear algebra specific
69
PACC2011, Sept. 2011 69 Dense Linear Algebra Libraries Target PlatformLAPACK ProjectFLAME Project SequentialLAPACKlibflame Sequential + multithreaded BLASLAPACKlibflame Multicore/multithreadedPLASMAlibflame+SuperMatrix Multicore + out-of-order schedulingPLASMA+Quarklibflame+SuperMatrix CPU + single GPUMAGMAlibflame+SuperMatrix Multicore + multiGPUDAGuE?libflame+SuperMatrix Distributed memoryScaLAPACKlibflame+SuperMatrix PLAPACK Elemental Distributed memory + GPUDAGuE? ScaLAPACK? libflame+SuperMatrix PLAPACK Elemental Out-of-Core?libflame+SuperMatrix
70
PACC2011, Sept. 2011 70 Comparison with Quark Agullo, Bouwmeester, Dongarra, Kurzak, Langou, Rosenbert. “Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures.” VecPar, 2010.
71
PACC2011, Sept. 2011 71 Outline Motivation What is FLAME? Deriving algorithms to be correct Representing algorithms in code Of blocked algorithms and algorithms-by-blocks Runtime support for multicore, GPU, and multiGPU Extensions to distributed memory platforms Related work Conclusion
72
PACC2011, Sept. 2011 72 Conclusions Programmability is the key to harnessing parallel computation One code, many target platforms Formal derivation provides confidence in code If there is a problem, it is not in the library! Separation of concern Library developer derives algorithms and codes them Execution of routines generates DAG Parallelism, temporal locality, and spatial locality are captured in DAG Runtime system uses appropriate heuristics to schedule
73
PACC2011, Sept. 2011 73 What does this mean for you? One successful approach: Identify units of data and units of computation Write a sequential program that generates a DAG Hand DAG to runtime for scheduling
74
PACC2011, Sept. 2011 74 The Future Currently: Library is an instantiation in code Future Create repository of algorithms, expert knowledge about algorithms, and knowledge about a target architecture Mechanically generate a library for a target architecture, exactly as an expert would Design-by-Transformation (DxT) Bryan Marker, Andy Terrel, Jack Poulson, Don Batory, and Robert van de Geijn. "Mechanizing the Expert Dense Linear Algebra Developer." FLAME Working Note #58. 2011
75
PACC2011, Sept. 2011 75 Availability Everything that has been discussed is available under LGPL license or BSD license libflame + SuperMatrix http://www.cs.utexas.edu/users/flame/ http://www.cs.utexas.edu/users/flame/ Elemental http://code.google.com/p/elemental/ http://code.google.com/p/elemental/ [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.