Download presentation
Presentation is loading. Please wait.
Published byAlaina Murphy Modified over 9 years ago
1
May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan
2
May 31, 2010Final defense2 Introduction Motivation Solve the traditional lack of programmability for the domain of dense matrix computations, specifically for shared-memory computer architectures Goal Make the difficult easy and the impossible doable
3
May 31, 2010Final defense3 Teaser Theoretical Peak Performance Better
4
May 31, 2010Final defense4 Solution Programmability Use tools provided by FLAME Parallelism Directed acyclic graph (DAG) scheduling
5
May 31, 2010Final defense5 Background High-Performance Computing [1] C. Addison, Y. Ren, and M. van Waveren. OpenMP issues arising in the development of parallel BLAS and LAPACK libraries. Scientific Programming, 11(2):95-104, April 2003. [2] E. Anderson, Z. Bai, J. Demmel, J. E. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. E. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users' Guide. SIAM, Philadelphia, 1992. [3] Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004. [4] Apostolos Gerasoulis and Izzy Nelken. Scheduling linear algebra parallel algorithms on MIMD architectures. In Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing, pages 68-95, Chicago, IL, USA, December 1989.
6
May 31, 2010Final defense6 Background DAG Scheduling [5] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720-748, September 1999. [6] C. L. McCreary, A. A. Khan, J. J. Thompson, and M. E. McArdle. A comparison of heuristics for scheduling DAGs on multiprocessors. In Proceedings of the Eighth International Parallel Processing Symposium, pages 446-451, Cancun, Mexico, April 1994. [7] Josep M. Perez, Rosa M. Badia, and Jesus Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Cluster '08: Proceedings of the 2008 IEEE International Conference on Cluster Computing, pages 142-151, Tsukuba, Japan, September 2008. [8] Jeffrey D. Ullman. NP-complete scheduling problems. Journal of Computer and System Sciences, 10(3):384-393, June 1975.
7
May 31, 2010Final defense7 Outline Introduction SuperMatrix Scheduling Performance Conclusion 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1
8
May 31, 2010Final defense8 SuperMatrix Formal Linear Algebra Methods Environment (FLAME) High-level abstractions for expressing linear algebra algorithms Cholesky Factorization A → L L T
9
May 31, 2010Final defense9 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*-----------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }
10
May 31, 2010Final defense10 SuperMatrix Cholesky Factorization Iteration 1 Iteration 2 CHOL Chol( A 11 ) TRSM A 21 A 11 -T SYRK A 22 – A 21 A 21 T SYRK A 22 – A 21 A 21 T CHOL Chol( A 11 ) TRSM A 21 A 11 -T * *
11
May 31, 2010Final defense11 SuperMatrix LAPACK-style Implementation DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, ‘Non-unit’, $ N-J-JB+1, JB, ONE, A( J, J ), LDA, $ A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO
12
May 31, 2010Final defense12 SuperMatrix FLASH Storage-by-blocks, algorithm-by-blocks
13
May 31, 2010Final defense13 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }
14
May 31, 2010Final defense14 SuperMatrix Cholesky Factorization Iteration 1 CHOL 0 Chol( A 0,0 )
15
May 31, 2010Final defense15 SuperMatrix Cholesky Factorization Iteration 1 CHOL 0 TRSM 2 TRSM 1 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T TRSM 2 A 2,0 A 0,0 -T
16
May 31, 2010Final defense16 SuperMatrix Cholesky Factorization Iteration 1 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 A 1,1 – A 1,0 A 1,0 T CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T SYRK 5 A 2,2 – A 2,0 A 2,0 T TRSM 2 A 2,0 A 0,0 -T GEMM 4 A 2,1 – A 2,0 A 1,0 T
17
May 31, 2010Final defense17 SuperMatrix Cholesky Factorization Iteration 2 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 A 2,2 – A 2,1 A 2,1 T CHOL 6 Chol( A 1,1 ) TRSM 7 A 2,1 A 1,1 -T
18
May 31, 2010Final defense18 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9 SuperMatrix Cholesky Factorization Iteration 3 CHOL 9 Chol( A 2,2 )
19
May 31, 2010Final defense19 SuperMatrix Separation of Concerns Analyzer Decomposes subproblems into component tasks Store tasks in global task queue sequentially Internally calculates data dependencies between all tasks, which form a DAG, only using input and output parameters for each task Dispatcher Spawn threads Schedule and dispatch tasks to threads in parallel
20
May 31, 2010Final defense20 SuperMatrix Analyzer Detect flow and anti- dependencies Block size manifests as size of contiguously stored blocks Embed pointers into hierarchical matrices Can be performed statically
21
May 31, 2010Final defense21 Outline Introduction SuperMatrix Scheduling Performance Conclusion 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1
22
May 31, 2010Final defense22 Scheduling 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1 Dispatcher foreach task in DAG do If task is ready then Enqueue task end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end
23
May 31, 2010Final defense23 Scheduling Dispatcher foreach task in DAG do If task is ready then Enqueue task end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end 7 7 3 3 4 4 5 5 4 4 3 3 2 2 1 1 5 5 6 6
24
May 31, 2010Final defense24 Scheduling Supermarket p lines for each p cashiers Efficient enqueue and dequeue Schedule depends on task to thread assignment Bank 1 line for p tellers Enqueue and dequeue become bottlenecks Dynamic dispatching of tasks to threads
25
May 31, 2010Final defense25 … Scheduling Single Queue Set of all ready and available tasks FIFO, priority PE 1 PE 0 PE p-1 Enqueue Dequeue
26
May 31, 2010Final defense26 … … Scheduling Multiple Queues Work stealing, data affinity PE 1 PE 0 PE p-1 Enqueue Dequeue
27
May 31, 2010Final defense27 Scheduling Work Stealing foreach task in DAG do If task is ready then Enqueue task end while tasks are available do Dequeue task if task ≠ Ø then Execute task Update dependent tasks … else Steal task end Enqueue Place all dependent tasks on queue of same thread that executes task Steal Select random thread and remove a task from tail of its queue
28
May 31, 2010Final defense28 Scheduling Work Stealing Mailbox Each thread has an associated mailbox Enqueue task onto queue and place in mailbox Can assign tasks to mailbox using 2D distribution Before attempting a steal, first check mailbox Optimize for data locality instead of random stealing Mailbox only checked during occurrences of steals
29
May 31, 2010Final defense29 Scheduling Data Affinity Assign all tasks that write to a particular block to the same thread Owner computes rule 2D block cyclic distribution Execution Trace Cholesky factorization: 4000×4000 Total time: 2D data affinity ~ FIFO queue Idle threads: 2D ≈ 27% and FIFO ≈ 17% 232232 010010 010010
30
May 31, 2010Final defense30 Scheduling Data Granularity Cost of task >> enqueue and dequeue Single vs. Multiple Queues FIFO queue increases load balance 2D data affinity decreases data communication Combine best aspects of both!
31
May 31, 2010Final defense31 … Scheduling Cache Affinity Single priority queue sorted by task height Software cache LRU Line = block Fully associative … PE 1 PE 0 PE p-1 Enqueue $ p-1 $1$1 $1$1 $0$0 $0$0 Dequeue
32
May 31, 2010Final defense32 Scheduling Enqueue Insert task Sort queue via task heights Dispatcher Update software cache via cache coherency protocol with write invalidation Cache Affinity Dequeue Search queue for task with output block in software cache If found return task Otherwise return head task
33
May 31, 2010Final defense33 Scheduling Optimizations Prefetching N = number of cache lines (blocks) Touch first N blocks accessed by DAG to preload cache before start of execution Thread preference Allow thread enqueuing a task to dequeue it before other threads have the opportunity Limit variability of blocks migrating between threads PE $ $ $ $ Memory Coherency
34
May 31, 2010Final defense34 Outline Introduction SuperMatrix Scheduling Performance Conclusion 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1
35
May 31, 2010Final defense35 Performance Target Architecture 4 socket 2.66 GHz Intel Dunnington 24 cores Linux and Windows 16 MB shared L3 cache per socket OpenMP Intel compiler 11.1 BLAS Intel MKL 10.2
36
May 31, 2010Final defense36 Performance Implementations SuperMatrix + serial MKL FIFO queue, cache affinity FLAME + multithreaded MKL Multithreaded MKL PLASMA + serial MKL Double precision real floating point arithmetic Tuned block size
37
May 31, 2010Final defense37 Performance Parallel Linear Algebra for Scalable Multi- core Architectures (PLASMA) 2.1.0 Create persistent POSIX threads Static pipelining All threads execute sequential algorithm by tiles If task is ready, execute; otherwise, stall DAG is not explicitly constructed Copy matrix from column-major order storage to block data layout and back to column-major Does not address programmability; circa 2009 Innovative Computing Laboratory University of Tennessee
38
May 31, 2010Final defense38 Performance
39
May 31, 2010Final defense39 Performance
40
May 31, 2010Final defense40 Performance
41
May 31, 2010Final defense41 Performance
42
May 31, 2010Final defense42 Performance
43
May 31, 2010Final defense43 Performance
44
May 31, 2010Final defense44 Performance Inversion of a Symmetric Positive Definite Matrix Cholesky factorization A → L L T CHOL Inversion of a triangular matrix R := L -1 TRINV Triangular matrix multiplication by its transpose A -1 := R T R TTMM
45
May 31, 2010Final defense45 Performance
46
May 31, 2010Final defense46 Performance
47
May 31, 2010Final defense47 Performance
48
May 31, 2010Final defense48 Performance
49
May 31, 2010Final defense49 Performance
50
May 31, 2010Final defense50 Performance
51
May 31, 2010Final defense51 Performance Alternate Target Architecture 16 socket 1.5 GHz Intel Itanium2 16 cores Linux 6 MB L3 cache per socket OpenMP Intel compiler 9.0 BLAS Intel MKL 8.1
52
May 31, 2010Final defense52 Performance
53
May 31, 2010Final defense53 Performance
54
May 31, 2010Final defense54 Performance Results SuperMatrix out-of-order vs. PLASMA in-order Cache affinity vs. FIFO queue Representative performance of other dense linear algebra operations Cache affinity is the most robust scheduling algorithm for computer architectures with different memory hierarchies because it addresses both load balance and data locality
55
May 31, 2010Final defense55 Performance LU Factorization with Partial Pivoting P A = L U In practice is numerically stable LU Factorization with Incremental Pivoting Maps well to algorithm-by-blocks Only slightly worse numerical behavior than partial pivoting L U swap rows pivot row nb+i column nb+i
56
May 31, 2010Final defense56 Performance
57
May 31, 2010Final defense57 Performance
58
May 31, 2010Final defense58 Performance
59
May 31, 2010Final defense59 Outline Introduction SuperMatrix Scheduling Performance Conclusion 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1
60
May 31, 2010Final defense60 What We Have Learned Separation of Concerns Facilitates programmability Scheduling Motivate different scheduling algorithms and heuristics using queueing theory Data locality is important as load balance for scheduling matrix computations
61
May 31, 2010Final defense61 Contributions Computer Science Developed a simple and elegant solution by addressing programmability for a problem that was thought to be difficult Scientific Computing Instantiation of algorithms-by-blocks and scheduling algorithms within open source library libflame for use by the community
62
May 31, 2010Final defense62 Acknowledgments Robert van de Geijn, Field Van Zee I thank the other members of the FLAME team for their support Funding Intel Microsoft NSF grants CCF–0540926 CCF–0702714
63
May 31, 2010Final defense63 References [9] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007. [10] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 123-132, Salt Lake City, UT, USA, February 2008. [11] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by- blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009.
64
May 31, 2010Final defense64 Conclusion More Information http://www.cs.utexas.edu/~flame Questions? echan@cs.utexas.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.