May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan.

May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan

May 31, 2010Final defense2 Introduction Motivation  Solve the traditional lack of programmability for the domain of dense matrix computations, specifically for shared-memory computer architectures Goal  Make the difficult easy and the impossible doable

May 31, 2010Final defense3 Teaser Theoretical Peak Performance Better

May 31, 2010Final defense4 Solution Programmability  Use tools provided by FLAME Parallelism  Directed acyclic graph (DAG) scheduling

May 31, 2010Final defense5 Background High-Performance Computing [1] C. Addison, Y. Ren, and M. van Waveren. OpenMP issues arising in the development of parallel BLAS and LAPACK libraries. Scientific Programming, 11(2):95-104, April 2003. [2] E. Anderson, Z. Bai, J. Demmel, J. E. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. E. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users' Guide. SIAM, Philadelphia, 1992. [3] Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004. [4] Apostolos Gerasoulis and Izzy Nelken. Scheduling linear algebra parallel algorithms on MIMD architectures. In Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing, pages 68-95, Chicago, IL, USA, December 1989.

May 31, 2010Final defense6 Background DAG Scheduling [5] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720-748, September 1999. [6] C. L. McCreary, A. A. Khan, J. J. Thompson, and M. E. McArdle. A comparison of heuristics for scheduling DAGs on multiprocessors. In Proceedings of the Eighth International Parallel Processing Symposium, pages 446-451, Cancun, Mexico, April 1994. [7] Josep M. Perez, Rosa M. Badia, and Jesus Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Cluster '08: Proceedings of the 2008 IEEE International Conference on Cluster Computing, pages 142-151, Tsukuba, Japan, September 2008. [8] Jeffrey D. Ullman. NP-complete scheduling problems. Journal of Computer and System Sciences, 10(3):384-393, June 1975.

May 31, 2010Final defense7 Outline Introduction SuperMatrix Scheduling Performance Conclusion 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1

May 31, 2010Final defense8 SuperMatrix Formal Linear Algebra Methods Environment (FLAME)  High-level abstractions for expressing linear algebra algorithms Cholesky Factorization A → L L T

May 31, 2010Final defense9 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*-----------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }

May 31, 2010Final defense10 SuperMatrix Cholesky Factorization  Iteration 1 Iteration 2 CHOL Chol( A 11 ) TRSM A 21 A 11 -T SYRK A 22 – A 21 A 21 T SYRK A 22 – A 21 A 21 T CHOL Chol( A 11 ) TRSM A 21 A 11 -T * *

May 31, 2010Final defense11 SuperMatrix LAPACK-style Implementation DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, ‘Non-unit’, $ N-J-JB+1, JB, ONE, A( J, J ), LDA, $ A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO

May 31, 2010Final defense12 SuperMatrix FLASH  Storage-by-blocks, algorithm-by-blocks

May 31, 2010Final defense13 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }

May 31, 2010Final defense14 SuperMatrix Cholesky Factorization  Iteration 1 CHOL 0 Chol( A 0,0 )

May 31, 2010Final defense15 SuperMatrix Cholesky Factorization  Iteration 1 CHOL 0 TRSM 2 TRSM 1 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T TRSM 2 A 2,0 A 0,0 -T

May 31, 2010Final defense16 SuperMatrix Cholesky Factorization  Iteration 1 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 A 1,1 – A 1,0 A 1,0 T CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T SYRK 5 A 2,2 – A 2,0 A 2,0 T TRSM 2 A 2,0 A 0,0 -T GEMM 4 A 2,1 – A 2,0 A 1,0 T

May 31, 2010Final defense17 SuperMatrix Cholesky Factorization  Iteration 2 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 A 2,2 – A 2,1 A 2,1 T CHOL 6 Chol( A 1,1 ) TRSM 7 A 2,1 A 1,1 -T

May 31, 2010Final defense18 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9 SuperMatrix Cholesky Factorization  Iteration 3 CHOL 9 Chol( A 2,2 )

May 31, 2010Final defense19 SuperMatrix Separation of Concerns  Analyzer Decomposes subproblems into component tasks Store tasks in global task queue sequentially Internally calculates data dependencies between all tasks, which form a DAG, only using input and output parameters for each task  Dispatcher Spawn threads Schedule and dispatch tasks to threads in parallel

May 31, 2010Final defense20 SuperMatrix Analyzer  Detect flow and anti- dependencies  Block size manifests as size of contiguously stored blocks  Embed pointers into hierarchical matrices  Can be performed statically

May 31, 2010Final defense22 Scheduling 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1 Dispatcher foreach task in DAG do If task is ready then Enqueue task end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end

May 31, 2010Final defense23 Scheduling Dispatcher foreach task in DAG do If task is ready then Enqueue task end while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end end end 7 7 3 3 4 4 5 5 4 4 3 3 2 2 1 1 5 5 6 6

May 31, 2010Final defense24 Scheduling Supermarket  p lines for each p cashiers  Efficient enqueue and dequeue  Schedule depends on task to thread assignment Bank  1 line for p tellers  Enqueue and dequeue become bottlenecks  Dynamic dispatching of tasks to threads

May 31, 2010Final defense25 … Scheduling Single Queue  Set of all ready and available tasks  FIFO, priority PE 1 PE 0 PE p-1 Enqueue Dequeue

May 31, 2010Final defense26 … … Scheduling Multiple Queues  Work stealing, data affinity PE 1 PE 0 PE p-1 Enqueue Dequeue

May 31, 2010Final defense27 Scheduling Work Stealing foreach task in DAG do If task is ready then Enqueue task end while tasks are available do Dequeue task if task ≠ Ø then Execute task Update dependent tasks … else Steal task end  Enqueue Place all dependent tasks on queue of same thread that executes task  Steal Select random thread and remove a task from tail of its queue

May 31, 2010Final defense28 Scheduling Work Stealing Mailbox  Each thread has an associated mailbox  Enqueue task onto queue and place in mailbox Can assign tasks to mailbox using 2D distribution  Before attempting a steal, first check mailbox  Optimize for data locality instead of random stealing Mailbox only checked during occurrences of steals

May 31, 2010Final defense29 Scheduling Data Affinity  Assign all tasks that write to a particular block to the same thread  Owner computes rule  2D block cyclic distribution Execution Trace  Cholesky factorization: 4000×4000  Total time: 2D data affinity ~ FIFO queue  Idle threads: 2D ≈ 27% and FIFO ≈ 17% 232232 010010 010010

May 31, 2010Final defense30 Scheduling Data Granularity  Cost of task >> enqueue and dequeue Single vs. Multiple Queues  FIFO queue increases load balance  2D data affinity decreases data communication  Combine best aspects of both!

May 31, 2010Final defense31 … Scheduling Cache Affinity  Single priority queue sorted by task height  Software cache LRU Line = block Fully associative … PE 1 PE 0 PE p-1 Enqueue $ p-1 $1$1 $1$1 $0$0 $0$0 Dequeue

May 31, 2010Final defense32 Scheduling  Enqueue Insert task Sort queue via task heights  Dispatcher Update software cache via cache coherency protocol with write invalidation Cache Affinity  Dequeue Search queue for task with output block in software cache If found return task Otherwise return head task

May 31, 2010Final defense33 Scheduling Optimizations  Prefetching N = number of cache lines (blocks) Touch first N blocks accessed by DAG to preload cache before start of execution  Thread preference Allow thread enqueuing a task to dequeue it before other threads have the opportunity Limit variability of blocks migrating between threads PE $ $ $ $ Memory Coherency

May 31, 2010Final defense35 Performance Target Architecture  4 socket 2.66 GHz Intel Dunnington 24 cores Linux and Windows 16 MB shared L3 cache per socket  OpenMP Intel compiler 11.1  BLAS Intel MKL 10.2

May 31, 2010Final defense36 Performance Implementations  SuperMatrix + serial MKL FIFO queue, cache affinity  FLAME + multithreaded MKL  Multithreaded MKL  PLASMA + serial MKL  Double precision real floating point arithmetic  Tuned block size

May 31, 2010Final defense37 Performance Parallel Linear Algebra for Scalable Multi- core Architectures (PLASMA) 2.1.0  Create persistent POSIX threads  Static pipelining All threads execute sequential algorithm by tiles If task is ready, execute; otherwise, stall DAG is not explicitly constructed  Copy matrix from column-major order storage to block data layout and back to column-major  Does not address programmability; circa 2009 Innovative Computing Laboratory University of Tennessee

May 31, 2010Final defense38 Performance

May 31, 2010Final defense44 Performance Inversion of a Symmetric Positive Definite Matrix  Cholesky factorization A → L L T CHOL  Inversion of a triangular matrix R := L -1 TRINV  Triangular matrix multiplication by its transpose A -1 := R T R TTMM

May 31, 2010Final defense51 Performance Alternate Target Architecture  16 socket 1.5 GHz Intel Itanium2 16 cores Linux 6 MB L3 cache per socket  OpenMP Intel compiler 9.0  BLAS Intel MKL 8.1

May 31, 2010Final defense54 Performance Results  SuperMatrix out-of-order vs. PLASMA in-order  Cache affinity vs. FIFO queue  Representative performance of other dense linear algebra operations  Cache affinity is the most robust scheduling algorithm for computer architectures with different memory hierarchies because it addresses both load balance and data locality

May 31, 2010Final defense55 Performance LU Factorization with Partial Pivoting P A = L U  In practice is numerically stable LU Factorization with Incremental Pivoting  Maps well to algorithm-by-blocks  Only slightly worse numerical behavior than partial pivoting L U swap rows pivot row nb+i column nb+i

May 31, 2010Final defense60 What We Have Learned Separation of Concerns  Facilitates programmability Scheduling  Motivate different scheduling algorithms and heuristics using queueing theory  Data locality is important as load balance for scheduling matrix computations

May 31, 2010Final defense61 Contributions Computer Science  Developed a simple and elegant solution by addressing programmability for a problem that was thought to be difficult Scientific Computing  Instantiation of algorithms-by-blocks and scheduling algorithms within open source library libflame for use by the community

May 31, 2010Final defense62 Acknowledgments Robert van de Geijn, Field Van Zee  I thank the other members of the FLAME team for their support Funding  Intel  Microsoft  NSF grants CCF–0540926 CCF–0702714

May 31, 2010Final defense63 References [9] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007. [10] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 123-132, Salt Lake City, UT, USA, February 2008. [11] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by- blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009.

May 31, 2010Final defense64 Conclusion More Information http://www.cs.utexas.edu/~flame Questions? echan@cs.utexas.edu

May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan.

Similar presentations

Presentation on theme: "May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan.

Similar presentations

Presentation on theme: "May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan."— Presentation transcript:

Similar presentations

About project

Feedback