March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan.

March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan

March 22, 2010Intel talk2 Motivation “The Free Lunch Is Over” – Herb Sutter  Parallelize or perish  Popular libraries like Linear Algebra PACKage (LAPACK) 3.0 must be completely rewritten FORTRAN 77 Column-major order matrix storage 187+ operations for each datatype One routine (algorithm) per operation ( L A P A C K ) ( L -A P -A C -K ) ( L A P A -C -K ) ( L -A P -A -C K ) ( L A -P -A C K ) ( L -A -P A C -K )

March 22, 2010Intel talk3 Teaser Better Theoretical Peak Performance

March 22, 2010Intel talk4 Goals Programmability  Use tools provided by FLAME Parallelism  Directed acyclic graph (DAG) scheduling

March 22, 2010Intel talk5 Outline Introduction SuperMatrix Scheduling Performance Conclusion 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1

March 22, 2010Intel talk6 SuperMatrix Formal Linear Algebra Method Environment (FLAME)  High-level abstractions for expressing linear algebra algorithms Cholesky Factorization A → L L T

March 22, 2010Intel talk7 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*-----------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }

March 22, 2010Intel talk8 SuperMatrix Cholesky Factorization  Iteration 1 Iteration 2 CHOL Chol( A 11 ) TRSM A 21 A 11 -T SYRK A 22 – A 21 A 21 T SYRK A 22 – A 21 A 21 T CHOL Chol( A 11 ) TRSM A 21 A 11 -T * *

March 22, 2010Intel talk9 SuperMatrix LAPACK-style Implementation DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO

March 22, 2010Intel talk10 SuperMatrix FLASH  Storage-by-blocks, algorithm-by-blocks

March 22, 2010Intel talk11 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }

March 22, 2010Intel talk12 SuperMatrix Cholesky Factorization  Iteration 1 CHOL 0 Chol( A 0,0 )

March 22, 2010Intel talk13 SuperMatrix Cholesky Factorization  Iteration 1 CHOL 0 TRSM 2 TRSM 1 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T TRSM 2 A 2,0 A 0,0 -T

March 22, 2010Intel talk14 SuperMatrix Cholesky Factorization  Iteration 1 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 A 1,1 – A 1,0 A 1,0 T CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T SYRK 5 A 2,2 – A 2,0 A 2,0 T TRSM 2 A 2,0 A 0,0 -T GEMM 4 A 2,1 – A 2,0 A 1,0 T

March 22, 2010Intel talk15 SuperMatrix Cholesky Factorization  Iteration 2 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 A 2,2 – A 2,1 A 2,1 T CHOL 6 Chol( A 1,1 ) TRSM 7 A 2,1 A 1,1 -T

March 22, 2010Intel talk16 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9 SuperMatrix Cholesky Factorization  Iteration 3 CHOL 9 Chol( A 2,2 )

March 22, 2010Intel talk17 SuperMatrix Separation of Concerns  Analyzer Decomposes subproblems into component tasks Store tasks in global task queue sequentially Internally calculates all dependencies between tasks, which form a DAG, only using input and output parameters for each task  Dispatcher Spawn threads Schedule and dispatch tasks to threads in parallel

March 22, 2010Intel talk18 SuperMatrix Analyzer  Detect flow, anti, and output dependencies  Embed pointers into hierarchical matrices  Block size manifests as size of contiguously stored blocks  Can be performed statically

March 22, 2010Intel talk20 Scheduling 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1 Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end

March 22, 2010Intel talk21 Scheduling 7 7 3 3 4 4 5 5 4 4 3 3 2 2 1 1 5 5 6 6 Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end

March 22, 2010Intel talk22 Scheduling Supermarket  p lines for each p cashiers  Efficient enqueue and dequeue  Schedule depends on task to thread assignment Bank  1 line for p tellers  Enqueue and dequeue become bottlenecks  Dynamic dispatching of tasks to threads

March 22, 2010Intel talk23 … Scheduling Single Queue  Set of all ready and available tasks  FIFO, priority PE 1 PE 0 PE p-1 Enqueue Dequeue

March 22, 2010Intel talk24 … … Scheduling Multiple Queues  Work stealing, data affinity PE 1 PE 0 PE p-1 Enqueue Dequeue

March 22, 2010Intel talk25 Scheduling Work Stealing Enqueue ready tasks while tasks are available do Dequeue task if task ≠ Ø then Execute task Update dependent tasks … else Steal task end  Enqueue Place all dependent tasks on queue of same thread that executes task  Steal Select random thread and remove a task from tail of its queue

March 22, 2010Intel talk26 Scheduling Work Stealing Mailbox  Each thread has an associated mailbox  Enqueue task onto queue and place in mailbox Can assign tasks to mailbox using 2D distribution  Before attempting a steal, first check mailbox  Optimize for data locality instead of random stealing Mailbox only checked during occurrences of steals

March 22, 2010Intel talk27 Scheduling Data Affinity  Assign all tasks that write to a particular block to the same thread  Owner computes rule  2D block cyclic distribution Execution Trace  Cholesky factorization: 4000×4000  Total time: 2D data affinity ~ FIFO queue  Idle threads: 2D ≈ 27% and FIFO ≈ 17% 232232 010010 010010

March 22, 2010Intel talk28 Scheduling Data Granularity  Cost of task >> enqueue and dequeue Single vs. Multiple Queues  FIFO queue increases load balance  2D data affinity decreases data communication  Combine best aspects of both

March 22, 2010Intel talk29 … Scheduling Cache Affinity  Single priority queue sorted by task height  Software cache LRU Line = block Fully associative … PE 1 PE 0 PE p-1 Enqueue $ p-1 $1$1 $1$1 $0$0 $0$0 Dequeue

March 22, 2010Intel talk30 Scheduling  Enqueue Insert task Sort queue via task heights  Dispatcher Update software cache via cache coherency protocol with write invalidation Cache Affinity  Dequeue Search queue for task with output block in software cache If found return task Otherwise return head task

March 22, 2010Intel talk31 Scheduling Optimizations  Prefetching N = number of cache lines (blocks) Touch first N blocks accessed by DAG to preload cache before start of execution  Thread preference Allow thread enqueuing a task to dequeue it before other threads have the opportunity Limit variability of blocks migrating between threads PE $ $ $ $ Memory Coherency

March 22, 2010Intel talk33 Performance Target Architecture  4 socket 2.66 GHz Intel Dunnington 24 cores Linux and Windows 16 MB shared L3 cache per socket  OpenMP Intel compiler 11.1  BLAS Intel MKL 10.2

March 22, 2010Intel talk34 Performance Implementations  SuperMatrix + serial MKL FIFO queue, cache affinity  FLAME + multithreaded MKL  Multithreaded MKL  PLASMA + serial MKL  Double precision real floating point arithmetic  Tuned block size

March 22, 2010Intel talk35 Performance Parallel Linear Algebra for Scalable Multi- core Architectures (PLASMA)  Create persistent POSIX threads  Static pipelining All threads execute sequential algorithm by tiles If task is ready, execute; otherwise, stall DAG is not explicitly constructed  Copy matrix from column-major order storage to block data layout and back to column-major  Does not address programmability Innovative Computing Laboratory University of Tennessee

March 22, 2010Intel talk36 Performance

March 22, 2010Intel talk41 Performance Inversion of a Symmetric Positive Definite Matrix  Cholesky factorization A → L L T CHOL  Inversion of a triangular matrix R := L -1 TRINV  Triangular matrix multiplication by its transpose A -1 := R T R TTMM

March 22, 2010Intel talk51 Performance LU Factorization with Partial Pivoting P A = L U  In practice is numerically stable LU Factorization with Incremental Pivoting  Maps well to algorithm-by-blocks  Only slightly worse numerical behavior than partial pivoting L U swap rows pivot row nb+i column nb+i

March 22, 2010Intel talk55 Performance Results  Cache affinity vs. FIFO queue  SuperMatrix out-of-order vs. PLASMA in-order  High variability of work stealing vs. predictable cache affinity performance  Representative performance of other dense linear algebra operations

March 22, 2010Intel talk57 Conclusion Separation of Concerns  Allows us to experiment with different scheduling algorithms Locality, Locality, Locality  Data communication is important as load balance for scheduling matrix computations

March 22, 2010Intel talk58 Conclusion Future Work  Intel Single-chip Cloud Computer Master-slave approach Software-managed cache coherency RCCE API RCCE_send RCCE_recv RCCE_shmalloc

March 22, 2010Intel talk59 Acknowledgments Robert van de Geijn, Field Van Zee  I thank the other members of the FLAME team for their support Funding  Intel  Microsoft  NSF grants CCF–0540926 CCF–0702714

March 22, 2010Intel talk60 References [1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007. [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 123-132, Salt Lake City, UT, USA, February 2008. [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by- blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009.

March 22, 2010Intel talk61 Conclusion More Information http://www.cs.utexas.edu/~flame Questions? echan@cs.utexas.edu

March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan.

Similar presentations

Presentation on theme: "March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan.

Similar presentations

Presentation on theme: "March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan."— Presentation transcript:

Similar presentations

About project

Feedback