March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan
March 22, 2010Intel talk2 Motivation “The Free Lunch Is Over” – Herb Sutter Parallelize or perish Popular libraries like Linear Algebra PACKage (LAPACK) 3.0 must be completely rewritten FORTRAN 77 Column-major order matrix storage 187+ operations for each datatype One routine (algorithm) per operation ( L A P A C K ) ( L -A P -A C -K ) ( L A P A -C -K ) ( L -A P -A -C K ) ( L A -P -A C K ) ( L -A -P A C -K )
March 22, 2010Intel talk3 Teaser Better Theoretical Peak Performance
March 22, 2010Intel talk4 Goals Programmability Use tools provided by FLAME Parallelism Directed acyclic graph (DAG) scheduling
March 22, 2010Intel talk5 Outline Introduction SuperMatrix Scheduling Performance Conclusion
March 22, 2010Intel talk6 SuperMatrix Formal Linear Algebra Method Environment (FLAME) High-level abstractions for expressing linear algebra algorithms Cholesky Factorization A → L L T
March 22, 2010Intel talk7 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /* */ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }
March 22, 2010Intel talk8 SuperMatrix Cholesky Factorization Iteration 1 Iteration 2 CHOL Chol( A 11 ) TRSM A 21 A 11 -T SYRK A 22 – A 21 A 21 T SYRK A 22 – A 21 A 21 T CHOL Chol( A 11 ) TRSM A 21 A 11 -T * *
March 22, 2010Intel talk9 SuperMatrix LAPACK-style Implementation DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO
March 22, 2010Intel talk10 SuperMatrix FLASH Storage-by-blocks, algorithm-by-blocks
March 22, 2010Intel talk11 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /* */ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }
March 22, 2010Intel talk12 SuperMatrix Cholesky Factorization Iteration 1 CHOL 0 Chol( A 0,0 )
March 22, 2010Intel talk13 SuperMatrix Cholesky Factorization Iteration 1 CHOL 0 TRSM 2 TRSM 1 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T TRSM 2 A 2,0 A 0,0 -T
March 22, 2010Intel talk14 SuperMatrix Cholesky Factorization Iteration 1 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 A 1,1 – A 1,0 A 1,0 T CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T SYRK 5 A 2,2 – A 2,0 A 2,0 T TRSM 2 A 2,0 A 0,0 -T GEMM 4 A 2,1 – A 2,0 A 1,0 T
March 22, 2010Intel talk15 SuperMatrix Cholesky Factorization Iteration 2 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 A 2,2 – A 2,1 A 2,1 T CHOL 6 Chol( A 1,1 ) TRSM 7 A 2,1 A 1,1 -T
March 22, 2010Intel talk16 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9 SuperMatrix Cholesky Factorization Iteration 3 CHOL 9 Chol( A 2,2 )
March 22, 2010Intel talk17 SuperMatrix Separation of Concerns Analyzer Decomposes subproblems into component tasks Store tasks in global task queue sequentially Internally calculates all dependencies between tasks, which form a DAG, only using input and output parameters for each task Dispatcher Spawn threads Schedule and dispatch tasks to threads in parallel
March 22, 2010Intel talk18 SuperMatrix Analyzer Detect flow, anti, and output dependencies Embed pointers into hierarchical matrices Block size manifests as size of contiguously stored blocks Can be performed statically
March 22, 2010Intel talk19 Outline Introduction SuperMatrix Scheduling Performance Conclusion
March 22, 2010Intel talk20 Scheduling Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end
March 22, 2010Intel talk21 Scheduling Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end
March 22, 2010Intel talk22 Scheduling Supermarket p lines for each p cashiers Efficient enqueue and dequeue Schedule depends on task to thread assignment Bank 1 line for p tellers Enqueue and dequeue become bottlenecks Dynamic dispatching of tasks to threads
March 22, 2010Intel talk23 … Scheduling Single Queue Set of all ready and available tasks FIFO, priority PE 1 PE 0 PE p-1 Enqueue Dequeue
March 22, 2010Intel talk24 … … Scheduling Multiple Queues Work stealing, data affinity PE 1 PE 0 PE p-1 Enqueue Dequeue
March 22, 2010Intel talk25 Scheduling Work Stealing Enqueue ready tasks while tasks are available do Dequeue task if task ≠ Ø then Execute task Update dependent tasks … else Steal task end Enqueue Place all dependent tasks on queue of same thread that executes task Steal Select random thread and remove a task from tail of its queue
March 22, 2010Intel talk26 Scheduling Work Stealing Mailbox Each thread has an associated mailbox Enqueue task onto queue and place in mailbox Can assign tasks to mailbox using 2D distribution Before attempting a steal, first check mailbox Optimize for data locality instead of random stealing Mailbox only checked during occurrences of steals
March 22, 2010Intel talk27 Scheduling Data Affinity Assign all tasks that write to a particular block to the same thread Owner computes rule 2D block cyclic distribution Execution Trace Cholesky factorization: 4000×4000 Total time: 2D data affinity ~ FIFO queue Idle threads: 2D ≈ 27% and FIFO ≈ 17%
March 22, 2010Intel talk28 Scheduling Data Granularity Cost of task >> enqueue and dequeue Single vs. Multiple Queues FIFO queue increases load balance 2D data affinity decreases data communication Combine best aspects of both
March 22, 2010Intel talk29 … Scheduling Cache Affinity Single priority queue sorted by task height Software cache LRU Line = block Fully associative … PE 1 PE 0 PE p-1 Enqueue $ p-1 $1$1 $1$1 $0$0 $0$0 Dequeue
March 22, 2010Intel talk30 Scheduling Enqueue Insert task Sort queue via task heights Dispatcher Update software cache via cache coherency protocol with write invalidation Cache Affinity Dequeue Search queue for task with output block in software cache If found return task Otherwise return head task
March 22, 2010Intel talk31 Scheduling Optimizations Prefetching N = number of cache lines (blocks) Touch first N blocks accessed by DAG to preload cache before start of execution Thread preference Allow thread enqueuing a task to dequeue it before other threads have the opportunity Limit variability of blocks migrating between threads PE $ $ $ $ Memory Coherency
March 22, 2010Intel talk32 Outline Introduction SuperMatrix Scheduling Performance Conclusion
March 22, 2010Intel talk33 Performance Target Architecture 4 socket 2.66 GHz Intel Dunnington 24 cores Linux and Windows 16 MB shared L3 cache per socket OpenMP Intel compiler 11.1 BLAS Intel MKL 10.2
March 22, 2010Intel talk34 Performance Implementations SuperMatrix + serial MKL FIFO queue, cache affinity FLAME + multithreaded MKL Multithreaded MKL PLASMA + serial MKL Double precision real floating point arithmetic Tuned block size
March 22, 2010Intel talk35 Performance Parallel Linear Algebra for Scalable Multi- core Architectures (PLASMA) Create persistent POSIX threads Static pipelining All threads execute sequential algorithm by tiles If task is ready, execute; otherwise, stall DAG is not explicitly constructed Copy matrix from column-major order storage to block data layout and back to column-major Does not address programmability Innovative Computing Laboratory University of Tennessee
March 22, 2010Intel talk36 Performance
March 22, 2010Intel talk37 Performance
March 22, 2010Intel talk38 Performance
March 22, 2010Intel talk39 Performance
March 22, 2010Intel talk40 Performance
March 22, 2010Intel talk41 Performance Inversion of a Symmetric Positive Definite Matrix Cholesky factorization A → L L T CHOL Inversion of a triangular matrix R := L -1 TRINV Triangular matrix multiplication by its transpose A -1 := R T R TTMM
March 22, 2010Intel talk42 Performance
March 22, 2010Intel talk43 Performance
March 22, 2010Intel talk44 Performance
March 22, 2010Intel talk45 Performance
March 22, 2010Intel talk46 Performance
March 22, 2010Intel talk47 Performance
March 22, 2010Intel talk48 Performance
March 22, 2010Intel talk49 Performance
March 22, 2010Intel talk50 Performance
March 22, 2010Intel talk51 Performance LU Factorization with Partial Pivoting P A = L U In practice is numerically stable LU Factorization with Incremental Pivoting Maps well to algorithm-by-blocks Only slightly worse numerical behavior than partial pivoting L U swap rows pivot row nb+i column nb+i
March 22, 2010Intel talk52 Performance
March 22, 2010Intel talk53 Performance
March 22, 2010Intel talk54 Performance
March 22, 2010Intel talk55 Performance Results Cache affinity vs. FIFO queue SuperMatrix out-of-order vs. PLASMA in-order High variability of work stealing vs. predictable cache affinity performance Representative performance of other dense linear algebra operations
March 22, 2010Intel talk56 Outline Introduction SuperMatrix Scheduling Performance Conclusion
March 22, 2010Intel talk57 Conclusion Separation of Concerns Allows us to experiment with different scheduling algorithms Locality, Locality, Locality Data communication is important as load balance for scheduling matrix computations
March 22, 2010Intel talk58 Conclusion Future Work Intel Single-chip Cloud Computer Master-slave approach Software-managed cache coherency RCCE API RCCE_send RCCE_recv RCCE_shmalloc
March 22, 2010Intel talk59 Acknowledgments Robert van de Geijn, Field Van Zee I thank the other members of the FLAME team for their support Funding Intel Microsoft NSF grants CCF– CCF–
March 22, 2010Intel talk60 References [1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages , San Diego, CA, USA, June [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages , Salt Lake City, UT, USA, February [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by- blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009.
March 22, 2010Intel talk61 Conclusion More Information Questions?