Download presentation
Presentation is loading. Please wait.
Published byRoy Willis Modified over 8 years ago
1
March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan
2
March 22, 2010Intel talk2 Motivation “The Free Lunch Is Over” – Herb Sutter Parallelize or perish Popular libraries like Linear Algebra PACKage (LAPACK) 3.0 must be completely rewritten FORTRAN 77 Column-major order matrix storage 187+ operations for each datatype One routine (algorithm) per operation ( L A P A C K ) ( L -A P -A C -K ) ( L A P A -C -K ) ( L -A P -A -C K ) ( L A -P -A C K ) ( L -A -P A C -K )
3
March 22, 2010Intel talk3 Teaser Better Theoretical Peak Performance
4
March 22, 2010Intel talk4 Goals Programmability Use tools provided by FLAME Parallelism Directed acyclic graph (DAG) scheduling
5
March 22, 2010Intel talk5 Outline Introduction SuperMatrix Scheduling Performance Conclusion 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1
6
March 22, 2010Intel talk6 SuperMatrix Formal Linear Algebra Method Environment (FLAME) High-level abstractions for expressing linear algebra algorithms Cholesky Factorization A → L L T
7
March 22, 2010Intel talk7 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*-----------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }
8
March 22, 2010Intel talk8 SuperMatrix Cholesky Factorization Iteration 1 Iteration 2 CHOL Chol( A 11 ) TRSM A 21 A 11 -T SYRK A 22 – A 21 A 21 T SYRK A 22 – A 21 A 21 T CHOL Chol( A 11 ) TRSM A 21 A 11 -T * *
9
March 22, 2010Intel talk9 SuperMatrix LAPACK-style Implementation DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO
10
March 22, 2010Intel talk10 SuperMatrix FLASH Storage-by-blocks, algorithm-by-blocks
11
March 22, 2010Intel talk11 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }
12
March 22, 2010Intel talk12 SuperMatrix Cholesky Factorization Iteration 1 CHOL 0 Chol( A 0,0 )
13
March 22, 2010Intel talk13 SuperMatrix Cholesky Factorization Iteration 1 CHOL 0 TRSM 2 TRSM 1 CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T TRSM 2 A 2,0 A 0,0 -T
14
March 22, 2010Intel talk14 SuperMatrix Cholesky Factorization Iteration 1 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 A 1,1 – A 1,0 A 1,0 T CHOL 0 Chol( A 0,0 ) TRSM 1 A 1,0 A 0,0 -T SYRK 5 A 2,2 – A 2,0 A 2,0 T TRSM 2 A 2,0 A 0,0 -T GEMM 4 A 2,1 – A 2,0 A 1,0 T
15
March 22, 2010Intel talk15 SuperMatrix Cholesky Factorization Iteration 2 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 A 2,2 – A 2,1 A 2,1 T CHOL 6 Chol( A 1,1 ) TRSM 7 A 2,1 A 1,1 -T
16
March 22, 2010Intel talk16 CHOL 0 TRSM 2 TRSM 1 SYRK 5 GEMM 4 SYRK 3 CHOL 6 TRSM 7 SYRK 8 CHOL 9 SuperMatrix Cholesky Factorization Iteration 3 CHOL 9 Chol( A 2,2 )
17
March 22, 2010Intel talk17 SuperMatrix Separation of Concerns Analyzer Decomposes subproblems into component tasks Store tasks in global task queue sequentially Internally calculates all dependencies between tasks, which form a DAG, only using input and output parameters for each task Dispatcher Spawn threads Schedule and dispatch tasks to threads in parallel
18
March 22, 2010Intel talk18 SuperMatrix Analyzer Detect flow, anti, and output dependencies Embed pointers into hierarchical matrices Block size manifests as size of contiguously stored blocks Can be performed statically
19
March 22, 2010Intel talk19 Outline Introduction SuperMatrix Scheduling Performance Conclusion 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1
20
March 22, 2010Intel talk20 Scheduling 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1 Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end
21
March 22, 2010Intel talk21 Scheduling 7 7 3 3 4 4 5 5 4 4 3 3 2 2 1 1 5 5 6 6 Dispatcher Enqueue ready tasks while tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent task end
22
March 22, 2010Intel talk22 Scheduling Supermarket p lines for each p cashiers Efficient enqueue and dequeue Schedule depends on task to thread assignment Bank 1 line for p tellers Enqueue and dequeue become bottlenecks Dynamic dispatching of tasks to threads
23
March 22, 2010Intel talk23 … Scheduling Single Queue Set of all ready and available tasks FIFO, priority PE 1 PE 0 PE p-1 Enqueue Dequeue
24
March 22, 2010Intel talk24 … … Scheduling Multiple Queues Work stealing, data affinity PE 1 PE 0 PE p-1 Enqueue Dequeue
25
March 22, 2010Intel talk25 Scheduling Work Stealing Enqueue ready tasks while tasks are available do Dequeue task if task ≠ Ø then Execute task Update dependent tasks … else Steal task end Enqueue Place all dependent tasks on queue of same thread that executes task Steal Select random thread and remove a task from tail of its queue
26
March 22, 2010Intel talk26 Scheduling Work Stealing Mailbox Each thread has an associated mailbox Enqueue task onto queue and place in mailbox Can assign tasks to mailbox using 2D distribution Before attempting a steal, first check mailbox Optimize for data locality instead of random stealing Mailbox only checked during occurrences of steals
27
March 22, 2010Intel talk27 Scheduling Data Affinity Assign all tasks that write to a particular block to the same thread Owner computes rule 2D block cyclic distribution Execution Trace Cholesky factorization: 4000×4000 Total time: 2D data affinity ~ FIFO queue Idle threads: 2D ≈ 27% and FIFO ≈ 17% 232232 010010 010010
28
March 22, 2010Intel talk28 Scheduling Data Granularity Cost of task >> enqueue and dequeue Single vs. Multiple Queues FIFO queue increases load balance 2D data affinity decreases data communication Combine best aspects of both
29
March 22, 2010Intel talk29 … Scheduling Cache Affinity Single priority queue sorted by task height Software cache LRU Line = block Fully associative … PE 1 PE 0 PE p-1 Enqueue $ p-1 $1$1 $1$1 $0$0 $0$0 Dequeue
30
March 22, 2010Intel talk30 Scheduling Enqueue Insert task Sort queue via task heights Dispatcher Update software cache via cache coherency protocol with write invalidation Cache Affinity Dequeue Search queue for task with output block in software cache If found return task Otherwise return head task
31
March 22, 2010Intel talk31 Scheduling Optimizations Prefetching N = number of cache lines (blocks) Touch first N blocks accessed by DAG to preload cache before start of execution Thread preference Allow thread enqueuing a task to dequeue it before other threads have the opportunity Limit variability of blocks migrating between threads PE $ $ $ $ Memory Coherency
32
March 22, 2010Intel talk32 Outline Introduction SuperMatrix Scheduling Performance Conclusion 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1
33
March 22, 2010Intel talk33 Performance Target Architecture 4 socket 2.66 GHz Intel Dunnington 24 cores Linux and Windows 16 MB shared L3 cache per socket OpenMP Intel compiler 11.1 BLAS Intel MKL 10.2
34
March 22, 2010Intel talk34 Performance Implementations SuperMatrix + serial MKL FIFO queue, cache affinity FLAME + multithreaded MKL Multithreaded MKL PLASMA + serial MKL Double precision real floating point arithmetic Tuned block size
35
March 22, 2010Intel talk35 Performance Parallel Linear Algebra for Scalable Multi- core Architectures (PLASMA) Create persistent POSIX threads Static pipelining All threads execute sequential algorithm by tiles If task is ready, execute; otherwise, stall DAG is not explicitly constructed Copy matrix from column-major order storage to block data layout and back to column-major Does not address programmability Innovative Computing Laboratory University of Tennessee
36
March 22, 2010Intel talk36 Performance
37
March 22, 2010Intel talk37 Performance
38
March 22, 2010Intel talk38 Performance
39
March 22, 2010Intel talk39 Performance
40
March 22, 2010Intel talk40 Performance
41
March 22, 2010Intel talk41 Performance Inversion of a Symmetric Positive Definite Matrix Cholesky factorization A → L L T CHOL Inversion of a triangular matrix R := L -1 TRINV Triangular matrix multiplication by its transpose A -1 := R T R TTMM
42
March 22, 2010Intel talk42 Performance
43
March 22, 2010Intel talk43 Performance
44
March 22, 2010Intel talk44 Performance
45
March 22, 2010Intel talk45 Performance
46
March 22, 2010Intel talk46 Performance
47
March 22, 2010Intel talk47 Performance
48
March 22, 2010Intel talk48 Performance
49
March 22, 2010Intel talk49 Performance
50
March 22, 2010Intel talk50 Performance
51
March 22, 2010Intel talk51 Performance LU Factorization with Partial Pivoting P A = L U In practice is numerically stable LU Factorization with Incremental Pivoting Maps well to algorithm-by-blocks Only slightly worse numerical behavior than partial pivoting L U swap rows pivot row nb+i column nb+i
52
March 22, 2010Intel talk52 Performance
53
March 22, 2010Intel talk53 Performance
54
March 22, 2010Intel talk54 Performance
55
March 22, 2010Intel talk55 Performance Results Cache affinity vs. FIFO queue SuperMatrix out-of-order vs. PLASMA in-order High variability of work stealing vs. predictable cache affinity performance Representative performance of other dense linear algebra operations
56
March 22, 2010Intel talk56 Outline Introduction SuperMatrix Scheduling Performance Conclusion 7 7 5 5 6 6 3 3 4 4 5 5 4 4 3 3 2 2 1 1
57
March 22, 2010Intel talk57 Conclusion Separation of Concerns Allows us to experiment with different scheduling algorithms Locality, Locality, Locality Data communication is important as load balance for scheduling matrix computations
58
March 22, 2010Intel talk58 Conclusion Future Work Intel Single-chip Cloud Computer Master-slave approach Software-managed cache coherency RCCE API RCCE_send RCCE_recv RCCE_shmalloc
59
March 22, 2010Intel talk59 Acknowledgments Robert van de Geijn, Field Van Zee I thank the other members of the FLAME team for their support Funding Intel Microsoft NSF grants CCF–0540926 CCF–0702714
60
March 22, 2010Intel talk60 References [1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix out-of-order scheduling of matrix operations on SMP and multi-core architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007. [2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 123-132, Salt Lake City, UT, USA, February 2008. [3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. van de Geijn, Field G. Van Zee, and Ernie Chan. Programming matrix algorithms-by- blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3):14:1-14:26, July 2009.
61
March 22, 2010Intel talk61 Conclusion More Information http://www.cs.utexas.edu/~flame Questions? echan@cs.utexas.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.