Download presentation
Presentation is loading. Please wait.
Published byMargaretMargaret Barton Modified over 8 years ago
1
June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas at Austin
2
June 9-11, 2007SPAA 20072 Motivation Motivating Example Cholesky Factorization A → L L T
3
June 9-11, 2007SPAA 20073 Motivation Better Peak Performance 96 Gflops
4
June 9-11, 2007SPAA 20074 Outline Performance FLAME SuperMatrix Conclusion
5
June 9-11, 2007SPAA 20075 Performance Target Architecture 16 CPU Itanium2 NUMA 8 dual-processor nodes OpenMP Intel Compiler 9.0 BLAS GotoBLAS 1.06 Intel MKL 8.1
6
June 9-11, 2007SPAA 20076 Performance Implementations Multithreaded BLAS (Sequential algorithm) LAPACK dpotrf FLAME var3 Serial BLAS (Parallel algorithm) Data-flow
7
June 9-11, 2007SPAA 20077 Performance Implementations Column-major order storage Varying block sizes { 64, 96, 128, 160, 192, 224, 256 } Select best performance for each problem size
8
June 9-11, 2007SPAA 20078 Performance
9
June 9-11, 2007SPAA 20079 Outline Performance FLAME SuperMatrix Conclusion
10
June 9-11, 2007SPAA 200710 FLAME Formal Linear Algebra Method Environment High-level abstraction away from indices “Views” into matrices Seamless transition from algorithms to code
11
June 9-11, 2007SPAA 200711 FLAME Cholesky Factorization for ( j = 0; j < n; j++ ) { A[j,j] = sqrt( A[j,j] ); for ( i = j+1; i < n; i++ ) A[i,j] = A[i,j] / A[j,j]; for ( k = j+1; k < n; k++ ) for ( i = k; i < n; i++ ) A[i,k] = A[i,k] – A[i,j] * A[k,j]; }
12
June 9-11, 2007SPAA 200712 FLAME LAPACK dpotrf Different variant (right-looking) DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO
13
June 9-11, 2007SPAA 200713 FLAME Partitioning Matrices
14
June 9-11, 2007SPAA 200714 FLAME
15
June 9-11, 2007SPAA 200715 FLAME
16
June 9-11, 2007SPAA 200716 FLAME FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }
17
June 9-11, 2007SPAA 200717 FLAME
18
June 9-11, 2007SPAA 200718 Outline Performance FLAME SuperMatrix Data-flow 2D data affinity Contiguous storage Conclusion
19
June 9-11, 2007SPAA 200719 SuperMatrix Cholesky Factorization Iteration 1 Chol Trsm Syrk GemmTrsmSyrk
20
June 9-11, 2007SPAA 200720 SuperMatrix Cholesky Factorization Iteration 2 Syrk Chol Trsm
21
June 9-11, 2007SPAA 200721 SuperMatrix Cholesky Factorization Iteration 3 Chol
22
June 9-11, 2007SPAA 200722 SuperMatrix Analyzer Delay execution and place tasks on queue Tasks are function pointers annotated with input/output information Compute dependence information (flow, anti, output) between all tasks Create DAG of tasks
23
June 9-11, 2007SPAA 200723 SuperMatrix Analyzer Chol Trsm Syrk Chol Gemm … Chol Trsm GemmSyrk Chol …… Task QueueDAG of tasks
24
June 9-11, 2007SPAA 200724 SuperMatrix FLASH Matrix of matrices
25
June 9-11, 2007SPAA 200725 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( );
26
June 9-11, 2007SPAA 200726 SuperMatrix Dispatcher Use DAG to execute tasks out-of-order in parallel Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation SuperScalar vs. SuperMatrix
27
June 9-11, 2007SPAA 200727 SuperMatrix Dispatcher 4 threads 5 x 5 matrix of blocks 35 tasks 14 stages Chol Trsm Syrk Gemm Syrk Trsm SyrkGemm Chol Trsm Syrk Gemm Chol Trsm Chol Syrk Gemm
28
June 9-11, 2007SPAA 200728 SuperMatrix
29
June 9-11, 2007SPAA 200729 SuperMatrix Chol Trsm SyrkGemm Syrk Trsm SyrkGemm Chol Trsm Syrk Gemm Trsm Chol Syrk Gemm Syrk Chol Syrk Dispatcher Tasks write to block [2,2] No data affinity
30
June 9-11, 2007SPAA 200730 SuperMatrix Blocks of Matrices Tasks Threads Processors Owner Computes Rule Data Affinity CPU Affinity Binding Threads to Processors Assigning Tasks to Threads Denote Tasks by the Blocks Overwritten
31
June 9-11, 2007SPAA 200731 SuperMatrix Data Affinity 2D block cyclic decomposition (ScaLAPACK) 4 x 4 matrix of blocks assigned to 2 x 2 mesh
32
June 9-11, 2007SPAA 200732 SuperMatrix
33
June 9-11, 2007SPAA 200733 SuperMatrix Contiguous Storage One level of blocking User inherently does not need to know about the underlying storage of data
34
June 9-11, 2007SPAA 200734 SuperMatrix
35
June 9-11, 2007SPAA 200735 SuperMatrix GotoBLAS vs. MKL All previous graphs link with GotoBLAS MKL better tuned for small matrices on Itanium2
36
June 9-11, 2007SPAA 200736 SuperMatrix
37
June 9-11, 2007SPAA 200737 SuperMatrix
38
June 9-11, 2007SPAA 200738 SuperMatrix
39
June 9-11, 2007SPAA 200739 SuperMatrix
40
June 9-11, 2007SPAA 200740 SuperMatrix Results LAPACK chose a bad variant Data affinity and contiguous storage have clear advantage Multithreaded GotoBLAS tuned for large matrices MKL better tuned for small matrices
41
June 9-11, 2007SPAA 200741 SuperMatrix
42
June 9-11, 2007SPAA 200742 SuperMatrix
43
June 9-11, 2007SPAA 200743 SuperMatrix
44
June 9-11, 2007SPAA 200744 Outline Performance FLAME SuperMatrix Conclusion
45
June 9-11, 2007SPAA 200745 Conclusion Key Points View blocks of matrices as units of computation instead of scalars Apply instruction-level parallelism to blocks Abstractions away from low-level details of scheduling
46
June 9-11, 2007SPAA 200746 Authors Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn Universidad Jaume I The University of Texas at Austin
47
June 9-11, 2007SPAA 200747 Acknowledgements We thank the other members of the FLAME team for their support Field Van Zee Funding NSF grant CCF-0540926
48
June 9-11, 2007SPAA 200748 References [1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225-233, New York, NY, USA, 1989. [2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2):214-244, 2001. [3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004. [4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4):422-455, 2001. [5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para 2006. B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, 2007. [6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967.
49
June 9-11, 2007SPAA 200749 Conclusion More Information http://www.cs.utexas.edu/users/flame Questions? echan@cs.utexas.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.