June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.

June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas at Austin

June 9-11, 2007SPAA 20072 Motivation Motivating Example  Cholesky Factorization  A → L L T

June 9-11, 2007SPAA 20073 Motivation Better Peak Performance 96 Gflops

June 9-11, 2007SPAA 20074 Outline Performance FLAME SuperMatrix Conclusion

June 9-11, 2007SPAA 20075 Performance Target Architecture  16 CPU Itanium2 NUMA 8 dual-processor nodes  OpenMP Intel Compiler 9.0  BLAS GotoBLAS 1.06 Intel MKL 8.1

June 9-11, 2007SPAA 20076 Performance Implementations  Multithreaded BLAS (Sequential algorithm) LAPACK dpotrf FLAME var3  Serial BLAS (Parallel algorithm) Data-flow

June 9-11, 2007SPAA 20077 Performance Implementations  Column-major order storage  Varying block sizes { 64, 96, 128, 160, 192, 224, 256 } Select best performance for each problem size

June 9-11, 2007SPAA 20078 Performance

June 9-11, 2007SPAA 200710 FLAME Formal Linear Algebra Method Environment  High-level abstraction away from indices “Views” into matrices  Seamless transition from algorithms to code

June 9-11, 2007SPAA 200711 FLAME Cholesky Factorization for ( j = 0; j < n; j++ ) { A[j,j] = sqrt( A[j,j] ); for ( i = j+1; i < n; i++ ) A[i,j] = A[i,j] / A[j,j]; for ( k = j+1; k < n; k++ ) for ( i = k; i < n; i++ ) A[i,k] = A[i,k] – A[i,j] * A[k,j]; }

June 9-11, 2007SPAA 200712 FLAME LAPACK dpotrf  Different variant (right-looking) DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO

June 9-11, 2007SPAA 200713 FLAME Partitioning Matrices

June 9-11, 2007SPAA 200714 FLAME

June 9-11, 2007SPAA 200716 FLAME FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }

June 9-11, 2007SPAA 200718 Outline Performance FLAME SuperMatrix  Data-flow  2D data affinity  Contiguous storage Conclusion

June 9-11, 2007SPAA 200719 SuperMatrix Cholesky Factorization  Iteration 1 Chol Trsm Syrk GemmTrsmSyrk

June 9-11, 2007SPAA 200720 SuperMatrix Cholesky Factorization  Iteration 2 Syrk Chol Trsm

June 9-11, 2007SPAA 200721 SuperMatrix Cholesky Factorization  Iteration 3 Chol

June 9-11, 2007SPAA 200722 SuperMatrix Analyzer  Delay execution and place tasks on queue Tasks are function pointers annotated with input/output information  Compute dependence information (flow, anti, output) between all tasks Create DAG of tasks

June 9-11, 2007SPAA 200723 SuperMatrix Analyzer Chol Trsm Syrk Chol Gemm … Chol Trsm GemmSyrk Chol …… Task QueueDAG of tasks

June 9-11, 2007SPAA 200724 SuperMatrix FLASH  Matrix of matrices

June 9-11, 2007SPAA 200725 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( );

June 9-11, 2007SPAA 200726 SuperMatrix Dispatcher  Use DAG to execute tasks out-of-order in parallel  Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation SuperScalar vs. SuperMatrix

June 9-11, 2007SPAA 200727 SuperMatrix Dispatcher  4 threads  5 x 5 matrix of blocks  35 tasks  14 stages Chol Trsm Syrk Gemm Syrk Trsm SyrkGemm Chol Trsm Syrk Gemm Chol Trsm Chol Syrk Gemm

June 9-11, 2007SPAA 200728 SuperMatrix

June 9-11, 2007SPAA 200729 SuperMatrix Chol Trsm SyrkGemm Syrk Trsm SyrkGemm Chol Trsm Syrk Gemm Trsm Chol Syrk Gemm Syrk Chol Syrk Dispatcher  Tasks write to block [2,2]  No data affinity

June 9-11, 2007SPAA 200730 SuperMatrix Blocks of Matrices Tasks Threads Processors Owner Computes Rule Data Affinity CPU Affinity Binding Threads to Processors Assigning Tasks to Threads Denote Tasks by the Blocks Overwritten

June 9-11, 2007SPAA 200731 SuperMatrix Data Affinity  2D block cyclic decomposition (ScaLAPACK) 4 x 4 matrix of blocks assigned to 2 x 2 mesh

June 9-11, 2007SPAA 200733 SuperMatrix Contiguous Storage  One level of blocking User inherently does not need to know about the underlying storage of data

June 9-11, 2007SPAA 200735 SuperMatrix GotoBLAS vs. MKL  All previous graphs link with GotoBLAS  MKL better tuned for small matrices on Itanium2

June 9-11, 2007SPAA 200740 SuperMatrix Results  LAPACK chose a bad variant  Data affinity and contiguous storage have clear advantage  Multithreaded GotoBLAS tuned for large matrices  MKL better tuned for small matrices

June 9-11, 2007SPAA 200745 Conclusion Key Points  View blocks of matrices as units of computation instead of scalars  Apply instruction-level parallelism to blocks  Abstractions away from low-level details of scheduling

June 9-11, 2007SPAA 200746 Authors Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn  Universidad Jaume I  The University of Texas at Austin

June 9-11, 2007SPAA 200747 Acknowledgements We thank the other members of the FLAME team for their support  Field Van Zee Funding  NSF grant CCF-0540926

June 9-11, 2007SPAA 200748 References [1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225-233, New York, NY, USA, 1989. [2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2):214-244, 2001. [3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004. [4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4):422-455, 2001. [5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para 2006. B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, 2007. [6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967.

June 9-11, 2007SPAA 200749 Conclusion More Information http://www.cs.utexas.edu/users/flame Questions? echan@cs.utexas.edu

June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.

Similar presentations

Presentation on theme: "June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.

Similar presentations

Presentation on theme: "June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas."— Presentation transcript:

Similar presentations

About project

Feedback