June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas at Austin
June 9-11, 2007SPAA Motivation Motivating Example Cholesky Factorization A → L L T
June 9-11, 2007SPAA Motivation Better Peak Performance 96 Gflops
June 9-11, 2007SPAA Outline Performance FLAME SuperMatrix Conclusion
June 9-11, 2007SPAA Performance Target Architecture 16 CPU Itanium2 NUMA 8 dual-processor nodes OpenMP Intel Compiler 9.0 BLAS GotoBLAS 1.06 Intel MKL 8.1
June 9-11, 2007SPAA Performance Implementations Multithreaded BLAS (Sequential algorithm) LAPACK dpotrf FLAME var3 Serial BLAS (Parallel algorithm) Data-flow
June 9-11, 2007SPAA Performance Implementations Column-major order storage Varying block sizes { 64, 96, 128, 160, 192, 224, 256 } Select best performance for each problem size
June 9-11, 2007SPAA Performance
June 9-11, 2007SPAA Outline Performance FLAME SuperMatrix Conclusion
June 9-11, 2007SPAA FLAME Formal Linear Algebra Method Environment High-level abstraction away from indices “Views” into matrices Seamless transition from algorithms to code
June 9-11, 2007SPAA FLAME Cholesky Factorization for ( j = 0; j < n; j++ ) { A[j,j] = sqrt( A[j,j] ); for ( i = j+1; i < n; i++ ) A[i,j] = A[i,j] / A[j,j]; for ( k = j+1; k < n; k++ ) for ( i = k; i < n; i++ ) A[i,k] = A[i,k] – A[i,j] * A[k,j]; }
June 9-11, 2007SPAA FLAME LAPACK dpotrf Different variant (right-looking) DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO
June 9-11, 2007SPAA FLAME Partitioning Matrices
June 9-11, 2007SPAA FLAME
June 9-11, 2007SPAA FLAME
June 9-11, 2007SPAA FLAME FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /* */ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }
June 9-11, 2007SPAA FLAME
June 9-11, 2007SPAA Outline Performance FLAME SuperMatrix Data-flow 2D data affinity Contiguous storage Conclusion
June 9-11, 2007SPAA SuperMatrix Cholesky Factorization Iteration 1 Chol Trsm Syrk GemmTrsmSyrk
June 9-11, 2007SPAA SuperMatrix Cholesky Factorization Iteration 2 Syrk Chol Trsm
June 9-11, 2007SPAA SuperMatrix Cholesky Factorization Iteration 3 Chol
June 9-11, 2007SPAA SuperMatrix Analyzer Delay execution and place tasks on queue Tasks are function pointers annotated with input/output information Compute dependence information (flow, anti, output) between all tasks Create DAG of tasks
June 9-11, 2007SPAA SuperMatrix Analyzer Chol Trsm Syrk Chol Gemm … Chol Trsm GemmSyrk Chol …… Task QueueDAG of tasks
June 9-11, 2007SPAA SuperMatrix FLASH Matrix of matrices
June 9-11, 2007SPAA SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /* */ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( );
June 9-11, 2007SPAA SuperMatrix Dispatcher Use DAG to execute tasks out-of-order in parallel Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation SuperScalar vs. SuperMatrix
June 9-11, 2007SPAA SuperMatrix Dispatcher 4 threads 5 x 5 matrix of blocks 35 tasks 14 stages Chol Trsm Syrk Gemm Syrk Trsm SyrkGemm Chol Trsm Syrk Gemm Chol Trsm Chol Syrk Gemm
June 9-11, 2007SPAA SuperMatrix
June 9-11, 2007SPAA SuperMatrix Chol Trsm SyrkGemm Syrk Trsm SyrkGemm Chol Trsm Syrk Gemm Trsm Chol Syrk Gemm Syrk Chol Syrk Dispatcher Tasks write to block [2,2] No data affinity
June 9-11, 2007SPAA SuperMatrix Blocks of Matrices Tasks Threads Processors Owner Computes Rule Data Affinity CPU Affinity Binding Threads to Processors Assigning Tasks to Threads Denote Tasks by the Blocks Overwritten
June 9-11, 2007SPAA SuperMatrix Data Affinity 2D block cyclic decomposition (ScaLAPACK) 4 x 4 matrix of blocks assigned to 2 x 2 mesh
June 9-11, 2007SPAA SuperMatrix
June 9-11, 2007SPAA SuperMatrix Contiguous Storage One level of blocking User inherently does not need to know about the underlying storage of data
June 9-11, 2007SPAA SuperMatrix
June 9-11, 2007SPAA SuperMatrix GotoBLAS vs. MKL All previous graphs link with GotoBLAS MKL better tuned for small matrices on Itanium2
June 9-11, 2007SPAA SuperMatrix
June 9-11, 2007SPAA SuperMatrix
June 9-11, 2007SPAA SuperMatrix
June 9-11, 2007SPAA SuperMatrix
June 9-11, 2007SPAA SuperMatrix Results LAPACK chose a bad variant Data affinity and contiguous storage have clear advantage Multithreaded GotoBLAS tuned for large matrices MKL better tuned for small matrices
June 9-11, 2007SPAA SuperMatrix
June 9-11, 2007SPAA SuperMatrix
June 9-11, 2007SPAA SuperMatrix
June 9-11, 2007SPAA Outline Performance FLAME SuperMatrix Conclusion
June 9-11, 2007SPAA Conclusion Key Points View blocks of matrices as units of computation instead of scalars Apply instruction-level parallelism to blocks Abstractions away from low-level details of scheduling
June 9-11, 2007SPAA Authors Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn Universidad Jaume I The University of Texas at Austin
June 9-11, 2007SPAA Acknowledgements We thank the other members of the FLAME team for their support Field Van Zee Funding NSF grant CCF
June 9-11, 2007SPAA References [1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages , New York, NY, USA, [2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2): , [3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, [4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4): , [5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, [6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967.
June 9-11, 2007SPAA Conclusion More Information Questions?