Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.

Similar presentations


Presentation on theme: "June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas."— Presentation transcript:

1 June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas at Austin

2 June 9-11, 2007SPAA 20072 Motivation Motivating Example  Cholesky Factorization  A → L L T

3 June 9-11, 2007SPAA 20073 Motivation Better Peak Performance 96 Gflops

4 June 9-11, 2007SPAA 20074 Outline Performance FLAME SuperMatrix Conclusion

5 June 9-11, 2007SPAA 20075 Performance Target Architecture  16 CPU Itanium2 NUMA 8 dual-processor nodes  OpenMP Intel Compiler 9.0  BLAS GotoBLAS 1.06 Intel MKL 8.1

6 June 9-11, 2007SPAA 20076 Performance Implementations  Multithreaded BLAS (Sequential algorithm) LAPACK dpotrf FLAME var3  Serial BLAS (Parallel algorithm) Data-flow

7 June 9-11, 2007SPAA 20077 Performance Implementations  Column-major order storage  Varying block sizes { 64, 96, 128, 160, 192, 224, 256 } Select best performance for each problem size

8 June 9-11, 2007SPAA 20078 Performance

9 June 9-11, 2007SPAA 20079 Outline Performance FLAME SuperMatrix Conclusion

10 June 9-11, 2007SPAA 200710 FLAME Formal Linear Algebra Method Environment  High-level abstraction away from indices “Views” into matrices  Seamless transition from algorithms to code

11 June 9-11, 2007SPAA 200711 FLAME Cholesky Factorization for ( j = 0; j < n; j++ ) { A[j,j] = sqrt( A[j,j] ); for ( i = j+1; i < n; i++ ) A[i,j] = A[i,j] / A[j,j]; for ( k = j+1; k < n; k++ ) for ( i = k; i < n; i++ ) A[i,k] = A[i,k] – A[i,j] * A[k,j]; }

12 June 9-11, 2007SPAA 200712 FLAME LAPACK dpotrf  Different variant (right-looking) DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO

13 June 9-11, 2007SPAA 200713 FLAME Partitioning Matrices

14 June 9-11, 2007SPAA 200714 FLAME

15 June 9-11, 2007SPAA 200715 FLAME

16 June 9-11, 2007SPAA 200716 FLAME FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }

17 June 9-11, 2007SPAA 200717 FLAME

18 June 9-11, 2007SPAA 200718 Outline Performance FLAME SuperMatrix  Data-flow  2D data affinity  Contiguous storage Conclusion

19 June 9-11, 2007SPAA 200719 SuperMatrix Cholesky Factorization  Iteration 1 Chol Trsm Syrk GemmTrsmSyrk

20 June 9-11, 2007SPAA 200720 SuperMatrix Cholesky Factorization  Iteration 2 Syrk Chol Trsm

21 June 9-11, 2007SPAA 200721 SuperMatrix Cholesky Factorization  Iteration 3 Chol

22 June 9-11, 2007SPAA 200722 SuperMatrix Analyzer  Delay execution and place tasks on queue Tasks are function pointers annotated with input/output information  Compute dependence information (flow, anti, output) between all tasks Create DAG of tasks

23 June 9-11, 2007SPAA 200723 SuperMatrix Analyzer Chol Trsm Syrk Chol Gemm … Chol Trsm GemmSyrk Chol …… Task QueueDAG of tasks

24 June 9-11, 2007SPAA 200724 SuperMatrix FLASH  Matrix of matrices

25 June 9-11, 2007SPAA 200725 SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( );

26 June 9-11, 2007SPAA 200726 SuperMatrix Dispatcher  Use DAG to execute tasks out-of-order in parallel  Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation SuperScalar vs. SuperMatrix

27 June 9-11, 2007SPAA 200727 SuperMatrix Dispatcher  4 threads  5 x 5 matrix of blocks  35 tasks  14 stages Chol Trsm Syrk Gemm Syrk Trsm SyrkGemm Chol Trsm Syrk Gemm Chol Trsm Chol Syrk Gemm

28 June 9-11, 2007SPAA 200728 SuperMatrix

29 June 9-11, 2007SPAA 200729 SuperMatrix Chol Trsm SyrkGemm Syrk Trsm SyrkGemm Chol Trsm Syrk Gemm Trsm Chol Syrk Gemm Syrk Chol Syrk Dispatcher  Tasks write to block [2,2]  No data affinity

30 June 9-11, 2007SPAA 200730 SuperMatrix Blocks of Matrices Tasks Threads Processors Owner Computes Rule Data Affinity CPU Affinity Binding Threads to Processors Assigning Tasks to Threads Denote Tasks by the Blocks Overwritten

31 June 9-11, 2007SPAA 200731 SuperMatrix Data Affinity  2D block cyclic decomposition (ScaLAPACK) 4 x 4 matrix of blocks assigned to 2 x 2 mesh

32 June 9-11, 2007SPAA 200732 SuperMatrix

33 June 9-11, 2007SPAA 200733 SuperMatrix Contiguous Storage  One level of blocking User inherently does not need to know about the underlying storage of data

34 June 9-11, 2007SPAA 200734 SuperMatrix

35 June 9-11, 2007SPAA 200735 SuperMatrix GotoBLAS vs. MKL  All previous graphs link with GotoBLAS  MKL better tuned for small matrices on Itanium2

36 June 9-11, 2007SPAA 200736 SuperMatrix

37 June 9-11, 2007SPAA 200737 SuperMatrix

38 June 9-11, 2007SPAA 200738 SuperMatrix

39 June 9-11, 2007SPAA 200739 SuperMatrix

40 June 9-11, 2007SPAA 200740 SuperMatrix Results  LAPACK chose a bad variant  Data affinity and contiguous storage have clear advantage  Multithreaded GotoBLAS tuned for large matrices  MKL better tuned for small matrices

41 June 9-11, 2007SPAA 200741 SuperMatrix

42 June 9-11, 2007SPAA 200742 SuperMatrix

43 June 9-11, 2007SPAA 200743 SuperMatrix

44 June 9-11, 2007SPAA 200744 Outline Performance FLAME SuperMatrix Conclusion

45 June 9-11, 2007SPAA 200745 Conclusion Key Points  View blocks of matrices as units of computation instead of scalars  Apply instruction-level parallelism to blocks  Abstractions away from low-level details of scheduling

46 June 9-11, 2007SPAA 200746 Authors Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn  Universidad Jaume I  The University of Texas at Austin

47 June 9-11, 2007SPAA 200747 Acknowledgements We thank the other members of the FLAME team for their support  Field Van Zee Funding  NSF grant CCF-0540926

48 June 9-11, 2007SPAA 200748 References [1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225-233, New York, NY, USA, 1989. [2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2):214-244, 2001. [3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004. [4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4):422-455, 2001. [5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para 2006. B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, 2007. [6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967.

49 June 9-11, 2007SPAA 200749 Conclusion More Information http://www.cs.utexas.edu/users/flame Questions? echan@cs.utexas.edu


Download ppt "June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas."

Similar presentations


Ads by Google