June 9-11, 2007SPAA 20071 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.

Slides:



Advertisements
Similar presentations
How a Domain-Specific Language Enables the Automation of Optimized Code for Dense Linear Algebra DxT – Design by Transformation 1 Bryan Marker, Don Batory,
Advertisements

ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,
Don Batory, Bryan Marker, Rui Gonçalves, Robert van de Geijn, and Janet Siegmund Department of Computer Science University of Texas at Austin Austin, Texas.
1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.
Parallel computer architecture classification
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.
April 19, 2010HIPS Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan.
Bronis R. de Supinski Center for Applied Scientific Computing Lawrence Livermore National Laboratory June 2, 2005 The Most Needed Feature(s) for OpenMP.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multi-core Processors Muthu Baskaran 1 Naga Vydyanathan 1 Uday Bondhugula.
Using VASP on Ranger Hang Liu. About this work and talk – A part of an AUS project for VASP users from UCSB computational material science group led by.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
INTEL CONFIDENTIAL Finding Parallelism Introduction to Parallel Programming – Part 3.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige.
Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Integrated Maximum Flow Algorithm for Optimal Response Time Retrieval of Replicated Data Nihat Altiparmak, Ali Saman Tosun The University of Texas at San.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
PACC2011, Sept [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Lecture 4 Sparse Factorization: Data-flow Organization
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.
Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN C HOL 0 A.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Chapter 1 Introduction.   In this chapter we will learn about structure and function of computer and possibly nature and characteristics of computer.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang
1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.
Generating Families of Practical Fast Matrix Multiplication Algorithms
05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Ioannis E. Venetis Department of Computer Engineering and Informatics
Using BLIS Building Blocks:
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Coding FLAME Algorithms with Example: Cholesky factorization
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Text Book Computer Organization and Architecture: Designing for Performance, 7th Ed., 2006, William Stallings, Prentice-Hall International, Inc.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Using BLIS Building Blocks:
Presentation transcript:

June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas at Austin

June 9-11, 2007SPAA Motivation Motivating Example  Cholesky Factorization  A → L L T

June 9-11, 2007SPAA Motivation Better Peak Performance 96 Gflops

June 9-11, 2007SPAA Outline Performance FLAME SuperMatrix Conclusion

June 9-11, 2007SPAA Performance Target Architecture  16 CPU Itanium2 NUMA 8 dual-processor nodes  OpenMP Intel Compiler 9.0  BLAS GotoBLAS 1.06 Intel MKL 8.1

June 9-11, 2007SPAA Performance Implementations  Multithreaded BLAS (Sequential algorithm) LAPACK dpotrf FLAME var3  Serial BLAS (Parallel algorithm) Data-flow

June 9-11, 2007SPAA Performance Implementations  Column-major order storage  Varying block sizes { 64, 96, 128, 160, 192, 224, 256 } Select best performance for each problem size

June 9-11, 2007SPAA Performance

June 9-11, 2007SPAA Outline Performance FLAME SuperMatrix Conclusion

June 9-11, 2007SPAA FLAME Formal Linear Algebra Method Environment  High-level abstraction away from indices “Views” into matrices  Seamless transition from algorithms to code

June 9-11, 2007SPAA FLAME Cholesky Factorization for ( j = 0; j < n; j++ ) { A[j,j] = sqrt( A[j,j] ); for ( i = j+1; i < n; i++ ) A[i,j] = A[i,j] / A[j,j]; for ( k = j+1; k < n; k++ ) for ( i = k; i < n; i++ ) A[i,k] = A[i,k] – A[i,j] * A[k,j]; }

June 9-11, 2007SPAA FLAME LAPACK dpotrf  Different variant (right-looking) DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO

June 9-11, 2007SPAA FLAME Partitioning Matrices

June 9-11, 2007SPAA FLAME

June 9-11, 2007SPAA FLAME

June 9-11, 2007SPAA FLAME FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /* */ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); }

June 9-11, 2007SPAA FLAME

June 9-11, 2007SPAA Outline Performance FLAME SuperMatrix  Data-flow  2D data affinity  Contiguous storage Conclusion

June 9-11, 2007SPAA SuperMatrix Cholesky Factorization  Iteration 1 Chol Trsm Syrk GemmTrsmSyrk

June 9-11, 2007SPAA SuperMatrix Cholesky Factorization  Iteration 2 Syrk Chol Trsm

June 9-11, 2007SPAA SuperMatrix Cholesky Factorization  Iteration 3 Chol

June 9-11, 2007SPAA SuperMatrix Analyzer  Delay execution and place tasks on queue Tasks are function pointers annotated with input/output information  Compute dependence information (flow, anti, output) between all tasks Create DAG of tasks

June 9-11, 2007SPAA SuperMatrix Analyzer Chol Trsm Syrk Chol Gemm … Chol Trsm GemmSyrk Chol …… Task QueueDAG of tasks

June 9-11, 2007SPAA SuperMatrix FLASH  Matrix of matrices

June 9-11, 2007SPAA SuperMatrix FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /* */ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); } FLASH_Queue_exec( );

June 9-11, 2007SPAA SuperMatrix Dispatcher  Use DAG to execute tasks out-of-order in parallel  Akin to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation SuperScalar vs. SuperMatrix

June 9-11, 2007SPAA SuperMatrix Dispatcher  4 threads  5 x 5 matrix of blocks  35 tasks  14 stages Chol Trsm Syrk Gemm Syrk Trsm SyrkGemm Chol Trsm Syrk Gemm Chol Trsm Chol Syrk Gemm

June 9-11, 2007SPAA SuperMatrix

June 9-11, 2007SPAA SuperMatrix Chol Trsm SyrkGemm Syrk Trsm SyrkGemm Chol Trsm Syrk Gemm Trsm Chol Syrk Gemm Syrk Chol Syrk Dispatcher  Tasks write to block [2,2]  No data affinity

June 9-11, 2007SPAA SuperMatrix Blocks of Matrices Tasks Threads Processors Owner Computes Rule Data Affinity CPU Affinity Binding Threads to Processors Assigning Tasks to Threads Denote Tasks by the Blocks Overwritten

June 9-11, 2007SPAA SuperMatrix Data Affinity  2D block cyclic decomposition (ScaLAPACK) 4 x 4 matrix of blocks assigned to 2 x 2 mesh

June 9-11, 2007SPAA SuperMatrix

June 9-11, 2007SPAA SuperMatrix Contiguous Storage  One level of blocking User inherently does not need to know about the underlying storage of data

June 9-11, 2007SPAA SuperMatrix

June 9-11, 2007SPAA SuperMatrix GotoBLAS vs. MKL  All previous graphs link with GotoBLAS  MKL better tuned for small matrices on Itanium2

June 9-11, 2007SPAA SuperMatrix

June 9-11, 2007SPAA SuperMatrix

June 9-11, 2007SPAA SuperMatrix

June 9-11, 2007SPAA SuperMatrix

June 9-11, 2007SPAA SuperMatrix Results  LAPACK chose a bad variant  Data affinity and contiguous storage have clear advantage  Multithreaded GotoBLAS tuned for large matrices  MKL better tuned for small matrices

June 9-11, 2007SPAA SuperMatrix

June 9-11, 2007SPAA SuperMatrix

June 9-11, 2007SPAA SuperMatrix

June 9-11, 2007SPAA Outline Performance FLAME SuperMatrix Conclusion

June 9-11, 2007SPAA Conclusion Key Points  View blocks of matrices as units of computation instead of scalars  Apply instruction-level parallelism to blocks  Abstractions away from low-level details of scheduling

June 9-11, 2007SPAA Authors Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn  Universidad Jaume I  The University of Texas at Austin

June 9-11, 2007SPAA Acknowledgements We thank the other members of the FLAME team for their support  Field Van Zee Funding  NSF grant CCF

June 9-11, 2007SPAA References [1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages , New York, NY, USA, [2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2): , [3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, [4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4): , [5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, [6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967.

June 9-11, 2007SPAA Conclusion More Information Questions?