Generating Families of Practical Fast Matrix Multiplication Algorithms

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.
ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.
A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College,
The FLAME Project Faculty: Robert van de Geijn (CS/ICES) Don Batory (CS) Maggie Myers (SDS) John Stanton (Chem) Victor (TACC) Research Staff: Field Van.
A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
Towards Communication Avoiding Fast Algorithm for Sparse Matrix Multiplication Part I: Minimizing arithmetic operations Oded Schwartz CS294, Lecture #21.
A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.
Systems Engineer An engineer who specializes in the implementation of production systems This material is based upon work supported by the National Science.
DISCLAIMER: This material is based on work supported by the National Science Foundation and the Department of Defense under grant No. CNS Any.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
1 BLIS Matrix Multiplication: from Real to Complex Field G. Van Zee.
Short Vector SIMD Code Generation for DSP Algorithms
Beyond GEMM: How Can We Make Quantum Chemistry Fast? or: Why Computer Scientists Don’t Like Chemists Devin Matthews 9/25/ BLIS Retreat1.
Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC Memory Efficient.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical.
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.
Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN C HOL 0 A.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Thinking in Parallel - Introduction New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang
June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.
High-performance Implementations of Fast Matrix Multiplication with Strassen’s Algorithm Jianyu Huang with Tyler M. Smith, Greg M. Henry, Robert A. van.
Deepening Teacher Math Content Knowledge MMP Key Components of High Quality Professional Development DeAnn Huinker & Kevin McLeod University of Wisconsin-Milwaukee.
Generalized and Hybrid Fast-ICA Implementation using GPU
TI Information – Selective Disclosure
Early Results of Deep Learning on the Stampede2 Supercomputer
BLIS: Year In Review, Field G. Van Zee
Ioannis E. Venetis Department of Computer Engineering and Informatics
Using BLIS Building Blocks:
High-Performance Matrix Multiplication
BLIS optimized for EPYCTM Processors
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Progress Report Meeting
Coding FLAME Algorithms with Example: Cholesky factorization
Many-core Software Development Platforms
Notes on Homework 1 What is Peak ? CS267 Lecture 2 CS267 Lecture 2 1.
Automatic Performance Tuning
Synchronization trade-offs in GPU implementations of Graph Algorithms
What we need to be able to count to tune programs
Introduction to cuBLAS
Adaptive Strassen and ATLAS’s DGEMM
Early Results of Deep Learning on the Stampede2 Supercomputer
Title of Poster Site Visit 2017 Introduction Results
Using BLIS Building Blocks:
Improving IPC by Kernel Design
Copyright © 2008 by Dale R. Thompson Dale R. Thompson
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Results (Accuracy of Low Rank Approximation by iSVD)
Department of Computer Science, University of Tennessee, Knoxville
Title of Poster Site Visit 2018 Introduction Results
Math review - scalars, vectors, and matrices
This material is based upon work supported by the National Science Foundation under Grant #XXXXXX. Any opinions, findings, and conclusions or recommendations.
Presentation transcript:

Generating Families of Practical Fast Matrix Multiplication Algorithms Jianyu Huang, Leslie Rice, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin Jianyu Huang, Leslie Rice, Devin A. Matthews, Robert A. van de Geijn IPDPS2017, May 31st, Orlando, FL

Outline Background Fast Matrix Multiplication (FMM) Code Generation High-performance GEMM High-performance Strassen Fast Matrix Multiplication (FMM) Code Generation Performance Model Experiments Conclusion

High-performance matrix multiplication (GEMM)

State-of-the-art GEMM in BLIS BLAS-like Library Instantiation Software (BLIS) is a portable framework for instantiating BLAS-like dense linear algebra libraries. BLIS provides a refactoring of GotoBLAS algorithm (best-known approach on CPU) to implement GEMM. GEMM implementation in BLIS has 6-layers of loops. The outer 5 loops are written in C. The inner-most loop (micro-kernel) is written in assembly for high performance. Field Van Zee, and Robert van de Geijn. “BLIS: A Framework for Rapidly Instantiating BLAS Functionality.” ACM TOMS 41.3 (2015): 14. Kazushige Goto, and Robert van de Geijn. “High-performance implementation of the level-3 BLAS.” ACM TOMS 35.1 (2008): 4. Kazushige Goto, and Robert van de Geijn. “Anatomy of high-performance matrix multiplication.” ACM TOMS 34.3 (2008): 12.

GotoBLAS algorithm for GEMM in BLIS kC x nC L3 Cache mC x kC L2 Cache mR x nR mR x kC *Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication via the 3m and 4m methods.” In ACM Transactions on Mathematical Software (TOMS), accepted.   Register kC x nR

High-performance Strassen *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In SC’16.  

Strassen’s Algorithm Reloaded M0 := (A00+A11)(B00+B11); M1 := (A10+A11)B00; M2 := A00(B01–B11); M3 := A11(B10–B00); M4 := (A00+A01)B11; M5 := (A10–A00)(B00+B01); M6 := (A01–A11)(B10+B11); C00 += M0 + M3 – M4 + M6 C01 += M2 + M4 C10 += M1 + M3 C11 += M0 – M1 + M2 + M5 M0 := (A00+A11)(B00+B11); C00 += M0; C11 += M0; M1 := (A10+A11)B00; C10 += M1; C11 –= M1; M2 := A00(B01–B11); C01 += M2; C11 += M2; M3 := A11(B10–B00); C00 += M3; C10 += M3; M4 := (A00+A01)B11; C01 += M4; C00 –= M4; M5 := (A10–A00)(B00+B01); C11 += M5; M6 := (A01–A11)(B10+B11); C00 += M6; M := (X+Y)(V+W); C +=M; D +=M; M := (X+dY)(V+eW); C += g0M; D += g1M; g0, g1,d,e  {-1, 0, 1}. General operation for one-level Strassen:

M := (X+dY)(V+eW); C += g0M; D += g1M; g0, g1,d,e  {-1, 0, 1}. M := (X+Y)(V+W); C += M; D += M;

M := (X+dY)(V+eW); C += g0M; D += g1M; g0, g1,d,e  {-1, 0, 1}. C += AB; M := (X+Y)(V+W); C += M; D += M;

kC x nC L3 Cache mC x kC L2 Cache M := (X+dY)(V+eW); C += g0M; D += g1M; g0, g1,d,e  {-1, 0, 1}. C += AB; M := (X+Y)(V+W); C += M; D += M; kC x nC L3 Cache mC x kC L2 Cache mR x nR Register

Outline Background Fast Matrix Multiplication (FMM) Code Generation High-performance GEMM High-performance Strassen Fast Matrix Multiplication (FMM) Code Generation Performance Model Experiments Conclusion

<2,2,2> Strassen’s Algorithm M0 := (A0+A3)(B0+B3); C0 += M0; C3 += M0; M1 := (A2+A3)B0; C2 += M1; C3 –= M1; M2 := A0(B1–B3); C1 += M2; C3 += M2; M3 := A3(B2–B0); C0 += M3; C2 += M3; M4 := (A0+A1)B3; C1 += M4; C0 –= M4; M5 := (A2–A0)(B0+B1); C3 += M5; M6 := (A1–A3)(B2+B3); C0 += M6;

Fast Matrix Multiplication (FMM)

<3,2,3> Fast Matrix Multiplication (FMM) M0 := (–A1+A3+A5)(B4+B5); C2 –= M0; M1 := (–A0+A2+A5)(B0–B5); C2 –= M1; C7 –= M1; M2 := ( –A0+A2 )( –B1–B4 ); C0–=M2;C1+=M2;C7+=M2; M3 := ( A2–A4+A5 )( B2–B3+B5 ); C2+=M3;C5+=M3;C6–=M3; M4 := ( –A0+A1+A2 )( B0+B1+B4 ); C0–=M4;C1+=M4;C4+=M4; M5 := ( A2 )( B0 ); C0+=M5;C1–=M5;C3+=M5;C4–=M5;C6+=M5; M6 := ( –A2+A3+A4–A5 )( B3–B5 ); C2–=M6;C5–=M6; M7 := ( A3 )( B3 ); C2+=M7;C3+=M7;C5+=M7; M8 := ( –A0+A2+A4 )( B1+B2 ); C7+=M8; M9 := ( –A0+A1 )( –B0–B1 ); C1+=M9;C4+=M9; M10 :=( A5 )( B2+B5 ); C2+=M10;C6+=M10;C8+=M10; M11 := ( A4–A5 )( B2 ); C2+=M11;C5+=M11;C7–=M11;C8+=M11; M12 := ( A1 )( B0+B1+B3+B4 ); C0+=M12; M13 := ( A2–A4 )( B0–B2+B3–B5 ); C6–=M13; M14 := ( –A0+A1+A2–A3 )( B5 ); C2–=M14;C4–=M14; *Austin R. Benson, Grey Ballard. "A framework for practical parallel fast matrix multiplication." PPoPP15. M0=( -A(0,1)+A(1,1)+A(2,1) )( B(1,1)+B(1,2) ); C(0,2)-=M0; M1=( -A(0,0)+A(1,0)+A(2,1) )( B(0,2)-B(1,1) ); C(0,2)-=M1;C(2,1)-=M1; M2=( -A(0,0)+A(1,0) )( -B(0,1)-B(1,1) ); C(0,0)-=M2;C(0,1)+=M2;C(2,1)+=M2; M3=( A(1,0)-A(2,0)+A(2,1) )( B(0,2)-B(1,0)+B(1,2) ); C(0,2)+=M3;C(1,2)+=M3;C(2,0)-=M3; M4=( -A(0,0)+A(0,1)+A(1,0) )( B(0,0)+B(0,1)+B(1,1) ); C(0,0)-=M4;C(0,1)+=M4;C(1,1)+=M4; M5=( A(1,0) )( B(0,0) ); C(0,0)+=M5;C(0,1)-=M5;C(1,0)+=M5;C(1,1)-=M5;C(2,0)+=M5; M6=( -A(1,0)+A(1,1)+A(2,0)-A(2,1) )( B(1,0)-B(1,2) ); C(0,2)-=M6;C(1,2)-=M6; M7=( A(1,1) )( B(1,0) ); C(0,2)+=M7;C(1,0)+=M7;C(1,2)+=M7; M8=( -A(0,0)+A(1,0)+A(2,0) )( B(0,1)+B(0,2) ); C(2,1)+=M8; M9=( -A(0,0)+A(0,1) )( -B(0,0)-B(0,1) ); C(0,1)+=M9;C(1,1)+=M9; M10=( A(2,1) )( B(0,2)+B(1,2) ); C(0,2)+=M10;C(2,0)+=M10;C(2,2)+=M10; M11=( A(2,0)-A(2,1) )( B(0,2) ); C(0,2)+=M11;C(1,2)+=M11;C(2,1)-=M11;C(2,2)+=M11; M12=( A(0,1) )( B(0,0)+B(0,1)+B(1,0)+B(1,1) ); C(0,0)+=M12; M13=( A(1,0)-A(2,0) )( B(0,0)-B(0,2)+B(1,0)-B(1,2) ); C(2,0)-=M13; M14=( -A(0,0)+A(0,1)+A(1,0)-A(1,1) )( B(1,1) ); C(0,2)-=M14;C(1,1)-=M14;

[1] Austin R. Benson, Grey Ballard * [1] Austin R. Benson, Grey Ballard. "A framework for practical parallel fast matrix multiplication." PPoPP15.

kC x nC L3 Cache mC x kC L2 Cache M := (X+dY)(V+eW); C += g0M; D += g1M; g0, g1,d,e  {-1, 0, 1}. C += AB; M := (X+Y)(V+W); C += M; D += M; kC x nC L3 Cache mC x kC L2 Cache mR x nR Register

[1] Austin R. Benson, Grey Ballard * [1] Austin R. Benson, Grey Ballard. "A framework for practical parallel fast matrix multiplication." PPoPP15.

[1] Austin R. Benson, Grey Ballard * [1] Austin R. Benson, Grey Ballard. "A framework for practical parallel fast matrix multiplication." PPoPP15.

One-level Fast Matrix Multiplication (FMM)

Two-level Fast Matrix Multiplication (FMM)

Kronecker Product m x n p x q (mp) x (nq) https://en.wikipedia.org/wiki/Kronecker_product. https://en.wikipedia.org/wiki/Tensor_product  

Morton-like Ordering * https://en.wikipedia.org/wiki/Z-order_curve.  

Two-level Fast Matrix Multiplication

Two-level Fast Matrix Multiplication 2-level Morton-like ordering index

Two-level <2,2,2> Strassen Algorithm

L-level Fast Matrix Multiplication L-level Morton-like ordering index

Outline Background Fast Matrix Multiplication (FMM) Code Generation High-performance GEMM High-performance Strassen Fast Matrix Multiplication (FMM) Code Generation Performance Model Experiments Conclusion

Generating skeleton framework Compute the Kronecker Product. Generate matrix partitions by conceptually morton-like ordering indexing. Fringes: dynamic peeling. *Mithuna Thottethodi, Siddhartha Chatterjee, and Alvin R. Lebeck. "Tuning Strassen's matrix multiplication for memory efficiency." Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 1998.  

Generating typical operations Packing routines: Micro-kernel: A hand-coded GEMM kernel Automatically generated updates to multiple submatrices of C.

Variations on a theme Naïve FMM AB FMM ABC FMM A traditional implementation with temporary buffers. AB FMM Integrate the addition of matrices into and . ABC FMM Integrate the update of multiple submatrices of C in the micro-kernel.

Outline Background Fast Matrix Multiplication (FMM) Code Generation High-performance GEMM High-performance Strassen Fast Matrix Multiplication (FMM) Code Generation Performance Model Experiments Conclusion

Performance Model Performance Metric Total Time Breakdown Arithmetic Operations Memory Operations

Actual vs. Modeled Performance Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket) Actual Modeled

Actual vs. Modeled Performance Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket) Actual Modeled

Actual Modeled Actual Modeled

Actual Modeled Actual Modeled

Selecting FMM implementation with performance model Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket) Best FMM: Best implementation among 20+ FMM with ABC, AB, Naïve implementations Selected FMM: Selected FMM implementation with the guidance of performance model

Outline Background Fast Matrix Multiplication (FMM) Code Generation High-performance GEMM High-performance Strassen Fast Matrix Multiplication (FMM) Code Generation Performance Model Experiments Conclusion

Hybrid Partitions Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket)

Parallel performance Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket) * Reference: Austin R. Benson, and Grey Ballard. "A framework for practical parallel fast matrix multiplication." in PPoPP2015.  

Parallel performance Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket) * Reference: Austin R. Benson, and Grey Ballard. "A framework for practical parallel fast matrix multiplication." in PPoPP2015.  

Outline Background Fast Matrix Multiplication (FMM) Code Generation High-performance GEMM High-performance Strassen Fast Matrix Multiplication (FMM) Code Generation Performance Model Experiments Conclusion

Our code generator … Implement families of FMM algorithms automatically. Express multi-level FMM as Kronecker product. Incorporate matrix summations into operations inside GEMM. Generate relatively accurate performance model.

Acknowledgement - NSF grants ACI-1550493. - A gift from Qualcomm Foundation. - Intel Corporation through an Intel Parallel Computing Center (IPCC). - Access to the Maverick and Stampede supercomputers administered by TACC. We thank Austin Benson and Grey Ballard for sharing their open source code and helpful discussions, the anonymous reviewers for their constructive comments, and the rest of the SHPC team (http://shpc.ices.utexas.edu) for their supports. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Thank you! The source code can be downloaded from: https://github.com/flame/fmm-gen

Related work [1]: Systematic way to identity and implement new FMM algorithms based on conventional calls to GEMM. [2]: Strassen can be incorporated into high-performance GEMM [3]: Kronecker product for expressing multi-level Strassen’s algorithm *[1] Austin R. Benson, Grey Ballard. "A framework for practical parallel fast matrix multiplication." PPoPP15. *[2] Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In SC’16. *[3] C-H Huang, Jeremy R. Johnson, and Robert W. Johnson. "A tensor product formulation of Strassen's matrix multiplication algorithm." Applied Mathematics Letters 3.3 (1990): 67-71.