Adaptive Strassen and ATLAS’s DGEMM

Slides:



Advertisements
Similar presentations
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
Advertisements

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Introduction CS 524 – High-Performance Computing.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
/ 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware.
Load Balancing Dan Priece. What is Load Balancing? Distributed computing with multiple resources Need some way to distribute workload Discreet from the.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Benjamin Gamble. What is Time?  Can mean many different things to a computer Dynamic Equation Variable System State 2.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Parallel Solution of the 3-D Laplace Equation Using a Symmetric-Galerkin Boundary Integral.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
By: David McQuilling and Jesus Caban Numerical Linear Algebra.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
8.2 Operations With Matrices
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
Data Structures and Algorithms in Parallel Computing Lecture 7.
1.3 Solutions of Linear Systems
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
1 Complete this to a Pfaffian orientation (all internal faces have an odd number of clockwise arcs).
Design and implementation Chapter 7 – Lecture 1. Design and implementation Software design and implementation is the stage in the software engineering.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Computer Systems Architecture Edited by Original lecture by Ian Sunley Areas: Computer users Basic topics What is a computer?
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Single Instruction Multiple Threads
cs612/2002sp/projects/ CS612 Term Projects cs612/2002sp/projects/
Parallel Programming Models
Generalized and Hybrid Fast-ICA Implementation using GPU
PERFORMANCE EVALUATIONS
Linear Algebra review (optional)
05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Introduction to Load Balancing:
Ioannis E. Venetis Department of Computer Engineering and Informatics
Parallel processing is not easy
The Development Process of Web Applications
COMPUTATIONAL MODELS.
Tohoku University, Japan
Optimizing Cache Performance in Matrix Multiplication
Multiplication of Matrices
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Optimizing Cache Performance in Matrix Multiplication
Amir Kamil and Katherine Yelick
BLAS: behind the scenes
Automatic Performance Tuning
Improved schedulability on the ρVEX polymorphic VLIW processor
All-Pairs Shortest Paths
A.R. Hurson 323 CS Building, Missouri S&T
Alternative Processor Panel Results 2008
Numerical Algorithms Quiz questions
Gary M. Zoppetti Gagan Agrawal
COMP60621 Fundamentals of Parallel and Distributed Systems
Matrices.
Amir Kamil and Katherine Yelick
Automatic optimization of parallel linear algebra software
Linear Algebra review (optional)
COMP60611 Fundamentals of Parallel and Distributed Systems
Figure 3. Converting an expression into a binary expression tree.
Presentation transcript:

Adaptive Strassen and ATLAS’s DGEMM Paolo D’Alberto (CMU) and Alexandru Nicolau (UCI) 12/2/2005 HPC Asia

The Problem: Matrix Computations The evolution of systems is modeled by matrix computations The prediction and evaluation of such models (of complex systems) is fundamental in scientific computing. For example, the solution of linear equations or the solution of least square systems. 12/2/2005 HPC Asia

The Problem: BLAS The Basic Linear Algebra Subroutines is an interface describing a set of (basic) matrix and vector computations Historically, the BLAS was a set of algorithms Library implementing the BLAS are the back-bone of nowadays high performance computations For ScaLAPACK ESSL, PHiPac and ATLAS 12/2/2005 HPC Asia

The Problem: ATLAS Implementation of BLAS 3 are based on Matrix Multiplication In practice, ATLAS automatically generates a custom-tailored MM: It probes the system It tailors a kernel of MM to a specific system It uses the MM as a basic routine for the other BLAS-3 routines 12/2/2005 HPC Asia

Matrix Multiplication (basics) = * C2 C3 A2 B3 B2 B3 C0= A0B0 + A1B2 C1= A0B1 + A1B3 C3= A2B1 + A3B3 C2= A2B0 + A3B2 12/2/2005 HPC Asia

The Problem: MM ATLAS uses this classic matrix multiply For square matrices of size nxn, the algorithm takes O(n3) It achieves 80-90% of peak performance Strassen’s algorithm for large problems. Because it reduces the number of computations (thus shortening the execution time) We investigate the effects on single-processor systems 12/2/2005 HPC Asia

The Problem: Strassen’s Strassen’s for 2n–size matrices O(nlog 7) For even-size matrices, one recursive step is always applicable Otherwise Dynamic and static padding Peeling: For odd-size matrices [Hauss 97 & Luo 2004]: 12/2/2005 HPC Asia

Odd-Size Square Matrices B 2n B0 A0 2n 2n+1 2n 2n A0 * B0 is an even-size problem. Strassen is applied once more 2n+1 12/2/2005 HPC Asia

Our Approach: balanced division For any matrix size, we apply a balanced Strassen’s division process This reduces the number of computations further than an odd/even size problem (or padded) Balanced division = balanced workload Thus, predictable performance Balanced sized operands Better data cache utilization 12/2/2005 HPC Asia

Balanced Division Matrices Near Square: m = n+p with min|n-p| B0 A0 A1 B1 n m A3 B2 B3 A2 p n p m The quadrants are near square matrices. At any step of the recursion, all sub-matrices are near square matrices 12/2/2005 HPC Asia

Balanced Matrices (New matrix add and multiplication) The balanced division with Strassen’s recursion needs a new MA definition because addition of matrices of different sizes We generalize the operations such that: The algorithm is correct The extra control for the irregular sizes is completely negligible and only for matrix additions 12/2/2005 HPC Asia

Experimental Results We considered 14 systems We hand coded the MA for each specific system We measure performance of ATLAS’s MM and MA We specify an adaptive recursion point size for each system We encode the recursion point in the algorithm We measured the relative performance Strassen vs ATLAS We report the details for three systems shortly 12/2/2005 HPC Asia

12/2/2005 HPC Asia

Opteron Strassen + ATLAS ATLAS’s Performance (the higher the better) 12/2/2005 HPC Asia

8600 PA-RISC Strassen + ATLAS ATLAS’s Performance 12/2/2005 HPC Asia

ALPHA Strassen + ATLAS ATLAS’s Performance 12/2/2005 HPC Asia

Conclusions Our approach uses the balanced division as Strassen’s does We performed an exhaustive testing of performance Some architectures do not offer practical opportunity for S’s We use benchmarking of ATLAS’s MM and MA for specific code tuning. In the spirit of adaptive software packages We speed up ATLAS’s MM without introducing any overhead Due to data layout or extra control. 12/2/2005 HPC Asia

Future work The algorithm extends to rectangular matrices We will characterize its performance Parallel formulation and performance Power management MM and MA compose the application however they have different architecture utilization Hardware configurations adaptation (e.g., Xscale) 12/2/2005 HPC Asia