Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…
Adaptable benchmarks for register blocked sparse matrix-vector multiplication ● Berkeley Benchmarking and Optimization group (BeBOP) ● Hormozd Gahvari.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
Sparse Triangular Solve in UPC By Christian Bell and Rajesh Nishtala.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.
1cs542g-term Notes  Assignment 1 is out (questions?)
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.
Programming Systems for a Digital Human Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Berkeley Benchmarking and OPtimization (BeBOP) Project James.
Uniprocessor Optimizations and Matrix Multiplication
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Makoto Kudoh*1, Hisayasu Kuroda*1,
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Slide 1 / 19 Mesh Generation and Load Balancing CS /11/2009 Stan Tomov Innovative Computing Laboratory Computer Science Department The University.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Vector computers.
Martin Kruliš by Martin Kruliš (v1.0)1.
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
University of California, Berkeley
Ioannis E. Venetis Department of Computer Engineering and Informatics
Richard Dorrance Literature Review: 1/11/13
Optimization Code Optimization ©SoftMoore Consulting.
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Vector Processing => Multimedia
Automatic Performance Tuning
for more information ... Performance Tuning
Adaptive Strassen and ATLAS’s DGEMM
STUDY AND IMPLEMENTATION
Presentation transcript:

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc James Demmel Katherine Yelick Yozo Hida Michael deLorimier Shoaib Kamil Rajesh Nishtala Benjamin Lee Context The performance of many applications is dominated by a few computational kernels. Needle in a haystack—Planar slice of a large space of mathematically equivalent dense matrix multiply implementations: Each square is an implementation color- coded by its performance (Mflop/s) on a 333 MHz Sun Ultra- IIi based workstation. It is not obvious how to model this space. Platform variability—Distribution of performance over dense matrix multiply implementations on 8 different platforms (architecture + compiler): Performance tuning for any one platform must be redone for each new platform. BeBOP Berkeley Benchmarking and OPtimization Group An Approach to Automatic Tuning For each kernel, identify and generate a space of implementations, and search for the best one. Tuning Sparse Matrix-Vector Multiply The S PARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax, where A is sparse. Extensions to New Kernels Preliminary results for symmetric A, A T A, and triangular solve. Future Work Integrating with applications, new kernels, further automation, understanding architectural implications. Cache blocking—Performance at various cache block sizes on a latent semantic indexing matrix. Sparse triangular solve—Implementation design space includes SSE2 instructions and “switch-to-dense.” A T A times a vector—The matrix A is brought through the memory hierarchy only once.  Applications need fast kernels Scientific computing; information retrieval Dense and sparse linear algebra Multimedia; audio and image processing Fast transforms Databases Sorting Security “Cryptokernels” (e.g., modular exponentiation)  Hardware complexity is increasing Microprocessor performance difficult to model Widening processor-memory gap; deep memory hierarchies  Implementation space Conceptually, the set of “interesting” implementations Depend on kernel and input May vary: oInstruction mix and order oMemory access patterns oData structures and precisions oMathematical formulation  Searching How? oExhaustively oHeuristically, guided by models When? oOnce per kernel and platform oAt compile time oAt run-time oHybrid approaches  Sparse matrix data structures Store only non-zeros Data structure overhead + irregular memory access  Implementation space Register blocking: exploit existing dense blocks Cache blocking: create better locality in x, y Multiple vectors: reuse elements of A  Searching example: selecting a register block size Off-line: One-time characterization of performance on a dense matrix stored in sparse format for all r, c At run-time: Estimate rxc fill, and choose r, c to maximize Mflops dense (r,c) / Fill(r,c) This approach has been applied successfully in dense linear algebra (PHiPAC ‘97; ATLAS ‘98) and signal processing (FFTW ‘98; SPIRAL ‘00), among others. Register blocking profile—One-time characterization of the machine (Mflop/s). Exploiting symmetry—When A=A T, only half the matrix need be stored, and each element used twice.  Symmetric sparse matrix-vector multiply Only store half of the non-zeros Reuse each stored element twice  Sparse triangular solve Compute T -1 x, where T is a sparse triangular matrix Exploit naturally occurring dense structure when T comes from certain applications (LU factorization)  Multiplying A T A by a vector A can be brought through the memory hierarchy once Arises in linear programming problems, among others (S PARSITY system optimizations also apply to these kernels!) Exploiting new structures—Symmetric matrix from a fluid flow modeling problem (left); triangular matrix from LU factorization (right). Why do these four profiles look so different?—We hope to answer this question and understand the implications for current and future architectures and applications.  Integration with applications and existing software libraries  Creating (via reordering) or exploiting other matrix structures  New sparse kernels (e.g., powers A k, triple product A T MA)  Further automation: generating implementation generators  Understanding performance in terms of underlying architectures S PARSITY —Performance improvement after run-time block-size selection. Multiple vectors—Significant speedups are possible when multiplying by several vectors. Register blocking example—Portion of a sparse matrix with a 4x3 register block grid superimposed.