Optimizing the Performance of Sparse Matrix-Vector Multiplication

Slides:



Advertisements
Similar presentations
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Advertisements

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
The view from space Last weekend in Los Angeles, a few miles from my apartment…
Adaptable benchmarks for register blocked sparse matrix-vector multiplication ● Berkeley Benchmarking and Optimization group (BeBOP) ● Hormozd Gahvari.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Sparse Computations: Better, Faster, Cheaper! Padma Raghavan Department of Computer Science and Engineering, The Pennsylvania State University
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Data Locality CS 524 – High-Performance Computing.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Chapter 2 Dimensionality Reduction. Linear Methods
1 © 2012 The MathWorks, Inc. Speeding up MATLAB Applications.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Makoto Kudoh*1, Hisayasu Kuroda*1,
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Efficiency and Flexibility of Jagged Arrays Geir Gundersen Department of Informatics University of Bergen Norway Joint work with Trond Steihaug.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Matrix Multiplication in Hadoop
Martin Kruliš by Martin Kruliš (v1.0)1.
University of California, Berkeley
Analysis of Sparse Convolutional Neural Networks
Linear Algebra review (optional)
CUDA Interoperability with Graphical Environments
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
Ioannis E. Venetis Department of Computer Engineering and Informatics
GPU Computing CIS-543 Lecture 10: CUDA Libraries
Application-Specific Customization of Soft Processor Microarchitecture
Optimizing Cache Performance in Matrix Multiplication
Resource Elasticity for Large-Scale Machine Learning
Richard Dorrance Literature Review: 1/11/13
COMP 430 Intro. to Database Systems
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Database Performance Tuning and Query Optimization
Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Bin Ren, Gagan Agrawal 9/18/2018.
Optimizing Cache Performance in Matrix Multiplication
Vector Processing => Multimedia
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Automatic Performance Tuning
for more information ... Performance Tuning
Memory Hierarchies.
Adaptive Strassen and ATLAS’s DGEMM
Ann Gordon-Ross and Frank Vahid*
Register Pressure Guided Unroll-and-Jam
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.
Optimizing MMM & ATLAS Library Generator
A Comparison of Cache-conscious and Cache-oblivious Codes
Chapter 11 Database Performance Tuning and Query Optimization
Memory System Performance Chapter 3
Linear Algebra review (optional)
Introduction to Computer Systems Engineering
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Optimizing the Performance of Sparse Matrix-Vector Multiplication Eun-Jin Im U.C.Berkeley 6/13/00 U.C.Berkeley

Overview Motivation Optimization techniques Sparsity system Register Blocking Cache Blocking Multiple Vectors Sparsity system Related Work Contribution Conclusion 6/13/00 U.C.Berkeley

Motivation : Usage Sparse Matrix-Vector Multiplication Usage of this operation: Iterative Solvers Explicit Methods Eigenvalue and Singular Value Problems Applications in structure modeling, fluid dynamics, document retrieval(Latent Semantic Indexing) and many other simulation areas 6/13/00 U.C.Berkeley

Motivation : Performance (1) Matrix-vector multiplication (BLAS2) is slower than matrix-matrix multiplication (BLAS3) For example, on 167 MHz UltraSPARC I, Vendor optimized matrix-vector multiplication: 57Mflops Vendor optimized matrix-matrix multiplication: 185Mflops The reason: lower ratio of the number of floating point operations to the number of memory operation 6/13/00 U.C.Berkeley

Motivation : Performance (2) Sparse matrix operation is slower than dense matrix operation. For example, on 167 MHz UltraSPARC I, Dense matrix-vector multiplication : naïve implementation : 38Mflops vendor optimized implementation : 57Mflops Sparse matrix-vector multiplication (Naïve implementation) 5.7 - 25Mflops The reason : indirect data structure, thus inefficient memory accesses 6/13/00 U.C.Berkeley

Motivation : Optimized libraries Old approach : Hand-Optimized Libraries Vendor-supplied BLAS, LAPACK New approach : Automatic generation of libraries PHiPAC (dense linear algebra) ATLAS (dense linear algebra) FFTW (fast fourier transform) Our approach : Automatic generation of libraries for sparse matrices Additional dimension : nonzero structure of sparse matrices 6/13/00 U.C.Berkeley

Sparse Matrix Formats There are large number of sparse matrix formats. Point-entry Coordinate format (COO), Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), Sparse Diagonal (DIA), … Block-entry Block Coordinate (BCO), Block Sparse Row (BSR), Block Sparse Column (BSC), Block Diagonal (BDI), Variable Block Compressed Sparse Row (VBR), … 6/13/00 U.C.Berkeley

Compressed Sparse Row Format We internally use CSR format, because it is relatively efficient format 6/13/00 U.C.Berkeley

Optimization Techniques Register Blocking Cache Blocking Multiple vector 6/13/00 U.C.Berkeley

Register Blocking Blocked Compressed Sparse Row Format 0 2 4 0 4 2 4 Advantages of the format Better temporal locality in registers The multiplication loop can be unrolled for better performance 0 2 4 A00 A01 A10 A11 A04 0 0 A15 A22 0 A32 A33 A25 0 A34 A35 0 4 2 4 6/13/00 U.C.Berkeley

Register Blocking : Fill Overhead We use uniform block size, adding fill overhead. fill overhead = 12/7 = 1.71 This increases both space and the number of floating point operations. 6/13/00 U.C.Berkeley

Register Blocking Dense Matrix profile on an UltraSPARC I (input to the performance model) 6/13/00 U.C.Berkeley

Register Blocking : Selecting the block size The hard part of the problem is picking the block size so that : It minimizes the fill overhead It maximizes the raw performance Two approaches : Exhaustive search Using a model 6/13/00 U.C.Berkeley

Register Blocking: Performance model Two components to the performance model Multiplication performance of dense matrix represented in sparse format Estimated fill overhead Predicted performance for block size r x c dense r x c blocked performance = fill overhead 6/13/00 U.C.Berkeley

Benchmark matrices Matrix 1: Dense matrix (1000 x 1000) Matrices 2-17 : Finite Element Method matrices Matrices 18-39 : matrices from Structural Engineering, Device Simulation Matrices 40-44 : Linear Programming matrices Matrix 45 : document retrieval matrix used for Latent Semantic Indexing Matrix 46 : random matrix (10000 x 10000, 0.15%) List of matrices in transparency 6/13/00 U.C.Berkeley

Register Blocking : Performance The optimization is effective most on FEM matrices and dense matrix (lower-numbered). 6/13/00 U.C.Berkeley

Register Blocking : Performance Comply? Speedup is generally best on MIPS R10000, which is competitive with the dense BLAS performance. (DGEMV/DGEMM = 0.38) 6/13/00 U.C.Berkeley

Register Blocking : Validation of Performance Model Comparison to the performance of exhaustive search (yellow bars, block sizes in lower row) on a subset of the benchmark matrices The exhaustive search does not produce much better result. 6/13/00 U.C.Berkeley

Register Blocking : Overhead Pre-computation overhead : Estimating fill overhead (red bars) Reorganizing the matrix (yellow bars) The ratio means the number of repetitions for which the optimization is beneficial. 6/13/00 U.C.Berkeley

Cache Blocking Temporal locality of access to source vector Source vector x Destination Vector y Add the figure In memory 6/13/00 U.C.Berkeley

Cache Blocking : Performance 26/589 MIPS, MIPS speedup is generally better. larger cache, larger miss penalty (26/589 ns for MIPS, 36/268 ns for Ultra.) Except document retrieval and random matrix. 6/13/00 U.C.Berkeley

Cache Blocking : Performance on document retrieval matrix Document retrieval matrix : 10K x 256K, 37M nonzeros, SVD is applied for LSI(Latent Semantic Indexing) The nonzero elements are spread across the matrix, with no dense cluster. Peak at 16K x 16K cache block with speedup 3.1 6/13/00 U.C.Berkeley

Cache Blocking : When and how to use cache blocking From the experiment, the matrices for which cache blocking is most effective are large and “random”. We developed a measurement of “randomness” of matrix. We perform search in coarse grain, to decide cache block size. 6/13/00 U.C.Berkeley

Combination of Register and Cache blocking : UltraSPARC The combination is rarely beneficial, often slower than either of the two optimization. 6/13/00 U.C.Berkeley

Combination of Register and Cache blocking : MIPS 6/13/00 U.C.Berkeley

Multiple Vector Multiplication Better chance of optimization : BLAS2 vs. BLAS3 Repetition of single-vector case Multiple-vector case 6/13/00 U.C.Berkeley

Multiple Vector Multiplication : Performances Register blocking performance Cache blocking performance 6/13/00 U.C.Berkeley

Multiple Vector Multiplication : Register Blocking Performance The speedup is larger than single vector register blocking. Even the performance of the matrices that did not speedup improved. (middle group in UltraSPARC) 6/13/00 U.C.Berkeley

Multiple Vector Multiplication : Cache Blocking Performance UltraSPARC MIPS Noticeable speedup for the matrices that did not speedup (UltraSPARC) Block sizes are much smaller than that of single vector cache blocking. 6/13/00 U.C.Berkeley

Sparsity System : Purpose Guide a choice of optimization Automatic selection of optimization parameters such as block size, number of vectors http://comix.cs.berkeley.edu/~ejim/sparsity 6/13/00 U.C.Berkeley

Sparsity System : Organization Example matrix Sparsity Machine Profiler Machine Performance Profile Sparsity Optimizer Optimized code, drivers Maximum Number of vectors 6/13/00 U.C.Berkeley

Summary : Speedup of Sparsity on UltraSPARC On UltraSPARC, up to 3x for single vector, 4.7x for multiple vector Single Vector Multiple Vector 6/13/00 U.C.Berkeley

Summary : Speedup of Sparsity on MIPS On MIPS, up to 3x single vector, 6x for multiple vector Single Vector Multiple Vector 6/13/00 U.C.Berkeley

Summary : Overhead of Sparsity Optimization The number of iteration = Overhead time Time saved The BLAS Technical Forum include a parameter in the matrix creation routine to indicate how many times the operation is performed. 6/13/00 U.C.Berkeley

Related Work (1) Dense Matrix Optimization Loop transformation by compilers : M. Wolf, etc. Hand-optimized libraries : BLAS, LAPACK Automatic Generation of Libraries PHiPAC, ATLAS and FFTW Sparse Matrix Standardization and Libraries BLAS Technical Forum NIST Sparse BLAS, MV++, SparseLib++, TNT Hand Optimization of Sparse Matrix-Vector Multi. S. Toledo, Oliker et. al. 6/13/00 U.C.Berkeley

Related Work (2) Sparse Matrix Packages Compiling Sparse Matrix Code SPARSKIT, PSPARSELIB, Aztec, BlockSolve95, Spark98 Compiling Sparse Matrix Code Sparse compiler (Bik), Bernoulli compiler (Kotlyar) On-demand Code Generation NIST SparseBLAS, Sparse compiler 6/13/00 U.C.Berkeley

Contribution Thorough investigation of memory hierarchy optimization for sparse matrix-vector multiplication Performance study on benchmark matrices Development of performance model to choose optimization parameter Sparsity system for automatic tuning and code generation of sparse matrix-vector multiplication 6/13/00 U.C.Berkeley

Conclusion Memory hierarchy optimization for sparse matrix-vector multiplication Register Blocking : matrices with dense local structure benefit Cache Blocking : large matrices with random structure benefit Multiple vector multiplication improves the performance further because of reuse of matrix elements The choice of optimization depends on both matrix structure and machine architecture. The automated system helps this complicated and time-consuming process. 6/13/00 U.C.Berkeley