Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

Slides:

Advertisements

Similar presentations

The view from space Last weekend in Los Angeles, a few miles from my apartment…

Advertisements

Adaptable benchmarks for register blocked sparse matrix-vector multiplication ● Berkeley Benchmarking and Optimization group (BeBOP) ● Hormozd Gahvari.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.

Introduction CS 524 – High-Performance Computing.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Data Locality CS 524 – High-Performance Computing.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Minimizing Communication in Numerical Linear Algebra Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Makoto Kudoh*1, Hisayasu Kuroda*1,

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:

The Potential of the Cell Processor for Scientific Computing

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

IMP: Indirect Memory Prefetcher

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad

Irregular Applications –Sparse Matrix Vector Multiplication

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

A few words on locality and arrays

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

Analysis of Sparse Convolutional Neural Networks

Ioannis E. Venetis Department of Computer Engineering and Informatics

High-Performance Matrix Multiplication

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Recitation 2: Synchronization, Shared memory, Matrix Transpose

for more information ... Performance Tuning

Linchuan Chen, Peng Jiang and Gagan Agrawal

Memory Hierarchies.

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.

All-Pairs Shortest Paths

Henk Corporaal TUEindhoven 2011

EE 4xx: Computer Architecture and Performance Programming

Memory System Performance Chapter 3

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Presentation transcript:

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

SpM×V and its Applications Sparse Matrix Vector Multiply (SpM×V): y  y+A∙x –x, y are dense vectors x: source vector y: destination vector –A is a sparse matrix (<1% of entries are nonzero) Applications employing SpM×V in the inner loop –Least Squares Problems –Eigenvalue Problems matrix: A vector: x vector: y

Storing a Matrix in Memory Compressed Sparse Row Data Structure and Algorithm type val : real[k] type ind : int[k] type ptr : int[m+1] 1foreach row i do 2 for l = ptr[i] to ptr[i + 1] – 1 do 3 y[i]  y[i] + val[l] ∙ x[ind[l]]

What’s so hard about it? Reason for Poor Performance of Naïve Implementation –Poor locality (indirect and irregular memory accesses) Limited by speed of main memory –Poor instruction mix (low flops to memory operations ratio) –Algorithm dependent on non-zero structure of matrix Dense matrices vs Sparse matrices

Register-Level Blocking (SPARSITY): 3x3 Example

BCSR with uniform, aligned grid

Register-Level Blocking (SPARSITY): 3x3 Example Fill-in zeros: trade-off extra ops for better efficiency

Blocked Compressed Sparse Row Inner loop performs floating point multiply-add on each non-zero in block instead of just one non-zero Reduces the number of times the source vector x has to be brought back into memory Reduces the number of indices that have to be stored and loaded

The Payoff: Speedups on Itanium 2 Reference Best: 4x2 Mflop/s

Explicit Software Pipelining ORIGINAL CODE: type val : real[k] type ind : int[k] type ptr : int[m+1] 1foreach row i do 2 for l = ptr[i] to ptr[i + 1] – 1 do 3 y[i]  y[i] + val[l] ∙ x[ind[l]] SOFTWARE PIPELINED CODE type val : real[k] type ind : int[k] type ptr : int[m+1] 1foreach row i do 2 for l = ptr[i] to ptr[i + 1] – 1 do 3 y[i]  y[i] + val_1 ∙ x_1 4 val_1 = val[l + 1] 5 x_1 = x[ind_2] 6 ind_2 = ind[l + 2]

Explicit Software Prefetching ORIGINAL CODE: type val : real[k] type ind : int[k] type ptr : int[m+1] 1foreach row i do 2 for l = ptr[i] to ptr[i + 1] – 1 do 3 y[i]  y[i] + val[l] ∙ x[ind[l]] SOFTWARE PREFETCHED CODE type val : real[k] type ind : int[k] type ptr : int[m+1] 1foreach row i do 2 for l = ptr[i] to ptr[i + 1] – 1 do 3 y[i]  y[i] + val[l] ∙ x[ind[l]] 4 pref(NTA, pref_v_amt + &val[l]) 5 pref(NTA, pref_i_amt + &ind[l]) 6 pref(NONE, &x[ind[l+pref_x_amt]]) *NTA refers to no temporal locality on all levels *NONE refers to temporal locality on highest Level

Characteristics of Modern Architectures High Set Associativity in Caches –4-way L1, 8-way L2, 12-way L3 Itanium 2 Multiple Load Store Units Multiple Execution Units –Six Integer Execution Units on Itanium 2 –Two Floating Point Multiply-Add Execution Units in Itanium 2 Question: What if we broke the matrix into multiple streams of execution?

Parallel SpMV Run different rows in different threads Can do that on data parallel architectures (SIMD/VLIW, Itanium/GPU)? –What if rows have different length? –One row finishes, other are still running Waiting threads keep processors idle –Can we avoid idleness? Standard solution: Segmented scan

Segmented Scan Multiple Segments (streams) of Simultaneous Execution Single Loop with branches inside to check if we’ve reached the end of a row for each segment. –Reduces Loop Overhead –Good if average NZ/Row is small Changes the Memory Access Patterns and can more efficiently use caches for some matrices –Future Work: Pass SpM×V through a cache simulator to observe cache behavior

Itanium 2 Results (1.3 GHz, Millennium Cluster)

Conclusions & Future Work Optimizations studied are a good idea and should include this into OSKI Develop Parallel / Multicore versions –Dual Core, Dual Socket Opterons, etc

Questions?

Extra Slides

Algorithm # 2: Segmented Scan 1x1x2 SegmentedScan Code type val : real[k] type ind : int[k] type ptr : int[m+1] type RowStart: int[VectorLength] r0  RowStart[0] r1  RowStart[1] nnz0  ptr[r0] nnz1  ptr[r1] EoR0  ptr[r0+1] EoR1  ptr[r1+1] 1while nnz0 < SegmentLength do 2 y[r0]  y[r0] + val[nnz0] ∙ x[ind[nnz0]] 3 y[r1]  y[r1] + val[nnz1] ∙ x[ind[nnz1]] 4 if(nnz0 = EoR0) 5 r0++ 6 EoR0  ptr[r0+1] 7 if(nnz1 = EoR1) 8 r1++ 9 EoR1  ptr[r1+1] 10 nnz0  nnz nnz1  nnz1 + 1

Measuring Performance Measure Dense Performance (r,c) –Performance (Mflop/s) of dense matrix in sparse r x c blocked format –Estimate Fill Ratio (r,c),  r,c Fill Ratio (r,c) = (number of stored values) / (number of true non- zeros) –Choose r,c that maximizes Estimated Performance (r,c) =

References 1.G. Belloch, M. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS , Carnegie Mellon University, E.-J. Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY: Framework for optimizing sparse matrix- vector multiply. International Journal of High Performance Computing Applications, 18(1):135–158, February R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick. Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply. Technical Report UCB/CSD , University of California, Berkeley, Berkeley, CA, USA, June Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical Report 90-20, NASA Ames Research Center, Moffett Field, CA, A. Schwaighofer. A matlab interface to svm light to version R. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, December R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June Institute of Physics Publishing. (to appear).