A benchmark for sparse matrix- vector multiplication ● Hormozd Gahvari and Mark Hoemmen ●

Slides:

Advertisements

Similar presentations

Adaptable benchmarks for register blocked sparse matrix-vector multiplication ● Berkeley Benchmarking and Optimization group (BeBOP) ● Hormozd Gahvari.

Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.

The University of Adelaide, School of Computer Science

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Data Locality CS 524 – High-Performance Computing.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP.

CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

Data Locality CS 524 – High-Performance Computing.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.

Architecture Basics ECE 454 Computer Systems Programming

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Makoto Kudoh*1, Hisayasu Kuroda*1,

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Performance Optimization Getting your programs to run faster.

Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Introduction to MMX, XMM, SSE and SSE2 Technology

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.

IMP: Indirect Memory Prefetcher

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Sunpyo Hong, Hyesoon Kim

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Vector computers.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

Optimizing Cache Performance in Matrix Multiplication

The Hardware/Software Interface CSE351 Winter 2013

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

CS 105 Tour of the Black Holes of Computing

Morgan Kaufmann Publishers

Optimizing Cache Performance in Matrix Multiplication

for more information ... Performance Tuning

Array Processor.

Memory Hierarchies.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.

Memory System Performance Chapter 3

Presentation transcript:

A benchmark for sparse matrix- vector multiplication ● Hormozd Gahvari and Mark Hoemmen ● ● Research made possible by: NSF, Argonne National Lab, a gift from Intel, National Energy Research Scientific Computing Center, and Tyler Berry

Topcs for today: ● Sparse matrix-vector multiplication (SMVM) and the Sparsity optimization ● Preexisting SMVM benchmarks vs. ours ● Results: Performance predictors ● Test case: Desktop SIMD

Sparse matrix-vector multiplication ● Sparse vs. dense matrix * vector – Dense: Can take advantage of temporal, spatial locality (BLAS level 2,3) – Sparse: “Stream through” matrix one value at a time – Index arrays: Lose locality ● Compressed sparse row (CSR) format

Register block optimization ● Many matrices have small blocks – FEM matrices especially – 2x2, 3x3, 6x6 common ● Register blocking: Like unrolling a loop (circumvent latencies) ● Sparsity: – Automatic heuristic optimal block size selection

SMVM benchmarks: Three strategies 1) Actually do SMVM with test cases 2) Simpler ops “simulating” SMVM 3) Analytical / heuristic model

1) Actually do SMVM ● SparseBench: Iterative Krylov solvers – Tests other things besides SMVM! ● SciMark 2.0: – Fixed problem size – Uses unoptimized CSR (no reg. blocks) ● Doesn't capture potential performance with many types of matrices ● Register blocking: Large impact (will see)

2) Microbenchmarks “simulating” SMVM ● Goal: capture SMVM behavior with simple set of operations ● STREAM – “Sustained memory bandwidth” – Copy, Scale, Add, Triad – Triad: like dense level-1 BLAS DAXPY ● Rich Vuduc's indirect indexed variants – Resemble sparse matrix addressing – Still not predictive

3) Analytical models of SMVM performance ● Account for miss rates, latencies and bandwidths ● Sparsity: bounds as heuristic to predict best block dimensions for a machine ● Upper and lower bounds not tight, so difficult to use for performance prediction ● Sparsity's goal: optimization, not performance prediction

Our SMVM benchmark ● Do SMVM with BSR matrix: randomly scattered blocks – BSR format: Typically less structured matrices anyway ● “Best” block size, 1x1 – Characterize different matrix types – Take advantage of potential optimizations (unlike current benchmarks), but in a general way

Dense matrix in sparse format ● Test this with optimal block size: – To show that fill doesn't affect performance much – Fill: affects locality of accesses to source vector

Data set sizing ● Size vectors to fit in largest cache, matrix out of cache – Tests “streaming in” of matrix values – Natural scaling to machine parameters! ● “Inspiration” SPECfp92 (small enough so manufacturers could size cache to fit all data) vs. SPECfp95 (data sizes increased) – Fill now machine-dependent: ● Tests show fill (locality of source vector accesses) has little effect

Results: “Best” block size ● Highest Mflops/s value for the block sizes tested, for: – Sparse matrix (fill chosen as above) – Dense matrix in sparse format (4096 x 4096) ● Compare with Mflops/s for STREAM Triad (a[i] = b[i] + s * c[i])

Rank processors acc. to benchmarks: ● For optimized (best block size) SMVM: – Peak mem bandwidth good predictor for Itanium 2, P4, PM relationship – STREAM mispredicts these ● STREAM: – Better predicts unoptimized (1 x 1) SMVM – Peak bandwidth no longer helpful

Our benchmark: Useful performance indicator ● Comparison with results for “real-life” matrices: – Works well for FEM matrices – Not always as well for non-FEM matrices – More wasted space in block data structure: directly proportional to slowdown

Comparison of Benchmark with Real Matrices ● Following two graphs show MFLOP rate of matrices generated by our benchmark vs. matrices from BeBOP group and a dense matrix in sparse format ● Plots compare by block size; matrix “number” is given in parentheses. Matrices 2-9 are FEM matrices. ● A comprehensive list of the BeBOP test suite matrices can be found in Vuduc, et. al., “Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply,” 2002.

Comparison Conclusions ● Our benchmark does a good job modeling real data ● Dense matrix in sparse format looks good on Ultra 3, but is noticeably inferior to our benchmark for large block sizes on Itanium 2

Evaluating SIMD instructions ● SMVM benchmark: – Tool to evaluate arch. features ● e.g.: Desktop SIMD floating-point ● SSE-2 ISA: – Pentium 4, M; AMD Opteron – Parallel ops on 2 floating-point doubles ● {ADD|MUL|DIV}PD: arithmetic ● MOVAPD: load aligned pair

Vectorizing DAXPY ● Register block: small dense Matrix * vector ● Dep. on matrix data ordering: – Column-major (Fortran-style): ● Need scalar * vector operation – Row-major (C-style): ● Need “reduce” (dot product)

Sparsity register block layout ● Row-major order within block – Vs. Sparse BLAS proposal (col-major)! – Vector reductions change associativity (results may differ from scalar version, due to roundoff) ● We chose to keep it for now – Can't just switch algorithm: orientation affects stride of vector loads – Need a good vector reduction

Vector reduce ● e.g. C. Kozyrakis' recent UC Berkeley Ph.D. thesis on multimedia vector ops ● “vhalf” instruction: – Copy lower half of src vector reg. --> upper half of dest. ● Iterate (vhalf, vector add) to reduce.

SSE-2 has “vhalf”! # Sum the 2 elements of %xmm1: # # Low 8B %xmm1 --> high 8B %xmm0 SHUFPD %xmm0, %xmm1 # High 8B of %xmm0 gets sum ADDPD %xmm0, %xmm1

One possible SSE-2 6x6 A*x ● %xmm0 <- (dest(0), 0) ● 6 MOVAPD: interleave matrix row pairs and src vector pairs ● Update indices ● 3x (MULPD, then ADDPD to %xmm0) ● Sum elems of %xmm0 – (SHUFPD and ADDPD) ● Extract and store sum

SSE-2: gcc and Intel C compilers won't vectorize! ● Use SIMD registers for scalar math! – SSE-2 latency: 1 cycle less than x87 – x87 uses same fn unit as SIMD anyway ● Vector reduce sub-optimal? – Fewer ops: less latency-hiding potential – Only 8 XMM regs: Can't unroll ● Col-major suboptimal – No scalar * vector instruction! ● Or the alignment issue...

“Small matrix library” ● From Intel: Matrix * vector ● Optimized for 6x6 or less ● Idea: – Replace Sparsity's explicit (BLAS-1-like) register block multiplication... –...with optimized function (BLAS-2-like) ● We're working on this ● Needed to say if SIMD valuable

SIMD load: alignment ● Possible reason for no automatic vectorization – Load pair needs alignm. on 16B bdys – Non-aligned load: slower – Compiler can't guarantee alignment ● Itanium 2: Same issue reappears...

SSE-2 results: Disappointing ● Pentium M: gains nothing ● Pentium 4: actually gains a little – SSE-2 1 cycle lower latency than x87 – Small blocks: latency dominates – x87 ISA harder to schedule ● AMD Opteron not available for testing – 16 XMM regs (vs. 8): better unrolling capability?

How SSE-2 should look: STREAM Scale b[0:N-1] = scalar * c[0:N-1] (speedup 1.72) Loop: movapd c(%eax), %xmm4 mulpd %xmm0, %xmm4 movntpd %xmm4, b(%eax) addl $16, %eax cmpl $ , %eax jl Loop

NetBurst (Pentium 4,M arch) (Note: diagram used w/out permission)

Can NetBurst keep up with DAXPY? ● One cycle: – 1 load aligned pair, 1 store aligned pair, 1 SIMD flop (alternate ADDPD/MULPD) ● DAXPY (in row-major): Triad - like – y(i) = y(i) + A(i,j) * x(j) – If y(i) loaded: 2 lds, 1 mul, 1 add, 1 store ● Ratio of loads to stores inadequate? – Itanium 2 changes this...

Itanium 2: Streaming fl-pt ● NO SSE-2 support!!! ● BUT: In 1 cycle: 2 MMF bundles: – 2 load pair (4 loads), 2 stores – 2 FMACs (a + s * b) ● (Or MFI: Load pair, FMAC, update idx) ● 1 cycle: theoretically 2x DAXPY!

Itanium 2: Alignment strikes again! ● Intel C Compiler won't generate “load pair” instructions!!! ● Why? – ldfpd (“load pair”) needs aligned data – Compiler doesn't see underlying dense BLAS 2 structure? – Register pressure?

SIMD conclusions: ● STREAM Triad suggests modest potential speedup ● Multiple scalar functional units: – More flexible than SIMD: Speedup independent of orientation ● Code scheduling difficult – Pragmas to tell compiler data is aligned – Encapsulate block A*x in hand-coded routine

Conclusions: ● Our benchmark: – Good SMVM performance prediction – Scales for any typical uniprocessor ● With “optimal” block sizes: – Performance tied to memory bandwidth ● With 1x1 blocks: – Performance related more to latency

Conclusions (2): ● SIMD: Need to test custom mini dense matrix * vector routines ● Development will continue after this semester: – More testing – Parallelization