A benchmark for sparse matrix- vector multiplication ● Hormozd Gahvari and Mark Hoemmen ● ● Research made possible by: NSF, Argonne National Lab, a gift from Intel, National Energy Research Scientific Computing Center, and Tyler Berry
Topcs for today: ● Sparse matrix-vector multiplication (SMVM) and the Sparsity optimization ● Preexisting SMVM benchmarks vs. ours ● Results: Performance predictors ● Test case: Desktop SIMD
Sparse matrix-vector multiplication ● Sparse vs. dense matrix * vector – Dense: Can take advantage of temporal, spatial locality (BLAS level 2,3) – Sparse: “Stream through” matrix one value at a time – Index arrays: Lose locality ● Compressed sparse row (CSR) format
Register block optimization ● Many matrices have small blocks – FEM matrices especially – 2x2, 3x3, 6x6 common ● Register blocking: Like unrolling a loop (circumvent latencies) ● Sparsity: – Automatic heuristic optimal block size selection
SMVM benchmarks: Three strategies 1) Actually do SMVM with test cases 2) Simpler ops “simulating” SMVM 3) Analytical / heuristic model
1) Actually do SMVM ● SparseBench: Iterative Krylov solvers – Tests other things besides SMVM! ● SciMark 2.0: – Fixed problem size – Uses unoptimized CSR (no reg. blocks) ● Doesn't capture potential performance with many types of matrices ● Register blocking: Large impact (will see)
2) Microbenchmarks “simulating” SMVM ● Goal: capture SMVM behavior with simple set of operations ● STREAM – “Sustained memory bandwidth” – Copy, Scale, Add, Triad – Triad: like dense level-1 BLAS DAXPY ● Rich Vuduc's indirect indexed variants – Resemble sparse matrix addressing – Still not predictive
3) Analytical models of SMVM performance ● Account for miss rates, latencies and bandwidths ● Sparsity: bounds as heuristic to predict best block dimensions for a machine ● Upper and lower bounds not tight, so difficult to use for performance prediction ● Sparsity's goal: optimization, not performance prediction
Our SMVM benchmark ● Do SMVM with BSR matrix: randomly scattered blocks – BSR format: Typically less structured matrices anyway ● “Best” block size, 1x1 – Characterize different matrix types – Take advantage of potential optimizations (unlike current benchmarks), but in a general way
Dense matrix in sparse format ● Test this with optimal block size: – To show that fill doesn't affect performance much – Fill: affects locality of accesses to source vector
Data set sizing ● Size vectors to fit in largest cache, matrix out of cache – Tests “streaming in” of matrix values – Natural scaling to machine parameters! ● “Inspiration” SPECfp92 (small enough so manufacturers could size cache to fit all data) vs. SPECfp95 (data sizes increased) – Fill now machine-dependent: ● Tests show fill (locality of source vector accesses) has little effect
Results: “Best” block size ● Highest Mflops/s value for the block sizes tested, for: – Sparse matrix (fill chosen as above) – Dense matrix in sparse format (4096 x 4096) ● Compare with Mflops/s for STREAM Triad (a[i] = b[i] + s * c[i])
Rank processors acc. to benchmarks: ● For optimized (best block size) SMVM: – Peak mem bandwidth good predictor for Itanium 2, P4, PM relationship – STREAM mispredicts these ● STREAM: – Better predicts unoptimized (1 x 1) SMVM – Peak bandwidth no longer helpful
Our benchmark: Useful performance indicator ● Comparison with results for “real-life” matrices: – Works well for FEM matrices – Not always as well for non-FEM matrices – More wasted space in block data structure: directly proportional to slowdown
Comparison of Benchmark with Real Matrices ● Following two graphs show MFLOP rate of matrices generated by our benchmark vs. matrices from BeBOP group and a dense matrix in sparse format ● Plots compare by block size; matrix “number” is given in parentheses. Matrices 2-9 are FEM matrices. ● A comprehensive list of the BeBOP test suite matrices can be found in Vuduc, et. al., “Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply,” 2002.
Comparison Conclusions ● Our benchmark does a good job modeling real data ● Dense matrix in sparse format looks good on Ultra 3, but is noticeably inferior to our benchmark for large block sizes on Itanium 2
Evaluating SIMD instructions ● SMVM benchmark: – Tool to evaluate arch. features ● e.g.: Desktop SIMD floating-point ● SSE-2 ISA: – Pentium 4, M; AMD Opteron – Parallel ops on 2 floating-point doubles ● {ADD|MUL|DIV}PD: arithmetic ● MOVAPD: load aligned pair
Vectorizing DAXPY ● Register block: small dense Matrix * vector ● Dep. on matrix data ordering: – Column-major (Fortran-style): ● Need scalar * vector operation – Row-major (C-style): ● Need “reduce” (dot product)
Sparsity register block layout ● Row-major order within block – Vs. Sparse BLAS proposal (col-major)! – Vector reductions change associativity (results may differ from scalar version, due to roundoff) ● We chose to keep it for now – Can't just switch algorithm: orientation affects stride of vector loads – Need a good vector reduction
Vector reduce ● e.g. C. Kozyrakis' recent UC Berkeley Ph.D. thesis on multimedia vector ops ● “vhalf” instruction: – Copy lower half of src vector reg. --> upper half of dest. ● Iterate (vhalf, vector add) to reduce.
SSE-2 has “vhalf”! # Sum the 2 elements of %xmm1: # # Low 8B %xmm1 --> high 8B %xmm0 SHUFPD %xmm0, %xmm1 # High 8B of %xmm0 gets sum ADDPD %xmm0, %xmm1
One possible SSE-2 6x6 A*x ● %xmm0 <- (dest(0), 0) ● 6 MOVAPD: interleave matrix row pairs and src vector pairs ● Update indices ● 3x (MULPD, then ADDPD to %xmm0) ● Sum elems of %xmm0 – (SHUFPD and ADDPD) ● Extract and store sum
SSE-2: gcc and Intel C compilers won't vectorize! ● Use SIMD registers for scalar math! – SSE-2 latency: 1 cycle less than x87 – x87 uses same fn unit as SIMD anyway ● Vector reduce sub-optimal? – Fewer ops: less latency-hiding potential – Only 8 XMM regs: Can't unroll ● Col-major suboptimal – No scalar * vector instruction! ● Or the alignment issue...
“Small matrix library” ● From Intel: Matrix * vector ● Optimized for 6x6 or less ● Idea: – Replace Sparsity's explicit (BLAS-1-like) register block multiplication... –...with optimized function (BLAS-2-like) ● We're working on this ● Needed to say if SIMD valuable
SIMD load: alignment ● Possible reason for no automatic vectorization – Load pair needs alignm. on 16B bdys – Non-aligned load: slower – Compiler can't guarantee alignment ● Itanium 2: Same issue reappears...
SSE-2 results: Disappointing ● Pentium M: gains nothing ● Pentium 4: actually gains a little – SSE-2 1 cycle lower latency than x87 – Small blocks: latency dominates – x87 ISA harder to schedule ● AMD Opteron not available for testing – 16 XMM regs (vs. 8): better unrolling capability?
How SSE-2 should look: STREAM Scale b[0:N-1] = scalar * c[0:N-1] (speedup 1.72) Loop: movapd c(%eax), %xmm4 mulpd %xmm0, %xmm4 movntpd %xmm4, b(%eax) addl $16, %eax cmpl $ , %eax jl Loop
NetBurst (Pentium 4,M arch) (Note: diagram used w/out permission)
Can NetBurst keep up with DAXPY? ● One cycle: – 1 load aligned pair, 1 store aligned pair, 1 SIMD flop (alternate ADDPD/MULPD) ● DAXPY (in row-major): Triad - like – y(i) = y(i) + A(i,j) * x(j) – If y(i) loaded: 2 lds, 1 mul, 1 add, 1 store ● Ratio of loads to stores inadequate? – Itanium 2 changes this...
Itanium 2: Streaming fl-pt ● NO SSE-2 support!!! ● BUT: In 1 cycle: 2 MMF bundles: – 2 load pair (4 loads), 2 stores – 2 FMACs (a + s * b) ● (Or MFI: Load pair, FMAC, update idx) ● 1 cycle: theoretically 2x DAXPY!
Itanium 2: Alignment strikes again! ● Intel C Compiler won't generate “load pair” instructions!!! ● Why? – ldfpd (“load pair”) needs aligned data – Compiler doesn't see underlying dense BLAS 2 structure? – Register pressure?
SIMD conclusions: ● STREAM Triad suggests modest potential speedup ● Multiple scalar functional units: – More flexible than SIMD: Speedup independent of orientation ● Code scheduling difficult – Pragmas to tell compiler data is aligned – Encapsulate block A*x in hand-coded routine
Conclusions: ● Our benchmark: – Good SMVM performance prediction – Scales for any typical uniprocessor ● With “optimal” block sizes: – Performance tied to memory bandwidth ● With 1x1 blocks: – Performance related more to latency
Conclusions (2): ● SIMD: Need to test custom mini dense matrix * vector routines ● Development will continue after this semester: – More testing – Parallelization