Download presentation
1
Avoiding Communication in Sparse Matrix-Vector Multiply (SpMV)
Sequential and shared-memory performance is dominated by off-chip communication Distributed-memory performance is dominated by network communication The problem: SpMV has low arithmetic intensity
2
SpMV Arithmetic Intensity (1)
dimension: n = 5 number of nonzeros: nnz = 3n-2 (tridiagonal A) SpMV floating point operations 2⋅nnz floating point words moved nnz + 2⋅n Assumption: A is invertible ⇒ nonzero in every row ⇒ nnz ≥ n overcounts flops by up to n (diagonal A)
3
SpMV Arithmetic Intensity (2)
O( lg(n) ) O( 1 ) O( n ) more flops per byte A r i t h m e t i c I n t e n s i t y SpMV, BLAS1,2 FFTs Stencils (PDEs) Dense Linear Algebra (BLAS3) Lattice Methods Particle Methods Arithmetic intensity := Total flops / Total DRAM bytes Upper bound: compulsory traffic further diminished by conflict or capacity misses SpMV flops 2⋅nnz words moved nnz + 2⋅n arith. intensity 2
4
SpMV Arithmetic Intensity (3)
In practice, A requires at least nnz words: indexing data, zero padding depends on nonzero structure, eg, banded or dense blocks depends on data structure, eg, CSR/C, COO, SKY, DIA, JDS, ELL, DCS/C, … blocked generalizations depends on optimizations, eg, index compression or variable block splitting actual flop:byte ratio attainable gflop/s Opteron 2356 (Barcelona) 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 stream bandwidth peak double precision floating-point rate 2 flops per word of data 8 bytes per double flop:byte ratio ≤ ¼ Can’t beat 1/16 of peak! How to do more flops per byte? Reuse data (x, y, A) across multiple SpMVs
5
Combining multiple SpMVs
(1) k independent SpMVs (1) used in: Block Krylov methods Krylov methods for multiple systems (AX = B) (2) k dependent SpMVs (2) used in: s-step Krylov methods, Communication-avoiding Krylov methods …to compute k Krylov basis vectors (3) k dependent SpMVs, in-place variant Def. Krylov space (given A, x, s): (3) used in: multigrid smoothers, power method Related to Streaming Matrix Powers optimization for CA-Krylov methods What if we can amortize cost of reading A over k SpMVs ? (k-fold reuse of A)
6
(1) k independent SpMVs (SpMM)
SpMM optimization: Compute row-by-row Stream A only once = 1 SpMV k independent SpMVs k independent SpMVs (using SpMM) flops 2⋅nnz 2k⋅nnz words moved nnz + 2n k⋅nnz + 2kn 1⋅nnz + 2kn arith. intensity, nnz = ω(n) 2 2k
7
(2) k dependent SpMVs (Akx)
Naïve algorithm (no reuse): Akx (Akx) optimization: Must satisfy data dependencies while keeping working set in cache 1 SpMV k dependent SpMVs k dependent SpMVs (using Akx) flops 2⋅nnz 2k⋅nnz words moved nnz + 2n k⋅nnz + 2kn 1⋅nnz + (k+1)n arith. intensity, nnz = ω(n) 2 2k
8
(2) k dependent SpMVs (Akx)
Akx algorithm (reuse nonzeros of A): 1 SpMV k dependent SpMVs k dependent SpMVs (using Akx) flops 2⋅nnz 2k⋅nnz words moved nnz + 2n k⋅nnz + 2kn 1⋅nnz + (k+1)n arith. intensity, nnz = ω(n) 2 2k
9
(3) k dependent SpMVs, in-place (Akx, last-vector-only)
Last-vector-only Akx optimization: Reuses matrix and vector k times, instead of once. Overwrites intermediates without memory traffic Attains O(k) reuse, even when nnz < n eg, A is a stencil (implicit values and structure) 1 SpMV k dependent SpMVs, in-place Akx, last-vector-only flops 2⋅nnz 2k⋅nnz words moved nnz + 2n k⋅nnz + 2kn 1⋅nnz + 2n arith. intensity, nnz = anything 2 2k
10
Combining multiple SpMVs (summary of sequential results)
Problem flops words moved optimization relative bandwidth savings (n, nnz ⟶ ∞) nnz = ω(n) nnz = c⋅n nnz = o(n) SpMV 2⋅nnz nnz + 2n - k independent SpMVs 2k⋅nnz k⋅nnz 2kn SpMM k ≤ min(c, k) 1 k dependent SpMVs Akx (k+1)n 2 k dependent SpMVs, in-place Akx, last-vector-only
11
Avoiding Serial Communication
actual flop:byte ratio attainable gflop/s Opteron 2356 (Barcelona) 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 peak DP Stream Bandwidth Reduce compulsory misses by reusing data: more efficient use of memory decreased bandwidth cost (Akx, asymptotic) Must also consider latency cost How many cachelines? depends on contiguous accesses When k = 16 ⇒ compute-bound? Fully utilize memory system Avoid additional memory traffic like capacity and conflict misses Fully utilize in-core parallelism (Note: still assumes no indexing data) In practice, complex performance tradeoffs. Autotune to find best k ?
12
On being memory bound Assume that off-chip communication (cache to memory) is bottleneck, eg, that we express sufficient ILP to hide hits in L3 When your multicore performance is bound by memory operations, is it because of latency or bandwidth? Latency-bound: expressed concurrency times the memory access rate does not fully utilize the memory bandwidth Traversing a linked list, pointer-chasing benchmarks Bandwidth-bound: expressed concurrency times the memory access rate exceeds the memory bandwidth SpMV, stream benchmarks Either way, manifests as pipeline stalls on loads/stores (suboptimal throughput) Caches can improve memory bottlenecks – exploit them whenever possible Avoid memory traffic when you have temporal or spatial locality Increase memory traffic when cache line entries are unused (no locality) Prefetchers can allow you to express more concurrency Hide memory traffic when your access pattern has sequential locality (clustered or regularly strided access patterns)
13
Distributed-memory parallel SpMV
Harder to make general statements about performance: Many ways to partition x, y, and A to P processors Communication, computation, and load-balance are partition-dependent What fits in cache? (What is “cache”?!) A parallel SpMV involves 1 or 2 rounds of messages (Sparse) collective communication, costly synchronization Latency-bound (hard to saturate network bandwidth) Scatter entries of x and/or gather entries of y across network k SpMVs cost O(k) rounds of messages Can we do k SpMVs in one round of messages? k independent vectors? SpMM generalizes Distribute all source vectors in one round of messages Avoid further synchronization k dependent vectors? Akx generalizes Distribute source vector plus additional ghost zone entries in one round of messages Last-vector-only Akx ≈ standard Akx in parallel No savings discarding intermediates
14
Distributed-memory parallel Akx
Example: tridiagonal matrix, k = 3, n = 40, p = 4 Naïve algorithm: k messages per neighbor Akx optimization: 1 message per neighbor
15
Polynomial Basis for Akx
Today we considered the special case of the monomials: Stability problems - tends to lose linear independence Converges to principal eigenvector Given A, x, k > 0, compute where pj(A) is a degree-j polynomial in A. Choose p for stability.
16
Tuning space for Akx DLP optimizations:
Topology-aware sparse collectives vectorization Hypergraph partitioning ILP optimizations: Dynamic load balancing Software pipelining Overlapped communication and computation Loop unrolling Algorithmic variants: Eliminate branches, inline functions Compositions of distributed-memory parallel, shared memory parallel, sequential algorithms TLP optimizations: Explicit SMT Streaming or explicitly buffered workspace Memory system optimizations: Explicit or implicit cache blocks NUMA-aware affinity Avoiding redundant computation/storage/traffic Software prefetching Last-vector-only optimization TLB blocking Remove low-rank components (blocking covers) Memory traffic optimizations: Different polynomial bases pj(A) Streaming stores (cache bypass) Other: Array padding Preprocessing optimizations Cache blocking Extended precision arithmetic Index compression Scalable data structures (sparse representations) Blocked sparse formats Dynamic value and/or pattern updates Stanza encoding Distributed memory optimizations:
17
Krylov subspace methods (1)
Want to solve Ax = b (still assume A is invertible) How accurately can you hope to compute x? Depends on condition number of A and the accuracy of your inputs A and b condition number with respect to matrix inversion cond(A) – how much A distorts the unit sphere (in some norm) 1/cond(A) – how close A is to a singular matrix expect to lose log10(cond(A)) decimal digits relative to (relative) input accuracy Idea: Make successive approximations, terminate when accuracy is sufficient How good is an approximation x0 to x? Error: e0 = x0 - x If you know e0, then compute x = x0 - e0 (and you’re done.) Finding e0 is as hard as finding x; assume you never have e0 Residual: r0 = b – Ax0 r0 = 0 ⇔ e0 = 0, but they do not necessarily vanish simultaneously cond(A) small ⇒ (r0 small ⇒ e0 small)
18
Krylov subspace methods (2)
Given approximation xold, refine by adding a correction xnew = xold + v Pick v as the ‘best possible choice’ from search space V Krylov subspace methods: V := 2. Expand V by one dimension 3. xold = xnew. Repeat. Once dim(V) = dim(A) = n, xnew should be exact Why Krylov subspaces? Cheap to compute (via SpMV) Search spaces V coincide with the residual spaces - makes it cheaper to avoid repeating search directions K(A,z) = K(c1A - c2I, c3z) ⇒ invariant under scaling, translation Without loss, assume |λ(A)| ≤ 1 As s increases, Ks gets closer to the dominant eigenvectors of A Intuitively, corrections v should target ‘largest-magnitude’ residual components
19
Convergence of Krylov methods
Convergence = process by which residual goes to zero If A isn’t too poorly conditioned, error should be small. Convergence only governed by the angles θm between spaces Km and AKm How fast does sin(θm) go to zero? Not eigenvalues! You can construct a unitary system that results in the same sequence of residuals r0, r1, … If A is normal, λ(A) provides bounds on convergence. Preconditioning Transforming A with hopes of ‘improving’ λ(A) or cond(A)
20
Conjugate Gradient (CG) Method
Given starting approximation x0 to Ax = b, let p0 := r0 := b - Ax0. For m = 0, 1, 2, … until convergence, do: Vector iterates: xm = candidate solution rm = residual pm = search direction Correct candidate solution along search direction Update residual according to new candidate solution Expand search space Communication-bound: 1 SpMV operation per iteration 2 dot products per iteration Reformulate to use Akx Do something about the dot products
21
Applying Akx to CG (1) Ignore x, α, and β, for now
Unroll the CG loop s times (in your head) Observe that: ie, two Akx calls This means we can represent rm+j and pm+j symbolically as linear combinations: And perform SpMV operations symbolically: (same holds for Rj-1) vectors of length n vectors of length 2j+1 CG loop: For m = 0,1,…, Do
22
Applying Akx to CG (2) 6. Now substitute coefficient vectors for vector iterates (eg, for r) SpMV performed symbolically by shifting coordinates: CG loop: For m = 0,1,…, Do
23
Blocking CG dot products
7. Let’s also compute the 2j+1-by-2j+1 Gram matrices: Now we can perform all dot products symbolically: CG loop: For m = 0,1,…, Do
24
CA-CG Given approximation x0 to Ax = b, let For j = 0 to s - 1, Do
Take s steps of CG without communication For j = 0 to s - 1, Do { For m = 0, s, 2s, …, until convergence, Do { Expand Krylov basis, using SpMM and Akx optimizations: Represent the 2s+1 inner products of length n with a 2s+1-by-2s+1 Gram matrix Communication (sequential and parallel) } End For Recover vector iterates: Represent SpMV operation as a change of basis (here, a shift): Communication (sequential only) CG loop: For m = 0,1,…, Do } End For Represent vector iterates of length n with vectors of length 2s+1 and 2s+2:
25
CA-CG complexity (1) Kernel Computation costs Communication costs
s dependent SpMVs 2s⋅nnz flops (1 source vector) Sequential: Read s vectors of length n Write s vectors of length n Read A s times bandwidth cost ≈ s⋅nnz + 2sn Parallel: Distribute 1 source vector s times Akx 4s⋅nnz flops (2 source vectors) Read 2 vectors of length n, Write 2s-1 vectors of length n, Read A once (both Akx and SpMM optimizations) bandwidth cost ≈ nnz + (2s+1)n Distribute 2 source vectors once Communication volume and number of messages increase with s (ghost zones)
26
CA-CG complexity (2) Kernel Computation costs Communication costs
2s+1 dot products Sequential: 2(2s+1)n flops ≈ 4ns Parallel: (2s+1)(2n+(p-1))/p ≈ 4ns/p Read a vector of length n 2s+1 times 2s+1 all-reduce collectives, each with lg(p) rounds of messages: latency cost ≈ 2slg(p) 1 word to/from each proc.: bandwidth cost ≈ 2s Gram matrix (2s+1)2n flops ≈ 4ns2 (2s+1)2(n/p + (p-1)/(2p)) ≈ 4ns2/p Symbolic dot products cost an additional (2s+1)2(2s+3) flops ≈ 8s3 Read a matrix of size (2s+1)×n once. 1 all-reduce collective, with lg(p) rounds of messages: latency cost ≈ lg(p) (2s+1)2/2 words to/from each proc.: bandwidth cost ≈ 4s2
27
CA-CG complexity (3) Using Gram matrix and coefficient vectors have additional costs for CA-CG: Dense work (besides dot products/Gram matrix) does not increase with s: CG ≈ 6sn CA-CG ≈ 3(2s+1)(2s+n) ≈ (6s+3)n Sequential memory traffic decreases (factor of 4.5) CG ≈ 6sn reads, 3sn writes CA-CG ≈ (2s+1)n reads, 3n writes Method Sequential flops Sequential bandwidth Sequential latency Parallel flops Parallel bandwidth Parallel latency CG, s steps 2s⋅nnz + 10sn s⋅nnz (13s+1)n s⋅nnz/b (13s+1)n/b 2s⋅nnz/p 10sn/p s⋅Expand(A) 2s (2s+1)lg(p) CA-CG(s), 1 step 4s⋅nnz (4s+10)sn nnz + (6s+6)n nnz/b + (6s+6)n/b 4s⋅nnz/p (4s+10)sn/p Expand(|A|s)+ Expand(|A|s-1) 4s2 + lg(p) b = cacheline size, p = number of processors
32
CA-CG tuning Performance optimizations:
3-term recurrence CA-CG formulation: Avoid the auxiliary vector p. 2x decrease in Akx flops, bandwidth cost and serial latency cost (vectors only - A already optimal) 25% decrease in Gram matrix flops, bandwidth cost, and serial latency cost Roughly equivalent dense flops Streaming Akx formulation: Interleave Akx and Gram matrix construction, then interleave Akx and vector reconstruction 2x increase in Akx flops Factor of s decrease in Akx sequential bandwidth and latency costs (vectors only - A already optimal) 2x increase in Akx parallel bandwidth, latency Eliminate Gram matrix sequential bandwidth and latency costs Eliminate dense bandwidth costs other than 3n writes Decreases overall sequential bandwidth and latency by O(s), regardless of nnz. Stability optimizations: Use scaled/shifted 2-term recurrence for Akx: Increase Akx flops by (2s+1)n Increase dense flops by an O(s2) term Use scaled/shifted 3-term recurrence for Akx: Increase Akx flops by 4sn Extended precision: Constant factor cost increases Restarting: Constant factor increase in Akx cost Preconditioning: Structure-dependent costs Rank-revealing, reorthogonalization, etc… Can also interleave Akx and Gram matrix construction without the extra work - decreases sequential bandwidth and latency by 33% rather than a factor of s
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.