Download presentation
Presentation is loading. Please wait.
1
Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder Caltech ASCI Center
2
Why Use the GPU? Semiconductor trends –cost –wires vs. compute –Stanford streaming supercomputer Parallelism –many functional units –graphics is prime example Harvesting this power –what application suitable? –what abstractions useful? History –massively parallel SIMD machines –media processing Chart courtesy Bill Dally Possible Actual Imagine stream processor; Bill Dally, StanfordConnection Machine CM2; Thinking Machines
3
Contributions and Related Work Contributions –numerical algorithms on GPU unstructured grids: conjugate gradients regular grids: multigrid –what abstractions are needed? Numerical algorithms –Goodnight et al. 2003 (MG) –Hall et al. 2003 (cache) –Harris et al. 2002 (FD sim.) –Hillisland et al. 2003 (optimization) –Krueger & Westermann 2003 (NLA) –Strzodka (PDEs)
4
Streaming Model Abstract model –Purcell, et al. 2002 –data structures: streams –algorithms: kernels Concrete model –render a rectangle –data structures: textures –algorithms: fragment programs Kernel input record stream output record stream globals Rasterizer (set up texture indices and all associated data) Fragment program (for all pixels in parallel) Texture as read-only memory Output goes to texture Bind buffer to texture Kernel globals
5
Sparse Matrices: Geometric Flow Ubiquitous in numerical computing –discretization of PDEs: animation finite elements, difference, volumes –optimization, editing, etc., etc. Example here: –processing of surfaces Canonical non-linear problem –mean curvature flow –implicit time discretization solve sequence of SPD systems Velocity opposite mean curvature normal
6
Conjugate Gradients High level code –inner loop –matrix-vector multiply –sum-reduction –scalar-vector MAD Inner product –fragment-wise multiply –followed by sum-reduction –odd dimensions can be handled
7
y=Ax Aj – off-diagonal matrix elements R – pointers to segments
8
Row-Vector Product X – vector elements R – pointers to segments A i – diagonal matrix elements J – pointers to x j A j – off-diagonal matrix elements Fragment program
9
Apply to All Pixels Two extremes –one row at a time: setup overhead –all rows at once: limited by worst row Middle ground –organize “batches” of work How to arrange batches? –order rows by non-zero entries optimal packing NP hard We choose fixed size rectangles –fragment pipe is quantized –simple experiments reveal best size 26 x 18 – 91% efficient wasted fragments on diagonal Time Area (pixels)
10
Packing (Greedy) 9988888771513 12 111099 777777776554 1513 12 11 1099 998 888 877 777 777 776 … non-zero entries per row each batch bound to an appropriate fragment program All this setup done once only at the beginning of time. Depends only on mesh connectivity
11
Recomputing Matrix Matrix entries depend on surface –must “render” into matrix –two additional indirection textures previous and next
12
Results (NV30@500MHz) 37k elements –matrix multiply 33 instructions, 120 per second only 13 flops latency limited –reduction 7 inst/frag/pass, 3400 per second –CG solve: 20 per second
13
Regular Grids Poisson solver as example –multigrid approach –this time variables on “pixel grid” e.g.: Navier-Stokes after discretization: solve Poisson eq. at each time step
14
Poisson Equation Appears all over the place –easy to discretize on regular grid –matrix multiply is stencil application –FD Laplace stencil: Use iterative matrix solver –just need application of stencil easy: just like filtering incorporate geometry (Jacobian) variable coefficients (i,j) -4 1 1 11 0 0 0 0
15
Multigrid Relax Projection Interpolation Fine to coarse to fine cycle –high freq. error removed quickly –lower frequency error takes longer Relax, Project, Interpolate
16
Computations and Storage Layout Lots of stencil applications –matrix multiply: 3x3 stencil –projection: 3x3 stencil –interpolation: 2x2(!) floor op in indexing Storage for matrices and DOFs –variables in one texture –matrices in 9(=3x3) textures –all textures packed exploit 4 channels domain decomp. padded boundary 1/16 1 1 1 1 2 2 224 x y z w
17
Coarser Matrices Operator at coarser level –needed for relaxation at all levels Triple matrix product… –work out terms and map to stencils exploit local support of stencils straightforward but t-e-d-i-o-u-s AfAf AcAc S P =
18
Results (NV30@500MHz) 257x257 grid –matrix multiply - 27 instructions 1370 per second –interpolation 10 inst. –projection 19 inst. Overall performance –257x257 at 80 fps!
19
Conclusions Enhancements –global registers for reductions –texture fetch with offset –rectangular texture border –scalar versus vector problems Where are we now? –good streaming processor –twice as fast as CPU implementation –lots of room for improvement Scientific computing compiler –better languages! Brook? C*? –manage layout in a buffer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.