Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys.

Similar presentations


Presentation on theme: "A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys."— Presentation transcript:

1 A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys University of Virginia Graphics Hardware 2003 July 26-27 – San Diego, CA augmented by Klaus Mueller, Stony Brook University

2 General-Purpose GPU Programming n Why do we port algorithms to the GPU? n How much faster can we expect it to be, really? n What is the challenge in porting?

3 Case Study Problem: Implement a Boundary Value Problem (BVP) solver using the GPU Could benefit an entire class of scientific and engineering applications, e.g.: n Heat transfer n Fluid flow

4 Related Work n Krüger and Westermann: Linear Algebra Operators for GPU Implementation of Numerical Algorithms n Bolz et al.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid n Very similar to our system n Developed concurrently n Complementary approach

5 Driving problem: Fluid mechanics sim Problem domain is a warped disc: regular grid

6 BVPs: Background n Boundary value problems are sometimes governed by PDEs of the form: L  = f L is some operator  is the problem domain f is a forcing function (source term) Given L and f, solve for .

7 BVPs: Example Heat Transfer n Find a steady-state temperature distribution T in a solid of thermal conductivity k with thermal source S n This requires solving a Poisson equation of the form: k  2 T = -S This is a BVP where L is the Laplacian operator  2 All our applications require a Poisson solver.

8 BVPs: Solving n Most such problems cannot be solved analytically n Instead, discretize onto a grid to form a set of linear equations, then solve: n Direct elimination n Gauss-Seidel iteration n Conjugate-gradient n Strongly implicit procedures n Multigrid method

9 Multigrid method n Iteratively corrects an approximation to the solution n Operates at multiple grid resolutions n Low-resolution grids are used to correct higher- resolution grids recursively n Very fast, especially for large grids: O(n)

10 Multigrid method n Use coarser grid levels to recursively correct an approximation to the solution n may converge slowly on fine grid -> restrict to course grid n push out long wavelength errors quickly (single grid solvers only smooth out high frequency errors) n Algorithm: n smooth residual  n restrict n recurse n interpolate  = L  i - f

11 Implementation - Overview For each step of the algorithm: n Bind as texture maps the buffers that contain the necessary data (current solution, residual, source terms, etc.) n Set the target buffer for rendering n Activate a fragment program that performs the necessary kernel computation (smoothing, residual calculation, restriction, interpolation) n Render a grid-sized quad with multitexturing fragment program render target buffer source buffer texture

12 Implementation - Overview

13 Input buffers n Solution buffer: four-channel floating point pixel buffer (p-buffer) n one channel each for solution, residual, source term, and a debugging term n toggle front and back surfaces used to hold old and new solution n Operator map: contains the discretized operator (e.g., Laplacian) n Red-black map: accelerate odd-even tests (see later)

14 Smoothing n Jacobi method n one matrix row: n calculate new value for each solution vector element: n in our application, the a ij are the Laplacian (sparse matrix):

15 Smoothing n Also factor in the source term n Use Red-black map to update only half of the grid cells in each pass n converges faster in practice n known as red-black iteration n requires two passes per iteration

16 Calculate residual n Apply operator (Laplacian) and source term to the current solution residual  = k  2 T + S n Store result in the target surface Use occlusion query to determine if all solution fragments are below threshold (  < threshold) n occlusion query = true means all fragments are below threshold n this is an L  norm, which may be too strict n less strict norms L 1, L 2, will require reduction or fragment accumulation register (not available yet), could run in CPU instead

17 Multigrid reduction and refinement n Average (restrict) current residual into coarser grid Iterate/smooth on coarser grid, solving k  2  = -S n Interpolate correction back into finer grid n or restrict once more -> recursion n use bilinear interpolation n Update grid with this correction n Iterate/smooth on fine grid

18 Boundary conditions n Dirichlet (prescribed) n Neumann (prescribed derivative) n Mixed (coupled value and derivative) n U k : value at grid point k n n k : normal at grid point k n Periodic boundaries result in toroidal mapping n Apply boundary conditions in smoothing pass

19 Boundary conditions n Only need to compute at boundaries n boundaries need significantly more computations n restrict computations to boundaries n GPUs do not allow branching n or better, both branches are executed and the invalid fragment is discarded n even more wasteful n decompose domain into boundary and interior areas n use general (boundary) and fastpath (interior) shaders n run these in two separate passes, on respective domains

20 Optimizing the Solver n Detect steady-state natively on GPU n Minimize shader length n Use special-case whenever possible n Limit context switches

21 Optimizing the Solver: Steady-state n How to detect convergence? n L 1 norm - average error n L 2 norm – RMS error (common in visual sim) n L  norm – max error (common in sci/eng apps) n Can use occlusion query! secs to steady state vs. grid size

22 Optimizing the Solver: Shader length n Minimize number of registers used n Vectorize as much as possible n Use the rasterizer to perform computations of linearly-varying values n Pre-compute invariants on CPU n Compute texture coodinate offsets in vertex shader shaderoriginal fpfastpath fpfastpath vp smooth 79-6-120-4-112-2 residual 45-7-016-4-011-1 restrict 66-6-121-3-011-1 interpolate 93-6-125-3-013-2

23 Optimizing the Solver: Special-case n Fast-path vs. slow-path n write several variants of each fragment program to handle boundary cases n eliminates conditionals in the fragment program n equivalent to avoiding CPU inner-loop branching slow path with boundaries fast path, no boundaries

24 Optimizing the Solver: Special-case n Fast-path vs. slow-path n write several variants of each fragment program to handle boundary cases n eliminates conditionals in the fragment program n equivalent to avoiding CPU inner-loop branching secs per v-cycle vs. grid size

25 Optimizing the Solver: Context-switching n Find best packing data of multiple grid levels into the pbuffer surfaces - many p-buffers

26 Optimizing the Solver: Context-switching n Find best packing data of multiple grid levels into the pbuffer surfaces - two p-buffers

27 Optimizing the Solver: Context-switching n Find best packing data of multiple grid levels into the pbuffer surfaces - a single p-buffer n Still one front- and one back surface for iterative smoothing

28 Optimizing the Solver: Context-switching n Remove context switching n Can introduce operations with undefined results: reading/writing same surface n Why do we need to do this? n there is a chance that we write and read from the same surface at the same time n Can we get away with it? n Yes, we can. Just need to be careful to avoid these conflicts n What about RGBA parallelism? n was not used in this implemtation, may give another boost of factor 4

29 Data Layout n Performance: secs to steady state vs. grid size

30 Data Layout n Compute 4 values at a time n Requires source, residual, solution values to be in different buffers n Complicates boundary calculations n Adds setup and teardown overhead Stacked domain n Possible additional vectorization:

31 Results: CPU vs. GPU n Performance: secs to steady state vs. grid size

32 Applications – Flow Simulation

33 Applications – High Dynamic Range CPUGPU

34 Conclusions What we need going forward: n Superbuffers n or: Universal support for multiple-surface pbuffers n or: Cheap context switching n Developer tools n Debugging tools n Documentation n Global accumulator n Ever increasing amounts of precision, memory n Textures bigger than 2048 on a side

35 Acknowledgements n Hardware n David Kirk n Matt Papakipos n Driver Support n Nick Triantos n Pat Brown n Stephen Ehmann n Fragment Programming n James Percy n Matt Pharr n General-purpose GPU n Mark Harris n Aaron Lefohn n Ian Buck n Funding n NSF Award #0092793


Download ppt "A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys."

Similar presentations


Ads by Google