Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures
Outline Multigrid Overview Multigrid Performance Model Multigrid Optimizations Results
3D Poisson’s Equation An elliptic PDE that arises in many physical problems (e.g. electrostatic or gravitational potential) Many different techniques for solving are available
3D Poisson’s Equation The continuous version is: 2 / x 2 + 2 / y 2 + 2 / z 2 = (or = ) The discrete version is: T * x = b In 2D, the 9-point stencil looks like:
Algorithms for Solving 2D Poisson’s Equation with N unknowns AlgorithmSerial FlopsMemory Dense LUN 3 N 2 Band LUN 2 N 3/2 JacobiN 2 N Explicit Inv.N N Conj.Grad.N 3/2 N RB SORN 3/2 N Sparse LUN 3/2 N*log N FFTN*log NN MultigridNN Lower boundNN
Multigrid Overview Basic Algorithm: Replace problem on fine grid by an approximation on a coarser grid Solve the coarse grid problem approximately, and use the solution as a starting guess for the fine- grid problem, which is then iteratively updated Solve the coarse grid problem recursively, i.e. by using a still coarser grid approximation, etc. Success depends on coarse grid solution being a good approximation to the fine grid
Multigrid Sketch on a 2D Mesh Consider a 2 m +1 by 2 m +1 grid Let P (i) be the problem of solving the discrete Poisson equation on a 2 i +1 by 2 i +1 grid in 2D Write linear system as T(i) * x(i) = b(i) P (m), P (m-1), …, P (1) is sequence of problems from finest to coarsest
Multigrid Operators The four operators that we examine are: evaluateResidual – calculates the residual of our current solution applySmoother – performs a Jacobi relaxation step coarsen – maps from a (2 m x 2 m x 2 m ) grid to a (2 m-1 x 2 m-1 x 2 m-1 ) grid prolongate – maps from a (2 m-1 x 2 m-1 x 2 m-1 ) grid to a (2 m x 2 m x 2 m ) grid All these operators perform nearest-neighbor computations using a 27-point stencil
Multigrid V-cycle Just a picture of the call graph In time a V-cycle looks like the following: level coarsen prolongate, applySmoother, evaluateResidual time
Why Multigrid Works Consider the error as a sum of sine curves of various frequencies Lower levels of multigrid have smaller frequency domains than higher levels A given level of multigrid dampens the error in the upper-half of its frequency domain by smoothing By going through all levels, all frequencies should be dampened
Jacobi Smoothing (Relaxation) Error after 1 weighted Jacobi step “Smoother” Less high frequency component Norm = Error after 2 weighted Jacobi steps “Smooth” Little high frequency component Norm =.9176, won’t decrease much more Initial error “Rough” Lots of high frequency components Norm = 1.65
Multigrid Performance Model Memory access is performance bottleneck Each pass over 3D grid requires (per cell): 27 integer operations (stencil coordinates) 27 FP loads of surrounding grid locations 27 (approx) FP operations 1 FP store Traversing grid consecutively in memory causes 9 cache misses every 1/(# doubles stored in a cache line) cells Grid size prevents reuse of cached values
Multigrid Optimizations Optimizations possible in 2 areas Reducing ALU operations per cell Reusing stencil coordinates between cells Reusing partial sums common to consecutive cells Improving memory behavior Reducing # of loads (register blocking) Reducing # of cache misses (cache blocking)
Multigrid Optimizations Common subexpression elimination Loop unrolling Memoization Cache blocking Memoization + cache blocking
Optimizations – Loop Unrolling Reduces # stencil coordinates computed per cell Exposes load reuse to compiler Allows compiler to use FP registers to store grid values, reducing loads Minimum number of loads is 9/grid point (given generous # FP registers)
Optimizations - Memoization Traverses grid once to precompute partial sums common to consecutive cells Traverses grid again to compute actual cell values 9 integer stencil operations/cell 18 FP operations/cell Reduces FP register pressure by breaking computation into two stages, but still uses 9 load streams per cell
Optimizations – Cache Blocking Break 3D grid into blocks that fit within cache Attempts to allow reuse between adjacent 2D-slices Reduces memory traffic to 3 load streams per cell Overhead when switching between blocks
Summary & Continuing Work Overhead of cache-blocking is too large for the small block sizes that fit in the IBM Power3’s L1 cache Memoization offers greatest performance benefit due to reduced FP operations We are currently analyzing performance data collected on other architectures