James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr11 CS 267: Applications of Parallel Computers Lecture 17 - Structured Grids James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr11.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

1 Numerical Solvers for BVPs By Dong Xu State Key Lab of CAD&CG, ZJU.
CS 290H 7 November Introduction to multigrid methods
03/23/07CS267 Lecture 201 CS 267: Multigrid on Structured Grids Kathy Yelick
SOLVING THE DISCRETE POISSON EQUATION USING MULTIGRID ROY SROR ELIRAN COHEN.
Geometric (Classical) MultiGrid. Hierarchy of graphs Apply grids in all scales: 2x2, 4x4, …, n 1/2 xn 1/2 Coarsening Interpolate and relax Solve the large.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Numerical Algorithms Matrix multiplication
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
Ma221 - Multigrid DemmelFall 2004 Ma 221 – Fall 2004 Multigrid Overview James Demmel
Numerical Algorithms • Matrix multiplication
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
03/14/06CS267 Lecture 17 CS 267: Applications of Parallel Computers Unstructured Multigrid for Linear Systems James Demmel Based in part on material from.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
CS267 Poisson 2.1 Demmel Fall 2002 CS 267 Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods
03/09/06CS267 Lecture 16 CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel
04/13/2009CS267 Lecture 20 CS 267: Applications of Parallel Computers Lecture Structured Grids Horst D. Simon
The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.
04/13/07CS267 Guest Lecture CS 267: Applications of Parallel Computers Unstructured Multigrid for Linear Systems James Demmel Based in part on material.
CS267 L11 Sources of Parallelism(2).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 11: Sources of Parallelism and Locality (Part 2)
CS267 L25 Solving PDEs II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 25: Solving Linear Systems arising from PDEs - II James Demmel.
CS240A: Conjugate Gradients and the Model Problem.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
1 Sources of Parallelism and Locality in Simulation.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
1 Numerical Integration of Partial Differential Equations (PDEs)
Scientific Computing Partial Differential Equations Poisson Equation.
Fall 2011Math 221 Multigrid James Demmel
CS240A: Conjugate Gradients and the Model Problem.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.
High Performance Computing 1 Multigrid Some material from lectures of J. Demmel, UC Berkeley Dept of CS.
Lecture 21 MA471 Fall 03. Recall Jacobi Smoothing We recall that the relaxed Jacobi scheme: Smooths out the highest frequency modes fastest.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
CS267 L24 Solving PDEs.1 Demmel Sp 1999 Math Poisson’s Equation, Jacobi, Gauss-Seidel, SOR, FFT Plamen Koev
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Numerical Algorithms Chapter 11.
Krylov-Subspace Methods - I
High Altitude Low Opening?
Programming assignment #1. Solutions and Discussion
Model Problem: Solving Poisson’s equation for temperature
Solving Linear Systems Ax=b
MultiGrid.
Lecture 19 MA471 Fall 2003.
CSE 245: Computer Aided Circuit Simulation and Verification
Lecture 22: Parallel Algorithms
James Demmel Multigrid James Demmel Fall 2010 Math 221.
Partial Differential Equations and Iterative Solvers
CS 267 Sources of Parallelism and Locality in Simulation – Part 2
James Demmel CS 267: Applications of Parallel Computers Lecture 17 - Structured Grids James Demmel
Jim Demmel and Horst D. Simon
Numerical Algorithms • Parallelizing matrix multiplication
CSCE569 Parallel Computing
James Demmel CS 267: Applications of Parallel Computers Lecture 17 - Structured Grids James Demmel
CS 252 Project Presentation
CS 267 Sources of Parallelism and Locality in Simulation – Part 2
James Demmel CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - I James Demmel.
Solving Linear Systems: Iterative Methods and Sparse Systems
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Oct 14, 2014 slides6b.ppt 1.
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson StencilPattern.ppt Oct 14,
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Programming assignment #1 Solving an elliptic PDE using finite differences Numerical Methods for PDEs Spring 2007 Jim E. Jones.
(based in part on material from Mark Adams)
Presentation transcript:

James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr11 CS 267: Applications of Parallel Computers Lecture 17 - Structured Grids James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr11

The Motifs (formerly “Dwarfs”) from “The Berkeley View” (Asanovic et al.) Motifs form key computational patterns Topic of this lecture 04/13/09 CS267 Lecture 20 2

Overview of Methods for Poisson Equation Jacobi’s method Outline Review PDEs Overview of Methods for Poisson Equation Jacobi’s method Red-Black SOR method Conjugate Gradient Multigrid (Next time: FFTs) 03/15/11 CS267 Lecture 17

Solving PDEs ° Hyperbolic problems (waves): Sound wave(position, time) Use explicit time-stepping Solution at each point depends on neighbors at previous time Elliptic (steady state) problems: Electrostatic Potential (position) Everything depends on everything else This means locality is harder to find than in hyperbolic problems Parabolic (time-dependent) problems: Temperature(position,time) Involves an elliptic solve at each time-step Focus on elliptic problems Canonical example is the Poisson equation 2u/x2 + 2u/y2 + 2u/z2 = f(x,y,z) 03/15/11 CS267 Lecture 17

Explicit Solution of PDEs Often used for hyperbolic PDEs Stability limits size of time-step Computation corresponds to Matrix vector multiply Combine nearest neighbors on grid Use finite differences with u[j,i] as the solution at time t= i* (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h) initial conditions on u[j,0] boundary conditions on u[0,i] and u[N,i] At each timestep i = 0,1,2,... i i=5 i=4 i=3 i=2 i=1 i=0 U[j,I+1] = For j=0 to N u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i] where z =C*/h2 j u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0] 03/15/11 CS267 Lecture 17

Matrix View of Explicit Method for Heat Equation u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i] u[ :, i+1] = T * u[ :, i] where T is tridiagonal: L called Laplacian (in 1D) For a 2D mesh (5 point stencil) the Laplacian has 5 diagonals For a 3D mesh there are 7 diagonals T = = I – z*L, L = 2 -1 -1 2 -1 -1 2 -1 -1 2 1-2z z z 1-2z z z 1-2z 1-2z z Graph and “3 point stencil” 03/15/11 CS267 Lecture 17

Poisson’s equation in 1D: 2u/x2 = f(x) Solve Tu = f for u where 2 -1 -1 2 -1 -1 2 -1 -1 2 -1 -1 2 T = 2 -1 Graph and “stencil” 03/15/11 CS267 Lecture 17

Similar to the 1D case, but the matrix T is now 2D Poisson’s equation Similar to the 1D case, but the matrix T is now 3D is analogous Graph and “stencil” 4 -1 -1 -1 4 -1 -1 -1 4 -1 -1 4 -1 -1 -1 -1 4 -1 -1 -1 -1 4 -1 -1 4 -1 -1 -1 4 -1 -1 -1 4 -1 -1 4 -1 T = -1 03/15/11 CS267 Lecture 17

Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars) Algorithm Serial PRAM Memory #Procs Dense LU N3 N N2 N2 Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3) Jacobi N2 (N5/3) N (N2/3) N N Explicit Inv. N2 log N N2 N2 Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N FFT N*log N log N N N Multigrid N log2 N N N Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997. Fix table in earlier lecture for 3D case! Band LU serial complexity: dimension * bandwidth^2 = n^2 * n^2 in 2D and n^3 * (n^2)^2 in 3D Band LU storage: dimension * bandwidth Band LU #procs = bandwidth^2 Jacobi complexity = complexity(SpMV) * #its, #its = cond# = n^2 in 2D and 3D CG complexity = [ complexity(SpMV)+complexity(dotprod) ] * #its, #its = sqrt(cond#) = n in 2D and 3D RB SOR complexity = complexity(SpMV) * #its, #its = sqrt(cond#) = n in 2D and 3D 03/15/11 CS267 Lecture 17

Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars) Algorithm Serial PRAM Memory #Procs Dense LU N3 N N2 N2 Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3) Jacobi N2 (N5/3) N (N2/3) N N Explicit Inv. N2 log N N2 N2 Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N FFT N*log N log N N N Multigrid N log2 N N N Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997. Fix table in earlier lecture for 3D case! Band LU serial complexity: dimension * bandwidth^2 = n^2 * n^2 in 2D and n^3 * (n^2)^2 in 3D Band LU storage: dimension * bandwidth Band LU #procs = bandwidth^2 Jacobi complexity = complexity(SpMV) * #its, #its = cond# = n^2 in 2D and 3D CG complexity = [ complexity(SpMV)+complexity(dotprod) ] * #its, #its = sqrt(cond#) = n in 2D and 3D RB SOR complexity = complexity(SpMV) * #its, #its = sqrt(cond#) = n in 2D and 3D 03/15/11 CS267 Lecture 17

To derive Jacobi’s method, write Poisson as: u(i,j) = (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) + b(i,j))/4 Let u(i,j,m) be approximation for u(i,j) after m steps u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) + u(i,j-1,m) + u(i,j+1,m) + b(i,j)) / 4 I.e., u(i,j,m+1) is a weighted average of neighbors Motivation: u(i,j,m+1) chosen to exactly satisfy equation at (i,j) Steps to converge proportional to problem size, N=n2 Therefore, serial complexity is O(N2) 03/15/11 CS267 Lecture 17

Convergence of Nearest Neighbor Methods Jacobi’s method involves nearest neighbor computation on nxn grid (N = n2) So it takes O(n) = O(sqrt(N)) iterations for information to propagate E.g., consider a rhs (b) that is 0, except the center is 1 The exact solution looks like: Even in the best case, any nearest neighbor computation will take n/2 steps to propagate on an nxn grid 03/15/11 CS267 Lecture 17

Convergence of Nearest Neighbor Methods Best 5 step solution is not a particular method, but uses the true solution 5 steps out from center, and zero elsewhere. Takes O(n) steps to propagate information across an nxn grid 03/15/11 CS267 Lecture 17

Parallelizing Jacobi’s Method Reduces to sparse-matrix-vector multiply by (nearly) T U(m+1) = (T/4 - I) * U(m) + B/4 Each value of U(m+1) may be updated independently keep 2 copies for iterations m and m+1 Requires that boundary values be communicated if each processor owns n2/p elements to update amount of data communicated, n/p per neighbor, is relatively small if n>>p 03/15/11 CS267 Lecture 20

Communication Avoiding Jacobi: Replace k iterations of y = Ax with [Ax, A2x, …, Akx] Example: A tridiagonal, n=32, k=3 Like Matrix-Powers Kernel, but simpler: Don’t need to store A explicitly (it’s Jacobi) Only need to save Akx A3·x A2·x A·x x … 32 1 2 3 4 … C. J. Pfeifer, Data flow and storage allocation for the PDQ-5 program on the Philco-2000, Comm. ACM 6, No. 7 (1963), 365366. Referred to in paper by Leiserson/Rao/Toledo/ in 1993 paper on blocking covers

Communication Avoiding Jacobi: Replace k iterations of y = Ax with [Ax, A2x, …, Akx] Example: A tridiagonal, n=32, k=3 Like Matrix-Powers Kernel, but simpler: Don’t need to store A explicitly (it’s Jacobi) Only need to save Akx A3·x A2·x A·x x 1 2 3 4 … … 32

Communication Avoiding Jacobi: Replace k iterations of y = Ax with [Ax, A2x, …, Akx] Sequential Algorithm Example: A tridiagonal, n=32, k=3 Like Matrix-Powers Kernel, but simpler: Don’t need to store A explicitly (it’s Jacobi) Only need to save Akx Step 1 A3·x A2·x A·x x 1 2 3 4 … … 32

Communication Avoiding Jacobi: Replace k iterations of y = Ax with [Ax, A2x, …, Akx] Sequential Algorithm Example: A tridiagonal, n=32, k=3 Like Matrix-Powers Kernel, but simpler: Don’t need to store A explicitly (it’s Jacobi) Only need to save Akx – move O(n) words instead of O(kn) Step 1 Step 2 Step 3 Step 4 A3·x A2·x A·x x 1 2 3 4 … … 32

Communication Avoiding Jacobi: Replace k iterations of y = Ax with [Ax, A2x, …, Akx] Parallel Algorithm Example: A tridiagonal, n=32, k=3 Like Matrix-Powers Kernel, but simpler: Don’t need to store A explicitly (it’s Jacobi) Only need to save Akx Proc 1 A3·x A2·x A·x x 1 2 3 4 … … 32

Communication Avoiding Jacobi: Replace k iterations of y = Ax with [Ax, A2x, …, Akx] Parallel Algorithm Example: A tridiagonal, n=32, k=3 Like Matrix-Powers Kernel, but simpler: Don’t need to store A explicitly (it’s Jacobi) Only need to save Akx Proc 2 A3·x A2·x A·x x 1 2 3 4 … … 32

Communication Avoiding Jacobi: Replace k iterations of y = Ax with [Ax, A2x, …, Akx] Parallel Algorithm Example: A tridiagonal, n=32, k=3 Like Matrix-Powers Kernel, but simpler: Don’t need to store A explicitly (it’s Jacobi) Only need to save Akx Proc 3 A3·x A2·x A·x x 1 2 3 4 … … 32

Communication Avoiding Jacobi: Replace k iterations of y = Ax with [Ax, A2x, …, Akx] Parallel Algorithm Example: A tridiagonal, n=32, k=3 Like Matrix-Powers Kernel, but simpler: Don’t need to store A explicitly (it’s Jacobi) Only need to save Akx Proc 4 A3·x A2·x A·x x 1 2 3 4 … … 32

Communication Avoiding Jacobi: Replace k iterations of y = Ax with [Ax, A2x, …, Akx] Parallel Algorithm Example: A tridiagonal, n=32, k=3 Entries in overlapping regions (triangles) computed redundantly Send O(1) messages instead of O(k) Proc 1 Proc 2 Proc 3 Proc 4 A3·x A2·x A·x x 1 2 3 4 … … 32

Remotely Dependent Entries for [x,Ax,A2x,A3x], 2D Laplacian

Reference for communication optimizations in Jacobi Bebop.cs.berkeley.edu “Autotuning Stencil Codes for Cache-Based Multicore Platforms”, K. Datta, UCB PhD thesis, 2009, “Avoiding Communication in Computing Krylov Subspaces,” J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick, 2007 “Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors”, K. Datta, S. Kamil, S. Williams, L. Oliker, J.Shalf, K. Yelick, SIAM Review Michelle Strout on polyhedral tiling: www.cs.colostate.edu/~mstrout/Papers/pubs-poly.php Ian Foster et al, on grids (SC2001) “Efficient out-of-core algorithms for linear relaxation using blocking covers,” C. Leiserson, S. Rao, S. Toledo, FOCS, 1993 “Data flow and storage allocation for the PDQ-5 program on the Philco-2000,” C. Pfeifer, CACM, 1963 03/15/11 CS267 Lecture 17

Improvements to Jacobi Similar to Jacobi: u(i,j,m+1) will be computed as a linear combination of neighbors Numeric coefficients and update order are different 2 improvements Use “most recent values” of u that are available, since these are probably more accurate Update value of u(m+1) “more aggressively” at each step First, note that while evaluating sequentially u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) … some of the values for m+1 are already available u(i,j,m+1) = (u(i-1,j,latest) + u(i+1,j,latest) … where latest is either m or m+1 03/15/11 CS267 Lecture 17

Cannot be parallelized, because of dependencies Gauss-Seidel Updating left-to-right row-wise order, we get the Gauss-Seidel algorithm for i = 1 to n for j = 1 to n u(i,j,m+1) = (u(i-1,j,m+1) + u(i+1,j,m) + u(i,j-1,m+1) + u(i,j+1,m) + b(i,j)) / 4 Updated m+1 Not updated m i j Cannot be parallelized, because of dependencies 03/15/11 CS267 Lecture 17

For general graph, use “graph coloring” Gauss-Seidel Updating left-to-right row-wise order, we get the Gauss-Seidel algorithm for i = 1 to n for j = 1 to n u(i,j,m+1) = (u(i-1,j,m+1) + u(i+1,j,m) + u(i,j-1,m+1) + u(i,j+1,m) + b(i,j)) / 4 Cannot be parallelized, because of dependencies, so instead we use a “red-black” order forall black points u(i,j) u(i,j,m+1) = (u(i-1,j,m) + … red neighbors forall red points u(i,j) u(i,j,m+1) = (u(i-1,j,m+1) + … black neighbors For general graph, use “graph coloring” Can use repeated Maximal Independent Sets to color Graph(T) is bipartite => 2 colorable (red and black) Nodes for each color can be updated simultaneously Same optimizations, using submatrices 03/15/11 CS267 Lecture 17

Successive Overrelaxation (SOR) Red-black Gauss-Seidel converges twice as fast as Jacobi, but there are twice as many parallel steps, so the same in practice To motivate next improvement, write basic step in algorithm as: u(i,j,m+1) = u(i,j,m) + correction(i,j,m) If “correction” is a good direction to move, then one should move even further in that direction by some factor w>1 u(i,j,m+1) = u(i,j,m) + w * correction(i,j,m) Called successive overrelaxation (SOR) Parallelizes like Jacobi Can prove w = 2/(1+sin(p/(n+1)) ) for best convergence for Poisson Number of steps to converge = parallel complexity = O(n), instead of O(n2) for Jacobi Serial complexity O(n3) = O(N3/2), instead of O(n4) = O(N2) for Jacobi 03/15/11 CS267 Lecture 17

Conjugate Gradient Algorithm for Solving Ax=b Initial guess x r = b – A*x, j=1 Repeat rho = rT*r … dot product If j=1, p = r, else beta = rho/old_rho, p = r + beta*p, endif … saxpy q = A*p … sparse matrix vector multiply, or stencil alpha = rho / pT * q … dot product x = x + alpha * p … saxpy r = r – alpha * q … saxpy old_rho = rho; j=j+1 Until rho small enough Converges in O(n) = O(N1/2) steps, like SOR, but more general Can be reorganized to use matrix powers kernel [Ax,A2x,…,Akx] “Communication Avoiding Krylov Subspace Methods,” M. Hoemmen, UCB PhD Thesis, bebop.cs.berkeley.edu, 2010 03/15/11 CS267 Lecture 17

Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars) Algorithm Serial PRAM Memory #Procs Dense LU N3 N N2 N2 Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3) Jacobi N2 (N5/3) N (N2/3) N N Explicit Inv. N2 log N N2 N2 Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N FFT N*log N log N N N Multigrid N log2 N N N Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997. Fix table in earlier lecture for 3D case! Band LU serial complexity: dimension * bandwidth^2 = n^2 * n^2 in 2D and n^3 * (n^2)^2 in 3D Band LU storage: dimension * bandwidth Band LU #procs = bandwidth^2 Jacobi complexity = complexity(SpMV) * #its, #its = cond# = n^2 in 2D and 3D CG complexity = [ complexity(SpMV)+complexity(dotprod) ] * #its, #its = sqrt(cond#) = n in 2D and 3D RB SOR complexity = complexity(SpMV) * #its, #its = sqrt(cond#) = n in 2D and 3D 03/15/11 CS267 Lecture 17

Similar to the 1D case, but the matrix T is now 2D Poisson’s equation Similar to the 1D case, but the matrix T is now 3D is analogous Graph and “stencil” 4 -1 -1 -1 4 -1 -1 -1 4 -1 -1 4 -1 -1 -1 -1 4 -1 -1 -1 -1 4 -1 -1 4 -1 -1 -1 4 -1 -1 -1 4 -1 -1 4 -1 T = -1 03/15/11 CS267 Lecture 17

Multigrid Motivation Recall that Jacobi, SOR, CG, or any other sparse- matrix-vector-multiply-based algorithm can only move information one grid cell at a time Take at least n steps to move information across n x n grid Therefore, converging in O(1) steps requires moving information across grid faster than to one neighboring grid cell per step One step can’t just do sparse-matrix-vector-multiply 03/15/11 CS267 Lecture 17

Multigrid Motivation Takes O(n) steps to propagate information across an nxn grid Fall 2009 CS267 Lecture 16

Big Idea used in multigrid and elsewhere If you are far away, problem looks simpler For gravity: approximate earth, distant galaxies, … by point masses Can solve such a coarse approximation to get an approximate solution, iterating if necessary Solve coarse approximation problem by using an even coarser approximation of it, and so on recursively Ex: Graph Partitioning (used to parallelize SpMV) Replace graph to be partitioned by a coarser graph Ex: Fast Multipole Method, Barnes-Hut for computing gravitational forces on n particles in O(n log n) time: Approximate particles in box by total mass, center of gravity Ex: Multigrid for solving PDE in O(n) time Use coarser mesh to get approximate solution of Poisson’s Eq. 03/15/11 CS267 Lecture 17

Fine and Coarse Approximations Smoothing and coarse grid projection FEM applied to itself recursion 03/15/11 CS267 Lecture 17

Multigrid Overview Basic Algorithm: Replace problem on fine grid by an approximation on a coarser grid Solve the coarse grid problem approximately, and use the solution as a starting guess for the fine-grid problem, which is then iteratively updated Solve the coarse grid problem recursively, i.e. by using a still coarser grid approximation, etc. Success depends on coarse grid solution being a good approximation to the fine grid Fine Coarse 03/15/11 CS267 Lecture 17

Multigrid Sketch in 1D Consider a 2m+1 grid in 1D for simplicity Let P(i) be the problem of solving the discrete Poisson equation on a 2i+1 grid in 1D Write linear system as T(i) * x(i) = b(i) P(m) , P(m-1) , … , P(1) is sequence of problems from finest to coarsest 03/15/11 CS267 Lecture 17

Multigrid Sketch in 2D Consider a 2m+1 by 2m+1 grid in 2D Let P(i) be the problem of solving the discrete Poisson equation on a 2i+1 by 2i+1 grid in 2D Write linear system as T(i) * x(i) = b(i) P(m) , P(m-1) , … , P(1) is sequence of problems from finest to coarsest 03/15/11 CS267 Lecture 17

Multigrid Operators For problem P(i) at varying coarsening levels (i, grid size grows with i): b(i) is the Right Hand Side (RHS) and x(i) is the current estimated solution All the following operators just average values on neighboring grid points (so information moves fast on coarse grids) The restriction operator R(i) maps P(i) to P(i-1) Restricts problem on fine grid P(i) to coarse grid P(i-1) Uses sampling or averaging b(i-1)= R(i) (b(i)) The interpolation operator In(i-1) maps approx. solution x(i-1) to x(i) Interpolates solution on coarse grid P(i-1) to fine grid P(i) x(i) = In(i-1)(x(i-1)) The solution operator S(i) takes P(i) and improves solution x(i) Uses “weighted” Jacobi or SOR on single level of grid x improved (i) = S(i) (b(i), x(i)) Overall algorithm, then details of operators both live on grids of size 2i-1 03/15/11 CS267 Lecture 17

Multigrid Operators For problem P(i) at varying coarsening levels (i, grid size grows with i): b(i) is the Right Hand Side (RHS) and x(i) is the current estimated solution All the following operators just average values on neighboring grid points (so information moves fast on coarse grids) The restriction operator R(i) maps P(i) to P(i-1) Restricts problem on fine grid P(i) to coarse grid P(i-1) Uses sampling or averaging b(i-1)= R(i) (b(i)) The interpolation operator In(i-1) maps approx. solution x(i-1) to x(i) Interpolates solution on coarse grid P(i-1) to fine grid P(i) x(i) = In(i-1)(x(i-1)) The solution operator S(i) takes P(i) and improves solution x(i) Uses “weighted” Jacobi or SOR on single level of grid x improved (i) = S(i) (b(i), x(i)) Overall algorithm, then details of operators both live on grids of size 2i-1 03/15/11 CS267 Lecture 17

Multigrid Operators For problem P(i) at varying coarsening levels (i, grid size grows with i): b(i) is the Right Hand Side (RHS) and x(i) is the current estimated solution All the following operators just average values on neighboring grid points (so information moves fast on coarse grids) The restriction operator R(i) maps P(i) to P(i-1) Restricts problem on fine grid P(i) to coarse grid P(i-1) Uses sampling or averaging b(i-1)= R(i) (b(i)) The interpolation operator In(i-1) maps approx. solution x(i-1) to x(i) Interpolates solution on coarse grid P(i-1) to fine grid P(i) x(i) = In(i-1)(x(i-1)) The solution operator S(i) takes P(i) and improves solution x(i) Uses “weighted” Jacobi or SOR on single level of grid x improved (i) = S(i) (b(i), x(i)) Overall algorithm, then details of operators both live on grids of size 2i-1 03/15/11 CS267 Lecture 17

Multigrid Operators For problem P(i) at varying coarsening levels (i, grid size grows with i): b(i) is the Right Hand Side (RHS) and x(i) is the current estimated solution All the following operators just average values on neighboring grid points (so information moves fast on coarse grids) The restriction operator R(i) maps P(i) to P(i-1) Restricts problem on fine grid P(i) to coarse grid P(i-1) Uses sampling or averaging b(i-1)= R(i) (b(i)) The interpolation operator In(i-1) maps approx. solution x(i-1) to x(i) Interpolates solution on coarse grid P(i-1) to fine grid P(i) x(i) = In(i-1)(x(i-1)) The solution operator S(i) takes P(i) and improves solution x(i) Uses “weighted” Jacobi or SOR on single level of grid x improved (i) = S(i) (b(i), x(i)) Overall algorithm, then details of operators both live on grids of size 2i-1 03/15/11 CS267 Lecture 17

Multigrid V-Cycle Algorithm Function MGV ( b(i), x(i) ) … Solve T(i)*x(i) = b(i) given b(i) and an initial guess for x(i) … return an improved x(i) if (i = 1) compute exact solution x(1) of P(1) only 1 unknown return x(1) else solve recursively x(i) = S(i) (b(i), x(i)) improve solution by damping high frequency error r(i) = T(i)*x(i) - b(i) compute residual d(i) = In(i-1) ( MGV( R(i) ( r(i) ), 0 ) ) solve T(i)*d(i) = r(i) recursively x(i) = x(i) - d(i) correct fine grid solution x(i) = S(i) ( b(i), x(i) ) improve solution again return x(i) Function MGV ( b(i), x(i) ) … Solve T(i)*x(i) = b(i) given b(i) and an initial guess for x(i) … return an improved x(i) if (i = 1) compute exact solution x(1) of P(1) only 1 unknown return x(1) else x(i) = S(i) (b(i), x(i)) improve solution by damping high frequency error r(i) = T(i)*x(i) - b(i) compute residual d(i) = In(i-1) ( MGV( R(i) ( r(i) ), 0 ) ) solve T(i)*d(i) = r(i) recursively x(i) = x(i) - d(i) correct fine grid solution x(i) = S(i) ( b(i), x(i) ) improve solution again return x(i) 03/15/11 CS267 Lecture 17

This is called a V-Cycle Just a picture of the call graph In time a V-cycle looks like the following 03/15/11 CS267 Lecture 17

Complexity of a V-Cycle On a serial machine Work at each point in a V-cycle is O(the number of unknowns) Cost of Level i is (2i-1)2 = O(4 i) for a 2D grid If finest grid level is m, total time is: S O(4 i) = O( 4 m) for a 2D grid = O(# unknowns) in general On a parallel machine (PRAM) with one processor per grid point and free communication, each step in the V-cycle takes constant time Total V-cycle time is O(m) = O(log #unknowns) m i=1 03/15/11 CS267 Lecture 17

Full Multigrid (FMG) Intuition: Function FMG (b(m), x(m)) improve solution by doing multiple V-cycles avoid expensive fine-grid (high frequency) cycles analysis of why this works is beyond the scope of this class Function FMG (b(m), x(m)) … return improved x(m) given initial guess compute the exact solution x(1) of P(1) for i=2 to m x(i) = MGV ( b(i), In (i-1) (x(i-1) ) ) In other words: Solve the problem with 1 unknown Given a solution to the coarser problem, P(i-1) , map it to starting guess for P(i) Solve the finer problem using the Multigrid V-cycle 03/15/11 CS267 Lecture 17

Full Multigrid Cost Analysis One V for each call to FMG people also use Ws and other compositions Serial time: S O(4 i) = O( 4 m) = O(# unknowns) PRAM time: S O(i) = O( m 2) = O( log2 # unknowns) m i=1 m i=1 03/15/11 CS267 Lecture 17

Complexity of Solving Poisson’s Equation Theorem: error after one FMG call: error_after <= (.5* error_before) independent of # unknowns At least 1 bit each time Corollary: We can make the error < any fixed tolerance in a fixed number of steps, independent of size of finest grid This is the most important convergence property of MG, distinguishing it from all other methods, which converge more slowly for large grids Total complexity just proportional to cost of one FMG call JD: explain first line (parse) 03/15/11 CS267 Lecture 17

The Solution Operator S(i) - Details The solution operator, S(i), is a weighted Jacobi Consider the 1D problem At level i, pure Jacobi replaces: x(j) := 1/2 (x(j-1) + x(j+1) + b(j) ) Weighted Jacobi uses: x(j) := 1/3 (x(j-1) + x(j) + x(j+1) + b(j) ) In 2D, similar average of nearest neighbors JD 03/15/11 CS267 Lecture 17

Weighted Jacobi chosen to damp high frequency error Initial error “Rough” Lots of high frequency components Norm = 1.65 Error after 1 weighted Jacobi step “Smoother” Less high frequency component Norm = 1.06 Norm is probably the root sum of squares of the errors (2norm) Error after 2 weighted Jacobi steps “Smooth” Little high frequency component Norm = .92, won’t decrease much more 03/15/11 CS267 Lecture 17

The Restriction Operator R(i) - Details The restriction operator, R(i), takes a problem P(i) with RHS b(i) and maps it to a coarser problem P(i-1) with RHS b(i-1) In 1D, average values of neighbors xcoarse(i) = 1/4 * xfine(i-1) + 1/2 * xfine(i) + 1/4 * xfine(i+1) In 2D, average with all 8 neighbors (N,S,E,W,NE,NW,SE,SW) 03/15/11 CS267 Lecture 17

Interpolation Operator In(i-1): details The interpolation operator In(i-1), takes a function on a coarse grid P(i-1) , and produces a function on a fine grid P(i) In 1D, linearly interpolate nearest coarse neighbors xfine(i) = xcoarse(i) if the fine grid point i is also a coarse one, else xfine(i) = 1/2 * xcoarse(left of i) + 1/2 * xcoarse(right of i) In 2D, interpolation requires averaging with 4 nearest neighbors (NW,SW,NE,SE) 03/15/11 CS267 Lecture 17

Convergence Picture of Multigrid in 1D 03/15/11 CS267 Lecture 17

Convergence Picture of Multigrid in 2D 03/15/11 CS267 Lecture 17

Start with n=2m+1 by 2m+1 grid (here m=5) Parallel 2D Multigrid Multigrid on 2D requires nearest neighbor (up to 8) computation at each level of the grid Start with n=2m+1 by 2m+1 grid (here m=5) Use an s by s processor grid (here s=4) 03/15/11 CS267 Lecture 17

Performance Model of parallel 2D Multigrid (V-cycle) Assume 2m+1 by 2m+1 grid of points, n= 2m-1, N=n2 Assume p = 4k processors, arranged in 2k by 2k grid Processors start with 2m-k by 2m-k subgrid of unknowns Consider V-cycle starting at level m At levels m through k of V-cycle, each processor does some work At levels k-1 through 1, some processors are idle, because a 2k-1 by 2k-1 grid of unknowns cannot occupy each processor Cost of one level in V-cycle If level j >= k, then cost = O(4j-k ) …. Flops, proportional to the number of grid points/processor + O( 1 ) a …. Send a constant # messages to neighbors + O( 2j-k) b …. Number of words sent If level j < k, then cost = O(1) …. Flops, proportional to the number of grid points/processor + O(1) a …. Send a constant # messages to neighbors + O(1) b .… Number of words sent Sum over all levels in all V-cycles in FMG to get complexity 03/15/11 CS267 Lecture 17

Comparison of Methods (in O(.) sense) # Flops # Messages # Words sent MG N/p + (log N)2 (N/p)1/2 + log p * log N log p * log N FFT N log N / p p1/2 N/p SOR N3/2 /p N1/2 N/p SOR is slower than others on all counts Flops for MG and FFT depends on accuracy of MG MG communicates less total data (bandwidth) Total messages (latency) depends … This coarse analysis can’t say whether MG or FFT is better when a >> b 03/15/11 CS267 Lecture 17

In practice, we don’t go all the way to P(1) Practicalities In practice, we don’t go all the way to P(1) In sequential code, the coarsest grids are negligibly cheap, but on a parallel machine they are not. Consider 1000 points per processor In 2D, the surface to communicate is 4xsqrt(1000) ~= 128, or 13% In 3D, the surface is 1000-83 ~= 500, or 50% See Tuminaro and Womble, SIAM J. Sci. Comp., v14, n5, 1993 for analysis of MG on 1024 nCUBE2 on 64x64 grid of unknowns, only 4 per processor efficiency of 1 V-cycle was .02, and on FMG .008 on 1024x1024 grid efficiencies were .7 (MG Vcycle) and .42 (FMG) although worse parallel efficiency, FMG is 2.6 times faster than V-cycles alone nCUBE had fast communication, slow processors 03/15/11 CS267 Lecture 17

Multigrid on an Adaptive Mesh For problems with very large dynamic range, another level of refinement is needed Build adaptive mesh and solve multigrid (typically) at each level Can’t afford to use finest mesh everywhere 03/15/11 CS267 Lecture 17

Multiblock Applications Solve system of equations on a union of rectangles subproblem of AMR E.g., 03/15/11 CS267 Lecture 17

Adaptive Mesh Refinement Data structures in AMR Usual parallelism is to assign grids on each level to processors Load balancing is a problem 03/15/11 CS267 Lecture 17

Domains in Titanium designed for this problem Support for AMR Domains in Titanium designed for this problem Kelp, Boxlib, and AMR++ are libraries for this Primitives to help with boundary value updates, etc. 03/15/11 CS267 Lecture 17

Multigrid on an Unstructured Mesh Another approach to variable activity is to use an unstructured mesh that is more refined in areas of interest Adaptive rectangular or unstructured? Numerics easier on rectangular Supposedly easier to implement (arrays without indirection) but boundary cases tend to dominate code Up to 39M unknowns on 960 processors, With 50% efficiency (Source: M. Adams) 03/15/11 CS267 Lecture 17

Multigrid on an Unstructured Mesh Need to partition graph for parallelism What does it mean to do Multigrid anyway? Need to be able to coarsen grid (hard problem) Can’t just pick “every other grid point” anymore Use “maximal independent sets” again How to make coarse graph approximate fine one Need to define R() and In() How do we convert from coarse to fine mesh and back? Need to define S() How do we define coarse matrix (no longer formula, like Poisson) Dealing with coarse meshes efficiently Should we switch to using fewer processors on coarse meshes? Should we switch to another solver on coarse meshes? 03/15/11 CS267 Lecture 17

Source of Unstructured Finite Element Mesh: Vertebra Source: M. Adams, H. Bayraktar, T. Keaveny, P. Papadopoulos, A. Gupta 03/15/11 CS267 Lecture 17

Multigrid for nonlinear elastic analysis of bone Source: M. Adams et al Mechanical testing for material properties 3D image mFE mesh 2.5 mm3 44 mm elements Micro Computed Tomography @ 22 mm resolution Up to 537M unknowns 4088 Processors (ASCI White) 70% parallel efficiency 03/15/11 CS267 Lecture 17

Extra Slides 03/15/11 CS267 Lecture 17

Preconditioning One can change A and b by preconditioning M1-1 A (M2-1 M2)x = M1-1 b is same equation as before for any choice of matrices M1 and M2 All these choices are designed to accelerate convergence of iterative methods Anew = M1-1 A M2-1 xnew = M2 x bnew = M1-1 b We choose M1 and M2 to make our standard methods perform better There are specialized preconditioning ideas and perhaps better general approaches such as multigrid and Incomplete LU (ILU) decomposition Anew xnew = bnew has same form as above and we can apply any of the methods that we used on A x = b 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox

Multigrid Philosophically Suppose we have a finest level M(1) with N by N points (in 2D) Then the k’th coarsest approximation M(k) to this has N/2k by N/2k points One way to think about Multigrid is that M(k+1) can form a preconditioner to M(k) so that one can replace natural matrix A(k) by A- 1(k+1)A(k) A-1(k+1)A(k) is a nicer matrix than A(k) and iterative solvers converge faster as long wavelength terms have been removed Basic issue is that A(k) and A(k+1) are of different size so we need prolongation and restriction to map solutions at level k and k+1 We apply this idea recursively 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox and (indirectly) Ulrich Ruede

Multigrid Algorithm: procedure MG(level, A, u, f) if level = coarsest then solve coarsest grid equation by another method “exactly” else smooth Alevel u = f (m1 times) Compute residual r = f - Alevel u Restrict F = R r ( R is Restriction Operator) Call MG( level + 1, A(level+1), V, F) (mc times) Interpolate v = P V (Interpolate new solution at this level) correct unew = u + v smooth Aunew = f (m2 times) and set u = unew endif endprocedure Alevel uexact = f uexact = u + v Alevel v = r = f - Alevel u 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox

Multigrid Cycles Finest Grid Coarsest Grid V Cycle W Cycle The parameter mc determines the exact strategy for how one iterates through different meshes One ends up with a full cycle as shown Coarsest Grid Full Multigrid Cycle Finest Coarsest 04/13/09 CS267 Lecture 20 Slide source: Geoffrey Fox

Extra Slides 04/13/2009 CS267 Lecture 20

Templates for choosing an iterative algorithm for Ax=b Use only matrix-vector-multiplication, dot, saxpy, … www.netlib.org/templates Symmetric? N Y AT available? Definite? N N Y Y Storage Expensive? Try Minres or CG Eigenvalues Known? Try QMR N Y N Y Try CG or Chebyshev Try CGS or Bi-CGSTAB Try CG Try GMRES 04/13/09 CS267 Lecture 20

Summary of Jacobi, SOR and CG Jacobi, SOR, and CG all perform sparse-matrix-vector multiply For Poisson, this means nearest neighbor communication on an n-by-n grid It takes n = N1/2 steps for information to travel across an n-by-n grid Since solution on one side of grid depends on data on other side of grid faster methods require faster ways to move information FFT Multigrid 04/13/09 CS267 Lecture 20