High Performance Computing 1 Multigrid Some material from lectures of J. Demmel, UC Berkeley Dept of CS.

Slides:



Advertisements
Similar presentations
Mutigrid Methods for Solving Differential Equations Ferien Akademie 05 – Veselin Dikov.
Advertisements

CS 450: COMPUTER GRAPHICS LINEAR ALGEBRA REVIEW SPRING 2015 DR. MICHAEL J. REALE.
Ch 7.7: Fundamental Matrices
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Lecture 19: Parallel Algorithms
Longest Common Subsequence
N-Body I CS 170: Computing for the Sciences and Mathematics.
1 Iterative Solvers for Linear Systems of Equations Presented by: Kaveh Rahnema Supervisor: Dr. Stefan Zimmer
1 Numerical Solvers for BVPs By Dong Xu State Key Lab of CAD&CG, ZJU.
CS 290H 7 November Introduction to multigrid methods
03/23/07CS267 Lecture 201 CS 267: Multigrid on Structured Grids Kathy Yelick
SOLVING THE DISCRETE POISSON EQUATION USING MULTIGRID ROY SROR ELIRAN COHEN.
MULTISCALE COMPUTATIONAL METHODS Achi Brandt The Weizmann Institute of Science UCLA
Geometric (Classical) MultiGrid. Hierarchy of graphs Apply grids in all scales: 2x2, 4x4, …, n 1/2 xn 1/2 Coarsening Interpolate and relax Solve the large.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Computational Methods in Physics PHYS 3437
Numerical Algorithms Matrix multiplication
Ma221 - Multigrid DemmelFall 2004 Ma 221 – Fall 2004 Multigrid Overview James Demmel
Algebraic MultiGrid. Algebraic MultiGrid – AMG (Brandt 1982)  General structure  Choose a subset of variables: the C-points such that every variable.
Numerical Algorithms • Matrix multiplication
Eigenvalues and Eigenvectors
CS 584. Review n Systems of equations and finite element methods are related.
Ch 7.8: Repeated Eigenvalues
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
CS267 Poisson 2.1 Demmel Fall 2002 CS 267 Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel
03/09/06CS267 Lecture 16 CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel
04/13/2009CS267 Lecture 20 CS 267: Applications of Parallel Computers Lecture Structured Grids Horst D. Simon
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Geometric (Classical) MultiGrid. Linear scalar elliptic PDE (Brandt ~1971)  1 dimension Poisson equation  Discretize the continuum x0x0 x1x1 x2x2 xixi.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Topic Overview One-to-All Broadcast and All-to-One Reduction
CS267 L25 Solving PDEs II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 25: Solving Linear Systems arising from PDEs - II James Demmel.
College Algebra Fifth Edition James Stewart Lothar Redlin Saleem Watson.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Eigenvalue Problems Solving linear systems Ax = b is one part of numerical linear algebra, and involves manipulating the rows of a matrix. The second main.
Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.
Multigrid for Nonlinear Problems Ferien-Akademie 2005, Sarntal, Christoph Scheit FAS, Newton-MG, Multilevel Nonlinear Method.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Lecture 22 - Exam 2 Review CVEN 302 July 29, 2002.
6. Introduction to Spectral method. Finite difference method – approximate a function locally using lower order interpolating polynomials. Spectral method.
1 Computational Methods II (Elliptic) Dr. Farzad Ismail School of Aerospace and Mechanical Engineering Universiti Sains Malaysia Nibong Tebal Pulau.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Fall 2011Math 221 Multigrid James Demmel
Numerical Methods.
Introduction to Scientific Computing II Multigrid Dr. Miriam Mehl Institut für Informatik Scientific Computing In Computer Science.
Introduction to Scientific Computing II Multigrid Dr. Miriam Mehl.
Lecture 21 MA471 Fall 03. Recall Jacobi Smoothing We recall that the relaxed Jacobi scheme: Smooths out the highest frequency modes fastest.
Introduction to Scientific Computing II
Finite Element Method. History Application Consider the two point boundary value problem.
Chapter 13 Discrete Image Transforms
Matrices CHAPTER 8.9 ~ Ch _2 Contents  8.9 Power of Matrices 8.9 Power of Matrices  8.10 Orthogonal Matrices 8.10 Orthogonal Matrices 
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.
Numerical Algorithms Chapter 11.
Solving Linear Systems Ax=b
James Demmel CS 267: Applications of Parallel Computers Lecture 17 - Structured Grids James Demmel
MultiGrid.
Lecture 19 MA471 Fall 2003.
Lecture 22: Parallel Algorithms
James Demmel Multigrid James Demmel Fall 2010 Math 221.
Introduction to Multigrid Method
Numerical Algorithms • Parallelizing matrix multiplication
Numerical Analysis Lecture13.
Jacobi Project Salvatore Orlando.
Error Correction Coding
Presentation transcript:

High Performance Computing 1 Multigrid Some material from lectures of J. Demmel, UC Berkeley Dept of CS

High Performance Computing 1 Multigrid Multigrid is a “unite-and conquer” algorithm for solving the elliptic PDEs. Multigrid begins by obtaining (or improving) a solution on an NxN grid by using and (N/2)x(N/2) grid (and proceeding recursively). This strategy, in addition to reducing computational work, shifts the frequency structure of errors present in the approximate solution. For a more detailed mathematical introduction to the multigrid algorithm, see "A Multigrid Tutorial" by W. Briggs.

High Performance Computing 1 Overview Consider an (2^m-1)-by-(2^m-1) grid of unknowns (a convenient number to work with). With boundary nodes, e add the nodes at the boundary, get a (2^m+1)-by-(2^m+1) grid on which the algorithm will operate. We let n=2^m+1. Let P(i) denote the problem of solving a discrete Poisson equation on a (2^i+1)-by-(2^i+1) grid, with (2^i-1)^2 unknowns. The problem is specified by the grid size i, the coefficient matrix T(i), and the right hand side b(i). Generate a sequence of related problems P(m), P(m-1), P(m-2),..., P(1) on coarser and coarser grids, where the solution to P(i-1) is a good approximation to the solution of P(i).

High Performance Computing 1 Grids

High Performance Computing 1 Method Let b(i) be the right-hand-side of the linear system P(i), and x(i) an approximate solution to P(i). Thus, x(i) and b(i) are (2^i-1)-by-(2^i-1) arrays of values at each grid point. Next define operators that transfer information between grids. –The restriction operator R(i) takes a pair (b(i),x(i)), consisting of a problem P(i) and its approximate solution x(i), and maps it to (b(i- 1),x(i-1)), the companion problem on the next coarser grid. On grid (i-1), start with a guess at the solution x(i-1): ( b(i-1), x(i-1) ) = R(i)( b(i), x(i) ) We will see that restriction is implemented simply by computing a weighted average of each grid point value with its nearest neighbors.

High Performance Computing 1 Method –The interpolation operator In(i-1) takes an approximate solution x(i-1) for P(i-1) and converts it to an approximation x(i) for the problem P(i) on the next finer grid: ( b(i), x(i) ) = In(i-1)( b(i-1), x(i-1) ) Its implementation also requires just a weighted average with nearest neighbors. –The solution operator S(i) take a problem P(i) and approximate solution x(i), and computes an improved x(i). x_improved(i) = S(i)( b(i), x(i) ) We often a variation of Jacobi's method as the solution operator (recall symmetry of Jacobi, in contrast to, say, GS)

High Performance Computing 1 Restriction

High Performance Computing 1 Interpolation or Prolongation

High Performance Computing 1 V Cycle Starting with a problem on a fine grid ( b(i), x(i) ), we want to damp the high frequency error by coarsening the grid, and improve the low frequency error by a smoothing operator. First a cut at the high frequency error by smoothing/approximate solution: x(i) = S(i)( b(i), x(i) ). Coarsen the grid: R(i)( b(i), x(i) ). Solve the coarser problem recursively MGV( R(i)( b(i), x(i) ) ).

High Performance Computing 1 V Cycle Coarse grid solution to fine grid In(i-1)( MGV( R(i)( b(i), x(i) ) ) ). Subtract the correction computed on the coarse grid from the fine grid solution x(i) = x(i) - d(i) Improve the solution more by smoothing: x(i) = S(i)( b(i), x(i) ).

High Performance Computing 1 Grids

High Performance Computing 1 V Cycle The work at each point in the algorithm is proportional to the number of unknowns, since the value at each grid point is just averaged with its nearest neighbors. So each grid point on level i will cost (2^i-1)^2 = O(4^i) operations. If the finest grid is at level m, the total serial work will be given by the geometric sum which is O(m) = O( log #unknowns ).

High Performance Computing 1 Full Multigrid The full Multigrid algorithm uses the V-cycle as a building block. FMG( b(m), x(m) ) begins by solving the simplest problem P(1) exactly. Given a solution x(i-1) of the coarse problem P(i- 1), prolongate to a starting guess x(i) for the next finer problem P(i): In(i-1)( b(i-1), x(i-1) ). Solve the finer problem using the Multigrid V- cycle with this starting guess: MGV( In(i-1)( b(i- 1), x(i-1) ) )

High Performance Computing 1 Full Multigrid To get a sense of the work, note there is one V cycle for each call to MGV in the inner loop of FMG. On a serial machine, this means that, at level i, it costs O(4^i) computations. Thus the total serial cost is again O( 4^m ) = O( # unknowns ) The serial complexity is optimal, in that a constant amount of work per unknown is performed. Analysis of a parallel version is considered later.

High Performance Computing 1 1D in detail Projection and prolongation are easier in 1D, and it is easier to get a sense of the algorithmic details. Consider P(i), the problem on a 1D grid with 2^i+1 points, with the middle 2^i-1 being the unknowns.

High Performance Computing 1 1D version

High Performance Computing 1 MG 1D Let T(i) be the coefficient matrix of problem P(i), that is a scaled (2^i-1)-by- (2^i-1) tridiagonal matrix with 2s on the diagonal and -1s on the off-diagonal. The scale factor 4^(-i) is the 1/h^2 term that shows up because T(i) is approximating a second derivative on a grid with grid spacing h=2^(-i).

High Performance Computing 1 MG 1D [ 2 -1 ] [ ] T(i) = 4^(-i) * [ ] [ ] [ ] [ -1 2 ]

High Performance Computing 1 MG 1D Recall from Fourier analysis, any function (in this case, the error in our approximate solution) can be considered as a linear combination of basis vectors - here, sine functions of different frequencies. These sine functions are in fact the eigenvectors of the T(i) matrix. The following figure shows some of these sine- curves when i=5, so T(5) is 31-by-31. The top plot is a plot of the eigenvalues (frequencies) of 4^5*T(5), from lowest to highest. The subsequent plots are the eigenvectors (sine- curves), starting with the lowest 4 frequencies, and then a few more frequencies up to the highest (number 31).

High Performance Computing 1 MG 1D

High Performance Computing 1 MG 1D Let v(j) be the j-th eigenvector of T(i), and V the eigenvector matrix V=[v(1),v(2),...] whose j-th column is v(j). V is an orthogonal matrix, so any vector x can be written as a linear combination of the v(j). alpha(j) the j-th frequency component of x, On the upper half of the frequency domain, multigrid acting on P(m), damps the error in the solution by the solution operator S(i). On the next coarser grid, with half as many points, multigrid damps the upper half of the remaining frequency components in the error. On the next coarser grid, the upper half of the remaining frequency components are damped, and so on, until we solve the exact (1 unknown) problem P(1).

High Performance Computing 1 S(j) Here we consider a so-called weighted Jacobi smoothing operator S(i). Pure Jacobi would replace the j-th component of the approximate solution x(i) by the weighted average: x(i)(j) =.5*( x(i)(j-1) + x(i)(j+1) + 4^i * b(i)(j) ) = x(i)(j) +.5*( x(i)(j-1) - 2*x(i)(j) + x(i)(j+1) + 4^i * b(i)(j) ) Weighted Jacobi instead uses x(i)(j) = x(i)(j) + w*.5*( x(i)(j-1) - 2*x(i)(j) + x(i)(j+1) + 4^i * b(i)(j) )

High Performance Computing 1 S(j) Choose the weight w to damp the highest frequencies; w=2/3 is a good choice x(i)(j) = (1/3)*( x(i)(j-1) + x(i)(j) + x(i)(j+1) + 4^i * b(i)(j) )

High Performance Computing 1 S(j) Take i=6, so there are 2^6-1=63 unknowns. In the figures to follow, the true solution is a sine curve, shown as a dotted line in the leftmost plot in each row. The approximate solution is a solid line in the same plot. The middle plot shows the error alone. The rightmost plot shows the frequency components of the error. Initially, the norm of the vector decreases rapidly, from 1.78 to 1.11, but then decays more gradually, because there is little more error in the high frequencies to damp. Thus, it only makes sense to do 1 or perhaps 2 iterations of S(i) at a time.

High Performance Computing 1 Errors 1

High Performance Computing 1 Errors 2

High Performance Computing 1 Errors 3

High Performance Computing 1 R(i) The restriction operator R(i) takes a problem P(i) with right-hand-side b(i) and an approximate solution x(i), and maps it to a simpler problem P(i-1) with right-hand-side b(i-1) and approximate solution x(i-1). Let r(i) be the residual of the approximate solution x(i): r(i) = T(i)*x(i) - b(i) If x(i) were the exact solution, r(i) would be zero. If we could solve the equation T(i) * d(i) = r(i) exactly for the correction d(i), then x(i)-d(i) would be the solution we seek, because T(i) * ( x(i)-d(i) ) = T(i)*x(i) - T(i)*d(i) = (r(i) + b(i)) - (r(i)) = b(i)

High Performance Computing 1 R(i) Thus, to apply R(i) to ( b(i), x(i) ) and get P(i-1), compute the residual r(i), restrict it to the next coarser grid to get b(i-1) = r(i-1) letting the initial guess x(i-1) be all zeros.

High Performance Computing 1 R(i) Thus, P(i-1) will be the problem of solving T(i- 1)*d(i-1) = r(i-1), for the correction d(i-1), starting with the initial guess x(i-1) of all zeros. The restriction will be accomplished by averaging nearest neighbors of r(i) on the fine grid (with 2^i- 1 unknowns) to get an approximation on the coarse grid (with 2^(i-1)-1 unknowns): The value at a coarse grid point is.5 times the value at the corresponding fine grid point, plus.25 times each of the fine grid point neighbors.

High Performance Computing 1 In(i-1) The interpolation operator In(i-1) takes the solution of the residual equation T(i-1)*x(i-1) = r(i-1) and uses it to correct the fine grid approximation x(i).

High Performance Computing 1 Errors again The next graphs summarize the convergence of Multigrid in two ways. The graph on the right shows the residual residual(k) = norm(T(i)x(i)-b(i)) after k iterations of Full Multigrid (FMG). The graph on the left shows the ratio of consecutive residuals, showing how the error (residual) is decreased by a factor bounded away from 1 at each step (for less smooth solutions, the residual will tend to decrease by a factor closer to.5 at each step.)

High Performance Computing 1 Error again

High Performance Computing 1 Parallel Multigrid Look more carefully at a parallel version of multigrid in 2D, including communication costs. Multigrid requires each grid point to be updated depending on as many as 8 neighbors (those to the N, E, S, W, NW, SW, SE and NE). This will determine communication costs. Assume an n=(2^m+1)-by-n grid of data, and that p=s^2 is a perfect square. Assume the data is laid out on an s-by-s grid of processors, with each processor owning an ((n- 1)/s)-by-((n-1)/s) subgrid.

High Performance Computing 1 Parallel Multigrid This is illustrated in the following diagram by the pink dotted line, which encircles the grid points owned by the corresponding processor. The grid points in the top processor row have been labeled (and color coded) by the grid number i of the problem P(i) in which they participate. There is exactly one point labeled 2 per processor. The only grid point in P(1) with a nonboundary value is owned by the processor above the one with the pink dotted line. In general, if p>=16, there will be.5*log_2(p)-1 grids P(i) in which fewer than all processor own participating grid points.

High Performance Computing 1

Parallel MG The processor indicated by the pink dotted line requires certain grid point data from its neighbors to execute the algorithm. For example, to update its own (blue) grid points for P(5), the processor requires 8 blue grid point values from its N, S, E, and W neighbors, as well as single blue grid point values from its NW, SW, SE and NE neighbors. Similarly, updating the (red) grid points for P(4) requires 4 red grid point values from the N, W, E and W neighbors, and one each from the NW, SW, SE and NE neighbors. This pattern continues until each processor has only one grid point. After this, only some processors participate in the computation, requiring one value each from 8 other processors.

High Performance Computing 1 Parallel MG From this, we can do a simple complexity analysis of the communication costs. For simplicity, let p=2^(2*k), so each processor owns a 2^(m-k)-by-2^(m-k) subgrid. Consider a V-cycle starting at level m. Then in levels m through k of the V-cycle, each processor will own at least one grid point. After that, in levels k-1 through 1, fewer than all processors will own an active grid point, and so some processors will remain idle. Let f be the time per flop, alpha the time for a message startup, and beta the time per word to send a message.

High Performance Computing 1 Parallel MG At level m >= i >= k, the time in the V-cycle will be Time at level i = O( 4^(i-k) ) * f... number of flops, proportional to the number of grid points per processor + O( 1 ) * alpha... send a constant number of messages to neighbors + O( 2^(i-k) ) *... number of words sent

High Performance Computing 1 Parallel MG Summing all these terms from i=k to m yields Time at levels k to m = O( 4^(m-k) )*f + O( m-k )*alpha + O( 2^(m-k) )*beta = O( n^2/p )*f + O( log(n/p) )*alpha + O( n/sqrt(p) )*beta

High Performance Computing 1 Parallel MG On levels k >= i >= 1, this effect disappears, because each processor needs to communicate about as many words as it does floating point operations: Time at level i = O( 1 ) * f... number of flops, proportional to # of grid points per processor + O( 1 ) * alpha... send a constant number of messages to neighbors + O( 1 ) * beta... number of words sent

High Performance Computing 1 Parallel MG Summing all these terms yields Time at levels 1 to k-1 = O( k-1 )*f + O( k-1 )*alpha + O( k-1 )*beta = O( log(p) )*f + O( log(p) )*alpha + O( log(p) )*beta

High Performance Computing 1 Parallel MG The total time for a V-cycle starting at the finest level is therefore Time = O( n^2/p + log(p) )*f + O( log(n) )*alpha + O( n/sqrt(p) + log(p) )*beta

High Performance Computing 1 Parallel MG For Full Multigrid, assuming n^2 >= p, so that each processor has at least 1 grid point: Time = O( N/p + log(p)*log(N) )*f + O( (log(N))^2 )*alpha + O( sqrt(N/p) + log(p)*log(N) )*beta where N=n^2 is the number of unknowns. The speedup over the serial floating point work O( N ), is nearly perfect, O( N/p + log(p)*log(N) ), when N>>p, but reduces to log(N)^2 when p=N (the PRAM model).

High Performance Computing 1 Parallel MG In practice, the time spent on the coarsest grids, when each processor has one or zero active grid points, while negligible on a serial machine, is significant on a parallel machine, enough so to make the efficiency noticeably lower.