Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.

Slides:



Advertisements
Similar presentations
Mutigrid Methods for Solving Differential Equations Ferien Akademie 05 – Veselin Dikov.
Advertisements

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Improvement of a multigrid solver for 3D EM diffusion Research proposal final thesis Applied Mathematics, specialism CSE (Computational Science and Engineering)
1 Iterative Solvers for Linear Systems of Equations Presented by: Kaveh Rahnema Supervisor: Dr. Stefan Zimmer
1 Numerical Solvers for BVPs By Dong Xu State Key Lab of CAD&CG, ZJU.
CS 290H 7 November Introduction to multigrid methods
03/23/07CS267 Lecture 201 CS 267: Multigrid on Structured Grids Kathy Yelick
SOLVING THE DISCRETE POISSON EQUATION USING MULTIGRID ROY SROR ELIRAN COHEN.
CSCI-455/552 Introduction to High Performance Computing Lecture 26.
Geometric (Classical) MultiGrid. Hierarchy of graphs Apply grids in all scales: 2x2, 4x4, …, n 1/2 xn 1/2 Coarsening Interpolate and relax Solve the large.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Ma221 - Multigrid DemmelFall 2004 Ma 221 – Fall 2004 Multigrid Overview James Demmel
Algebraic MultiGrid. Algebraic MultiGrid – AMG (Brandt 1982)  General structure  Choose a subset of variables: the C-points such that every variable.
Numerical Algorithms • Matrix multiplication
CS 584. Review n Systems of equations and finite element methods are related.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
03/14/06CS267 Lecture 17 CS 267: Applications of Parallel Computers Unstructured Multigrid for Linear Systems James Demmel Based in part on material from.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
CS267 Poisson 2.1 Demmel Fall 2002 CS 267 Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
03/09/06CS267 Lecture 16 CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel
04/13/2009CS267 Lecture 20 CS 267: Applications of Parallel Computers Lecture Structured Grids Horst D. Simon
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
04/13/07CS267 Guest Lecture CS 267: Applications of Parallel Computers Unstructured Multigrid for Linear Systems James Demmel Based in part on material.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
CS267 L11 Sources of Parallelism(2).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 11: Sources of Parallelism and Locality (Part 2)
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.
Geometric (Classical) MultiGrid. Linear scalar elliptic PDE (Brandt ~1971)  1 dimension Poisson equation  Discretize the continuum x0x0 x1x1 x2x2 xixi.
CS267 L25 Solving PDEs II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 25: Solving Linear Systems arising from PDEs - II James Demmel.
CS240A: Conjugate Gradients and the Model Problem.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
MULTISCALE COMPUTATIONAL METHODS Achi Brandt The Weizmann Institute of Science UCLA
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
1 Numerical Integration of Partial Differential Equations (PDEs)
Multigrid for Nonlinear Problems Ferien-Akademie 2005, Sarntal, Christoph Scheit FAS, Newton-MG, Multilevel Nonlinear Method.
Improving Coarsening and Interpolation for Algebraic Multigrid Jeff Butler Hans De Sterck Department of Applied Mathematics (In Collaboration with Ulrike.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Introduction to Scientific Computing II From Gaussian Elimination to Multigrid – A Recapitulation Dr. Miriam Mehl.
1 Computational Methods II (Elliptic) Dr. Farzad Ismail School of Aerospace and Mechanical Engineering Universiti Sains Malaysia Nibong Tebal Pulau.
Elliptic PDEs and the Finite Difference Method
Fall 2011Math 221 Multigrid James Demmel
Multigrid Computation for Variational Image Segmentation Problems: Multigrid approach  Rosa Maria Spitaleri Istituto per le Applicazioni del Calcolo-CNR.
CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.
Introduction to Scientific Computing II Overview Michael Bader.
Introduction to Scientific Computing II Multigrid Dr. Miriam Mehl Institut für Informatik Scientific Computing In Computer Science.
Introduction to Scientific Computing II Multigrid Dr. Miriam Mehl.
High Performance Computing 1 Multigrid Some material from lectures of J. Demmel, UC Berkeley Dept of CS.
Lecture 21 MA471 Fall 03. Recall Jacobi Smoothing We recall that the relaxed Jacobi scheme: Smooths out the highest frequency modes fastest.
Introduction to Scientific Computing II
MULTISCALE COMPUTATIONAL METHODS Achi Brandt The Weizmann Institute of Science UCLA
The Application of the Multigrid Method in a Nonhydrostatic Atmospheric Model Shu-hua Chen MMM/NCAR.
University of Colorado
Relaxation Methods in the Solution of Partial Differential Equations
EEE 431 Computational Methods in Electrodynamics
Ioannis E. Venetis Department of Computer Engineering and Informatics
James Demmel CS 267: Applications of Parallel Computers Lecture 17 - Structured Grids James Demmel
CS203 – Advanced Computer Architecture
MultiGrid.
Iterative Methods Good for sparse matrices Jacobi Iteration
James Demmel Multigrid James Demmel Fall 2010 Math 221.
Introduction to Scientific Computing II
James Demmel CS 267: Applications of Parallel Computers Lecture 17 - Structured Grids James Demmel
Numerical Algorithms • Parallelizing matrix multiplication
James Demmel CS 267: Applications of Parallel Computers Lecture 17 - Structured Grids James Demmel
Programming assignment #1 Solving an elliptic PDE using finite differences Numerical Methods for PDEs Spring 2007 Jim E. Jones.
Presentation transcript:

Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures

Outline Multigrid Overview Multigrid Performance Model Multigrid Optimizations Results

3D Poisson’s Equation An elliptic PDE that arises in many physical problems (e.g. electrostatic or gravitational potential) Many different techniques for solving are available

3D Poisson’s Equation The continuous version is:  2  /  x 2 +  2  /  y 2 +  2  /  z 2 =  (or  =  ) The discrete version is: T * x = b In 2D, the 9-point stencil looks like:

Algorithms for Solving 2D Poisson’s Equation with N unknowns AlgorithmSerial FlopsMemory Dense LUN 3 N 2 Band LUN 2 N 3/2 JacobiN 2 N Explicit Inv.N N Conj.Grad.N 3/2 N RB SORN 3/2 N Sparse LUN 3/2 N*log N FFTN*log NN MultigridNN Lower boundNN

Multigrid Overview Basic Algorithm: Replace problem on fine grid by an approximation on a coarser grid Solve the coarse grid problem approximately, and use the solution as a starting guess for the fine- grid problem, which is then iteratively updated Solve the coarse grid problem recursively, i.e. by using a still coarser grid approximation, etc. Success depends on coarse grid solution being a good approximation to the fine grid

Multigrid Sketch on a 2D Mesh Consider a 2 m +1 by 2 m +1 grid Let P (i) be the problem of solving the discrete Poisson equation on a 2 i +1 by 2 i +1 grid in 2D Write linear system as T(i) * x(i) = b(i) P (m), P (m-1), …, P (1) is sequence of problems from finest to coarsest

Multigrid Operators The four operators that we examine are: evaluateResidual – calculates the residual of our current solution applySmoother – performs a Jacobi relaxation step coarsen – maps from a (2 m x 2 m x 2 m ) grid to a (2 m-1 x 2 m-1 x 2 m-1 ) grid prolongate – maps from a (2 m-1 x 2 m-1 x 2 m-1 ) grid to a (2 m x 2 m x 2 m ) grid All these operators perform nearest-neighbor computations using a 27-point stencil

Multigrid V-cycle Just a picture of the call graph In time a V-cycle looks like the following: level coarsen prolongate, applySmoother, evaluateResidual time

Why Multigrid Works Consider the error as a sum of sine curves of various frequencies Lower levels of multigrid have smaller frequency domains than higher levels A given level of multigrid dampens the error in the upper-half of its frequency domain by smoothing By going through all levels, all frequencies should be dampened

Jacobi Smoothing (Relaxation) Error after 1 weighted Jacobi step “Smoother” Less high frequency component Norm = Error after 2 weighted Jacobi steps “Smooth” Little high frequency component Norm =.9176, won’t decrease much more Initial error “Rough” Lots of high frequency components Norm = 1.65

Multigrid Performance Model Memory access is performance bottleneck Each pass over 3D grid requires (per cell): 27 integer operations (stencil coordinates) 27 FP loads of surrounding grid locations 27 (approx) FP operations 1 FP store Traversing grid consecutively in memory causes 9 cache misses every 1/(# doubles stored in a cache line) cells Grid size prevents reuse of cached values

Multigrid Optimizations Optimizations possible in 2 areas Reducing ALU operations per cell Reusing stencil coordinates between cells Reusing partial sums common to consecutive cells Improving memory behavior Reducing # of loads (register blocking) Reducing # of cache misses (cache blocking)

Multigrid Optimizations Common subexpression elimination Loop unrolling Memoization Cache blocking Memoization + cache blocking

Optimizations – Loop Unrolling Reduces # stencil coordinates computed per cell Exposes load reuse to compiler Allows compiler to use FP registers to store grid values, reducing loads Minimum number of loads is 9/grid point (given generous # FP registers)

Optimizations - Memoization Traverses grid once to precompute partial sums common to consecutive cells Traverses grid again to compute actual cell values 9 integer stencil operations/cell 18 FP operations/cell Reduces FP register pressure by breaking computation into two stages, but still uses 9 load streams per cell

Optimizations – Cache Blocking Break 3D grid into blocks that fit within cache Attempts to allow reuse between adjacent 2D-slices Reduces memory traffic to 3 load streams per cell Overhead when switching between blocks

Summary & Continuing Work Overhead of cache-blocking is too large for the small block sizes that fit in the IBM Power3’s L1 cache Memoization offers greatest performance benefit due to reduced FP operations We are currently analyzing performance data collected on other architectures