Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

Slides:



Advertisements
Similar presentations
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Advertisements

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CS 240A: Solving Ax = b in parallel Dense A: Gaussian elimination with partial pivoting (LU) Same flavor as matrix * matrix, but more complicated Sparse.
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
SOLVING THE DISCRETE POISSON EQUATION USING MULTIGRID ROY SROR ELIRAN COHEN.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Numerical Algorithms • Matrix multiplication
1cs542g-term Notes  Assignment 1 is out (questions?)
CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.
CS 584. Review n Systems of equations and finite element methods are related.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods
The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
CS267 L11 Sources of Parallelism(2).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 11: Sources of Parallelism and Locality (Part 2)
Sparse Matrix Methods Day 1: Overview Matlab and examples Data structures Ax=b Sparse matrices and graphs Fill-reducing matrix permutations Matching and.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS240A: Conjugate Gradients and the Model Problem.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
CS 240A: Complexity Measures for Parallel Computation.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
CS 240A: Parallelism in Physical Simulation
Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Scientific Computing Partial Differential Equations Poisson Equation.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
By: David McQuilling and Jesus Caban Numerical Linear Algebra.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
CS 219: Sparse Matrix Algorithms
L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.
Solution of Sparse Linear Systems
Parallel Solution of the Poisson Problem Using MPI
CS240A: Conjugate Gradients and the Model Problem.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
Administrivia: October 5, 2009 Homework 1 due Wednesday Reading in Davis: Skim section 6.1 (the fill bounds will make more sense next week) Read section.
CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.
Data Structures and Algorithms in Parallel Computing Lecture 7.
CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections
CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki.
9 Nov B - Introduction to Scientific Computing1 Sparse Systems and Iterative Methods Paul Heckbert Computer Science Department Carnegie Mellon.
Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning
The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.
Parallel Direct Methods for Sparse Linear Systems
High Altitude Low Opening?
Model Problem: Solving Poisson’s equation for temperature
CS 290N / 219: Sparse Matrix Algorithms
Solving Linear Systems Ax=b
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Lecture 19 MA471 Fall 2003.
Computation on meshes, sparse matrices, and graphs
Parallel Matrix Multiplication and other Full Matrix Algorithms
Computational meshes, matrices, conjugate gradients, and mesh partitioning Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy.
CSCE569 Parallel Computing
Parallel Matrix Multiplication and other Full Matrix Algorithms
Presentation transcript:

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for some of these slides.

The middleware of scientific computing Computers Continuous physical modeling Linear algebra Ax = b

Example: The Temperature Problem A cabin in the snow Wall temperature is 0°, except for a radiator at 100° What is the temperature in the interior?

Example: The Temperature Problem A cabin in the snow (a square region ) Wall temperature is 0°, except for a radiator at 100° What is the temperature in the interior?

The physics: Poisson’s equation

6.43 Many Physical Models Use Stencil Computations PDE models of heat, fluids, structures, … Weather, airplanes, bridges, bones, … Game of Life many, many others

Model Problem: Solving Poisson’s equation for temperature Discrete approximation to Poisson’s equation: t(i) = ¼ ( t(i-k) + t(i-1) + t(i+1) + t(i+k) ) Intuitively: Temperature at a point is the average of the temperatures at surrounding points k = n 1/2

Examples of stencils 5-point stencil in 2D (temperature problem) 9-point stencil in 2D (game of Life) 7-point stencil in 3D (3D temperature problem) 25-point stencil in 3D (seismic modeling) … and many more

9 Parallelizing Stencil Computations Parallelism is simple Grid is a regular data structure Even decomposition across processors gives load balance Spatial locality limits communication cost Communicate only boundary values from neighboring patches Communication volume v = total # of boundary cells between patches

10 Two-dimensional block decomposition n mesh cells, p processors Each processor has a patch of n/p cells Block row (or block col) layout: v = 2 * p * sqrt(n) 2-dimensional block layout: v = 4 * sqrt(p) * sqrt(n)

Detailed complexity measures for data movement I: Latency/Bandwidth Model Moving data between processors by message-passing Machine parameters:  or  t startup latency (message startup time in seconds)  or t data inverse bandwidth (in seconds per word) between nodes of Triton,  ×  and  × Time to send & recv or bcast a message of w words:  + w*  t comm total commmunication time t comp total computation time Total parallel time: t p = t comp + t comm

12 Ghost Nodes in Stencil Computations Comm cost = α * (#messages) + β * (total size of messages) Keep a ghost copy of neighbors’ boundary nodes Communicate every second iteration, not every iteration Reduces #messages, not total size of messages Costs extra memory and computation Can also use more than one layer of ghost nodes Green = my interior nodes Yellow = neighbors’ boundary nodes = my “ghost nodes” Blue = my boundary nodes

13 Parallelism in Regular meshes Computing a Stencil on a regular mesh need to communicate mesh points near boundary to neighboring processors. Often done with ghost regions Surface-to-volume ratio keeps communication down, but Still may be problematic in practice Implemented using “ghost” regions. Adds memory overhead

Model Problem: Solving Poisson’s equation for temperature Discrete approximation to Poisson’s equation: t(i) = ¼ ( t(i-k) + t(i-1) + t(i+1) + t(i+k) ) Intuitively: Temperature at a point is the average of the temperatures at surrounding points k = n 1/2

Model Problem: Solving Poisson’s equation for temperature For each i from 1 to n, except on the boundaries: – t(i-k) – t(i-1) + 4*t(i) – t(i+1) – t(i+k) = 0 n equations in n unknowns: A*t = b Each row of A has at most 5 nonzeros In three dimensions, k = n 1/3 and each row has at most 7 nzs k = n 1/2

A Stencil Computation Solves a System of Linear Equations Solve Ax = b for x Matrix A, right-hand side vector b, unknown vector x A is sparse: most of the entries are 0

Conjugate gradient iteration to solve A*x=b One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0 = 0, r 0 = b, d 0 = r 0 (these are all vectors) for k = 1, 2, 3,... α k = (r T k-1 r k-1 ) / (d T k-1 Ad k-1 ) step length x k = x k-1 + α k d k-1 approximate solution r k = r k-1 – α k Ad k-1 residual = b - Ax k β k = (r T k r k ) / (r T k-1 r k-1 ) improvement d k = r k + β k d k-1 search direction

Vector and matrix primitives for CG DAXPY: v = α*v + β*w (vectors v, w; scalars α, β) Broadcast the scalars α and β, then independent * and + comm volume = 2p, span = log n DDOT : α = v T *w =  j v[j]*w[j] (vectors v, w; scalar α) Independent *, then + reduction comm volume = p, span = log n Matvec: v = A*w (matrix A, vectors v, w) The hard part But all you need is a subroutine to compute v from w Sometimes you don’t need to store A (e.g. temperature problem) Usually you do need to store A, but it’s sparse...

Broadcast and reduction Broadcast of 1 value to p processors in log p time Reduction of p values to 1 in log p time Takes advantage of associativity in +, *, min, max, etc. α Add-reduction Broadcast

Where’s the data (temperature problem)? The matrix A: Nowhere!! The vectors x, b, r, d: Each vector is one value per stencil point Divide stencil points among processors, n/p points each How do you divide up the sqrt(n) by sqrt(n) region of points? Block row (or block col) layout: v = 2 * p * sqrt(n) 2-dimensional block layout: v = 4 * sqrt(p) * sqrt(n)

6.43 How do you partition the sqrt(n) by sqrt(n) stencil points? First version: number the grid by rows Leads to a block row decomposition of the region v = 2 * p * sqrt(n)

6.43 How do you partition the sqrt(n) by sqrt(n) stencil points? Second version: 2D block decomposition Numbering is a little more complicated v = 4 * sqrt(p) * sqrt(n)

Where’s the data (temperature problem)? The matrix A: Nowhere!! The vectors x, b, r, d: Each vector is one value per stencil point Divide stencil points among processors, n/p points each How do you divide up the sqrt(n) by sqrt(n) region of points? Block row (or block col) layout: v = 2 * p * sqrt(n) 2-dimensional block layout: v = 4 * sqrt(p) * sqrt(n)

The Landscape of Ax = b Algorithms Gaussian elimination Iterative Any matrix Symmetric positive definite matrix More RobustLess Storage More Robust More General

Conjugate gradient in general CG can be used to solve any system Ax = b, if …

Conjugate gradient in general CG can be used to solve any system Ax = b, if … The matrix A is symmetric (a ij = a ji ) … … and positive definite (all eigenvalues > 0).

Conjugate gradient in general CG can be used to solve any system Ax = b, if … The matrix A is symmetric (a ij = a ji ) … … and positive definite (all eigenvalues > 0). Symmetric positive definite matrices occur a lot in scientific computing & data analysis!

Conjugate gradient in general CG can be used to solve any system Ax = b, if … The matrix A is symmetric (a ij = a ji ) … … and positive definite (all eigenvalues > 0). Symmetric positive definite matrices occur a lot in scientific computing & data analysis! But usually the matrix isn’t just a stencil. Now we do need to store the matrix A. Where’s the data?

Conjugate gradient in general CG can be used to solve any system Ax = b, if … The matrix A is symmetric (a ij = a ji ) … … and positive definite (all eigenvalues > 0). Symmetric positive definite matrices occur a lot in scientific computing & data analysis! But usually the matrix isn’t just a stencil. Now we do need to store the matrix A. Where’s the data? The key is to use graph data structures and algorithms.

Vector and matrix primitives for CG DAXPY: v = α*v + β*w (vectors v, w; scalars α, β) Broadcast the scalars α and β, then independent * and + comm volume = 2p, span = log n DDOT : α = v T *w =  j v[j]*w[j] (vectors v, w; scalar α) Independent *, then + reduction comm volume = p, span = log n Matvec: v = A*w (matrix A, vectors v, w) The hard part But all you need is a subroutine to compute v from w Sometimes you don’t need to store A (e.g. temperature problem) Usually you do need to store A, but it’s sparse...

Graphs and Sparse Matrices Sparse matrix is a representation of a (sparse) graph Matrix entries are edge weights Number of nonzeros per row is the vertex degree Edges represent data dependencies in matrix-vector multiplication

Parallel Dense Matrix-Vector Product (Review) y = A*x, where A is a dense matrix Layout: 1D by rows Algorithm: Foreach processor j Broadcast X(j) Compute A(p)*x(j) A(i) is the n by n/p block row that processor Pi owns Algorithm uses the formula Y(i) = A(i)*X =  j A(i)*X(j) x y P0 P1 P2 P3 P0 P1 P2 P3

Lay out matrix and vectors by rows y(i) = sum(A(i,j)*x(j)) Only compute terms with A(i,j) ≠ 0 Algorithm Each processor i: Broadcast x(i) Compute y(i) = A(i,:)*x Optimizations Only send each proc the parts of x it needs, to reduce comm Reorder matrix for better locality by graph partitioning Worry about balancing number of nonzeros / processor, if rows have very different nonzero counts x y P0 P1 P2 P3 P0 P1 P2 P3 Parallel sparse matrix-vector product

Data structure for sparse matrix A (stored by rows) Full matrix: 2-dimensional array of real or complex numbers (nrows*ncols) memory Sparse matrix: compressed row storage about (2*nzs + nrows) memory

P0P0 P1P1 P2P2 PnPn Each processor stores: # of local nonzeros range of local rows nonzeros in CSR form Distributed-memory sparse matrix data structure

36 Irregular mesh: NASA Airfoil in 2D

37 Composite Mesh from a Mechanical Structure

38 Converting the Mesh to a Matrix

39 Adaptive Mesh Refinement (AMR) Adaptive mesh around an explosion Refinement done by calculating errors

40 Adaptive Mesh Shock waves in a gas dynamics using AMR (Adaptive Mesh Refinement) See: fluid density

41 Irregular mesh: Tapered Tube (Multigrid)

Scientific computation and data analysis Computers Continuous physical modeling Linear algebra

Scientific computation and data analysis Computers Continuous physical modeling Linear algebra Discrete structure analysis Graph theory Computers

Scientific computation and data analysis Continuous physical modeling Linear algebra & graph theory Discrete structure analysis Computers