Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Optimizing Krylov Subspace Methods Jim Demmel EECS & Math Departments,

Slides:

Advertisements

Similar presentations

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Siddharth Choudhary.  Refines a visual reconstruction to produce jointly optimal 3D structure and viewing parameters  ‘bundle’ refers to the bundle.

CS 240A: Solving Ax = b in parallel Dense A: Gaussian elimination with partial pivoting (LU) Same flavor as matrix * matrix, but more complicated Sparse.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

MATH 685/ CSI 700/ OR 682 Lecture Notes

Solving Linear Systems (Numerical Recipes, Chap 2)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Numerical Algorithms Matrix multiplication

Avoiding Communication in Numerical Linear Algebra Jim Demmel EECS & Math Departments UC Berkeley.

Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.

Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

1cs542g-term Notes  In assignment 1, problem 2: smoothness = number of times differentiable.

1cs542g-term Notes  Assignment 1 due tonight ( me by tomorrow morning)

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Sparse Matrix Methods Day 1: Overview Day 2: Direct methods

Avoiding Communication in Sparse Matrix-Vector Multiply (SpMV)

The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:

Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.

Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.

CS240A: Conjugate Gradients and the Model Problem.

CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 6. Eigenvalue problems.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

Antonio M. Vidal Jesús Peinado

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COALA: Communication Optimal Algorithms for Linear Algebra Jim Demmel EECS & Math Depts. UC Berkeley Laura Grigori INRIA Saclay Ile de France.

1 Minimizing Communication in Numerical Linear Algebra Introduction, Technological Trends Jim Demmel EECS & Math Departments,

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.

1 Minimizing Communication in Linear Algebra James Demmel 15 June

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.

CS240A: Conjugate Gradients and the Model Problem.

Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }

Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Consider Preconditioning – Basic Principles Basic Idea: is to use Krylov subspace method (CG, GMRES, MINRES …) on a modified system such as The matrix.

Krylov-Subspace Methods - I Lecture 6 Alessandra Nardi Thanks to Prof. Jacob White, Deepak Ramaswamy, Michal Rewienski, and Karen Veroy.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Symmetric-pattern multifrontal factorization T(A) G(A)

Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.

Numerical Algorithms Chapter 11.

Communication-Avoiding Iterative Methods

Krylov-Subspace Methods - I

Model Problem: Solving Poisson’s equation for temperature

Minimizing Communication in Linear Algebra

for more information ... Performance Tuning

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

CSCE569 Parallel Computing

Numerical Algorithms Quiz questions

Avoiding Communication in Linear Algebra

Presentation transcript:

Minimizing Communication in Numerical Linear Algebra Optimizing Krylov Subspace Methods Jim Demmel EECS & Math Departments, UC Berkeley

Outline of Lecture 8: Optimizing Krylov Subspace Methods k-steps of typical iterative solver for Ax=b or Ax=λx –Does k SpMVs with starting vector (eg with b, if solving Ax=b) –Finds “best” solution among all linear combinations of these k+1 vectors –Many such “Krylov Subspace Methods” Conjugate Gradients, GMRES, Lanczos, Arnoldi, … Goal: minimize communication in Krylov Subspace Methods –Assume matrix “well-partitioned,” with modest surface-to-volume ratio –Parallel implementation Conventional: O(k log p) messages, because k calls to SpMV New: O(log p) messages - optimal –Serial implementation Conventional: O(k) moves of data from slow to fast memory New: O(1) moves of data – optimal Lots of speed up possible (modeled and measured) –Price: some redundant computation 2 Summer School Lecture 8

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax], A tridiagonal, 2 processors Can be computed without communication Proc 1 Proc 2 3 Summer School Lecture 8

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Can be computed without communication Proc 1 Proc 2 Locally Dependent Entries for [x,Ax,A 2 x], A tridiagonal, 2 processors 4 Summer School Lecture 8

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Can be computed without communication Proc 1 Proc 2 Locally Dependent Entries for [x,Ax,…,A 3 x], A tridiagonal, 2 processors 5 Summer School Lecture 8

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Can be computed without communication Proc 1 Proc 2 Locally Dependent Entries for [x,Ax,…,A 4 x], A tridiagonal, 2 processors 6 Summer School Lecture 8

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors Can be computed without communication k=8 fold reuse of A Proc 1 Proc 2 7

Remotely Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work  “surface/volume ratio” Proc 1 Proc 2 8

Spyplot of A with sparsity pattern of a 5 point stencil, natural order 9

10 Summer School Lecture 8

Spyplot of A with sparsity pattern of a 5 point stencil, natural order, assigned to p=9 processors 11 For n x n mesh assigned to p processors, each processor needs 2n data from neighbors

Spyplot of A with sparsity pattern of a 5 point stencil, nested dissection order 12

Spyplot of A with sparsity pattern of a 5 point stencil, nested dissection order 13 For n x n mesh assigned to p processors, each processor needs 4n/p 1/2 data from neighbors

Remotely Dependent Entries for [x,Ax,…,A 3 x], A with sparsity pattern of a 5 point stencil 14

Remotely Dependent Entries for [x,Ax,…,A 3 x], A with sparsity pattern of a 5 point stencil 15 Summer School Lecture 8

Remotely Dependent Entries for [x,Ax,A 2 x,A 3 x], A irregular, multiple processors 16

Remotely Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work  “surface/volume ratio” Proc 1 Proc 2 17

18 Reducing redundant work for a tridiagonal matrix Summer School Lecture 8

Summary of Parallel Optimizations so far, for “Akx kernel”: (A,x,k)  [Ax,A 2 x,…,A k x] Possible to reduce #messages from O(k) to O(1) Depends on matrix being “well-partitioned” –Each processor only needs data from a few neighbors Price: redundant computation of some entries of A j x –Amount of redundant work depends on “surface to volume ratio” of partition: ideally each processor only needs a little data from neighbors, compared to what it owns itself Best case: tridiagonal (1D mesh): need O(1) data from nbrs, versus O(n/p) data locally; #flops grows by factor 1+O(k/(n/p)) = 1 + O(k/data_per_proc)  1 Pretty good: 2D mesh (n x n): need O(n/p 1/2 ) data from nbors, versus O(n 2 /p) locally; #flops grows by 1+O(k/(data_per_proc) 1/2 ) Less good: 3D mesh (n x n x n): need O(n 2 /p 2/3 ) data from nbors, versus O(n 3 /p) locally; #flops grows by 1+O(k/(data_per_proc) 1/3 ) 19 Summer School Lecture 8

Predicted speedup for model Petascale machine of [Ax,A 2 x,…,A k x] for 2D (n x n) Mesh log 2 n k 20

Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A 2 x,…,A k x] for 2D (n x n) Mesh 21 k log 2 n

Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A 2 x,…,A k x] for 2D (n x n) Mesh 22 k log 2 n

Predicted speedup for model Petascale machine of [Ax,A 2 x,…,A k x] for 3D (n x n x n) Mesh log 2 n k 23

Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A 2 x,…,A k x] for 3D (n x n x n) Mesh 24 k log 2 n

Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A 2 x,…,A k x] for 3D (n x n x n) Mesh 25 k log 2 n

Minimizing #messages versus #words_moved Parallel case –Can’t reduce #words needed from other processors –Required to get right answer Sequential case –Can reduce #words needed from slow memory –Assume A, x too large to fit in fast memory, –Naïve implementation of [Ax,A 2 x,…,A k x] by k calls to SpMV need to move A, x between fast and slow memory k times –We will move it  one time – clearly optimal 26 Summer School Lecture 8

Sequential [x,Ax,…,A 4 x], with memory hierarchy v One read of matrix from slow memory, not k=4 Minimizes words moved = bandwidth cost No redundant work 27

Remotely Dependent Entries for [x,Ax,…,A 3 x], A 100x100 with bandwidth 2 Only ~25% of A, vectors fits in fast memory 28

In what order should the sequential algorithm process a general sparse matrix? For a band matrix, we obviously want to process the matrix “from left to right”, since data we need for the next partition is already in fast memory Not obvious what best order is for a general matrix We can formulate question of finding the best order as a Traveling Salesman Problem (TSP) One vertex per matrix partition Weight of edge (j, k) is memory cost of processing partition k right after partition j TSP: find lowest cost “tour” visiting all vertices

What about multicore? Two kinds of communication to minimize –Between processors on the chip –Between on-chip cache and off-chip DRAM Use hybrid of both techniques described so far –Use parallel optimization so each core can work independently –Use sequential optimization to minimize off-chip DRAM traffic of each core 30 Summer School Lecture 8

Speedups on Intel Clovertown (8 core) Test matrices include stencils and practical matrices See SC09 paper on bebop.cs.berkeley.edu for details 31 Summer School Lecture 8

Minimizing Communication of GMRES to solve Ax=b GMRES: find x in span{b,Ab,…,A k b} minimizing || Ax-b || 2 Cost of k steps of standard GMRES vs new GMRES Standard GMRES for i=1 to k w = A · v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H Sequential: #words_moved = O(k·nnz) from SpMV + O(k 2 ·n) from MGS Parallel: #messages = O(k) from SpMV + O(k 2 · log p) from MGS Communication-Avoiding GMRES … CA-GMRES W = [ v, Av, A 2 v, …, A k v ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem Sequential: #words_moved = O(nnz) from SpMV + O(k·n) from TSQR Parallel: #messages = O(1) from computing W + O(log p) from TSQR Oops – W from power method, precision lost! 32

33

How to make CA-GMRES Stable? Use a different polynomial basis for the Krylov subspace Not Monomial basis W = [v, Av, A 2 v, …], instead [v, p 1 (A)v,p 2 (A)v,…] Possible choices: Newton Basis W N = [v, (A – θ 1 I)v, (A – θ 2 I)(A – θ 1 I)v, …] where shifts θ i chosen as approximate eigenvalues from Arnoldi (using same Krylov subspace, so “free”) Chebyshev Basis W C = [v, T 1 (A)v, T 2 (A)v, …] where T 1 (z) chosen to be small in region of complex plane containing large eigenvalues Summer School Lecture 8 34

“Monomial” basis [Ax,…,A k x] fails to converge Newton polynomial basis does converge 35

Speed ups on 8-core Clovertown 36

Summary of what is known (1/3), open GMRES –Can independently choose k to optimize speed, restart length r to optimize convergence –Need to “co-tune” Akx kernel and TSQR –Know how to use more stable polynomial bases –Proven speedups Can similarly reorganize other Krylov methods –Arnoldi and Lanczos, for Ax = λx and for Ax = λMx –Conjugate Gradients, for Ax = b –Biconjugate Gradients, for Ax=b –BiCGStab? –Other Krylov methods? 37 Summer School Lecture 8

Summary of what is known (2/3), open Preconditioning: MAx = Mb –Need different communication-avoiding kernel: [x,Ax,MAx,AMAx,MAMAx,AMAMAx,…] –Can think of this as union of two of the earlier kernels [x,(MA)x,(MA) 2 x,…,(MA) k x] and [x,Ax,(AM)Ax,(AM) 2 Ax,…,(AM) k Ax] –For which preconditioners M can we minimize communication? Easy: diagonal M How about block diagonal M? 38 Summer School Lecture 8

Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 39

Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 40

Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 41

Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 42

Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 43

Summary of what is known (3/3), open Preconditioning: MAx = Mb –Need different communication-avoiding kernel: [x,Ax,MAx,AMAx,MAMAx,AMAMAx,…] –For block diagonal M, matrix powers rapidly become dense, but ranks of off-diagonal block grow slowly –Can take advantage of low rank to minimize communication –y i = (AM) k ij · x j = U  V · x j = U  (V · x j ) –Works (in principle) for Hierarchically Semi-Separable M –How does it work in practice? For details (and open problems) see M. Hoemmen’s PhD thesis 44 Compute on proc j, Send to proc i Reconstruct y i on proc i Summer School Lecture 8

Of things not said (much about) … Other Communication-Avoiding (CA) dense factorizations –QR with column pivoting (tournament pivoting) –Cholesky with diagonal pivoting –GE with “complete pivoting” –LDL T ? Maybe with complete pivoting? CA-Sparse factorizations –Cholesky, assuming good separators –Lower bounds from Lipton/Rose/Tarjan –Matrix multiplication, anything else? Strassen, and all linear algebra with #words_moved = O(n  / M  /2-1 ) –Parallel? Extending CA-Krylov methods to “bottom solver” in multigrid with Adaptive Mesh Refinement (AMR) 45 Summer School Lecture 8

Summary Don’t Communic… 46 Time to redesign all dense and sparse linear algebra Summer School Lecture 8

EXTRA SLIDES Summer School Lecture 8 47