Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Slides:



Advertisements
Similar presentations
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Solving Linear Systems (Numerical Recipes, Chap 2)
Avoiding Communication in Numerical Linear Algebra Jim Demmel EECS & Math Departments UC Berkeley.
Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.
Avoiding Communication in Sparse Matrix-Vector Multiply (SpMV)
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS240A: Conjugate Gradients and the Model Problem.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Sparse Matrix-Dense Vector Multiply on G80: Probing the CUDA Parameter Space Comp 790 GPGP Project Stephen Olivier.
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
COALA: Communication Optimal Algorithms for Linear Algebra Jim Demmel EECS & Math Depts. UC Berkeley Laura Grigori INRIA Saclay Ile de France.
1 Minimizing Communication in Numerical Linear Algebra Introduction, Technological Trends Jim Demmel EECS & Math Departments,
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.
1 Minimizing Communication in Linear Algebra James Demmel 15 June
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Antisocial Parallelism: Avoiding, Hiding and Managing Communication Kathy Yelick Associate Laboratory Director of Computing Sciences Lawrence Berkeley.
The Potential of the Cell Processor for Scientific Computing
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
CS240A: Conjugate Gradients and the Model Problem.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
Minimizing Communication in Numerical Linear Algebra Optimizing Krylov Subspace Methods Jim Demmel EECS & Math Departments,
1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Irregular Applications –Sparse Matrix Vector Multiplication
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Communication-Avoiding Iterative Methods
Model Problem: Solving Poisson’s equation for temperature
Minimizing Communication in Linear Algebra
BLAS: behind the scenes
The Coming Multicore Revolution: What Does it Mean for Programming?
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
CSCE569 Parallel Computing
Avoiding Communication in Linear Algebra
EE 4xx: Computer Architecture and Performance Programming
Presentation transcript:

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially growing gaps between Floating point time << 1/Network BW << Network Latency Improving 59%/year vs 26%/year vs 15%/year Floating point time << 1/Memory BW << Memory Latency Improving 59%/year vs 23%/year vs 5.5%/year Goal 1: reorganize linear algebra to avoid communication Not just hiding communication (speedup  2x ) Arbitrary speedups possible

Motivation (2) Algorithms and architectures getting more complex Performance harder to understand Can’t count on conventional compiler optimizations Goal 2: Automate algorithm reorganization “Autotuning” Emulate success of PHiPAC, ATLAS, FFTW, OSKI etc. Example: Sparse-matrix-vector-multiply (SpMV) on multicore, Cell Sam Williams, Rich Vuduc, Lenny Oliker, John Shalf, Kathy Yelick

4 Autotuned Performance of SpMV(1) Clovertown was already fully populated with DIMMs Gave Opteron as many DIMMs as Clovertown Firmware update for Niagara2 Array padding to avoid inter- thread conflict misses PPE’s use ~1/3 of Cell chip area Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

5 Autotuned Performance of SpMV(2) Model faster cores by commenting out the inner kernel calls, but still performing all DMAs Enabled 1x1 BCOO ~16% improvement IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve +better Cell implementation

Outline of Minisymposia 9 & 34 Minimize communication in linear algebra, autotuning MS9: Direct Methods (now) – Dense LU: Laura Grigori – Dense QR: Julien Langou – Sparse LU: Hua Xiang MS34: Iterative methods (Thursday, 4-6pm) – Jacobi iteration with Stencils: Kaushik Datta – Gauss-Seidel iteration: Michelle Strout – Bases for Krylov Subspace Methods: Marghoob Mohiyuddin – Stable Krylov Subspace Methods: Mark Hoemmen

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax], A tridiagonal 2 processors Can be computed without communication Proc 1 Proc 2

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax,A 2 x], A tridiagonal 2 processors Can be computed without communication Proc 1 Proc 2

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax,…,A 3 x], A tridiagonal 2 processors Can be computed without communication Proc 1 Proc 2

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax,…,A 4 x], A tridiagonal 2 processors Can be computed without communication Proc 1 Proc 2

x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Locally Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal 2 processors Can be computed without communication k=8 fold reuse of A Proc 1 Proc 2

Remotely Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal 2 processors x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work  “surface/volume ratio” Proc 1 Proc 2

Remotely Dependent Entries for [x,Ax,…,A 3 x], A irregular, multiple processors

Fewer Remotely Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal 2 processors x Ax A2xA2x A3xA3x A4xA4x A5xA5x A6xA6x A7xA7x A8xA8x Reduce redundant work by half Proc 1 Proc 2

Sequential [x,Ax,…,A 4 x], with memory hierarchy v One read of matrix from slow memory, not k=4 Minimizes words moved = bandwidth cost No redundant work

Design Goals for [x,Ax,…,A k x] Parallel case – Goal: Constant number of messages, not O(k) Minimizes latency cost Possible price: extra flops and/or extra words sent, amount depends on surface/volume Sequential case – Goal: Move A, vectors once through memory hierarchy, not k times Minimizes bandwidth cost Possible price: extra flops, amount depends on surface/volume

Design Space for [x,Ax,…,A k x] (1) Mathematical Operation – Keep last vector A k x only Jacobi, Gauss Seidel – Keep all vectors Krylov Subspace Methods – Preconditioning (Ay=b  MAy=Mb) [x,Ax,MAx,AMAx,MAMAx,…,(MA) k x] – Improving conditioning of basis W = [x, p 1 (A)x, p 2 (A)x,…,p k (A)x] p i (A) = degree i polynomial chosen to reduce cond(W)

Design Space for [x,Ax,…,A k x] (2) Representation of sparse A – Zero pattern may be explicit or implicit – Nonzero entries may be explicit or implicit Implicit  save memory, communication Explicit patternImplicit pattern Explicit nonzerosGeneral sparse matrixImage segmentation Implicit nonzerosLaplacian(graph), for graph partitioning “Stencil matrix” Ex: tridiag(-1,2,-1) Representation of dense preconditioners M – Low rank off-diagonal blocks (semiseparable)

Design Space for [x,Ax,…,A k x] (3) Parallel implementation – From simple indexing, with redundant flops  surface/volume – To complicated indexing, with no redundant flops but some extra communication Sequential implementation – Depends on whether vectors fit in fast memory Reordering rows, columns of A – Important in parallel and sequential cases Plus all the optimizations for one SpMV!

Examples from later talks (MS34) Kaushik Datta – Autotuning of stencils in parallel case – Example: 66 Gflops on Cell (measured) Michelle Strout – Autotuning of Gauss-Seidel for general sparse A – Example speedup: 4.5x (measured) Marghoob Mohiyuddin – Tuning [x,Ax,…,A k x] for general sparse A – Example speedups: 22x on Petascale machine (modeled) 3x on out-of-core (measured) Mark Hoemmen – How to use [x,Ax,…,A k x] stably in GMRES, other Krylov methods – Requires communication avoiding QR decomposition …

Minimizing Communication in QR W = W1W2W3W4W1W2W3W4 R1R2R3R4R1R2R3R4 R 12 R 34 R 1234 QR decomposition of m x n matrix W, m >> n P processors, block row layout Usual Algorithm Compute Householder vector for each column Number of messages  n log P Communication Avoiding Algorithm Reduction operation, with QR as operator Number of messages  log P

Design space for QR TSQR = Tall Skinny QR (m >> n) – Shape of reduction tree depends on architecture Parallel: use “deep” tree, saves messages/latency Sequential: use flat tree, saves words/bandwidth Multicore: use mixture – QR([ ]): save half the flops since R i triangular – Recursive QR General QR – Use TSQR for panel factorizations R1R2R1R2 If it works for QR, why not LU?

Examples from later talks (MS9) Laura Grigori – Dense LU – How to pivot stably? – 12x speeds (measured) Julien Langou – Dense QR – Speedups up to 5.8x (measured), 23x(modeled) Hua Xiang – Sparse LU – More important to reduce communication

Summary Possible to reduce communication to theoretical minimum in various linear algebra computations – Parallel: O(1) or O(log p) messages to take k steps, not O(k) or O(k log p) – Sequential: move data through memory once, not O(k) times – Lots of speed up possible (modeled and measured) Lots of related work – Some ideas go back to 1960s, some new – Rising cost of communication forcing us to reorganize linear algebra (among other things!) Lots of open questions – For which preconditioners M can we avoid communication in [x,Ax,MAx,AMAx,MAMAx,…,(MA) k x]? – Can we avoid communication in direct eigensolvers?

bebop.cs.berkeley.edu