Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder Caltech ASCI Center.

Slides:

Advertisements

Similar presentations

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

Advertisements

DSPs Vs General Purpose Microprocessors

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

1 Numerical Solvers for BVPs By Dong Xu State Key Lab of CAD&CG, ZJU.

CS 290H 7 November Introduction to multigrid methods

An Efficient Multigrid Solver for (Evolving) Poisson Systems on Meshes Misha Kazhdan Johns Hopkins University.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.

MULTISCALE COMPUTATIONAL METHODS Achi Brandt The Weizmann Institute of Science UCLA

Geometric (Classical) MultiGrid. Hierarchy of graphs Apply grids in all scales: 2x2, 4x4, …, n 1/2 xn 1/2 Coarsening Interpolate and relax Solve the large.

Numerical Algorithms • Matrix multiplication

CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

CSE351/ IT351 Modeling and Simulation

Sparse Matrix Algorithms CS 524 – High-Performance Computing.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.

CS240A: Conjugate Gradients and the Model Problem.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

Numerical methods for PDEs PDEs are mathematical models for –Physical Phenomena Heat transfer Wave motion.

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys.

© Fluent Inc. 9/5/2015L1 Fluids Review TRN Solution Methods.

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys.

Enhancing GPU for Scientific Computing Some thoughts.

Improving Coarsening and Interpolation for Algebraic Multigrid Jeff Butler Hans De Sterck Department of Applied Mathematics (In Collaboration with Ulrike.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.

Cg Programming Mapping Computational Concepts to GPUs.

Hans De Sterck Department of Applied Mathematics University of Colorado at Boulder Ulrike Meier Yang Center for Applied Scientific Computing Lawrence Livermore.

Van Emden Henson Panayot Vassilevski Center for Applied Scientific Computing Lawrence Livermore National Laboratory Element-Free AMGe: General algorithms.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Texture Memory -in CUDA Perspective TEXTURE MEMORY IN - IN CUDA PERSPECTIVE VINAY MANCHIRAJU.

CFD Lab - Department of Engineering - University of Liverpool Ken Badcock & Mark Woodgate Department of Engineering University of Liverpool Liverpool L69.

GPU-Accelerated Surface Denoising and Morphing with LBM Scheme Ye Zhao Kent State University, Ohio.

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.

GPU Computation Strategies & Tricks Ian Buck NVIDIA.

Parallel Solution of the Poisson Problem Using MPI

CS240A: Conjugate Gradients and the Model Problem.

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

Lecture 21 MA471 Fall 03. Recall Jacobi Smoothing We recall that the relaxed Jacobi scheme: Smooths out the highest frequency modes fastest.

Discretization Methods Chapter 2. Training Manual May 15, 2001 Inventory # Discretization Methods Topics Equations and The Goal Brief overview.

Discretization for PDEs Chunfang Chen,Danny Thorne Adam Zornes, Deng Li CS 521 Feb., 9,2006.

Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.

Geometry processing on GPUs Jens Krüger Technische Universität München.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Consider Preconditioning – Basic Principles Basic Idea: is to use Krylov subspace method (CG, GMRES, MINRES …) on a modified system such as The matrix.

Dynamic Geometry Displacement Jens Krüger Technische Universität München.

A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering

Programming assignment # 3 Numerical Methods for PDEs Spring 2007 Jim E. Jones.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Xing Cai University of Oslo

EECE571R -- Harnessing Massively Parallel Processors ece

Jens Krüger Technische Universität München

Implementation of DWT using SSE Instruction Set

Jens Krüger Technische Universität München

Static Image Filtering on Commodity Graphics Processors

Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico

Ray Tracing on Programmable Graphics Hardware

University of Virginia

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Presentation transcript:

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder Caltech ASCI Center

Why Use the GPU? Semiconductor trends –cost –wires vs. compute –Stanford streaming supercomputer Parallelism –many functional units –graphics is prime example Harvesting this power –what application suitable? –what abstractions useful? History –massively parallel SIMD machines –media processing Chart courtesy Bill Dally Possible Actual Imagine stream processor; Bill Dally, StanfordConnection Machine CM2; Thinking Machines

Contributions and Related Work Contributions –numerical algorithms on GPU unstructured grids: conjugate gradients regular grids: multigrid –what abstractions are needed? Numerical algorithms –Goodnight et al (MG) –Hall et al (cache) –Harris et al (FD sim.) –Hillisland et al (optimization) –Krueger & Westermann 2003 (NLA) –Strzodka (PDEs)

Streaming Model Abstract model –Purcell, et al –data structures: streams –algorithms: kernels Concrete model –render a rectangle –data structures: textures –algorithms: fragment programs Kernel input record stream output record stream globals Rasterizer (set up texture indices and all associated data) Fragment program (for all pixels in parallel) Texture as read-only memory Output goes to texture Bind buffer to texture Kernel globals

Sparse Matrices: Geometric Flow Ubiquitous in numerical computing –discretization of PDEs: animation finite elements, difference, volumes –optimization, editing, etc., etc. Example here: –processing of surfaces Canonical non-linear problem –mean curvature flow –implicit time discretization solve sequence of SPD systems Velocity opposite mean curvature normal

Conjugate Gradients High level code –inner loop –matrix-vector multiply –sum-reduction –scalar-vector MAD Inner product –fragment-wise multiply –followed by sum-reduction –odd dimensions can be handled

y=Ax Aj – off-diagonal matrix elements R – pointers to segments

Row-Vector Product X – vector elements R – pointers to segments A i – diagonal matrix elements J – pointers to x j A j – off-diagonal matrix elements Fragment program

Apply to All Pixels Two extremes –one row at a time: setup overhead –all rows at once: limited by worst row Middle ground –organize “batches” of work How to arrange batches? –order rows by non-zero entries optimal packing NP hard We choose fixed size rectangles –fragment pipe is quantized –simple experiments reveal best size 26 x 18 – 91% efficient wasted fragments on diagonal Time Area (pixels)

Packing (Greedy) … non-zero entries per row each batch bound to an appropriate fragment program All this setup done once only at the beginning of time. Depends only on mesh connectivity

Recomputing Matrix Matrix entries depend on surface –must “render” into matrix –two additional indirection textures previous and next

Results 37k elements –matrix multiply 33 instructions, 120 per second only 13 flops latency limited –reduction 7 inst/frag/pass, 3400 per second –CG solve: 20 per second

Regular Grids Poisson solver as example –multigrid approach –this time variables on “pixel grid” e.g.: Navier-Stokes after discretization: solve Poisson eq. at each time step

Poisson Equation Appears all over the place –easy to discretize on regular grid –matrix multiply is stencil application –FD Laplace stencil: Use iterative matrix solver –just need application of stencil easy: just like filtering incorporate geometry (Jacobian) variable coefficients (i,j)

Multigrid Relax Projection Interpolation Fine to coarse to fine cycle –high freq. error removed quickly –lower frequency error takes longer Relax, Project, Interpolate

Computations and Storage Layout Lots of stencil applications –matrix multiply: 3x3 stencil –projection: 3x3 stencil –interpolation: 2x2(!) floor op in indexing Storage for matrices and DOFs –variables in one texture –matrices in 9(=3x3) textures –all textures packed exploit 4 channels domain decomp. padded boundary 1/ x y z w

Coarser Matrices Operator at coarser level –needed for relaxation at all levels Triple matrix product… –work out terms and map to stencils exploit local support of stencils straightforward but t-e-d-i-o-u-s AfAf AcAc S P =

Results 257x257 grid –matrix multiply - 27 instructions 1370 per second –interpolation 10 inst. –projection 19 inst. Overall performance –257x257 at 80 fps!

Conclusions Enhancements –global registers for reductions –texture fetch with offset –rectangular texture border –scalar versus vector problems Where are we now? –good streaming processor –twice as fast as CPU implementation –lots of room for improvement Scientific computing compiler –better languages! Brook? C*? –manage layout in a buffer