Parallelizing the conjugate gradient algorithm for multilevel Toeplitz systems Jie Chen a and Tom L. H. Li b a Argonne National Laboratory b University.

Slides:



Advertisements
Similar presentations
05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
R Lecture 4 Naomi Altman Department of Statistics Department of Statistics (Based on notes by J. Lee)
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
1 Group representations Consider the group C 4v ElementMatrix E C C C
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Motion Analysis Slides are from RPI Registration Class.
Statistical Estimation of High Dimensional Covariance Matrices – a sampling from Prof. Hero’s research group Ted Tsiligkaridis SPEECS Friday, Sept. 9,
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.
Design of parallel algorithms
Topic Overview One-to-All Broadcast and All-to-One Reduction
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS240A: Conjugate Gradients and the Model Problem.
Probabilistic Latent Semantic Analysis
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Image deblurring Seminar inverse problems April 18th 2007 Willem Dijkstra.
PETE 603 Lecture Session #29 Thursday, 7/29/ Iterative Solution Methods Older methods, such as PSOR, and LSOR require user supplied iteration.
CE 311 K - Introduction to Computer Methods Daene C. McKinney
1 Chapter 3 Matrix Algebra with MATLAB Basic matrix definitions and operations were covered in Chapter 2. We will now consider how these operations are.
Today Wrap up of probability Vectors, Matrices. Calculus
Session: Image Processing Seung-Tak Noh 五十嵐研究室 M2.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Qualifier Exam in HPC February 10 th, Quasi-Newton methods Alexandru Cioaca.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
SAND C 1/17 Coupled Matrix Factorizations using Optimization Daniel M. Dunlavy, Tamara G. Kolda, Evrim Acar Sandia National Laboratories SIAM Conference.
1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.
Lecture 7 - Systems of Equations CVEN 302 June 17, 2002.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.
Parallel Solution of the Poisson Problem Using MPI
CS240A: Conjugate Gradients and the Model Problem.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
R EPRESENTING H IGHER D IMENSIONAL A RRAYS INTO A G ENERALIZED T WO - DIMENSIONAL A RRAY Authors K.M. Azharul Hasan Md Abu Hanif Shaikh Dept. of Computer.
A Convergent Solution to Tensor Subspace Learning.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
Data Structures and Algorithms in Parallel Computing Lecture 7.
CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections
Declaration of Relevant Financial Interests or Relationships David Atkinson: I have no relevant financial interest or relationship to disclose with regard.
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
Consider Preconditioning – Basic Principles Basic Idea: is to use Krylov subspace method (CG, GMRES, MINRES …) on a modified system such as The matrix.
Krylov-Subspace Methods - I Lecture 6 Alessandra Nardi Thanks to Prof. Jacob White, Deepak Ramaswamy, Michal Rewienski, and Karen Veroy.
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
Kernel Regression Prof. Bennett Math Model of Learning and Discovery 1/28/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Fast 3D Least-squares Migration with a Deblurring Filter Wei Dai.
1 Matrix Math ©Anthony Steed Overview n To revise Vectors Matrices.
Spectral Methods for Dimensionality
Estimation Techniques for High Resolution and Multi-Dimensional Array Signal Processing EMS Group – Fh IIS and TU IL Electronic Measurements and Signal.
Solving Linear Systems Ax=b
Parallel Matrix Multiplication and other Full Matrix Algorithms
CSCE569 Parallel Computing
Matrix Methods Summary
Parallel Matrix Multiplication and other Full Matrix Algorithms
Introduction to Scientific Computing II
Sebastian Semper1 and Florian Roemer1,2
Presentation transcript:

Parallelizing the conjugate gradient algorithm for multilevel Toeplitz systems Jie Chen a and Tom L. H. Li b a Argonne National Laboratory b University of Missouri—St. Louis ICCS 2013

Toeplitz  What is Toeplitz?  Where does it come from? –One dimensional regular grid  Another example –The standard Laplacian. But it is often treated as a sparse matrix. 2

Multilevel Toeplitz  Multilevel Toeplitz is defined w.r.t. the number of levels  Where does it come from? –d-dimensional regular grid  Think about 2D Laplacian 3

Why Solve (Multilevel) Toeplitz Systems?  Scattered data interpolation [ Figure from ]  The covariance matrix K is multilevel Toeplitz when the X i ’s are on a regular grid  K -1 also appear in many other problems, such as maximum likelihood estimation 4

How to Solve?  General direct methods via matrix factorizations: O(n 3 )  Fast direct methods for 1-level Toepiltz: O(n 2 ) –Levinson-Durbin, 1947, 1960 –Bareiss, 1969  Superfast direct methods for 1-level Toeplitz: O(n log α n) –Pan, 1993 –Stewart, 2003 –Chandrasekaran et al, 2007  Methods for specialized systems: banded, block Toeplitz, Toeplitz block, etc  General method for any-level Toeplitz: O(n log n) –Chan and Jin, 2007 –Use an iterative solver (e.g., conjugate gradient) –Matrix-vector multiplication through FFT –Circulant preconditioner 5 We parallelize this method

Conjugate Gradient (CG) 6 Multilevel Toeplitz Multilevel circulant

Toeplitz-Multiply  Circulant embedding (1-level case)  In case of symmetry, both T and C are represented by the first columns, t and c  Circulant-multiply: 1.λ = fft(c) 2.v’ = ifft( λ.* fft(y’) )  Simple to generalize to d-level case –t is a d-dimensional tensor –Circulant embedding done along all dimensions –FFT and IFFT become multidimensional 7

Circulant Preconditioning Multilevel Toeplitz T  Multilevel circulant preconditioner M data representation t (d-D tensor) data representation m (d-D tensor, same size) A 3-D Example to construct m: 1.Initialize m( :, :, : ) = t( :, :, : ) 2.s( j, :, : ) = [ (n 1 -j) * m( j, :, : ) + j * m( n 1 -j, :, : ) ] / n 1, j = 0:n 1 -1; then copy s to m 3.s( :, j, : ) = [ (n 2 -j) * m( :, j, : ) + j * m( :, n 2 -j, : ) ] / n 1, j = 0:n 2 -1; then copy s to m 4.s( :, :, j ) = [ (n 3 -j) * m( :, :, j ) + j * m( :, :, n 3 -j ) ] / n 1, j = 0:n 3 -1; then copy s to m  In the 1-level case, this preconditioner yields superlinear convergence for CG  In the higher level case, superlinear convergence is lost; but still good performance in practice 8

Toeplitz CG 9

Parallelization 1: Toeplitz-Multiply Naïve approach 1.y’ = Embed y 2.z = multidimensional-FFT(y’) 3.w = λ.* z 4.v’ = multidimensional-IFFT(w) 5.v = Truncate v’ Less-communication approach 1.y’’ = Embed y along unpartitioned dims 2.y’ = FFT(y’’) along unpartitioned dims 3.Transpose y’ 4.z’ = Embed y’ along unpartitioned dims 5.z = FFT(z’) along unpartitioned dims 6.w = λ.* z 7.w’’ = IFFT(w) along unpartitioned dims 8.w’ = Truncate w’’ along unpartitioned dims 9.Transpose w’ 10.v’ = IFFT(w’) along unpartitioned dims 11.v = Truncate v’ along unpartitioned dims 10 Red lines require MPI_Alltoall

Parallelization 1: Toeplitz-Multiply 11

Parallelization 1: Toeplitz-Multiply  How to partition a d-dimensional data cube? –Use an array of processes –Use a 2-dimensional grid of processes –Use a d’-dimensional grid of processes ( d’ = 1, 2, …, d )  The larger d’ is, the more processes one can use  The larger d’ is, the smaller the total size of MPI_Alltoall. ( p = p 0. p 1 … p d’-1 ) 12

Parallelization 1: Toeplitz-Multiply 13

Parallelization 2: Eliminate Allreduce 14

Parallelization 2: Eliminate Allreduce Computing v = Ty and σ = (v,y) simultaneously:  Use the alltoall in the ifft of w to sum the inner product between z and w  Thus eliminating allreduce 15

Parallelization 2: Eliminate Allreduce 16

Overall Solver Performance Strong scalingWeak scaling (2 20 grid points per core) 17

Summary  Multilevel Toeplitz matrices appear in, e.g., statistics  Iterative methods have been the methods for multilevel Toeplitz systems so far  We parallelize CG: –Use a multidimensional grid of processes to partition a multidimensional data –Eliminate communication in data embedding –Eliminate allreduce communication for computing inner products  Largest experiment: Solve 1B-by-1B matrix using 1K processes in 1 minute  Other iterative methods (e.g., GMRES) can be similarly parallelized 18