ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

Slides:

Advertisements

Similar presentations

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Advertisements

Computational Physics Linear Algebra Dr. Guy Tel-Zur Sunset in Caruaru by Jaime JaimeJunior. publicdomainpictures.netVersion , 14:00.

Lecture 19: Parallel Algorithms

Block LU Factorization Lecture 24 MA471 Fall 2003.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

1 Parallel Algorithms II Topics: matrix and graph algorithms.

Solving Linear Systems (Numerical Recipes, Chap 2)

Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems Ted J. Wasserman (IBM Corp. / Queen’s University) Pat.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

Data Locality CS 524 – High-Performance Computing.

Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.

Ordinary least squares regression (OLS)

Data Shackling Locality enhancement of dense numerical linear algebra codes Traversals along co-ordinate axes Data-centric reference for each statement.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.

Don P. Chambers Center for Space Research The University of Texas at Austin Understanding Sea-Level Rise and Variability 6-9 June, 2006 Paris, France The.

A Factored Sparse Approximate Inverse software package (FSAIPACK) for the parallel preconditioning of linear systems Massimiliano Ferronato, Carlo Janna,

LU Decomposition 1. Introduction Another way of solving a system of equations is by using a factorization technique for matrices called LU decomposition.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Binary Image Compression via Monochromatic Pattern Substitution: A Sequential Speed-Up Luigi Cinque and Sergio De Agostino Computer Science Department.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Utilizing Multi-threading, Parallel Processing, and Memory Management Techniques to Improve Transportation Model Performance Jim Lam Andres Rabinowicz.

ESA living planet symposium 2010 ESA living planet symposium 28 June – 2 July 2010, Bergen, Norway GOCE data analysis: realization of the invariants approach.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 4. Least squares.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

University of Colorado Boulder ASEN 5070 Statistical Orbit determination I Fall 2012 Professor George H. Born Professor Jeffrey S. Parker Lecture 11: Batch.

Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.

Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

Linear Systems Dinesh A.

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

TI Information – Selective Disclosure

Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.

Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.

Xing Cai University of Oslo

Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno

Ioannis E. Venetis Department of Computer Engineering and Informatics

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Coding FLAME Algorithms with Example: Cholesky factorization

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

Lecture 22: Parallel Algorithms

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Linchuan Chen, Peng Jiang and Gagan Agrawal

Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Presentation transcript:

ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin, TX Enrique Quintana-Ortí Depto. de Ingenieria y Ciencia de Computadores Universidad Jaume I, Castellón, Spain Robert van de Geijn Department of Computer Sciences The University of Texas at Austin, Austin, TX Thierry Joffrain Department of Computer Sciences The University of Texas at Austin, Austin, TX

ScicomP 10, Aug 9-13, 2004 Motivation Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m n In-core

ScicomP 10, Aug 9-13, 2004 Motivation Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m n In-core

ScicomP 10, Aug 9-13, 2004 Motivation Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m n In-core

ScicomP 10, Aug 9-13, 2004 Motivation m >> n n While this is effective for many applications, it is inherently unscalable As m >> n, fewer columns can fit into memory

ScicomP 10, Aug 9-13, 2004 A=QR Q = I + YTY T Out-of-Core QR Factorization Compact WY Representation Q is an orthogonal matrix R is upper triangular Y is an m×r collection of Householder vectors, normalized to be unit lower triangular (trapezoidal) T is r×r upper triangular Given the m×n matrix, A, we wish to apply the factorization

ScicomP 10, Aug 9-13, 2004 Step 1: Begin with an unfactored matrix which resides on disk. = Stored on disk= In memory QR Factorization Out-of-Core Implementation

ScicomP 10, Aug 9-13, 2004 Step 2: Divide matrix into a mesh of tiles of size t, where each tile is stored as a separate file. = Stored on disk= In memory QR Factorization Out-of-Core Implementation t t

ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation YiYi TiTi

ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

ScicomP 10, Aug 9-13, 2004 Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk= In memory QR Factorization Out-of-Core Implementation YiYi TiTi

ScicomP 10, Aug 9-13, 2004 Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk= In memory QR Factorization Out-of-Core Implementation

ScicomP 10, Aug 9-13, 2004 Step 8: Repeat Steps 1-7 on lower quadrant. = Stored on disk= In memory QR Factorization Out-of-Core Implementation YiYi TiTi

ScicomP 10, Aug 9-13, 2004 Step 8: Repeat Steps 1-7 on lower quadrant. Continue until entire matrix has been factored. = Stored on disk= In memory QR Factorization Out-of-Core Implementation

ScicomP 10, Aug 9-13, 2004 PA=LU Out-of-Core LU Factorization P is an permutation matrix U is n×n upper triangular L is lower trapezoidal Implementation analogous to out-of-core QR factorization Given the m×n matrix, A, we wish to apply the factorization

ScicomP 10, Aug 9-13, 2004 Step 1: Factor first tile, saving permutation matrix. = Stored on disk= In memory LU Factorization Out-of-Core Implementation PiPi LiLi UiUi

ScicomP 10, Aug 9-13, 2004 Step 2: Update remaining tiles in row using panels of L and the saved permutation matrices. = Stored on disk= In memory LU Factorization Out-of-Core Implementation PiPi LiLi UiUi

ScicomP 10, Aug 9-13, 2004 Step 3: Factor next tile in first column using LU update algorithm. = Stored on disk= In memory LU Factorization Out-of-Core Implementation PiPi LiLi UiUi

ScicomP 10, Aug 9-13, 2004 Step 4: Update remaining tiles in row using panels of L and stored permutation matrices. = Stored on disk= In memory LU Factorization Out-of-Core Implementation LiLi UiUi PiPi

ScicomP 10, Aug 9-13, 2004 Development Environment Parallel Linear Algebra Package (PLAPACK) Optimized parallel routines (FORTRAN and C interfaces) View-based infrastructure Uses standard MPI and BLAS libraries Parallel Out-Of-Core Parallel Linear Algebra (POOCLAPACK) Out-of-core extension to PLAPACK Handles the complexity of the I/O operations (i.e., hidden to user) Uses standard read/write functions for portability

ScicomP 10, Aug 9-13, 2004 Performance of Parallel OOC QR IBM P690: 32 Gb, T.P. of 5.2 Gflops, DGEMM of Gflops

ScicomP 10, Aug 9-13, 2004 Performance for Sequential OOC LU

ScicomP 10, Aug 9-13, 2004 Earth Science Application Gravity Recovery And Climate Experiment (GRACE) A collaborative effort between The University of Texas Center for Space Research (CSR) The Jet Propulsion Laboratory (JPL) GeoForschungsZentrum (GFZ) Deutschen Zentrum für Luft- und Raumfahrt (DLR) National Aeronautics and Space Administration (NASA)

ScicomP 10, Aug 9-13, 2004 Earth Science Application Goal was to compute a rigorous 360x360 gravity model No approximation techniques Translates to roughly 100 km 2 resolution Involves the least squares estimation of ~130,000 parameters Requires the combination of hundreds of millions of observations surface gravity data (land) – ½ TB altimetry-based mean sea surface data (ocean) GRACE data (satellite) Using new parallel OOC QR algorithm A 360x360 field was generated, complete with full covariance Largest rigorous gravity field model ever created Used a single IBM P690 node OOC QR required only 32 GB To do in-core would require 165 GB of memory Required ~6 days of wall clock time to compute (2326 CPU hours) A single processor machine with sufficient memory would require 3.2 months

ScicomP 10, Aug 9-13, 2004 Conclusion Tile-based out-of-core algorithms provide scalability Size of the tile is based on the memory of the machine (i.e. fixed) and is independent of the problem size Algorithms achieve excellent performance The large tile sizes mean the algorithm spends nearly all of its time in large, highly efficient matrix-matrix operations This helps to offset the I/O cost associated with moving the tiles to and from disk Use of the PLAPACK & POOCLAPACK greatly simplified the implementation Reduces complexity of code Makes code portable Has already proven valuable to Earth science applications

ScicomP 10, Aug 9-13, 2004 Conclusion Broad spectrum of applications Large scale problems Small clusters Embedded systems Other small memory machines Tile-based OOC approach can be extended to other dense linear algebra operations Cholesky, matrix inverse, BLAS-3, etc. Goal is to provide a full suite of OOC utilities

ScicomP 10, Aug 9-13, 2004 For More Information Visit the PLAPACK website: Visit the GRACE website: