Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Slides:



Advertisements
Similar presentations
Implementation of Voxel Volume Projection Operators Using CUDA
Advertisements

The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
OpenFOAM on a GPU-based Heterogeneous Cluster
By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Sparse Matrix Algorithms on GPUs and their Integration into SCIRun & Miriam Leeser Dana Brooks 1 This work is supported.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
September 15, Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Large-scale Deep Unsupervised Learning using Graphics Processors
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.
GPU Architecture and Programming
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
High Performance Computational Fluid-Thermal Sciences & Engineering Lab GenIDLEST Co-Design Virginia Tech AFOSR-BRI Workshop July 20-21, 2014 Keyur Joshi,
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
CS240A: Conjugate Gradients and the Model Problem.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
Irregular Applications –Sparse Matrix Vector Multiplication
L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
My Coordinates Office EM G.27 contact time:
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
A Massively Parallel Incompressible Smoothed Particle Hydrodynamics Simulator for Oilfield Applications Paul Dickenson 1,2, William N Dawes 1 1 CFD Laboratory,
Analysis of Sparse Convolutional Neural Networks
GPU-based iterative CT reconstruction
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
Bisection and Twisted SVD on GPU
Lecture 19 MA471 Fall 2003.
© 2012 Elsevier, Inc. All rights reserved.
6- General Purpose GPU Programming
Presentation transcript:

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory A. Newman (LBNL)

Overview  Introduction: Geophysical modeling on GPUs  Iterative Krylov solvers on GPU and implementation details  Krylov solver performance tests  Conclusions

CSEM data inversion using QMR EMGeo-GPU has already been run successfully 16 NVIDIA Tesla C 2050 (Fermi) GPUs, 3 GB memory, 448 parallel CUDA processor cores Compared to 16  8 Intel Quad core Nehalem, 2.4 GHz CSEM imaging experiment of Troll gas field (North Sea)

ERT data inversion using CG CO 2 plume imaging study

SIP data inversion using BiCG Rifle SIP monitoring study

Finite-difference representation of Maxwell and Poisson equations Maxwell equation  13-point stencil Poisson equation  7-point stencil

Iterative Krylov subspace methods Solution of the linear system involves constructing the Krylov subspace in order to compute the optimal approximation

Numerical modeling on GPUs Main challenge: Manage memory access in most efficient way

Sparse matrix types arising in electrical and electromagnetic modeling problems Maxwell: Controlled-source EM, Magnetotelluric Poisson: Electrical resistivity tomography, Induced polarization

Sparse Matrix Storage Formats Diagonal (DIA) StructuredUnstructured Ellpack (ELL) Compressed Row (CSR) Hybrid (HYB) Coordinate (COO)

ELLPACK Format  Storage of N non-zeros per matrix row  Zero-padding for rows with < N non-zeros  Ease of implementation

ELL SpMV GPU implementation n – number of rows in the matrix (large) m – max number of non-zeros per row (small) Index matrixValue matrix x y

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, row concatenation.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, row concatenation.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, row concatenation.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Memory access not coalesced! One thread per row, row concatenation.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Many threads per row, row concatenation. Coalesced reads.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Many threads per row, row concatenation. In block reduction.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Many threads per row, row concatenation. Reduction and writing rhs are slow! In block reduction.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, column concatenation. (from another block)

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, column concatenation. (from another block)

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, column concatenation. Coalesced reads and no reductions (from another block)

ELL SpMV GPU implementation For 13 non zero elements per row on a Tesla C2050.

Minimize Memory Bandwidth Use fused kernels. Use pointer swaps instead of memory copies when possible.

CPU communication

Multi GPU communication Use the same layout for vectors on the CPU and GPU. Simplifies MPI communication routines. Extra complication of the data transfer to CPU.

Multi GPU communication GPU communication diagram.

Multi GPU communication Blocking communication

Multi GPU communication Non blocking communication

Iterative Krylov solver performance tests Typically used for EM problems: CG, BiCG, QMR

Computing times for 1000 Krylov solver iterations

SpMV with “Constant-Coefficient-Matrix” Vector Helmholtz equation  =2  f

Choose Dirichlet boundary conditions such that the operator   ℝ n  n SpMV with Constant-Coefficient-Matrix

Pseudo code for SpMV with “standard” matrix: Ax=b

Pseudo code for SpMV with Constant- Coefficient-Matrix: Cx+dx=b Scaling of solution vector Scaling of rhs vector

QMR solver performace on CPU & GPU using CCM – solution times for 1000 Krylov solver iterations Example grid size: 190  190  100

QMR solver performace on GPU using CCM – memory throughput

Grid intervals  Coefficients Example grid size: 100  100  100

Grid intervals  Solution times Increase in computing time:  17 %

Grid intervals  Memory usage Only significant portion given by index array

Conclusions Our GPU implementation of iterative Krylov methods exploits massive parallelism of modern GPU hardware Efficiency increases with problem size Memory limitations are overcome by multi-GPU scheme and novel SpMV method for structured grids

Thanks to National Energy Research Scientific Computing Center (NERSC) for support provided through NERSC Petascale Program