Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory A. Newman (LBNL)

Overview  Introduction: Geophysical modeling on GPUs  Iterative Krylov solvers on GPU and implementation details  Krylov solver performance tests  Conclusions

CSEM data inversion using QMR EMGeo-GPU has already been run successfully 16 NVIDIA Tesla C 2050 (Fermi) GPUs, 3 GB memory, 448 parallel CUDA processor cores Compared to 16  8 Intel Quad core Nehalem, 2.4 GHz CSEM imaging experiment of Troll gas field (North Sea)

ERT data inversion using CG CO 2 plume imaging study

SIP data inversion using BiCG Rifle SIP monitoring study

Finite-difference representation of Maxwell and Poisson equations Maxwell equation  13-point stencil Poisson equation  7-point stencil

Iterative Krylov subspace methods Solution of the linear system involves constructing the Krylov subspace in order to compute the optimal approximation

Numerical modeling on GPUs Main challenge: Manage memory access in most efficient way

Sparse matrix types arising in electrical and electromagnetic modeling problems Maxwell: Controlled-source EM, Magnetotelluric Poisson: Electrical resistivity tomography, Induced polarization

Sparse Matrix Storage Formats Diagonal (DIA) StructuredUnstructured Ellpack (ELL) Compressed Row (CSR) Hybrid (HYB) Coordinate (COO)

ELLPACK Format  Storage of N non-zeros per matrix row  Zero-padding for rows with < N non-zeros  Ease of implementation

ELL SpMV GPU implementation n – number of rows in the matrix (large) m – max number of non-zeros per row (small) Index matrixValue matrix x y

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, row concatenation.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Memory access not coalesced! One thread per row, row concatenation.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Many threads per row, row concatenation. Coalesced reads.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Many threads per row, row concatenation. In block reduction.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Many threads per row, row concatenation. Reduction and writing rhs are slow! In block reduction.

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, column concatenation. (from another block)

ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, column concatenation. Coalesced reads and no reductions (from another block)

ELL SpMV GPU implementation For 13 non zero elements per row on a Tesla C2050.

Minimize Memory Bandwidth Use fused kernels. Use pointer swaps instead of memory copies when possible.

CPU communication

Multi GPU communication Use the same layout for vectors on the CPU and GPU. Simplifies MPI communication routines. Extra complication of the data transfer to CPU.

Multi GPU communication GPU communication diagram.

Multi GPU communication Blocking communication

Multi GPU communication Non blocking communication

Iterative Krylov solver performance tests Typically used for EM problems: CG, BiCG, QMR

Computing times for 1000 Krylov solver iterations

SpMV with “Constant-Coefficient-Matrix” Vector Helmholtz equation  =2  f

Choose Dirichlet boundary conditions such that the operator   ℝ n  n SpMV with Constant-Coefficient-Matrix

Pseudo code for SpMV with “standard” matrix: Ax=b

Pseudo code for SpMV with Constant- Coefficient-Matrix: Cx+dx=b Scaling of solution vector Scaling of rhs vector

QMR solver performace on CPU & GPU using CCM – solution times for 1000 Krylov solver iterations Example grid size: 190  190  100

QMR solver performace on GPU using CCM – memory throughput

Grid intervals  Coefficients Example grid size: 100  100  100

Grid intervals  Solution times Increase in computing time:  17 %

Grid intervals  Memory usage Only significant portion given by index array

Conclusions Our GPU implementation of iterative Krylov methods exploits massive parallelism of modern GPU hardware Efficiency increases with problem size Memory limitations are overcome by multi-GPU scheme and novel SpMV method for structured grids

Thanks to National Energy Research Scientific Computing Center (NERSC) for support provided through NERSC Petascale Program

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Similar presentations

Presentation on theme: "Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Similar presentations

Presentation on theme: "Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory."— Presentation transcript:

Similar presentations

About project

Feedback