GPU Implementations for Finite Element Methods

Slides:



Advertisements
Similar presentations
Steady-state heat conduction on triangulated planar domain May, 2002
Advertisements

Partial Differential Equations
1 Numerical Solvers for BVPs By Dong Xu State Key Lab of CAD&CG, ZJU.
LECTURE SERIES on STRUCTURAL OPTIMIZATION Thanh X. Nguyen Structural Mechanics Division National University of Civil Engineering
July 11, 2006 Comparison of Exact and Approximate Adjoint for Aerodynamic Shape Optimization ICCFD 4 July 10-14, 2006, Ghent Giampietro Carpentieri and.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Solving Linear Systems (Numerical Recipes, Chap 2)
OpenFOAM on a GPU-based Heterogeneous Cluster
BVP Weak Formulation Weak Formulation ( variational formulation) where Multiply equation (1) by and then integrate over the domain Green’s theorem gives.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.
Lecture 9: Finite Elements Sauro Succi. FEM: non-spherical cows Coordinate-free: Unstructured.
Accelerating the Optimization in SEE++ Presentation at RISC, Hagenberg Johannes Watzl 04/27/2006 Cooperation Project by RISC and UAR.
Two-Dimensional Heat Analysis Finite Element Method 20 November 2002 Michelle Blunt Brian Coldwell.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.
A Factored Sparse Approximate Inverse software package (FSAIPACK) for the parallel preconditioning of linear systems Massimiliano Ferronato, Carlo Janna,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.
An introduction to the finite element method using MATLAB
Eng Ship Structures 1 Matrix Analysis Using MATLAB Example.
1 20-Oct-15 Last course Lecture plan and policies What is FEM? Brief history of the FEM Example of applications Discretization Example of FEM softwares.
The Finite Element Method A Practical Course
1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.
Illustration of FE algorithm on the example of 1D problem Problem: Stress and displacement analysis of a one-dimensional bar, loaded only by its own weight,
High Performance Computational Fluid-Thermal Sciences & Engineering Lab GenIDLEST Co-Design Virginia Tech AFOSR-BRI Workshop July 20-21, 2014 Keyur Joshi,
Solution of Sparse Linear Systems
Parallel Solution of the Poisson Problem Using MPI
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Lecture 21 MA471 Fall 03. Recall Jacobi Smoothing We recall that the relaxed Jacobi scheme: Smooths out the highest frequency modes fastest.
CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections
Partial Derivatives Example: Find If solution: Partial Derivatives Example: Find If solution: gradient grad(u) = gradient.
Finite Element Modelling of Photonic Crystals Ben Hiett J Generowicz, M Molinari, D Beckett, KS Thomas, GJ Parker and SJ Cox High Performance Computing.
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
Linear System expensive p=[0,0.2,0.4,0.45,0.5,0.55,0.6,0.8,1]; t=[1:8; 2:9]; e=[1,9]; n = length(p); % number of nodes m = size(t,2); % number of elements.
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
Programming assignment # 3 Numerical Methods for PDEs Spring 2007 Jim E. Jones.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.
1 CHAP 3 WEIGHTED RESIDUAL AND ENERGY METHOD FOR 1D PROBLEMS FINITE ELEMENT ANALYSIS AND DESIGN Nam-Ho Kim.
Parallel Direct Methods for Sparse Linear Systems
Auburn University
Generalized and Hybrid Fast-ICA Implementation using GPU
Analysis of Sparse Convolutional Neural Networks
EEE 431 Computational Methods in Electrodynamics
Xing Cai University of Oslo
Introduction to the Finite Element Method
Finite Element Method in Geotechnical Engineering
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
GPU Computing CIS-543 Lecture 10: CUDA Libraries
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
Solving Systems of Linear Equations: Iterative Methods
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Introduction to Finite Elements
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Finite Element Method To be added later 9/18/2018 ELEN 689.
Deflated Conjugate Gradient Method
Deflated Conjugate Gradient Method
A robust preconditioner for the conjugate gradient method
GENERAL VIEW OF KRATOS MULTIPHYSICS
Numerical Linear Algebra
Solving Linear Systems: Iterative Methods and Sparse Systems
University of Virginia
Administrivia: November 9, 2009
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Programming assignment #1 Solving an elliptic PDE using finite differences Numerical Methods for PDEs Spring 2007 Jim E. Jones.
Presentation transcript:

GPU Implementations for Finite Element Methods Brian S. Cohen 12 December 2016

Stiffness Matrix Assembly Last Time… Stiffness Matrix Assembly 10 2 Julia v0.47 MATLAB v2016b 10 1 10 10 -1 CPU Time [s] 10 -2 10 -3 10 -4 10 2 10 3 10 4 10 5 10 6 10 7 Number of DOFs B. Cohen – 21 November 2018

Goals Implement an efficient GPU-based assembly routine to interface with the EllipticFEM.jl package Speed test all implementations and compare against CPU algorithm using varied mesh densities Investigate where GPU implementation choke points are and how these can be improved in the future Implement a GPU-based linear solver routine Speed test solver and compare against CPU algorithm B. Cohen – 21 November 2018

Finite Element Mesh A finite element mesh is a set of nodes and elements that divide a geometric domain on which our PDE can be solved Other relevant information for the mesh may be necessary Element centroids Element edge lengths Element quality Subdomain tags EllipticFEM.jl stores this information in object meshData All meshes are generated using linear 2D triangle elements Node data stored as Float64 2D Array Element data stored as Int64 2D Array Element e Node pi Node pj Node pk 𝒆𝒍𝒆𝒎𝒆𝒏𝒕𝒔= … 𝑝 𝑖 … … 𝑝 𝑗 … … 𝑝 𝑘 … 𝒏𝒐𝒅𝒆𝒔= 𝑥 1 ⋯ 𝑥 𝑛 𝑦 1 ⋯ 𝑦 𝑛 B. Cohen – 21 November 2018

Finite Element Matrix Assembly Consider the simple linear system The stiffness matrix 𝐊 is an assembly of all element contributions Element contributions are derived from the “hat” function used to approximate the solution on each element 𝐊= 𝐾 1,1 ⋯ 𝐾 1,𝑛𝐷𝑂𝐹 ⋮ ⋱ ⋮ 𝐾 𝑛𝐷𝑂𝐹,1 ⋯ 𝐾 𝑛𝐷𝑂𝐹,𝑛𝐷𝑂𝐹 𝐊𝐮=𝐟 𝐊= 𝑒=1 𝑚 𝐤 𝑒 𝐤 𝑒 = 𝑘 11 𝑘 12 𝑘 13 𝑘 21 𝑘 22 𝑘 23 𝑘 31 𝑘 32 𝑘 33 K = sparse(I,J,V) 𝒖 𝒊 𝐤 𝑒 = 𝐉 −T 𝐁 T 𝐄 𝐉 −𝟏 𝐁 𝑑𝐴 y x B. Cohen – 21 November 2018

GPU Implementation A Pre-Processing Assemble 𝐊 Matrix Solve CPU GPU Read Equation Data Generate Geometric Data Call sparse() constructor 𝐮=𝐊\𝐛 CPU Generate (I,J) Vectors Generate Mesh Data GPU Generate Ke_Values Array Double for-loop implementation B. Cohen – 21 November 2018

GPU Implementation B Pre-Processing Assemble 𝐊 Matrix Solve CPU GPU Generate (I,J) Vectors Read Equation Data Generate Geometric Data Generate Mesh Data Call sparse() constructor 𝐮=𝐊\𝐛 CPU Generate Ke_Values Array GPU Transfer node and element arrays only to GPU B. Cohen – 21 November 2018

CPU vs. GPU Implementations 10 2 CPU Implementation GPU Implementation A GPU Implementation B 1 10 10 10 -1 CPU Time for I, J, V Assembly [s] 10 -2 10 -3 10 -4 2 3 4 5 6 7 10 10 10 10 10 10 Number of DOFs GeForce GTX 765M, 2048MB B. Cohen – 21 November 2018

Runtime Diagnostics Implementation A Implementation B 1 1.0 CPU -> GPU CPU -> GPU 0.8 I, J, V array assembly 0.8 I, J, V array assembly GPU -> CPU GPU -> CPU sparse() assembly sparse() assembly 0.6 0.6 CPU Runtime [s] CPU Runtime [s] 0.4 0.4 0.2 0.2 102 103 104 105 106 102 103 104 105 106 Number of DOFs Number of DOFs Overhead to transfer mesh data from CPU → GPU is low Overhead to transfer mesh data from GPU → CPU is high It would be nice to be able perform sparse matrix construction on GPU It would be even nicer to solve the problem on the GPU B. Cohen – 21 November 2018

Solving the Model 𝐊𝐮=𝐟 Now we want to solve the linear model: ArrayFire.jl does not currently support sparse matrices Dense matrix operations seem to be comparable in speed or slower on GPU’s than CPU’s CUSPARSE.jl wraps NVIDIA CUSPARSE library functions High performance sparse linear algebra library Does not wrap any solver routines Built on CUDArt.jl package Wraps the CUDA runtime API Both packages required CUDA Toolkit (v8.0) 𝐊𝐮=𝐟 B. Cohen – 21 November 2018

GPU Solver Implementation Preconditioned Conjugate Gradient Method 𝐊 is a sparse symmetric positive definite matrix Improves convergence if 𝐊 is not well conditioned Uses the Incomplete Cholesky Factorization Rather than solve the original system We solve the following system 𝐫←𝐟−𝐊𝐮 𝒇𝒐𝒓 𝑖=1, 2, …until convergence do 𝒔𝒐𝒍𝒗𝒆 𝐌𝐳←𝐫 𝜌 𝑖 ← 𝐫 T 𝐳 𝒊𝒇 𝑖==1 𝒕𝒉𝒆𝒏 𝐊≈𝐌= 𝐑 T 𝐑 𝐩←𝐳 𝒆𝒍𝒔𝒆 𝐊𝐮=𝐟 𝛽← 𝜌 𝑖 𝜌 𝑖−1 𝐩←𝐳+𝛽𝐩 𝒆𝒏𝒅 𝒊𝒇 𝐑 −T 𝐊 𝐑 −𝟏 𝐑𝐮 = 𝐑 −T 𝐟 q←𝐀𝐩 𝛼← 𝜌 𝑖 𝐩 T 𝐪 𝐱←𝐱+𝛼𝐩 𝐫←𝐫−𝛼𝐪 end 𝒇𝒐𝒓 B. Cohen – 21 November 2018

Solver Results CPU Time to Solve Ku=f [s] Number of DOFs 10 10 10 10 2 CPU Implementation GPU Implementation 1 10 10 10 -1 CPU Time to Solve Ku=f [s] 10 -2 10 -3 10 -4 2 3 4 5 6 7 10 10 10 10 10 10 Number of DOFs B. Cohen – 21 November 2018

Conclusion GPU computing in Julia shows promise of speeding up FEM matrix assembly and solve routines Potentially greater gains to be made with higher order 2D/3D elements Minimize data transfer to the GPU needed to assemble FEM matrices Keeping code vectorized helps Removing any temporary data copies on GPU ArrayFire.jl should (and hopefully soon will) support sparse matrix assembly and arithmetic Open issue on GitHub since late September, 2016 CUSPARSE.jl should (and hopefully soon will) wrap additional functions COO matrix constructor and iterative solvers would be especially useful Large impact on optimization problems where matrix assembly routines and solvers are called many times B. Cohen – 21 November 2018