Download presentation
Presentation is loading. Please wait.
1
GPU Implementations for Finite Element Methods
Brian S. Cohen 12 December 2016
2
Stiffness Matrix Assembly
Last Time… Stiffness Matrix Assembly 10 2 Julia v0.47 MATLAB v2016b 10 1 10 10 -1 CPU Time [s] 10 -2 10 -3 10 -4 10 2 10 3 10 4 10 5 10 6 10 7 Number of DOFs B. Cohen – 21 November 2018
3
Goals Implement an efficient GPU-based assembly routine to interface with the EllipticFEM.jl package Speed test all implementations and compare against CPU algorithm using varied mesh densities Investigate where GPU implementation choke points are and how these can be improved in the future Implement a GPU-based linear solver routine Speed test solver and compare against CPU algorithm B. Cohen – 21 November 2018
4
Finite Element Mesh A finite element mesh is a set of nodes and elements that divide a geometric domain on which our PDE can be solved Other relevant information for the mesh may be necessary Element centroids Element edge lengths Element quality Subdomain tags EllipticFEM.jl stores this information in object meshData All meshes are generated using linear 2D triangle elements Node data stored as Float64 2D Array Element data stored as Int64 2D Array Element e Node pi Node pj Node pk 𝒆𝒍𝒆𝒎𝒆𝒏𝒕𝒔= … 𝑝 𝑖 … … 𝑝 𝑗 … … 𝑝 𝑘 … 𝒏𝒐𝒅𝒆𝒔= 𝑥 1 ⋯ 𝑥 𝑛 𝑦 1 ⋯ 𝑦 𝑛 B. Cohen – 21 November 2018
5
Finite Element Matrix Assembly
Consider the simple linear system The stiffness matrix 𝐊 is an assembly of all element contributions Element contributions are derived from the “hat” function used to approximate the solution on each element 𝐊= 𝐾 1,1 ⋯ 𝐾 1,𝑛𝐷𝑂𝐹 ⋮ ⋱ ⋮ 𝐾 𝑛𝐷𝑂𝐹,1 ⋯ 𝐾 𝑛𝐷𝑂𝐹,𝑛𝐷𝑂𝐹 𝐊𝐮=𝐟 𝐊= 𝑒=1 𝑚 𝐤 𝑒 𝐤 𝑒 = 𝑘 11 𝑘 12 𝑘 13 𝑘 21 𝑘 22 𝑘 23 𝑘 31 𝑘 32 𝑘 33 K = sparse(I,J,V) 𝒖 𝒊 𝐤 𝑒 = 𝐉 −T 𝐁 T 𝐄 𝐉 −𝟏 𝐁 𝑑𝐴 y x B. Cohen – 21 November 2018
6
GPU Implementation A Pre-Processing Assemble 𝐊 Matrix Solve CPU GPU
Read Equation Data Generate Geometric Data Call sparse() constructor 𝐮=𝐊\𝐛 CPU Generate (I,J) Vectors Generate Mesh Data GPU Generate Ke_Values Array Double for-loop implementation B. Cohen – 21 November 2018
7
GPU Implementation B Pre-Processing Assemble 𝐊 Matrix Solve CPU GPU
Generate (I,J) Vectors Read Equation Data Generate Geometric Data Generate Mesh Data Call sparse() constructor 𝐮=𝐊\𝐛 CPU Generate Ke_Values Array GPU Transfer node and element arrays only to GPU B. Cohen – 21 November 2018
8
CPU vs. GPU Implementations
10 2 CPU Implementation GPU Implementation A GPU Implementation B 1 10 10 10 -1 CPU Time for I, J, V Assembly [s] 10 -2 10 -3 10 -4 2 3 4 5 6 7 10 10 10 10 10 10 Number of DOFs GeForce GTX 765M, 2048MB B. Cohen – 21 November 2018
9
Runtime Diagnostics Implementation A Implementation B
1 1.0 CPU -> GPU CPU -> GPU 0.8 I, J, V array assembly 0.8 I, J, V array assembly GPU -> CPU GPU -> CPU sparse() assembly sparse() assembly 0.6 0.6 CPU Runtime [s] CPU Runtime [s] 0.4 0.4 0.2 0.2 102 103 104 105 106 102 103 104 105 106 Number of DOFs Number of DOFs Overhead to transfer mesh data from CPU → GPU is low Overhead to transfer mesh data from GPU → CPU is high It would be nice to be able perform sparse matrix construction on GPU It would be even nicer to solve the problem on the GPU B. Cohen – 21 November 2018
10
Solving the Model 𝐊𝐮=𝐟 Now we want to solve the linear model:
ArrayFire.jl does not currently support sparse matrices Dense matrix operations seem to be comparable in speed or slower on GPU’s than CPU’s CUSPARSE.jl wraps NVIDIA CUSPARSE library functions High performance sparse linear algebra library Does not wrap any solver routines Built on CUDArt.jl package Wraps the CUDA runtime API Both packages required CUDA Toolkit (v8.0) 𝐊𝐮=𝐟 B. Cohen – 21 November 2018
11
GPU Solver Implementation
Preconditioned Conjugate Gradient Method 𝐊 is a sparse symmetric positive definite matrix Improves convergence if 𝐊 is not well conditioned Uses the Incomplete Cholesky Factorization Rather than solve the original system We solve the following system 𝐫←𝐟−𝐊𝐮 𝒇𝒐𝒓 𝑖=1, 2, …until convergence do 𝒔𝒐𝒍𝒗𝒆 𝐌𝐳←𝐫 𝜌 𝑖 ← 𝐫 T 𝐳 𝒊𝒇 𝑖==1 𝒕𝒉𝒆𝒏 𝐊≈𝐌= 𝐑 T 𝐑 𝐩←𝐳 𝒆𝒍𝒔𝒆 𝐊𝐮=𝐟 𝛽← 𝜌 𝑖 𝜌 𝑖−1 𝐩←𝐳+𝛽𝐩 𝒆𝒏𝒅 𝒊𝒇 𝐑 −T 𝐊 𝐑 −𝟏 𝐑𝐮 = 𝐑 −T 𝐟 q←𝐀𝐩 𝛼← 𝜌 𝑖 𝐩 T 𝐪 𝐱←𝐱+𝛼𝐩 𝐫←𝐫−𝛼𝐪 end 𝒇𝒐𝒓 B. Cohen – 21 November 2018
12
Solver Results CPU Time to Solve Ku=f [s] Number of DOFs 10 10 10 10
2 CPU Implementation GPU Implementation 1 10 10 10 -1 CPU Time to Solve Ku=f [s] 10 -2 10 -3 10 -4 2 3 4 5 6 7 10 10 10 10 10 10 Number of DOFs B. Cohen – 21 November 2018
13
Conclusion GPU computing in Julia shows promise of speeding up FEM matrix assembly and solve routines Potentially greater gains to be made with higher order 2D/3D elements Minimize data transfer to the GPU needed to assemble FEM matrices Keeping code vectorized helps Removing any temporary data copies on GPU ArrayFire.jl should (and hopefully soon will) support sparse matrix assembly and arithmetic Open issue on GitHub since late September, 2016 CUSPARSE.jl should (and hopefully soon will) wrap additional functions COO matrix constructor and iterative solvers would be especially useful Large impact on optimization problems where matrix assembly routines and solvers are called many times B. Cohen – 21 November 2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.