Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards, Hong Luo Department of Mechanical and Aerospace Engineering Frank Mueller Department of Computer Science North Carolina State University
Recent Publications L. Luo, J. R. Edwards, H. Luo, F. Mueller, “A fine- grained block ILU Scheme on regular structures for GPGPU,” Computer & Fluids, Vol. 119, pp , L. Luo, J. R. Edwards, H. Luo, F. Mueller, “Fine-grained Optimizations of Implicit Algorithms in An Incompressible Navier-Stokes Solver on GPGPU,” AIAA Aviation and Aeronautics Forum and Exposition, Dallas, TX, Fine-grained Jacobian Filling in INCOMP3D 2
LHS Filling in INCOMP3D Fine-grained Jacobian Filling in INCOMP3D 3 δ 1,1,1 ε 1,1,1 ζ 1,1,1 η 1,1,1 γ 2,1,1 δ 2,1,1 ε 2,1,1 γ 3,1,1 δ 3,1,1 β 1,2,1 α 1,1,2 α i,j,k β i,j,k γ i,j,k δ i,j,k ε i,j,k ζ i,j,k η i,j,k η I,J,K-1 ζ I,J-1,K ε I-1,J,K α I,J,K β I,J,K γ I,J,K δ I,J,K
Challenges LHS Filling in INCOMP3D Two primary components in LHS filling – AFILL: invisid flux Jacobian – TSD: viscous flux Jacobian and time derivative linearization LHS filling is heavily memory-bound – Before optimization: one GPU thread per grid location – Large amount of data per grid location: for RANS, each block is 6×6, 3 blocks per grid location per spatial direction Test case: URANS, 15M grid, 204 blocks, 2000 steps Fine-grained Jacobian Filling in INCOMP3D 4
Challenges of LHS Filling FGBILU Output = input Memory-bound due to overall data amount Homogenous Fine-grained algorithms does not cause branching LHS Output >> input Memory bound due to output data amount Inhomogeneous part Coefficient calculations causes branching Homogenous part Matrix filling is highly homogenous Fine-grained Jacobian Filling in INCOMP3D 5 Though memory-bound like FGBILU factorization, LHS filling poses unique challenges.
Optimization Strategy for FGBILU Fine-grained Jacobian Filling in INCOMP3D 6 Coarse-grained Computation Input Data Output Data Computation Input Data Output Data Fully Fine-grained
Two Steps of LHS Filling Step 1: calculations of common coefficients – Inhomogeneous: different coefficients are determined by different mathematical expressions – To ensure reasonable data locality, this step must be carried out in coarse grain: one thread per grid location Step 2: filling of submatrix blocks – Highly homogeneous – All elements are calculated based on the common coefficients and geometry data – This step can be carried out in fine grain Fine-grained Jacobian Filling in INCOMP3D 7
A Fully Fine-grained Scheme Not Suitable Fine-grained Jacobian Filling in INCOMP3D 8 Coarse-grained Fully fine-grained Computation Output Data Input Data ComputationInput Data Output Data Too much branching in Step 1 Memory bound
Coarse-grained Step 1 2-step Mixed-grained Approach Fine-grained Jacobian Filling in INCOMP3D 9 Ideally, changing granularity within one kernel – Dynamic Parallelism attempts to address this, but probably not efficient for LHS filling: too few child threads per grid location. Fine-grained Step 2 Output Data A two-step approach Input Data Output Data Computation Input Data Common Coefficients Parallel data reading: no bottleneck
Further Optimization Techniques Avoid unrolled private arrays – Instead, use existing global arrays to store intermediate results Merge spatial directions – Increases concurrent work by three times – Improves data locality by reusing shared data within a grid location – Odd-even coloring scheme is no longer necessary Replace long branches with short branches – May be compiled into predicate operations on GPU, which does not incur branching penalty Replace mathematical branches with logical coefficients – Avoids branching – May also be compiled into predicate operations Fine-grained Jacobian Filling in INCOMP3D 10
Preliminary Results The new strategy significantly improves performance of LHS filling subroutines – AFILL reaches 14.5X speedup, and TSD reaches 6.3X speedup (TSD is less memory-bound originally) – Blocks sizes are small in this test case, so speedup numbers are far from optimal Data transfer (not data packing) is now the bottleneck Test case: URANS, 3M grid, 128 blocks, 200 steps Fine-grained Jacobian Filling in INCOMP3D 11
Upcoming Tasks High-order extension of RHS – By adopting multiple-step computation and intermediate storage, data contingency can be avoided. Since coloring scheme is no longer needed, high-order schemes become much more tractable. Improve performance of L2 norm calculation – Classic sum reduction, currently consumes 45% run time of RHS Improve MPI data transfer performance – Masking: concurrent MPI transfer and computation Use CPU memory to store LHS matrices – Can potentially allow much more blocks per GPU – Masking: concurrent GPU-CPU transfers and kernel executions Run large-scale simulations with INCOMP3D Fine-grained Jacobian Filling in INCOMP3D 12