Download presentation
Presentation is loading. Please wait.
Published byAnn Fox Modified over 9 years ago
1
Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards, Hong Luo Department of Mechanical and Aerospace Engineering Frank Mueller Department of Computer Science North Carolina State University
2
Recent Publications L. Luo, J. R. Edwards, H. Luo, F. Mueller, “A fine- grained block ILU Scheme on regular structures for GPGPU,” Computer & Fluids, Vol. 119, pp 149-161, 2015. L. Luo, J. R. Edwards, H. Luo, F. Mueller, “Fine-grained Optimizations of Implicit Algorithms in An Incompressible Navier-Stokes Solver on GPGPU,” AIAA Aviation and Aeronautics Forum and Exposition, Dallas, TX, 2015. Fine-grained Jacobian Filling in INCOMP3D 2
3
LHS Filling in INCOMP3D Fine-grained Jacobian Filling in INCOMP3D 3 δ 1,1,1 ε 1,1,1 ζ 1,1,1 η 1,1,1 γ 2,1,1 δ 2,1,1 ε 2,1,1 γ 3,1,1 δ 3,1,1 β 1,2,1 α 1,1,2 α i,j,k β i,j,k γ i,j,k δ i,j,k ε i,j,k ζ i,j,k η i,j,k η I,J,K-1 ζ I,J-1,K ε I-1,J,K α I,J,K β I,J,K γ I,J,K δ I,J,K
4
Challenges LHS Filling in INCOMP3D Two primary components in LHS filling – AFILL: invisid flux Jacobian – TSD: viscous flux Jacobian and time derivative linearization LHS filling is heavily memory-bound – Before optimization: one GPU thread per grid location – Large amount of data per grid location: for RANS, each block is 6×6, 3 blocks per grid location per spatial direction Test case: URANS, 15M grid, 204 blocks, 2000 steps Fine-grained Jacobian Filling in INCOMP3D 4
5
Challenges of LHS Filling FGBILU Output = input Memory-bound due to overall data amount Homogenous Fine-grained algorithms does not cause branching LHS Output >> input Memory bound due to output data amount Inhomogeneous part Coefficient calculations causes branching Homogenous part Matrix filling is highly homogenous Fine-grained Jacobian Filling in INCOMP3D 5 Though memory-bound like FGBILU factorization, LHS filling poses unique challenges.
6
Optimization Strategy for FGBILU Fine-grained Jacobian Filling in INCOMP3D 6 Coarse-grained Computation Input Data Output Data Computation Input Data Output Data Fully Fine-grained
7
Two Steps of LHS Filling Step 1: calculations of common coefficients – Inhomogeneous: different coefficients are determined by different mathematical expressions – To ensure reasonable data locality, this step must be carried out in coarse grain: one thread per grid location Step 2: filling of submatrix blocks – Highly homogeneous – All elements are calculated based on the common coefficients and geometry data – This step can be carried out in fine grain Fine-grained Jacobian Filling in INCOMP3D 7
8
A Fully Fine-grained Scheme Not Suitable Fine-grained Jacobian Filling in INCOMP3D 8 Coarse-grained Fully fine-grained Computation Output Data Input Data ComputationInput Data Output Data Too much branching in Step 1 Memory bound
9
Coarse-grained Step 1 2-step Mixed-grained Approach Fine-grained Jacobian Filling in INCOMP3D 9 Ideally, changing granularity within one kernel – Dynamic Parallelism attempts to address this, but probably not efficient for LHS filling: too few child threads per grid location. Fine-grained Step 2 Output Data A two-step approach Input Data Output Data Computation Input Data Common Coefficients Parallel data reading: no bottleneck
10
Further Optimization Techniques Avoid unrolled private arrays – Instead, use existing global arrays to store intermediate results Merge spatial directions – Increases concurrent work by three times – Improves data locality by reusing shared data within a grid location – Odd-even coloring scheme is no longer necessary Replace long branches with short branches – May be compiled into predicate operations on GPU, which does not incur branching penalty Replace mathematical branches with logical coefficients – Avoids branching – May also be compiled into predicate operations Fine-grained Jacobian Filling in INCOMP3D 10
11
Preliminary Results The new strategy significantly improves performance of LHS filling subroutines – AFILL reaches 14.5X speedup, and TSD reaches 6.3X speedup (TSD is less memory-bound originally) – Blocks sizes are small in this test case, so speedup numbers are far from optimal Data transfer (not data packing) is now the bottleneck Test case: URANS, 3M grid, 128 blocks, 200 steps Fine-grained Jacobian Filling in INCOMP3D 11
12
Upcoming Tasks High-order extension of RHS – By adopting multiple-step computation and intermediate storage, data contingency can be avoided. Since coloring scheme is no longer needed, high-order schemes become much more tractable. Improve performance of L2 norm calculation – Classic sum reduction, currently consumes 45% run time of RHS Improve MPI data transfer performance – Masking: concurrent MPI transfer and computation Use CPU memory to store LHS matrices – Can potentially allow much more blocks per GPU – Masking: concurrent GPU-CPU transfers and kernel executions Run large-scale simulations with INCOMP3D Fine-grained Jacobian Filling in INCOMP3D 12
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.