Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards,

Similar presentations


Presentation on theme: "Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards,"— Presentation transcript:

1 Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards, Hong Luo Department of Mechanical and Aerospace Engineering Frank Mueller Department of Computer Science North Carolina State University

2 Recent Publications L. Luo, J. R. Edwards, H. Luo, F. Mueller, “A fine- grained block ILU Scheme on regular structures for GPGPU,” Computer & Fluids, Vol. 119, pp 149-161, 2015. L. Luo, J. R. Edwards, H. Luo, F. Mueller, “Fine-grained Optimizations of Implicit Algorithms in An Incompressible Navier-Stokes Solver on GPGPU,” AIAA Aviation and Aeronautics Forum and Exposition, Dallas, TX, 2015. Fine-grained Jacobian Filling in INCOMP3D 2

3 LHS Filling in INCOMP3D Fine-grained Jacobian Filling in INCOMP3D 3 δ 1,1,1 ε 1,1,1 ζ 1,1,1 η 1,1,1 γ 2,1,1 δ 2,1,1 ε 2,1,1 γ 3,1,1 δ 3,1,1 β 1,2,1 α 1,1,2 α i,j,k β i,j,k γ i,j,k δ i,j,k ε i,j,k ζ i,j,k η i,j,k η I,J,K-1 ζ I,J-1,K ε I-1,J,K α I,J,K β I,J,K γ I,J,K δ I,J,K

4 Challenges LHS Filling in INCOMP3D Two primary components in LHS filling – AFILL: invisid flux Jacobian – TSD: viscous flux Jacobian and time derivative linearization LHS filling is heavily memory-bound – Before optimization: one GPU thread per grid location – Large amount of data per grid location: for RANS, each block is 6×6, 3 blocks per grid location per spatial direction Test case: URANS, 15M grid, 204 blocks, 2000 steps Fine-grained Jacobian Filling in INCOMP3D 4

5 Challenges of LHS Filling FGBILU Output = input Memory-bound due to overall data amount Homogenous Fine-grained algorithms does not cause branching LHS Output >> input Memory bound due to output data amount Inhomogeneous part Coefficient calculations causes branching Homogenous part Matrix filling is highly homogenous Fine-grained Jacobian Filling in INCOMP3D 5 Though memory-bound like FGBILU factorization, LHS filling poses unique challenges.

6 Optimization Strategy for FGBILU Fine-grained Jacobian Filling in INCOMP3D 6 Coarse-grained Computation Input Data Output Data Computation Input Data Output Data Fully Fine-grained

7 Two Steps of LHS Filling Step 1: calculations of common coefficients – Inhomogeneous: different coefficients are determined by different mathematical expressions – To ensure reasonable data locality, this step must be carried out in coarse grain: one thread per grid location Step 2: filling of submatrix blocks – Highly homogeneous – All elements are calculated based on the common coefficients and geometry data – This step can be carried out in fine grain Fine-grained Jacobian Filling in INCOMP3D 7

8 A Fully Fine-grained Scheme Not Suitable Fine-grained Jacobian Filling in INCOMP3D 8 Coarse-grained Fully fine-grained Computation Output Data Input Data ComputationInput Data Output Data Too much branching in Step 1 Memory bound

9 Coarse-grained Step 1 2-step Mixed-grained Approach Fine-grained Jacobian Filling in INCOMP3D 9 Ideally, changing granularity within one kernel – Dynamic Parallelism attempts to address this, but probably not efficient for LHS filling: too few child threads per grid location. Fine-grained Step 2 Output Data A two-step approach Input Data Output Data Computation Input Data Common Coefficients Parallel data reading: no bottleneck

10 Further Optimization Techniques Avoid unrolled private arrays – Instead, use existing global arrays to store intermediate results Merge spatial directions – Increases concurrent work by three times – Improves data locality by reusing shared data within a grid location – Odd-even coloring scheme is no longer necessary Replace long branches with short branches – May be compiled into predicate operations on GPU, which does not incur branching penalty Replace mathematical branches with logical coefficients – Avoids branching – May also be compiled into predicate operations Fine-grained Jacobian Filling in INCOMP3D 10

11 Preliminary Results The new strategy significantly improves performance of LHS filling subroutines – AFILL reaches 14.5X speedup, and TSD reaches 6.3X speedup (TSD is less memory-bound originally) – Blocks sizes are small in this test case, so speedup numbers are far from optimal Data transfer (not data packing) is now the bottleneck Test case: URANS, 3M grid, 128 blocks, 200 steps Fine-grained Jacobian Filling in INCOMP3D 11

12 Upcoming Tasks High-order extension of RHS – By adopting multiple-step computation and intermediate storage, data contingency can be avoided. Since coloring scheme is no longer needed, high-order schemes become much more tractable. Improve performance of L2 norm calculation – Classic sum reduction, currently consumes 45% run time of RHS Improve MPI data transfer performance – Masking: concurrent MPI transfer and computation Use CPU memory to store LHS matrices – Can potentially allow much more blocks per GPU – Masking: concurrent GPU-CPU transfers and kernel executions Run large-scale simulations with INCOMP3D Fine-grained Jacobian Filling in INCOMP3D 12


Download ppt "Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards,"

Similar presentations


Ads by Google