Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards,

Slides:



Advertisements
Similar presentations
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Parallelizing Compilers Presented by Yiwei Zhang.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
GPU Acceleration of Pyrosequencing Noise Removal Dept. of Computer Science and Engineering University of South Carolina Yang Gao, Jason D. Bakos Heterogeneous.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Implementing a Speech Recognition System on a GPU using CUDA
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Parallel Solution of the Poisson Problem Using MPI
Adaptive Hopfield Network Gürsel Serpen Dr. Gürsel Serpen Associate Professor Electrical Engineering and Computer Science Department University of Toledo.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
OPTIMIZATION OF DIESEL INJECTION USING GRID COMPUTING Miguel Caballer Universidad Politécnica de Valencia.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Martin Kruliš by Martin Kruliš (v1.0)1.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Co-Design Update 12/19/14 Brent Pickering Virginia Tech AOE Department.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Recent Development on IN3D-ACC July 22, 2014 Recent Progress: 3D MPI Performance 1 Lixiang (Eric) Luo, Jack Edwards, Hong Luo Department of Mechanical.
CS/EE 217 – GPU Architecture and Parallel Programming
Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz
Quiz for Week #5.
Parallel and Distributed Simulation Techniques
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CS/EE 217 – GPU Architecture and Parallel Programming
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
GENERAL VIEW OF KRATOS MULTIPHYSICS
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Parallelismo.
Aiman H. El-Maleh Sadiq M. Sait Syed Z. Shazli
Real time signal processing
6- General Purpose GPU Programming
Computational issues Issues Solutions Large time scale
Presentation transcript:

Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards, Hong Luo Department of Mechanical and Aerospace Engineering Frank Mueller Department of Computer Science North Carolina State University

Recent Publications L. Luo, J. R. Edwards, H. Luo, F. Mueller, “A fine- grained block ILU Scheme on regular structures for GPGPU,” Computer & Fluids, Vol. 119, pp , L. Luo, J. R. Edwards, H. Luo, F. Mueller, “Fine-grained Optimizations of Implicit Algorithms in An Incompressible Navier-Stokes Solver on GPGPU,” AIAA Aviation and Aeronautics Forum and Exposition, Dallas, TX, Fine-grained Jacobian Filling in INCOMP3D 2

LHS Filling in INCOMP3D Fine-grained Jacobian Filling in INCOMP3D 3 δ 1,1,1 ε 1,1,1 ζ 1,1,1 η 1,1,1 γ 2,1,1 δ 2,1,1 ε 2,1,1 γ 3,1,1 δ 3,1,1 β 1,2,1 α 1,1,2 α i,j,k β i,j,k γ i,j,k δ i,j,k ε i,j,k ζ i,j,k η i,j,k η I,J,K-1 ζ I,J-1,K ε I-1,J,K α I,J,K β I,J,K γ I,J,K δ I,J,K

Challenges LHS Filling in INCOMP3D Two primary components in LHS filling – AFILL: invisid flux Jacobian – TSD: viscous flux Jacobian and time derivative linearization LHS filling is heavily memory-bound – Before optimization: one GPU thread per grid location – Large amount of data per grid location: for RANS, each block is 6×6, 3 blocks per grid location per spatial direction Test case: URANS, 15M grid, 204 blocks, 2000 steps Fine-grained Jacobian Filling in INCOMP3D 4

Challenges of LHS Filling FGBILU Output = input Memory-bound due to overall data amount Homogenous Fine-grained algorithms does not cause branching LHS Output >> input Memory bound due to output data amount Inhomogeneous part Coefficient calculations causes branching Homogenous part Matrix filling is highly homogenous Fine-grained Jacobian Filling in INCOMP3D 5 Though memory-bound like FGBILU factorization, LHS filling poses unique challenges.

Optimization Strategy for FGBILU Fine-grained Jacobian Filling in INCOMP3D 6 Coarse-grained Computation Input Data Output Data Computation Input Data Output Data Fully Fine-grained

Two Steps of LHS Filling Step 1: calculations of common coefficients – Inhomogeneous: different coefficients are determined by different mathematical expressions – To ensure reasonable data locality, this step must be carried out in coarse grain: one thread per grid location Step 2: filling of submatrix blocks – Highly homogeneous – All elements are calculated based on the common coefficients and geometry data – This step can be carried out in fine grain Fine-grained Jacobian Filling in INCOMP3D 7

A Fully Fine-grained Scheme Not Suitable Fine-grained Jacobian Filling in INCOMP3D 8 Coarse-grained Fully fine-grained Computation Output Data Input Data ComputationInput Data Output Data Too much branching in Step 1 Memory bound

Coarse-grained Step 1 2-step Mixed-grained Approach Fine-grained Jacobian Filling in INCOMP3D 9 Ideally, changing granularity within one kernel – Dynamic Parallelism attempts to address this, but probably not efficient for LHS filling: too few child threads per grid location. Fine-grained Step 2 Output Data A two-step approach Input Data Output Data Computation Input Data Common Coefficients Parallel data reading: no bottleneck

Further Optimization Techniques Avoid unrolled private arrays – Instead, use existing global arrays to store intermediate results Merge spatial directions – Increases concurrent work by three times – Improves data locality by reusing shared data within a grid location – Odd-even coloring scheme is no longer necessary Replace long branches with short branches – May be compiled into predicate operations on GPU, which does not incur branching penalty Replace mathematical branches with logical coefficients – Avoids branching – May also be compiled into predicate operations Fine-grained Jacobian Filling in INCOMP3D 10

Preliminary Results The new strategy significantly improves performance of LHS filling subroutines – AFILL reaches 14.5X speedup, and TSD reaches 6.3X speedup (TSD is less memory-bound originally) – Blocks sizes are small in this test case, so speedup numbers are far from optimal Data transfer (not data packing) is now the bottleneck Test case: URANS, 3M grid, 128 blocks, 200 steps Fine-grained Jacobian Filling in INCOMP3D 11

Upcoming Tasks High-order extension of RHS – By adopting multiple-step computation and intermediate storage, data contingency can be avoided. Since coloring scheme is no longer needed, high-order schemes become much more tractable. Improve performance of L2 norm calculation – Classic sum reduction, currently consumes 45% run time of RHS Improve MPI data transfer performance – Masking: concurrent MPI transfer and computation Use CPU memory to store LHS matrices – Can potentially allow much more blocks per GPU – Masking: concurrent GPU-CPU transfers and kernel executions Run large-scale simulations with INCOMP3D Fine-grained Jacobian Filling in INCOMP3D 12