Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Dan Iannuzzi Kevin Pine CS 680. Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties.
OpenFOAM on a GPU-based Heterogeneous Cluster
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
CS240A: Conjugate Gradients and the Model Problem.
NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
High Performance Computational Fluid-Thermal Sciences & Engineering Lab GenIDLEST Co-Design Virginia Tech 1 AFOSR-BRI Workshop December Amit Amritkar,
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.
Beam Dynamic Calculation by NVIDIA® CUDA Technology E. Perepelkin, V. Smirnov, and S. Vorozhtsov JINR, Dubna 7 July 2009.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
CFD Lab - Department of Engineering - University of Liverpool Ken Badcock & Mark Woodgate Department of Engineering University of Liverpool Liverpool L69.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng …Virginia Tech Pavan Balaji, James Dinan, Rajeev Thakur …Argonne.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
High Performance Computational Fluid-Thermal Sciences & Engineering Lab GenIDLEST Co-Design Virginia Tech AFOSR-BRI Workshop July 20-21, 2014 Keyur Joshi,
QCAdesigner – CUDA HPPS project
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards,
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Recent Development on IN3D-ACC July 22, 2014 Recent Progress: 3D MPI Performance 1 Lixiang (Eric) Luo, Jack Edwards, Hong Luo Department of Mechanical.
Gwangsun Kim, Jiyun Jeong, John Kim
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Hui Liu University of Calgary
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
A survey of Exascale Linear Algebra Libraries for Data Assimilation
CS427 Multicore Architecture and Parallel Computing
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Linchuan Chen, Xin Huo and Gagan Agrawal
IMPLEMENTATION OF MULTIPLE-PRECISION MODULAR MULTIPLICATION ON GPU
CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
CS/EE 217 – GPU Architecture and Parallel Programming
GENERAL VIEW OF KRATOS MULTIPHYSICS
CS 252 Project Presentation
Hybrid Programming with OpenMP and MPI
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Multicore and GPU Programming
Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu
6- General Purpose GPU Programming
Multicore and GPU Programming
Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho
Presentation transcript:

Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz GenIDLEST Co-Design Amit Amritkar & Danesh Tafti Collaborators Wu-chun Feng, Paul Sathre, Kaixi Hou, Sriram Chivukula, Hao Wang, Tom Scogland, Eric de Sturler & Kasia Swirydowicz Virginia Tech AFOSR-BRI Workshop July 23 2014 1

Solution of pressure Poisson equation Solver co-design with Math team Amit Amritkar, Danesh Tafti, Eric deSturler, Katarzyna Swirydowicz Solution of pressure Poisson equation Most time consuming function (50 to 90 % of total time) Solving multiple linear systems Ax = b ‘A’ remains constant from one time step to other in many CFD calculations rGCROT/rGCRODR algorithm Recycling of basis vectors from one time step to the subsequent ones Hybrid approach rGCROT to build the outer vector space initially rBiCG-STAB for subsequent systems for faster performance

Manual CUDA code optimization OpenACC version of the code Co-design with CS team Amit Amritkar, Danesh Tafti, Wu Feng, Paul Sathre, Kaixi Hou, Sriram Chivukula, Hao Wang, Tom Scogland Manual CUDA code optimization From 5x to 10x OpenACC version of the code OpenACC vs CUDA code performance (Currently at 0.4x) Integration with “Library” Dot product Inter mesh block communication 3

Publications Amit Amritkar, Eric De Sturler, Katarzyna Swirydowicz, Danesh Tafti and Kapil Ahuja. “Recycling Krylov subspaces for CFD application.” To be submitted to Computer methods in Applied Mechanics and Engineering Amit Amritkar and Danesh Tafti. “CFD computations using preconditioned Krylov solver on GPUs.” Proceedings of ASME 2014 Fluids Engineering Division Summer Meeting, August 3-7, 2014, Chicago, Illinois, USA Katarzyna Swirydowicz, Amit Amritkar, Eric De Sturler and Danesh Tafti. “Recycling Krylov subspaces for CFD application.” Presentation at ASME 2014 Fluids Engineering Division Summer Meeting, August 3-7, 2014, Chicago, Illinois, USA Amit Amritkar, Danesh Tafti, Paul Sathre, Kaixi Hou, Sriram Chivakula and Wu-Chun Feng. “Accelerating Bio-Inspired MAV Computations using GPUs.” Proceedings of AIAA Aviation and Aeronautics Forum and Exposition 2014, 16 - 20 June 2014, Atlanta, Georgia

Future work Use GPU aware MPI GPU Direct v2 gives about 25% performance improvement Overlapping computations with communications Integrate with the library developed by the CS team Assess performance on multiple architectures Evaluation of RK methods (Rosenbrock-Krylov) and IMEX DIMSIM for fractional step algorithm Evaluation of OpenACC for portability and OpenMP 4.0 for accelerator programming Use of co-processing (Intel mic) Combination of CPU and Co-processor Data copy between CPU/GPU with face data pack/unpack

Comparison of execution time for OpenACC vs CUDA vs CPU (serial)

Recap GPU version of GenIDLEST Validation studies of the GPU code Strategy Validation studies of the GPU code Turbulent channel flow Turbulent pipe flow Application Bat flight 7

Outline Co-design Future work GenIDLEST Code Application CS Team Math Team Future work GenIDLEST Code Features and capabilities Application Bat flight – scaling study Other modifications 8

GenIDLEST Flow Chart 9

Data Structures and Mapping to Co-Processor Architectures Compute Node CPUs Co-procs. Global grid Nodal grid MPI offload OpenMP GPU Mesh blocks Intel MIC MPI Cache blocks 10

GPU (60 GPUs) CPU (60 CPU cores)

GPU code scaling study Strong scaling study HokieSpeed – no RDMA Bat flight 24 million grid node HokieSpeed – no RDMA 12

Comparison of mean time taken on 32 GPUs The time taken in data exchange related calls has increased for 256 mesh block case Local copy on a GPU is expensive 13

Code profiling Consolidated profiling data Time spent (percentage) CPU calculations are pre and post processing of data including I/O Communication costs dominate Potentially reduce by using RDMA (GPUDirect v3)   256 Mesh block 32 Mesh block Communication 61 40 GPU calculations 22 35 CPU calculations 8 15 Other 9 10 14

Modifications to the point Jacobi preconditioner Original version on GPU Only one iteration per load into memory Synchronization across thread blocks after every inner iteration Modified version on GPU Use of shared memory Multiple inner iterations Synchronization across thread blocks after all inner iterations Do iterations Launch kernel stencil calculations End 15

Modified preconditioner Time in pressure solving for 1 time step Same number of iterations to converge Turbulent channel flow Bat wing flapping # GPUs Original kernel (s) Modified kernel (s) Speedup 1 0.236 0.159 1.32x 4 3.075 2.033 1.33x # GPUs Original kernel (s) Modified kernel (s) Speedup 32 5.13 3.04 1.4x 64 1.96 1.61 1.17x 16

Non-orthogonal case optimization Pipe flow calculations 19 point stencil Time in pressure calculations for 1 time step Same number of iterations to converge Use of shared memory Size limitations Reduce the cache block size further Other strategy to load all variables into shared memory # GPUs Original (s) Optimized (s) Speedup 36 0.67 0.625 1.06x 17

Other modifications Modifications to accommodate multiple layers of cells for communication Modified in-situ calculations of i,j,k mapping to a threadblock Modified indices in dot_product Convergence check on GPUs Reduction kernel 18