Co-Design Update 12/19/14 Brent Pickering Virginia Tech AOE Department.

Slides:

Advertisements

Similar presentations

Refining High Performance FORTRAN Code from Programming Model Dependencies Ferosh Jacob University of Alabama Department of Computer Science

Advertisements

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.

Introduction to the CUDA Platform

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

EULER Code for Helicopter Rotors EROS - European Rotorcraft Software Romuald Morvant March 2001.

Coupled Fluid-Structural Solver CFD incompressible flow solver has been coupled with a FEA code to analyze dynamic fluid-structure coupling phenomena CFD.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

R-Mancala Srinivas Krishnan & Kiranjit Sidhu. Outline Design Details Refactoring Experience Demo.

By Yequn Zhang, Yu Zhang. Contents Introduction Problem Analysis Proposed Algorithm Evaluation.

Network and Grid Computing –Modeling, Algorithms, and Software Mo Mu Joint work with Xiao Hong Zhu, Falcon Siu.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Parallelizing Compilers Presented by Yiwei Zhang.

Fluid Simulation using CUDA Thomas Wambold CS680: GPU Program Optimization August 31, 2011.

Wind Modeling Studies by Dr. Xu at Tennessee State University

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.

AIAA th AIAA/ISSMO Symposium on MAO, 09/05/2002, Atlanta, GA 0 AIAA OBSERVATIONS ON CFD SIMULATION UNCERTAINITIES Serhat Hosder,

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.

1 ENERGY 211 / CME 211 Lecture 26 November 19, 2008.

Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.

E-science grid facility for Europe and Latin America E2GRIS1 Gustavo Miranda Teixeira Ricardo Silva Campos Laboratório de Fisiologia Computacional.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

High Performance Computational Fluid-Thermal Sciences & Engineering Lab GenIDLEST Co-Design Virginia Tech AFOSR-BRI Workshop July 20-21, 2014 Keyur Joshi,

Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.

MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.

Author : Cedric Augonnet, Samuel Thibault, and Raymond Namyst INRIA Bordeaux, LaBRI, University of Bordeaux Workshop on Highly Parallel Processing on a.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

Single Node Optimization Computational Astrophysics.

CHAPTER 3-3: PAGE MAPPING MEMORY MANAGEMENT. VIRTUAL MEMORY Key Idea Disassociate addresses referenced in a running process from addresses available in.

Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.

GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower.

Fine-grained Adoption of Jocobian Matrix Filling in INCOMP3D July 20, 2015 Fine-grained Jacobian Filling in INCOMP3D 1 Lixiang (Eric) Luo, Jack Edwards,

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.

Anders Nielsen Technical University of Denmark, DTU-Aqua Mark Maunder Inter-American Tropical Tuna Commission An Introduction.

STAR Simulation. Status and plans V. Perevoztchikov Brookhaven National Laboratory,USA.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

PYTHON FOR HIGH PERFORMANCE COMPUTING. OUTLINE  Compiling for performance  Native ways for performance  Generator  Examples.

The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.

Progress Report—11/13 宗慶. Problem Statement Find kernels of large and sparse linear systems over GF(2)

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

An Update on Accelerating CICE with OpenACC

NFV Compute Acceleration APIs and Evaluation

Memory Management Virtual Memory.

Department of Computer Science, TU München

Xing Cai University of Oslo

Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz

Is RRTMGP suited for GPU?

Introduction to OpenGL

Experience with Maintaining the GPU Enabled Version of COSMO

Dycore Rewrite Tobias Gysi.

GPU Implementations for Finite Element Methods

Memory Management Tasks

University of Wisconsin-Madison

Multithreading Why & How.

The Challenge of Cross - Language Interoperability

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science

Presentation transcript:

Co-Design Update 12/19/14 Brent Pickering Virginia Tech AOE Department

Collaboration Continuing to work with Dr. Sandu’s research group on the time-integration test code (“SENSEI-lite”). – Recap: Matlab (mex) based 2D proxy of the SENSEI FVM code. – Full NS equations are implemented for RHS: can split into viscous and inviscid contributions. Primary test case: 2D unsteady cylinder. – Fixed issues with time integration order (has been working successfully with higher order). – Added ability to modify viscosity coefficient (e.g., to change the “stiffness” of the diffusive equations). – Returns rhs vectors only at present, split into viscous/inviscid. Future work: LHS function. – Can be implemented in the same Matlab framework (SENSEI-lite), permitting equation split, etc. and any additional custom features... – Another possibility would be to fork SENSEI  more work overall to get the equation splitting, etc., working, however the implicit machinery is in place (and we could run more problem cases).

Moving Forward with SENSEI Most straightforward approach at present seems to be OpenACC: – Compilers with full OpenMP 4.0 accelerator support? Support for other directive APIs competitive with OpenACC? – However, we are having some issues compiling SENSEI with PGI (see bug report at SENSEI likely requires some refactoring to work with any directive based programming scheme. – Still have the problem with virtual methods/function pointers. – Additionally, there seems to be “loop-carried dependency” issues in the current implementation (may prevent automatic parallelization). – Some data structures (e.g. flux-limiter storage) are optimized to avoid redundant calculations and storage and are not currently set up well for SIMD.  A common issue with FVM is the need to write flux data calculated at each interface back to the cells. In SENSEI, there is also some data exchange used to avoid redundant computations (limiters, viscous subgrid).

Moving Forward with SENSEI In the near term, working on a simplified proxy function to test our FVM algorithms with OpenACC directives. – Looking at many possibilities such as different loop orders, storing flux values, separate kernels for each flux direction, etc. – Higher parallel performance of GPU may offset redundant arithmetic. – Might be necessary to take a minor hit in CPU performance for a general scheme that works well on accelerator platforms. Longer term: – Probably need to replace some of our virtual methods with switch-case. – Restructure data-dependent loops. (LHS buildup routines could be tricky, and may require separate functions for acc). Loop fusion if possible. – We still need a GPU iterative solver library to run implicit cases…

Upcoming Publications Almost finished with OpenACC journal article (based on SciTech paper). Will submit to Computers and Fluids at the beginning of January. – Finished with LDC code after this! (Note: observed strange performance regression (~25%) in LDC code using newer PGI compiler versions.)