An Update on Accelerating CICE with OpenACC

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Introduction to the CUDA Platform
Introduction to Openmp & openACC
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
Contemporary Languages in Parallel Computing Raymond Hummel.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Extracted directly from:
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
GPU Programming with CUDA – Optimisation Mike Griffiths
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Synchronization These notes introduce:
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Co-Design Update 12/19/14 Brent Pickering Virginia Tech AOE Department.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Productive Performance Tools for Heterogeneous Parallel Computing
GPUNFV: a GPU-Accelerated NFV System
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Exploiting NVIDIA GPUs with OpenMP
NVIDIA Profiler’s Guide
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Computer-Generated Force Acceleration using GPUs: Next Steps
Experience with Maintaining the GPU Enabled Version of COSMO
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Chapter 4: Threads.
MPI-Message Passing Interface
May 19 Lecture Outline Introduce MPI functionality
Cristiano Padrin (CASPUR)
Multithreaded Programming
Synchronization These notes introduce:
Presentation transcript:

An Update on Accelerating CICE with OpenACC Two main challenges: low computation intensity and frequest halo updates Dec 3, 2015 LA-UR-15-29184

Outline Current status What we did GPUDirect for MPI communication Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes

Outline Current status What we did GPUDirect for MPI communication Strategy for going forward

Current Status Modified dynamics transport routines Implemented a GPUDirect version of halo updates Used buffers to aggregate messages Recent attempt to run larger 320-rank problem Ran on out device memory on Titan Would be nice if OpenACC could provide async device mem allocation ACC illustrates problems with directive-based programming model: you don’t know what code gets generated

Outline Current status What we did GPUDirect for MPI communication Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes

Test Platform – Nvidia PSG cluster Mix of GPUs: K20, K40 and K80 Nodes have 2 to 8 devices K80 is essentially 2 K40s on single card CPUs: ivybridge and haswell Testing is done on either 20-core ivybridge nodes with 6 K40s/node or 32-core haswell nodes with 8 K80s/node

Test Platform – Nvidia PSG cluster Asymmetric device distribution 1-2 cards on socket 0 and 3-4 cards on socket 1 Displayed using ‘nvidia-smi topo –m’ GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 mlx5_0 CPU Affinity X PIX PHB SOC 0-9 10-19

Test Platform – K80 nodes K80 is 2 K40s on single card ‘nvidia-smi topo –m’ shows 8 devices, but only 4 K80 cards GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity X PIX SOC 0-15 PHB 16-31 PXB

Profile using gprof + gprof2dot.py

Accelerating Dynamics Transport Focused on horizontal_remap Minimize data movement Overlap host-to-device and device-to-host with kernel computations using async streams Use Fortran pointers into single large memory block

Results Haswell Ivybridge Baseline 55.12 65.63 Version 53 55.82 65.55 55.33 65.78 Buffered baseline 55.18 65.29 GPUDirect 57.39 67.19

Pointers into Blocks Use Fortran pointers into large memory chunks allocate( mem_chunk(nx, ny,2) ) v => mem_chunk(:,:,1) w => mem_chunk(:,:,2) !$acc enter data create(mem_chunk) !$acc update device(mem_chunk) !$acc data present(v,w) !$acc parallel loop collapse(2) do i = 1,ny do j = 1,nx v(i,j) = alpha * w(i,j) enddo

CUDA streams Use CUDA streams for asynchronous kernels If loops are data independent, launch with separate streams along with data updates to host/device !$acc parallel loop collapse(2) async(1) do i = 1,n do j = 1,m a(i,j) = a(i,j) * w(i,j) enddo !$acc parallel loop collapse(2) async(2) b(i,j) = b(i,j) + alpha * t(i,j)

CUDA streams Pass stream to subroutines inside loops If each iteration is data independent ACC kernels inside construct_fields then use iblk as asynchronous streams In theory we should get concurrent execution do iblk = 1, nblocks call construct_fields(iblk,…) enddo

NVPROF – revision 53

NVPROF – revision 69

Excessive kernel launches nvprof profiler shows for rev 53: ======== API calls: Time(%) Time Calls Avg Min Max Name 69.55% 2.33983s 1 2.33983s 2.33983s 2.33983s cudaFree 9.02% 303.34ms 1052 288.35us 1.5210us 1.5327ms cuEventSynchronize 6.52% 219.28ms 386 568.08us 1.8180us 4.8207ms cuStreamSynchronize 6.32% 212.54ms 22944 9.2630us 8.0320us 272.22us cuLaunchKernel 3.46% 116.31ms 4 29.078ms 26.526ms 35.518ms cuMemHostAlloc cudaFree appears to be called very early on; perhaps part of device initialization/overhead. Need to track this down. Perhaps this overhead can be amortized with larger problem sizes running for longer. - Note the time spent in cuEventSynchronize. Using ‘async’ triggers cuEventRecord/cuEventSynchronize

Reduce kernel launches - push down loops Refactor some loops by pushing loop inside called routine do iblk = 1,nblocks call departure_points(ilo(iblk), ihi(iblk),…) enddo call departure_points_all(nblocks,…)

Reduce kernel launches – code fusion Combine code and loops into single kernel do nt = 1,ntrace if ( tracer_type(nt) == 1 ) then call limited_gradient(…,mxav, myav,…) mtxav = funcx(i,j) mtyav = funcy(i,j) : else if ( tracer_type(nt) == 2 ) then nt1 = depend(nt) call limited_gradient(…,mtxav(nt1), mtyav(nt1),…) else if ( tracer_type(nt) == 3 ) then endif enddo

Reduced kernel launches nvprof profiler shows for rev 69: ======== API calls: Time(%) Time Calls Avg Min Max Name 74.13% 2.36824s 1 2.36824s 2.36824s 2.36824s cudaFree 10.85% 346.63ms 1052 329.49us 1.5490us 1.6895ms cuEventSynchronize 5.10% 162.87ms 434 375.28us 1.7690us 3.6578ms cuStreamSynchronize 3.79% 121.04ms 4 30.260ms 27.768ms 35.218ms cuMemHostAlloc 1.86% 59.298ms 44036 1.3460us 233ns 129.30us cuPointerGetAttribute 1.55% 49.416ms 498 99.229us 190ns 4.3737ms cuDeviceGetAttribute 0.84% 26.792ms 1824 14.688us 8.7890us 230.45us cuLaunchKernel 0.51% 16.285ms 37 440.14us 4.6900us 9.0301ms cuMemAlloc

Reduce kernel launches – code fusion Fused code becomes Disadvantage: code duplication for variants of limited_gradient subroutine Dependencies in tracers in construct_fields limited kernel concurrency call fused_limited_gradient_indep(mxav, myav, mtxav, mtyav,…) call fused_limited_gradient_tracer2(mtxav,mtyav…)

GPU Affinity Bind MPI rank to a device on same socket Using hwloc library Get list of devices on local socket Set device in round-robin fashion Can also restrict list using environment variable CICE_GPU_DEVICES Used to restrict K80s to use only one of devices on card

Outline Current status What we did GPUDirect for MPI communication Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes

GPUDirect Network adapter reads/writes directly from GPU memory CPU GPU Requires supporting network adapter (Mellanox Infiniband) and tesla-class card Introduced in cuda 5.0 Infiniband Chipset PCI-e

GPUDirect Allows bypass of copies to/from CPU memory for MPI communication If MPI library and network interconnect support it Modified halo updates to use GPUDirect Added additional buffering capability to aggregate into larger messages ice_haloBegin, ice_haloAddUpdate, ice_haloFlush, ice_haloEnd Pushed buffering to CPU halo updates

Buffered halo updates ice_haloBegin ice_haloAddUpdate ice_haloFlush ice_haloEnd call ice_haloBegin(haloInfo,num_fields,updateInfo) call ice_haloAddUpdate(dpx,…) call ice_haloAddUpdate(dpy,…) call ice_haloFlush(haloInfo, updateInfo) call ice_haloAddUpdate(mx,…) call ice_haloEnd(haloInfo, updateInfo) We refactored some CPU halo updates to use new buffered scheme.

Outline Current status What we did GPUDirect for MPI communication Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes

Going Forward More physics being added to column physics, so will revisit with OpenACC No halo updates during computations Explore task parallelism Can we restructure code execution to expose tasks that translate to GPU kernels Expand GPU acceleration to other parts of CICE Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes