Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Update on Accelerating CICE with OpenACC

Similar presentations


Presentation on theme: "An Update on Accelerating CICE with OpenACC"— Presentation transcript:

1 An Update on Accelerating CICE with OpenACC
Two main challenges: low computation intensity and frequest halo updates Dec 3, 2015 LA-UR

2 Outline Current status What we did GPUDirect for MPI communication
Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes

3 Outline Current status What we did GPUDirect for MPI communication
Strategy for going forward

4 Current Status Modified dynamics transport routines
Implemented a GPUDirect version of halo updates Used buffers to aggregate messages Recent attempt to run larger 320-rank problem Ran on out device memory on Titan Would be nice if OpenACC could provide async device mem allocation ACC illustrates problems with directive-based programming model: you don’t know what code gets generated

5 Outline Current status What we did GPUDirect for MPI communication
Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes

6 Test Platform – Nvidia PSG cluster
Mix of GPUs: K20, K40 and K80 Nodes have 2 to 8 devices K80 is essentially 2 K40s on single card CPUs: ivybridge and haswell Testing is done on either 20-core ivybridge nodes with 6 K40s/node or 32-core haswell nodes with 8 K80s/node

7 Test Platform – Nvidia PSG cluster
Asymmetric device distribution 1-2 cards on socket 0 and 3-4 cards on socket 1 Displayed using ‘nvidia-smi topo –m’ GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 mlx5_0 CPU Affinity X PIX PHB SOC 0-9 10-19

8 Test Platform – K80 nodes K80 is 2 K40s on single card
‘nvidia-smi topo –m’ shows 8 devices, but only 4 K80 cards GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity X PIX SOC 0-15 PHB 16-31 PXB

9 Profile using gprof + gprof2dot.py

10 Accelerating Dynamics Transport
Focused on horizontal_remap Minimize data movement Overlap host-to-device and device-to-host with kernel computations using async streams Use Fortran pointers into single large memory block

11 Results Haswell Ivybridge Baseline 55.12 65.63 Version 53 55.82 65.55
55.33 65.78 Buffered baseline 55.18 65.29 GPUDirect 57.39 67.19

12 Pointers into Blocks Use Fortran pointers into large memory chunks
allocate( mem_chunk(nx, ny,2) ) v => mem_chunk(:,:,1) w => mem_chunk(:,:,2) !$acc enter data create(mem_chunk) !$acc update device(mem_chunk) !$acc data present(v,w) !$acc parallel loop collapse(2) do i = 1,ny do j = 1,nx v(i,j) = alpha * w(i,j) enddo

13 CUDA streams Use CUDA streams for asynchronous kernels
If loops are data independent, launch with separate streams along with data updates to host/device !$acc parallel loop collapse(2) async(1) do i = 1,n do j = 1,m a(i,j) = a(i,j) * w(i,j) enddo !$acc parallel loop collapse(2) async(2) b(i,j) = b(i,j) + alpha * t(i,j)

14 CUDA streams Pass stream to subroutines inside loops
If each iteration is data independent ACC kernels inside construct_fields then use iblk as asynchronous streams In theory we should get concurrent execution do iblk = 1, nblocks call construct_fields(iblk,…) enddo

15 NVPROF – revision 53

16 NVPROF – revision 69

17 Excessive kernel launches
nvprof profiler shows for rev 53: ======== API calls: Time(%) Time Calls Avg Min Max Name 69.55% s s s s cudaFree 9.02% ms us us ms cuEventSynchronize 6.52% ms us us ms cuStreamSynchronize 6.32% ms us us us cuLaunchKernel 3.46% ms ms ms ms cuMemHostAlloc cudaFree appears to be called very early on; perhaps part of device initialization/overhead. Need to track this down. Perhaps this overhead can be amortized with larger problem sizes running for longer. - Note the time spent in cuEventSynchronize. Using ‘async’ triggers cuEventRecord/cuEventSynchronize

18 Reduce kernel launches - push down loops
Refactor some loops by pushing loop inside called routine do iblk = 1,nblocks call departure_points(ilo(iblk), ihi(iblk),…) enddo call departure_points_all(nblocks,…)

19 Reduce kernel launches – code fusion
Combine code and loops into single kernel do nt = 1,ntrace if ( tracer_type(nt) == 1 ) then call limited_gradient(…,mxav, myav,…) mtxav = funcx(i,j) mtyav = funcy(i,j) : else if ( tracer_type(nt) == 2 ) then nt1 = depend(nt) call limited_gradient(…,mtxav(nt1), mtyav(nt1),…) else if ( tracer_type(nt) == 3 ) then endif enddo

20 Reduced kernel launches
nvprof profiler shows for rev 69: ======== API calls: Time(%) Time Calls Avg Min Max Name 74.13% s s s s cudaFree 10.85% ms us us ms cuEventSynchronize 5.10% ms us us ms cuStreamSynchronize 3.79% ms ms ms ms cuMemHostAlloc 1.86% ms us ns us cuPointerGetAttribute 1.55% ms us ns ms cuDeviceGetAttribute 0.84% ms us us us cuLaunchKernel 0.51% ms us us ms cuMemAlloc

21 Reduce kernel launches – code fusion
Fused code becomes Disadvantage: code duplication for variants of limited_gradient subroutine Dependencies in tracers in construct_fields limited kernel concurrency call fused_limited_gradient_indep(mxav, myav, mtxav, mtyav,…) call fused_limited_gradient_tracer2(mtxav,mtyav…)

22 GPU Affinity Bind MPI rank to a device on same socket
Using hwloc library Get list of devices on local socket Set device in round-robin fashion Can also restrict list using environment variable CICE_GPU_DEVICES Used to restrict K80s to use only one of devices on card

23 Outline Current status What we did GPUDirect for MPI communication
Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes

24 GPUDirect Network adapter reads/writes directly from GPU memory
CPU GPU Requires supporting network adapter (Mellanox Infiniband) and tesla-class card Introduced in cuda 5.0 Infiniband Chipset PCI-e

25 GPUDirect Allows bypass of copies to/from CPU memory for MPI communication If MPI library and network interconnect support it Modified halo updates to use GPUDirect Added additional buffering capability to aggregate into larger messages ice_haloBegin, ice_haloAddUpdate, ice_haloFlush, ice_haloEnd Pushed buffering to CPU halo updates

26 Buffered halo updates ice_haloBegin ice_haloAddUpdate ice_haloFlush
ice_haloEnd call ice_haloBegin(haloInfo,num_fields,updateInfo) call ice_haloAddUpdate(dpx,…) call ice_haloAddUpdate(dpy,…) call ice_haloFlush(haloInfo, updateInfo) call ice_haloAddUpdate(mx,…) call ice_haloEnd(haloInfo, updateInfo) We refactored some CPU halo updates to use new buffered scheme.

27 Outline Current status What we did GPUDirect for MPI communication
Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes

28 Going Forward More physics being added to column physics, so will revisit with OpenACC No halo updates during computations Explore task parallelism Can we restructure code execution to expose tasks that translate to GPU kernels Expand GPU acceleration to other parts of CICE Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes


Download ppt "An Update on Accelerating CICE with OpenACC"

Similar presentations


Ads by Google