An Update on Accelerating CICE with OpenACC Two main challenges: low computation intensity and frequest halo updates Dec 3, 2015 LA-UR-15-29184
Outline Current status What we did GPUDirect for MPI communication Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes
Outline Current status What we did GPUDirect for MPI communication Strategy for going forward
Current Status Modified dynamics transport routines Implemented a GPUDirect version of halo updates Used buffers to aggregate messages Recent attempt to run larger 320-rank problem Ran on out device memory on Titan Would be nice if OpenACC could provide async device mem allocation ACC illustrates problems with directive-based programming model: you don’t know what code gets generated
Outline Current status What we did GPUDirect for MPI communication Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes
Test Platform – Nvidia PSG cluster Mix of GPUs: K20, K40 and K80 Nodes have 2 to 8 devices K80 is essentially 2 K40s on single card CPUs: ivybridge and haswell Testing is done on either 20-core ivybridge nodes with 6 K40s/node or 32-core haswell nodes with 8 K80s/node
Test Platform – Nvidia PSG cluster Asymmetric device distribution 1-2 cards on socket 0 and 3-4 cards on socket 1 Displayed using ‘nvidia-smi topo –m’ GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 mlx5_0 CPU Affinity X PIX PHB SOC 0-9 10-19
Test Platform – K80 nodes K80 is 2 K40s on single card ‘nvidia-smi topo –m’ shows 8 devices, but only 4 K80 cards GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity X PIX SOC 0-15 PHB 16-31 PXB
Profile using gprof + gprof2dot.py
Accelerating Dynamics Transport Focused on horizontal_remap Minimize data movement Overlap host-to-device and device-to-host with kernel computations using async streams Use Fortran pointers into single large memory block
Results Haswell Ivybridge Baseline 55.12 65.63 Version 53 55.82 65.55 55.33 65.78 Buffered baseline 55.18 65.29 GPUDirect 57.39 67.19
Pointers into Blocks Use Fortran pointers into large memory chunks allocate( mem_chunk(nx, ny,2) ) v => mem_chunk(:,:,1) w => mem_chunk(:,:,2) !$acc enter data create(mem_chunk) !$acc update device(mem_chunk) !$acc data present(v,w) !$acc parallel loop collapse(2) do i = 1,ny do j = 1,nx v(i,j) = alpha * w(i,j) enddo
CUDA streams Use CUDA streams for asynchronous kernels If loops are data independent, launch with separate streams along with data updates to host/device !$acc parallel loop collapse(2) async(1) do i = 1,n do j = 1,m a(i,j) = a(i,j) * w(i,j) enddo !$acc parallel loop collapse(2) async(2) b(i,j) = b(i,j) + alpha * t(i,j)
CUDA streams Pass stream to subroutines inside loops If each iteration is data independent ACC kernels inside construct_fields then use iblk as asynchronous streams In theory we should get concurrent execution do iblk = 1, nblocks call construct_fields(iblk,…) enddo
NVPROF – revision 53
NVPROF – revision 69
Excessive kernel launches nvprof profiler shows for rev 53: ======== API calls: Time(%) Time Calls Avg Min Max Name 69.55% 2.33983s 1 2.33983s 2.33983s 2.33983s cudaFree 9.02% 303.34ms 1052 288.35us 1.5210us 1.5327ms cuEventSynchronize 6.52% 219.28ms 386 568.08us 1.8180us 4.8207ms cuStreamSynchronize 6.32% 212.54ms 22944 9.2630us 8.0320us 272.22us cuLaunchKernel 3.46% 116.31ms 4 29.078ms 26.526ms 35.518ms cuMemHostAlloc cudaFree appears to be called very early on; perhaps part of device initialization/overhead. Need to track this down. Perhaps this overhead can be amortized with larger problem sizes running for longer. - Note the time spent in cuEventSynchronize. Using ‘async’ triggers cuEventRecord/cuEventSynchronize
Reduce kernel launches - push down loops Refactor some loops by pushing loop inside called routine do iblk = 1,nblocks call departure_points(ilo(iblk), ihi(iblk),…) enddo call departure_points_all(nblocks,…)
Reduce kernel launches – code fusion Combine code and loops into single kernel do nt = 1,ntrace if ( tracer_type(nt) == 1 ) then call limited_gradient(…,mxav, myav,…) mtxav = funcx(i,j) mtyav = funcy(i,j) : else if ( tracer_type(nt) == 2 ) then nt1 = depend(nt) call limited_gradient(…,mtxav(nt1), mtyav(nt1),…) else if ( tracer_type(nt) == 3 ) then endif enddo
Reduced kernel launches nvprof profiler shows for rev 69: ======== API calls: Time(%) Time Calls Avg Min Max Name 74.13% 2.36824s 1 2.36824s 2.36824s 2.36824s cudaFree 10.85% 346.63ms 1052 329.49us 1.5490us 1.6895ms cuEventSynchronize 5.10% 162.87ms 434 375.28us 1.7690us 3.6578ms cuStreamSynchronize 3.79% 121.04ms 4 30.260ms 27.768ms 35.218ms cuMemHostAlloc 1.86% 59.298ms 44036 1.3460us 233ns 129.30us cuPointerGetAttribute 1.55% 49.416ms 498 99.229us 190ns 4.3737ms cuDeviceGetAttribute 0.84% 26.792ms 1824 14.688us 8.7890us 230.45us cuLaunchKernel 0.51% 16.285ms 37 440.14us 4.6900us 9.0301ms cuMemAlloc
Reduce kernel launches – code fusion Fused code becomes Disadvantage: code duplication for variants of limited_gradient subroutine Dependencies in tracers in construct_fields limited kernel concurrency call fused_limited_gradient_indep(mxav, myav, mtxav, mtyav,…) call fused_limited_gradient_tracer2(mtxav,mtyav…)
GPU Affinity Bind MPI rank to a device on same socket Using hwloc library Get list of devices on local socket Set device in round-robin fashion Can also restrict list using environment variable CICE_GPU_DEVICES Used to restrict K80s to use only one of devices on card
Outline Current status What we did GPUDirect for MPI communication Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes
GPUDirect Network adapter reads/writes directly from GPU memory CPU GPU Requires supporting network adapter (Mellanox Infiniband) and tesla-class card Introduced in cuda 5.0 Infiniband Chipset PCI-e
GPUDirect Allows bypass of copies to/from CPU memory for MPI communication If MPI library and network interconnect support it Modified halo updates to use GPUDirect Added additional buffering capability to aggregate into larger messages ice_haloBegin, ice_haloAddUpdate, ice_haloFlush, ice_haloEnd Pushed buffering to CPU halo updates
Buffered halo updates ice_haloBegin ice_haloAddUpdate ice_haloFlush ice_haloEnd call ice_haloBegin(haloInfo,num_fields,updateInfo) call ice_haloAddUpdate(dpx,…) call ice_haloAddUpdate(dpy,…) call ice_haloFlush(haloInfo, updateInfo) call ice_haloAddUpdate(mx,…) call ice_haloEnd(haloInfo, updateInfo) We refactored some CPU halo updates to use new buffered scheme.
Outline Current status What we did GPUDirect for MPI communication Strategy for going forward Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes
Going Forward More physics being added to column physics, so will revisit with OpenACC No halo updates during computations Explore task parallelism Can we restructure code execution to expose tasks that translate to GPU kernels Expand GPU acceleration to other parts of CICE Tools: profiling, mercurial Methodologies: sandbox, unit testing, incremental changes