CUDA Fortran Programming with the IBM XL Fortran Compiler Rafik Zurob, XL Fortran FE Development IBM Canada Lab
Agenda Introduce some features that make CUDA Fortran easier to use than CUDA C Introduce XL Fortran’s support for CUDA Fortran Compare CUDA programming with OpenMP 4.0 / 4.5 programming 8/29/2019
CUDA Fortran CUDA Fortran is a set of extensions to the Fortran programming language to allow access to the GPU Created by the Portland Group and NVIDIA in 2009-2010 Provides seamless integration of CUDA into Fortran declarations and statements Functionally equivalent to CUDA C 8/29/2019
A Quick Introduction to CUDA Fortran The device memory type is specified via attributes Procedure prefixes specify procedure targets module m real, constant :: r_c integer, device :: i_d contains attributes(global) subroutine kernel(arg) integer arg(*) integer, shared :: i_s end subroutine end module program main real, device :: r_d(10) real, managed :: r_m end program CUDA Fortran 8/29/2019
A Quick Introduction to CUDA Fortran Allocation and deallocation are automatic for local device variables on the host void sub() { float *r_d, *r_m; cudaMalloc(&r_d, 40); cudaMallocManaged(&r_m, 4, cudaMemAttachGlobal); // … cudaFree(r_d); cudaFree(r_m); } Memory type of the target is part of the declaration subroutine sub real, device :: r_d(10) real, managed :: r_m ! … end subroutine CUDA Fortran Compiler calls the right cudaMalloc and cudaFree functions 8/29/2019
A Quick Introduction to CUDA Fortran Allocation, deallocation, and pointer assignment are done via standard Fortran syntax program main real, device, allocatable, target :: r_d(:) real, device, pointer :: p_d(:) allocate(r_d(10)) p_d => r_d nullify(p_d) deallocate(r_d) end program CUDA Fortran 8/29/2019
A Quick Introduction to CUDA Fortran Data transfer can be done via Fortran assignment int main() { int *r_d, r_h[10]; … cudaMemset(r_d, -1, 40); cudaMemcpy(r_h, r_d, 40, cudaMemcpyDefault); for (int i = 0; i < 10; ++i) r_h[i] -= 1; } No knowledge of the CUDA API is required for data transfer program main integer, device :: r_d(10) integer :: r_h(10) r_d = -1 r_h = r_d - 1 end program CUDA Fortran Program is easier to understand 8/29/2019
A Quick Introduction to CUDA Fortran Asynchronous data transfer can be done via Fortran assignment with a different default stream program main use cudafor real, device :: r_d(10) real, allocatable, pinned :: r_h(:) integer(cuda_stream_kind) stream integer istat allocate(r_h(10)) istat = cudaStreamCreate(stream) istat = cudaforSetDefaultStream(stream) r_d = 1 r_h = r_d + 1 end program CUDA Fortran Gives access to the CUDA Runtime API Changes the default stream used by assignment and kernel calls Assignment is now asynchronous 8/29/2019
A Quick Introduction to CUDA Fortran CUDA Runtime API is also available from CUDA Fortran Fortran modules providing bind(c) interfaces for the CUDA C libraries are available e.g. libdevice, libm, cublas, cufft, cusparse, and curand 8/29/2019
XL Fortran XL Fortran is a full-featured Fortran compiler that has been targeting the POWER platform since 1990 One of the first compilers to offer full Fortran 2003 support Supports a large subset of Fortran 2008 and TS 29113 Compiler team works closely with the POWER hardware team XL Fortran takes maximum advantage of IBM processor technology as it becomes available The XL compiler products use common optimizer and backend technology 8/29/2019
CUDA Fortran support in XL Fortran CPU code is aggressively optimized for POWER Fortran source XL Fortran Frontend W-Code (XL IR) CUDA Toolkit optimizes device code Data flow, loop, other optimizations High-Level Optimizer XL’s optimizer sees both host and device code W-Code CPU/GPU W-Code Partitioner GPU W-Code CPU W-Code W-Code to LLVM IR translator POWER Low-level Optimizer Low-level Optimizations POWER Code Generation Register Allocation + Scheduling libNVVM PTX CodeGen LLVM Optimizer PTX Assembler NVVM IR PTX NVVM = LLVM with NVIDIA enhancements XL Device Libraries nvlink CUDA Device Libraries CUDA Runtime Libraries System Linker CUDA Driver Executable for POWER system /GPU 8/29/2019
Usability Enhancements in XL Fortran CUDA C programs calling CUDA API functions usually wrap the call in a macro that checks the return code XL Fortran can automatically check API functions -qcudaerr = [ none | stmts | all ] Default is to check API calls made by the compiler Can check all CUDA Runtime API functions Can check user-defined functions you designate as returning CUDA API return codes Checking does not clear the CUDA error state, so it won’t interfere with programs that do their own checking Checking can stop the program on error or just warn 8/29/2019
Error Checking Example CUDA Fortran module m contains attributes(global) subroutine kernel() end subroutine end module program main use m integer num_threads read(*, *) num_threads call kernel<<<1, num_threads>>>() end program $ ./a.out 2048 "largegrid.cuf", line 11: 1525-244 API function cudaLaunchKernel failed with error code 9: invalid configuration argument. 8/29/2019
OpenMP 4.0 / 4.5 A device-independent GPU programming model Raises the level of abstraction Shifts the burden of hardware exploitation to compiler toolchains Increases programmer productivity BUT: You need a good optimizing compiler 8/29/2019
OpenMP 4.0 / 4.5 Target construct Instructs the compiler to attempt to offload contents Compiler will outline the contents of the target region to device and host procedures OpenMP runtime library will invoke a kernel that calls the device procedure The host procedure will only be called if the device procedure couldn’t be executed OpenMP !$omp target map(tofrom: C) map(to: A, B) … !$omp end target 8/29/2019
OpenMP 4.0 / 4.5 Teams construct Creates a league of thread teams to be executed by the master thread of each team In a target region running on the device, teams are similar to hardware blocks of threads OpenMP !$omp teams num_teams(nteams) thread_limit(nthreads) … !$omp end teams 8/29/2019
OpenMP 4.0 / 4.5 Distribute construct Distributes the execution of the iterations of one or more loops between the master threads of all teams in an enclosing team construct OpenMP !$omp distribute do i = 1, N … end do 8/29/2019
Example Consider 4096 x 4096 matrix multiplication Most common source optimization for the device is to use tiling Tiling, also known as blocking, refers to partitioning data to smaller blocks that are easier to work with Tiles can be assigned to blocks of threads (CUDA Fortran) or teams (OpenMP) Tiles can be stored in faster memory or in contiguous temps This can be done in both CUDA and OpenMP On NVIDIA GPUs, we can use shared memory to increase reuse between threads In CUDA, this can be done by the user In OpenMP, this has to be done by the compiler 8/29/2019
Example CUDA C: nvcc –O3 –arch=sm_35 CUDA Fortran: xlcuf –Ofast OpenMP: xlc –qsmp=omp –qoffload -Ofast 8/29/2019
XL’s Optimizer makes a difference 8/29/2019
OpenMP performance depends on compiler Data obtained from untuned alpha compilers. Data illustrates that compiler optimizations can sometimes backfire. With OpenMP, performance is highly dependent on having a tuned compiler. 8/29/2019
Example Data obtained from untuned research compiler from November 2015. Notice how OpenMP performance significantly improved since this data was obtained. This again illustrates dependency on having a well-tuned compiler 8/29/2019
Summary CUDA Fortran is a high level GPU programming language that is functionally equivalent to CUDA C CUDA Fortran users can take full advantage of the NVIDIA GPU XL Fortran 15.1.4 provides support for a commonly used subset of CUDA Fortran, while maintaining its industry-leading CPU optimization OpenMP 4.5 is a high level, portable, programming model that is dependent on having good compilers 8/29/2019
Questions? 8/29/2019
Additional Information XL POWER compilers community http://ibm.biz/xl-power-compilers XL Fortran home page http://www.ibm.com/software/products/en/fortcompfami/ XL Fortran Café https://ibm.biz/Bdx8XX XL Fortran F2008 Status https://ibm.biz/Fortran2008Status XL Fortran TS29113 Status https://ibm.biz/FortranTS29113Status 8/29/2019