CUDA Fortran Programming with the IBM XL Fortran Compiler

CUDA Fortran Programming with the IBM XL Fortran Compiler
Rafik Zurob, XL Fortran FE Development IBM Canada Lab

Agenda Introduce some features that make CUDA Fortran easier to use than CUDA C Introduce XL Fortran’s support for CUDA Fortran Compare CUDA programming with OpenMP 4.0 / 4.5 programming 8/29/2019

CUDA Fortran CUDA Fortran is a set of extensions to the Fortran programming language to allow access to the GPU Created by the Portland Group and NVIDIA in Provides seamless integration of CUDA into Fortran declarations and statements Functionally equivalent to CUDA C 8/29/2019

A Quick Introduction to CUDA Fortran
The device memory type is specified via attributes Procedure prefixes specify procedure targets module m real, constant :: r_c integer, device :: i_d contains attributes(global) subroutine kernel(arg) integer arg(*) integer, shared :: i_s end subroutine end module program main real, device :: r_d(10) real, managed :: r_m end program CUDA Fortran 8/29/2019

Allocation and deallocation are automatic for local device variables on the host void sub() { float *r_d, *r_m; cudaMalloc(&r_d, 40); cudaMallocManaged(&r_m, 4, cudaMemAttachGlobal); // … cudaFree(r_d); cudaFree(r_m); } Memory type of the target is part of the declaration subroutine sub real, device :: r_d(10) real, managed :: r_m ! … end subroutine CUDA Fortran Compiler calls the right cudaMalloc and cudaFree functions 8/29/2019

Allocation, deallocation, and pointer assignment are done via standard Fortran syntax program main real, device, allocatable, target :: r_d(:) real, device, pointer :: p_d(:) allocate(r_d(10)) p_d => r_d nullify(p_d) deallocate(r_d) end program CUDA Fortran 8/29/2019

Data transfer can be done via Fortran assignment int main() { int *r_d, r_h[10]; … cudaMemset(r_d, -1, 40); cudaMemcpy(r_h, r_d, 40, cudaMemcpyDefault); for (int i = 0; i < 10; ++i) r_h[i] -= 1; } No knowledge of the CUDA API is required for data transfer program main integer, device :: r_d(10) integer :: r_h(10) r_d = -1 r_h = r_d - 1 end program CUDA Fortran Program is easier to understand 8/29/2019

Asynchronous data transfer can be done via Fortran assignment with a different default stream program main use cudafor real, device :: r_d(10) real, allocatable, pinned :: r_h(:) integer(cuda_stream_kind) stream integer istat allocate(r_h(10)) istat = cudaStreamCreate(stream) istat = cudaforSetDefaultStream(stream) r_d = 1 r_h = r_d + 1 end program CUDA Fortran Gives access to the CUDA Runtime API Changes the default stream used by assignment and kernel calls Assignment is now asynchronous 8/29/2019

CUDA Runtime API is also available from CUDA Fortran Fortran modules providing bind(c) interfaces for the CUDA C libraries are available e.g. libdevice, libm, cublas, cufft, cusparse, and curand 8/29/2019

XL Fortran XL Fortran is a full-featured Fortran compiler that has been targeting the POWER platform since 1990 One of the first compilers to offer full Fortran 2003 support Supports a large subset of Fortran 2008 and TS 29113 Compiler team works closely with the POWER hardware team XL Fortran takes maximum advantage of IBM processor technology as it becomes available The XL compiler products use common optimizer and backend technology 8/29/2019

CUDA Fortran support in XL Fortran
CPU code is aggressively optimized for POWER Fortran source XL Fortran Frontend W-Code (XL IR) CUDA Toolkit optimizes device code Data flow, loop, other optimizations High-Level Optimizer XL’s optimizer sees both host and device code W-Code CPU/GPU W-Code Partitioner GPU W-Code CPU W-Code W-Code to LLVM IR translator POWER Low-level Optimizer Low-level Optimizations POWER Code Generation Register Allocation + Scheduling libNVVM PTX CodeGen LLVM Optimizer PTX Assembler NVVM IR PTX NVVM = LLVM with NVIDIA enhancements XL Device Libraries nvlink CUDA Device Libraries CUDA Runtime Libraries System Linker CUDA Driver Executable for POWER system /GPU 8/29/2019

Usability Enhancements in XL Fortran
CUDA C programs calling CUDA API functions usually wrap the call in a macro that checks the return code XL Fortran can automatically check API functions -qcudaerr = [ none | stmts | all ] Default is to check API calls made by the compiler Can check all CUDA Runtime API functions Can check user-defined functions you designate as returning CUDA API return codes Checking does not clear the CUDA error state, so it won’t interfere with programs that do their own checking Checking can stop the program on error or just warn 8/29/2019

Error Checking Example
CUDA Fortran module m contains attributes(global) subroutine kernel() end subroutine end module program main use m integer num_threads read(*, *) num_threads call kernel<<<1, num_threads>>>() end program $ ./a.out 2048 "largegrid.cuf", line 11: API function cudaLaunchKernel failed with error code 9: invalid configuration argument. 8/29/2019

OpenMP 4.0 / 4.5 A device-independent GPU programming model
Raises the level of abstraction Shifts the burden of hardware exploitation to compiler toolchains Increases programmer productivity BUT: You need a good optimizing compiler 8/29/2019

OpenMP 4.0 / 4.5 Target construct
Instructs the compiler to attempt to offload contents Compiler will outline the contents of the target region to device and host procedures OpenMP runtime library will invoke a kernel that calls the device procedure The host procedure will only be called if the device procedure couldn’t be executed OpenMP !$omp target map(tofrom: C) map(to: A, B) … !$omp end target 8/29/2019

OpenMP 4.0 / 4.5 Teams construct
Creates a league of thread teams to be executed by the master thread of each team In a target region running on the device, teams are similar to hardware blocks of threads OpenMP !$omp teams num_teams(nteams) thread_limit(nthreads) … !$omp end teams 8/29/2019

OpenMP 4.0 / 4.5 Distribute construct
Distributes the execution of the iterations of one or more loops between the master threads of all teams in an enclosing team construct OpenMP !$omp distribute do i = 1, N … end do 8/29/2019

Example Consider 4096 x 4096 matrix multiplication
Most common source optimization for the device is to use tiling Tiling, also known as blocking, refers to partitioning data to smaller blocks that are easier to work with Tiles can be assigned to blocks of threads (CUDA Fortran) or teams (OpenMP) Tiles can be stored in faster memory or in contiguous temps This can be done in both CUDA and OpenMP On NVIDIA GPUs, we can use shared memory to increase reuse between threads In CUDA, this can be done by the user In OpenMP, this has to be done by the compiler 8/29/2019

Example CUDA C: nvcc –O3 –arch=sm_35 CUDA Fortran: xlcuf –Ofast
OpenMP: xlc –qsmp=omp –qoffload -Ofast 8/29/2019

XL’s Optimizer makes a difference
8/29/2019

OpenMP performance depends on compiler
Data obtained from untuned alpha compilers. Data illustrates that compiler optimizations can sometimes backfire. With OpenMP, performance is highly dependent on having a tuned compiler. 8/29/2019

Example Data obtained from untuned research compiler from November Notice how OpenMP performance significantly improved since this data was obtained. This again illustrates dependency on having a well-tuned compiler 8/29/2019

Summary CUDA Fortran is a high level GPU programming language that is functionally equivalent to CUDA C CUDA Fortran users can take full advantage of the NVIDIA GPU XL Fortran provides support for a commonly used subset of CUDA Fortran, while maintaining its industry-leading CPU optimization OpenMP 4.5 is a high level, portable, programming model that is dependent on having good compilers 8/29/2019

Questions? 8/29/2019

Additional Information
XL POWER compilers community XL Fortran home page XL Fortran Café XL Fortran F2008 Status XL Fortran TS29113 Status 8/29/2019

CUDA Fortran Programming with the IBM XL Fortran Compiler

Similar presentations

Presentation on theme: "CUDA Fortran Programming with the IBM XL Fortran Compiler"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CUDA Fortran Programming with the IBM XL Fortran Compiler

Similar presentations

Presentation on theme: "CUDA Fortran Programming with the IBM XL Fortran Compiler"— Presentation transcript:

Similar presentations

About project

Feedback