CUDA Fortran Programming with the IBM XL Fortran Compiler

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CSE 105 Structured Programming Language (C)
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Guide To UNIX Using Linux Third Edition
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
CSE 1301 J Lecture 2 Intro to Java Programming Richard Gesick.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
GPU Architecture and Programming
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Fortran 2003 Jim Xia IBM Toronto Lab
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Introduction to Operating Systems Concepts
Computer Engg, IIT(BHU)
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Chapter 1 Introduction.
CS427 Multicore Architecture and Parallel Computing
Evolution of Operating Systems
Realizing Concurrency using the thread model
CS427 Multicore Architecture and Parallel Computing
Multithreading Tutorial
Computer Engg, IIT(BHU)
Chapter 1 Introduction.
Exploiting NVIDIA GPUs with OpenMP
Multithreading Tutorial
COP4020 Programming Languages
CMP 131 Introduction to Computer Programming
Multithreading Tutorial
Multithreading Tutorial
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Realizing Concurrency using Posix Threads (pthreads)
CUDA Execution Model – III Streams and Events
Tutorial 4.
6- General Purpose GPU Programming
Peter Oostema & Rajnish Aggarwal 6th March, 2019
Presentation transcript:

CUDA Fortran Programming with the IBM XL Fortran Compiler Rafik Zurob, XL Fortran FE Development IBM Canada Lab

Agenda Introduce some features that make CUDA Fortran easier to use than CUDA C Introduce XL Fortran’s support for CUDA Fortran Compare CUDA programming with OpenMP 4.0 / 4.5 programming 8/29/2019

CUDA Fortran CUDA Fortran is a set of extensions to the Fortran programming language to allow access to the GPU Created by the Portland Group and NVIDIA in 2009-2010 Provides seamless integration of CUDA into Fortran declarations and statements Functionally equivalent to CUDA C 8/29/2019

A Quick Introduction to CUDA Fortran The device memory type is specified via attributes Procedure prefixes specify procedure targets module m real, constant :: r_c integer, device :: i_d contains attributes(global) subroutine kernel(arg) integer arg(*) integer, shared :: i_s end subroutine end module program main real, device :: r_d(10) real, managed :: r_m end program CUDA Fortran 8/29/2019

A Quick Introduction to CUDA Fortran Allocation and deallocation are automatic for local device variables on the host void sub() { float *r_d, *r_m; cudaMalloc(&r_d, 40); cudaMallocManaged(&r_m, 4, cudaMemAttachGlobal); // … cudaFree(r_d); cudaFree(r_m); } Memory type of the target is part of the declaration subroutine sub real, device :: r_d(10) real, managed :: r_m ! … end subroutine CUDA Fortran Compiler calls the right cudaMalloc and cudaFree functions 8/29/2019

A Quick Introduction to CUDA Fortran Allocation, deallocation, and pointer assignment are done via standard Fortran syntax program main real, device, allocatable, target :: r_d(:) real, device, pointer :: p_d(:) allocate(r_d(10)) p_d => r_d nullify(p_d) deallocate(r_d) end program CUDA Fortran 8/29/2019

A Quick Introduction to CUDA Fortran Data transfer can be done via Fortran assignment int main() { int *r_d, r_h[10]; … cudaMemset(r_d, -1, 40); cudaMemcpy(r_h, r_d, 40, cudaMemcpyDefault); for (int i = 0; i < 10; ++i) r_h[i] -= 1; } No knowledge of the CUDA API is required for data transfer program main integer, device :: r_d(10) integer :: r_h(10) r_d = -1 r_h = r_d - 1 end program CUDA Fortran Program is easier to understand 8/29/2019

A Quick Introduction to CUDA Fortran Asynchronous data transfer can be done via Fortran assignment with a different default stream program main use cudafor real, device :: r_d(10) real, allocatable, pinned :: r_h(:) integer(cuda_stream_kind) stream integer istat allocate(r_h(10)) istat = cudaStreamCreate(stream) istat = cudaforSetDefaultStream(stream) r_d = 1 r_h = r_d + 1 end program CUDA Fortran Gives access to the CUDA Runtime API Changes the default stream used by assignment and kernel calls Assignment is now asynchronous 8/29/2019

A Quick Introduction to CUDA Fortran CUDA Runtime API is also available from CUDA Fortran Fortran modules providing bind(c) interfaces for the CUDA C libraries are available e.g. libdevice, libm, cublas, cufft, cusparse, and curand 8/29/2019

XL Fortran XL Fortran is a full-featured Fortran compiler that has been targeting the POWER platform since 1990 One of the first compilers to offer full Fortran 2003 support Supports a large subset of Fortran 2008 and TS 29113 Compiler team works closely with the POWER hardware team XL Fortran takes maximum advantage of IBM processor technology as it becomes available The XL compiler products use common optimizer and backend technology 8/29/2019

CUDA Fortran support in XL Fortran CPU code is aggressively optimized for POWER Fortran source XL Fortran Frontend W-Code (XL IR) CUDA Toolkit optimizes device code Data flow, loop, other optimizations High-Level Optimizer XL’s optimizer sees both host and device code W-Code CPU/GPU W-Code Partitioner GPU W-Code CPU W-Code W-Code to LLVM IR translator POWER Low-level Optimizer Low-level Optimizations POWER Code Generation Register Allocation + Scheduling libNVVM PTX CodeGen LLVM Optimizer PTX Assembler NVVM IR PTX NVVM = LLVM with NVIDIA enhancements XL Device Libraries nvlink CUDA Device Libraries CUDA Runtime Libraries System Linker CUDA Driver Executable for POWER system /GPU 8/29/2019

Usability Enhancements in XL Fortran CUDA C programs calling CUDA API functions usually wrap the call in a macro that checks the return code XL Fortran can automatically check API functions -qcudaerr = [ none | stmts | all ] Default is to check API calls made by the compiler Can check all CUDA Runtime API functions Can check user-defined functions you designate as returning CUDA API return codes Checking does not clear the CUDA error state, so it won’t interfere with programs that do their own checking Checking can stop the program on error or just warn 8/29/2019

Error Checking Example CUDA Fortran module m contains attributes(global) subroutine kernel() end subroutine end module program main use m integer num_threads read(*, *) num_threads call kernel<<<1, num_threads>>>() end program $ ./a.out 2048 "largegrid.cuf", line 11: 1525-244 API function cudaLaunchKernel failed with error code 9: invalid configuration argument. 8/29/2019

OpenMP 4.0 / 4.5 A device-independent GPU programming model Raises the level of abstraction Shifts the burden of hardware exploitation to compiler toolchains Increases programmer productivity BUT: You need a good optimizing compiler 8/29/2019

OpenMP 4.0 / 4.5 Target construct Instructs the compiler to attempt to offload contents Compiler will outline the contents of the target region to device and host procedures OpenMP runtime library will invoke a kernel that calls the device procedure The host procedure will only be called if the device procedure couldn’t be executed OpenMP !$omp target map(tofrom: C) map(to: A, B) … !$omp end target 8/29/2019

OpenMP 4.0 / 4.5 Teams construct Creates a league of thread teams to be executed by the master thread of each team In a target region running on the device, teams are similar to hardware blocks of threads OpenMP !$omp teams num_teams(nteams) thread_limit(nthreads) … !$omp end teams 8/29/2019

OpenMP 4.0 / 4.5 Distribute construct Distributes the execution of the iterations of one or more loops between the master threads of all teams in an enclosing team construct OpenMP !$omp distribute do i = 1, N … end do 8/29/2019

Example Consider 4096 x 4096 matrix multiplication Most common source optimization for the device is to use tiling Tiling, also known as blocking, refers to partitioning data to smaller blocks that are easier to work with Tiles can be assigned to blocks of threads (CUDA Fortran) or teams (OpenMP) Tiles can be stored in faster memory or in contiguous temps This can be done in both CUDA and OpenMP On NVIDIA GPUs, we can use shared memory to increase reuse between threads In CUDA, this can be done by the user In OpenMP, this has to be done by the compiler 8/29/2019

Example CUDA C: nvcc –O3 –arch=sm_35 CUDA Fortran: xlcuf –Ofast OpenMP: xlc –qsmp=omp –qoffload -Ofast 8/29/2019

XL’s Optimizer makes a difference 8/29/2019

OpenMP performance depends on compiler Data obtained from untuned alpha compilers. Data illustrates that compiler optimizations can sometimes backfire. With OpenMP, performance is highly dependent on having a tuned compiler. 8/29/2019

Example Data obtained from untuned research compiler from November 2015. Notice how OpenMP performance significantly improved since this data was obtained. This again illustrates dependency on having a well-tuned compiler 8/29/2019

Summary CUDA Fortran is a high level GPU programming language that is functionally equivalent to CUDA C CUDA Fortran users can take full advantage of the NVIDIA GPU XL Fortran 15.1.4 provides support for a commonly used subset of CUDA Fortran, while maintaining its industry-leading CPU optimization OpenMP 4.5 is a high level, portable, programming model that is dependent on having good compilers 8/29/2019

Questions? 8/29/2019

Additional Information XL POWER compilers community http://ibm.biz/xl-power-compilers XL Fortran home page http://www.ibm.com/software/products/en/fortcompfami/ XL Fortran Café https://ibm.biz/Bdx8XX XL Fortran F2008 Status https://ibm.biz/Fortran2008Status XL Fortran TS29113 Status https://ibm.biz/FortranTS29113Status 8/29/2019