Heterogeneous Computing with D

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Contemporary Languages in Parallel Computing Raymond Hummel.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

An Introduction to Programming with CUDA Paul Richmond

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

GPU Architecture and Programming

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine.

A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.

Introduction to Operating Systems Concepts

Computer Engg, IIT(BHU)

Prof. Zhang Gang School of Computer Sci. & Tech.

The architecture of the P416 compiler

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Chapter 1 Introduction.

Names and Attributes Names are a key programming language feature

CS427 Multicore Architecture and Parallel Computing

Profile-Guided Optimization in the LDC D compiler

Distributed Processors

Enabling machine learning in embedded systems

Process Creation Processes get created (and destroyed) all the time in a typical computer Some by explicit user command Some by invocation from other running.

Chapter 1 Introduction.

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

GPU Programming using OpenCL

Operation System Program 4

MASS CUDA Performance Analysis and Improvement

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

What is Concurrent Programming?

Lecture Topics: 11/1 General Operating System Concepts Processes

Introduction to Virtual Machines

Introduction to Virtual Machines

Chap 1. Getting Started Objectives

Multicore and GPU Programming

6- General Purpose GPU Programming

Multicore and GPU Programming

Peter Oostema & Rajnish Aggarwal 6th March, 2019

Presentation transcript:

Heterogeneous Computing with D Using the PTX and SPIR-V targets with a system programming language Kai Nacke 4 February 2018 LLVM dev room @ FOSDEM‘18 Hello! I am Kai and I am the maintainer of LDC, the LLVM based compiler. My talk today is about an exciting extension to the compiler: the ability to create applications which runs on normal CPU’s and which can use your graphic card (or GPU - graphics processing unit) for special computing tasks. Because the application runs on different hardware the is called heterogeneous computing. The vision we try to realize is to minimize the difference between CPU and GPU programming at the source level. The application developer should be able to use one tool to target both CPUs and GPUs. He should also be able to use the same programming language. A system programming language like D seems to be a good choice.

Heterogeneous Computing with D | Kai Nacke GPU Architecture Uses Single Instruction Multiple Thread (SIMT) model Contains a huge number of cores Blocks of threads executing the same code (“kernel”) Hierarchy of Grid / Block / Warp Different memory types Global and local memory Shared memory Constant memory Memory may be cached Register file uses memory, too Possible lot of restrictions, e.g. no double data type or no recursion First lets have a closer look at GPU architecture. GPU’s are different from normal CPU’s because they are massively parallel. They use a special model called SIMT – Single Instruction Multiple Thread. Basically there is a huge number of cores (or threads) which all execute the same code. Code callable from a host device is called a kernel. Because there are so many cores (the graphic card in my notebook has 640 cores) they are managed in a hierarchy. A Warp contains N thread, usually 32. A block is made up of N warps. And the grid contains all the blocks. The cores of a warp (actually a half-warp) execute in parallel. This means they are all executing the same code but on different data. This is important for conditional instructions. If a condition on one thread has not the same state as the other threads then this threads stalls because some other code must be executed. You should try to distribute the code in such a way that for each warp each thread has the same code path. Another difference to CPU’s is that a GPU has different memory types and usually no or only simple caching. The memory types differ in who can access and how long the data lives. Global memory is accessible from all threads and the host. Local memory and the register file is only accessible in one thread. And shared memory is shared between all threads. All this happens independent of the CPU. The kernel is simply launched on the GPU and the CPU is free to do other stuff until the GPU has finished the task. GPUs are restricted devices. For example, not all devices support the double data type or recursion. You have to be aware of these restrictions. 03.02.2018 Heterogeneous Computing with D | Kai Nacke

Heterogeneous Computing with D | Kai Nacke Development model There are open and proprietary APIs OpenCL CUDA General development flow Kernels are developed in a high-level language Compiled into an intermediate representation GPU driver compiles to machine instructions during load time Host application loads and launches kernels The kernel is written in some programming language. Usually it is C, C++ or Fortran. With my talk I show how to do it with D. There are two widely used APIs: OpenCL and CUDA. OpenCL is an open standard while CUDA is supported by only one vendor. The kernel is compiled into an intermediate representation. For OpenCL, this is SPIR-V and for CUDA it is PTX. This intermediate representation is compiled by the GPU driver into machine code at load time. (Saving the machine code is also possible.) The kernel code cannot run standalone. A host application has to load and launch the kernel. The host application also provides and consumes the data the GPU uses. 03.02.2018 Heterogeneous Computing with D | Kai Nacke

Applications of GPU computing Well suited for many scientific applications Examples Machine learning Monte-Carlo simulations Pattern matching Ray tracing … Games There are many scientific problems which fit very well on the SIMT model. Because GPUs are very cheap compared to CPUs they are very popular, e.g. in supercomputing. Most applications have in common that they process a large amount of data in a uniform way. This includes Machine learning Monte-Carlo simulations (e.g. for financial problems) Pattern matching in a genome database Ray tracing There are countless other examples. Not to forgot is the origin of GPUs and a still very important application space: the acceleration of games, e.g. computation of physical effects 03.02.2018 Heterogeneous Computing with D | Kai Nacke

Challenges for developers Application developer SIMT requires different thinking Different memory spaces increase complexity Compiler developer Application source compiles to different targets Address spaces must be mapped to language constructs Separation of compiler and library GPU programming has some challenges for the application and compiler developer. The application developer has to think differently to benefit from SIMT. And he must be able to decide for which computation it is worth to use the GPU. If the problem is not suitable for parallel computation then there is no advantage in using a GPU. The different memory spaces increases the complexity of the application. Moreover, a GPU has often only simple caching. The memory layout of data structures is much more important than on a CPU. Bad memory access patterns can slow down computation significantly. The compiler developer has different challenges. A source module can contain host and kernel code. The compiler must be able to produce code for several targets in one invocation. He must also deal with the different address spaces and how they are mapped to language constructs. Last but not least he must decide which features the compiler must know about and which belong into a library. 03.02.2018 Heterogeneous Computing with D | Kai Nacke

Heterogeneous Computing with D | Kai Nacke LLVM support LLVM generates code for different GPU’s nvptx, nvptx64 (http://llvm.org/docs/NVPTXUsage.html) r600, amdgcn (http://llvm.org/docs/AMDGPUUsage.html) OpenCL uses SPIR-V SPIR is derived from LLVM IR No SPIR-V backend in official LLVM Requires custom LLVM (https://github.com/thewilsonator/llvm/releases) LDC uses the LLVM backend. LLVM has backends for NVIDIA and AMD GPUs. They can be used like the other LLVM targets. For details please have a look at the official documentation. OpenCL uses an intermediate representation called SPIR-V. This IR is derived from LLVM IR but it is not compatible with it. At Khronos Group (the organization behind OpenCL) there exists an LLVM version with a SPIR-V backend. Unfortunately this version is somewhat old (based on LLVM 3.6.1) and there exists no official release of it. Because of this the dcompute developer ported the SPIR-V to newer LLVM versions. Please follow the link. Given the fact that OpenCL is an open API this situation is unsatisfying. There should be an SPIR-V backend in LLVM. 03.02.2018 Heterogeneous Computing with D | Kai Nacke

A closer look at the PTX backend Global variables and pointer types need an address space Functions have a different calling convention Kernel functions need special metadata %res = alloca float addrspace(1)* define ptx_kernel void @mykernel() declare ptx_device i32 @llvm.nvvm.read.ptx.sreg.warpsize() Let’s have a closer look at the PTX backend. What changes to the code generation do we have to implement? First, global variables and pointer types need to be augmented with an address space. There are 6 different address spaces. Intrinsics are provided to convert pointers between address spaces. Next, functions have a different calling convention. Kernel functions – the functions called from the host – use the ptx_kernel calling convention. Device functions (only called from kernels and other device functions) use the ptx_device calling convention. And last, for each kernel function an annotation is required to identify this function as kernel. Besides some special intrinsics for the target this is all you need to know. !nvvm.annotations = !{!0} !0 = !{void ()* @mykernel, !"kernel", i32 1} 03.02.2018 Heterogeneous Computing with D | Kai Nacke

Heterogeneous Computing with D | Kai Nacke D source level @compute(CompileFor.deviceOnly) module dcompute.tests.kernels; pragma(LDC_no_moduleinfo); import ldc.dcompute; import dcompute.std.index; @kernel void saxpy(GlobalPointer!(float) res, GlobalPointer!(float) x, GlobalPointer!(float) y, float alpha, size_t N) { auto i = GlobalIndex.x; if (i >= N) return; res[i] = alpha*x[i] + y[i]; } At the D level, there are some new elements. A module which contains device and kernel functions must be marked with the @compute() attribute. The value CompileFor.deviceOnly indicates that this module contains only code for a GPU. This attribute is defined in module ldc.dcompute. You have to import this module. Because a GPU is a very limited device, it can’t support all D features. Notable, it makes no sense to generate module information. A pragma is used to turn this off. A kernel function is marked with the @kernel attribute. All other functions are device functions. There are templates to declare pointers to the various address spaces. GlobalPointer is one example. The other are PrivatePointer ,SharedPointer, ConstantPointer and GenericPointer. The runtime library provides some useful abstractions like GlobalIndex and GlobalDim. The implementation relays on intrinsics and is different for OpenCL and CUDA. The source shown can be compiled with both targets. 03.02.2018 Heterogeneous Computing with D | Kai Nacke

Implementation details Uses attributes and templates @compute(), @kernel, struct Pointer(AddrSpace as, T) Requires –betterC switch because of target limitations (no garbage collection, no exceptions, no module init, …) Compiles to different targets in one invocation Requires tweaks because of global variables (DataLayout, Target, …) New ABI implementation as usual A lot more stuff in runtime library ldc2 -mdcompute-targets=cuda-500 -oq -betterC dcompute\tests\kernels.d The implementation in the compiler is not that complex. The compiler needs to know the new magic attributes and templates. Then it is easy to generate the necessary code. Because GPU code is very limited, it is not possible to implement all D features for these targets. Garbage collection, exceptions and type information are among these features. Luckily there is a compiler switch which turns these features off and simply let act D like a better C. Naturally, the command line switch is called –betterC. For the ease of the developer, the compiler should be able to produce code for several devices in one invocation. The new –mdcompute-targets switch takes a comma separated list of GPU targets and generates code for them. This sounds easy. In reality, the compiler uses a handful of global variables, which makes it nasty to switch between different targets. Removing these global variables is a work in progress. As usual with LLVM, if you use a new target you have to implement the ABI. Therefore LDC got 2 new ABI classes, for OpenCL and PTX. The bulk work is in the dcompute library. This library aims to hide the differences between the different targets. A lot of stuff is already in this library. There are plans to further enhance it. 03.02.2018 Heterogeneous Computing with D | Kai Nacke

Heterogeneous Computing with D | Kai Nacke Conclusion Adding GPU targets was pretty easy Global variables in compiler were main obstacle Still work in progress, especially the library Library should enable uniform handling of PTX and OpenCL What we really miss is a SPIR-V backend in official LLVM With dcompute an application developer can seamlessly develop for CPU and GPU targets. Adding the GPU targets was pretty easy. The tasks which had to be done were not that different from other CPU targets. Only the internal compiler structure with some global variables required some tricky work. The runtime library already contains a lot of stuff. To reach the goal of uniform treatment of PTX and OpenCL targets more work needs to be done. The only part we really miss is a SPIR-V backend in LLVM. Please add this! 03.02.2018 Heterogeneous Computing with D | Kai Nacke

Heterogeneous Computing with D | Kai Nacke Questions? Thank you very much! Do you have questions? 03.02.2018 Heterogeneous Computing with D | Kai Nacke