GPU Architecture and Programming

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Martin Kruliš by Martin Kruliš (v1.0)1.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

Martin Kruliš by Martin Kruliš (v1.0)1.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

Prof. Zhang Gang School of Computer Sci. & Tech.

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

CS427 Multicore Architecture and Parallel Computing

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Introduction to CUDA.

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

General Purpose Graphics Processing Units (GPGPUs)

Chapter 4:Parallel Programming in CUDA C

6- General Purpose GPU Programming

Presentation transcript:

GPU Architecture and Programming

GPU vs CPU https://www.youtube.com/watch?v=fKK933KK6Gg

GPU Architecture GPU (Graphics Processing Unit) were originally designed as graphics accelerators, used for real-time graphics rendering. Starting in the late 1990s, the hardware became increasingly programmable, culminating in NVIDIA's first GPU in 1999.

CPU + GPU is a powerful combination CPUs consist of a few cores optimized for serial processing, GPUs consist of thousands of smaller, more efficient cores designed for parallel performance. Serial portions of the code run on the CPU while parallel portions run on the GPU

Architecture of GPU NVIDIA GPUs have a number of multiprocessors, each of which executes in parallel with the others. - The high end Tesla accelerators have 30 multiprocessors - The high end Fermi has 16 multiprocessors Each multipprocessor has a group of stream processors On Tesla, each multiprocessor has a group of 8 stream processors (cores); On Fermi, each multiprocessor has two groups of 16 stream processors (cores). So, the high end Tesla accelerators have 30 x 8 = 240 cores; the high end Fermi has 16 x 2 x 16 = 512 cores. Each core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles double-precision operations. Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors. 32 threads are one execution unit, which is called a warp. Codes are executed in groups of warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor. Reference: http://www.pgroup.com/lit/articles/insider/v2n1a5.htm (Point is that GPU can parallely execute many threads parallelly) Image copied from http://www.pgroup.com/lit/articles/insider/v2n1a5.htm Image copied from http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf

CUDA Programming CUDA (Compute Unified Device Architecture) is a parallel programming platform created by NVIDIA based on its GPUs. By using CUDA, you can write programs that directly access GPU. CUDA platform is accessible to programmers via CUDA libraries and extensions to programming languages like C, C++ AND Fortran. C/C++ programmers use “CUDA C/C++”, compiled with nvcc compiler Fortran programmers can use CUDA Fortran, compiled with PGI CUDA Fortran GPU

Terminology: Host: The CPU and its memory (host memory) Device: The GPU and its memory (device memory)

Programming Paradigm Parallel function of application: execute as a kernel Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Programming Flow Copy input data from CPU memory to GPU memory Load GPU program and execute Copy results from GPU memory to CPU memory

Each parallel function of application is execute as a kernel That means GPUs are programmed as a sequence of kernels; typically, each kernel completes execution before the next kernel begins. Fermi has some support for multiple, independent kernels to execute simultaneously, but most kernels are large enough to fill the entire machine.

The host program launches a sequence of kernels. The execution of a kernel should be divided into the exectuation of many threads on GPU. Lets see how we organize threads for a kernel. Overall, we can say: Threads are grouped into blocks, and multiple blocks form a grid. Each thread has a unique local index in its block, and each block has a unique index in the grid. Kernels can use these indices to compute array subscripts. Threads in a single block will be executed on a single multiprocessor; a warp will always be a subset of threads from a single block. There is a hard limit on the size of a thread block, 512 threads or 16 warps for Tesla, 1024 threads or 32 warps for Fermi. A Tesla multiprocessor can have 1024 threads simultaneously active or 32 warps, from up to 8 thread blocks A Fermi multiprocessor can have 48 simultaneously active warps, equivalent to 1536 threads, from up to 8 thread blocks Image copied from http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf

Hello World! Example _ _global_ _ is a CUDA C/C++ keyword meaning mykernel() will be exectued on the device mykernel() will be called from the host Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Addition Example Since add runs on device, pointers a, b, and c must point to device memory Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

CUDA API for managing device memory cudaMalloc (), cudaFree(), cudaMemcpy() Similar to the C equivalents malloc(), free(), memcpy() Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Vector Addition Example Kernel Function: The execution of the kernel on GPU is actually the exetuation of many threads. This statement specifies what each thread needs to do. Each thread needs an index of the data it will manipulate. Each thread will have a global unique thread id, so we can map the thread ID to the index of the data it manipulates. For this special case, we need to have N threads (assume the size of the array is N), then one way we can organize the these N threads is: Create N blocks in one dimension, and each block has 1 thread. The id of each block is represented by a variable blockIdx: block ID (blockIdx.x, blockId.y) For this case, therefore, the ids for all the blocks are like: (0,0), (1, 0), (2, 0), …, (n,0). The id of the thread within a block is represented by a variable threadIdx: threadId id within a block (threadIdx.x, threadIdx.y, threadIdx.z) For this case, therefore, the index for a thread within a block is (0, 0, 0). Then, each thread will have a global unique id, which can be calculated from the corresponding block id and its internal thread id. So a thread can be globally indentified by blockIdx.x + trheadIdx.x = blockIdx.x Then we can map the threadid to the index of the data it will process. Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

main: All threads are grouped into N blocks. Each block contain 1 thread. Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Alternative 1: Alternatively, having n threads, we can have one block and the block contains n threads. This block id is (0 ,0). Within this block, the thread id is (0, 0, 0), (1, 0, 0), (2, 0, 0), (3, 0, 0), … (n, 0, 0). The gloabal id for each thread will be 0, 1, 2, …, n Then map the global thread id to the index of the data it will manipulate. Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Alternative 2: Assume that we will have multiple blocks, each block will have multiple threads. Assume the number of threads in a block is M. Then the total number of blocks we need is N/M. int globalThreadId = threadIdx.x + blockIdx.x * M //M is the number of threads in a block Int globalThreadId = threadIdx.x + blockIdx.x * blockDim.x Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

So the kernel becomes Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

The main becomes Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Handling Arbitrary Vector Sizes Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf