GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
Speed, Accurate and Efficient way to identify the DNA.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Basic CUDA Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Mattan Erez The University of Texas at Austin
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
General Purpose Graphics Processing Units (GPGPUs)
© David Kirk/NVIDIA and Wen-mei W. Hwu,
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa

Outline Introduction Architecture Description Introduction to CUDA API

Introduction Shift in the traditional paradigm of sequential programming, towards parallel processing. Scientific computing needs to change in order to deal with vast amounts of data. Hardware changes contributed to move towards parallel processing.

Three Walls of Serial Performance Manferdelli, J. (2007) - The Many-Core Inflection Point for Mass Market Computer Systems Memory Wall Discrepancy between memory and CPU performance Instruction Level Parallelism Wall Effort put into ILP increases with not enough returns Power Wall Clock frequency vs. Heat dissipation efforts.

Accelerators In HPC, an accelerator is a hardware component whose role is to speed up some aspect of the computing workload. In the old days (1980s), supercomputers we had array processors, for vector operations on arrays, and floating point accelerators. More recently, Field Programmable Gate Arrays (FPGAs) allow reprogramming deep into the hardware. Courtesy of Henry Neeman -

Accelerators Advantages They make your code run faster Disadvantages More expensive Harder to program Code is not portable from one accelerator to another. (OpenCL attempts to change this) Courtesy of Henry Neeman -

Introducing GPGPU General Purpose Computing on Graphics Processing Units Great example of the trend of moving away from the traditional model.

Why GPUs? Graphics Processing Units (GPUs) were originally designed to accelerate graphics tasks like image rendering. They became very popular with videogamers, because they’ve produced better and better images, and lightning fast. And, prices have been extremely good, ranging from three figures at the low end to four figures at the high end. GPUs mostly do stuff like rendering images. This is done through mostly floating point arithmetic – the same stuff people use supercomputing for! Courtesy of Henry Neeman -

GPU vs. CPU Flop Rate From Nvidia CUDA Programing Guide

Architecture

Architecture Comparison

CPU vs. GPU From Nvidia CUDA Programing Guide

Components Texture Processors Clusters Streaming Multiprocessors Streaming Processor From

Streaming Multiprocessors Blocks of threads are assigned to SMs A SM contains 8 Scalar Processors Tesla C1060 Number of SM = 30 Number of Cores = 240 The more SM you have the better

Hardware Hierarchy Stream Processor Array Contains 10 Texture Processor Clusters Texture Processor Clusters Contains 3 Streaming Multiprocessors Streaming Multiprocessors Contains 8 Scalar Processors Scalar Processors They do the work :)

Connecting some dots... Great! We see the GPU architecture is different from what we see in the traditional CPU. So... Now what? What this all means? How do we use it?

Glossary The HOST – Is the machine executing main program The DEVICE – Is the card with the GPU The KERNEL – Is the routine that runs on the GPU A THREAD – Is the basic execution unit in the GPU A BLOCK – Is a group of threads A GRID – Is a group of blocks A WARP – Is a group of 32 threads

CUDA Kernel Execution Recall that threads are organized in BLOCKS and at the same time BLOCKS are organized in a GRID. The GRID can have 2 dimensions. X and Y Maximum sizes of each dimension of a grid: x x 1 The BLOCK(S) can have 3 dimensions X,Y,Z Maximum sizes of each dimension of a block: 512 x 512 x 64 Prior to kernel execution we need to set it up by setting the dimensions of the GRID and the dimensions of the BLOCKS

Scheduling in Hardware Grid is launched Blocks are distributed to the necessary SMs SM initiates processing of warps SM schedules warps that are ready As warps finish and resources are liberated, then new warps are scheduled. SM can take 1024 threads Ex: 256 x 4 OR 128 x 8 Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Kirk & Hwu – University of Illinois Urbana- Champaign

Memory Layout Registers and shared memory are the fastest Local Memory is virtual memory Global Memory is the slowest. From Nvidia CUDA Programing Guide

Thread Memory Access Threads access memory as follows Registers – Read & Write Local Memory – Read & Write Shared Memory – Read & Write (block level) Global Memory – Read & Write (grid level) Constant Memory – Read (grid level) Remember that Local Memory is implemented as virtual memory from a region that resides in Global Memory.

CUDA API

Programming Pattern Host reads input and allocates memory in the device Host copies data to the device Host invokes a kernel that gets executed in parallel, using the data and hardware in the device, to do some useful work. Host copies back the results from the device for post processing.

Kernel Setup _global_ void myKernel(); //declaration dim3 dimGrid(2,2,1); dim3 dimBlock(4,8,8); myKernel >>( d_b, d_a );

Device Memory Allocation cudaMalloc(&myDataAddress,sizeOfData) Address of a pointer to the allocated data and the size of such data. cudaFree(myDataPointer) Used to free the allocated memory on the device. Also check cudaMallocHost() and cudaFreeHost() in the CUDA Refrence Manual.

Device Data Transfer cudaMemcpy() Requires: pointer to destination, pointer to source, size, type of transfer Examples: cudaMemcpy(elements_d, elements_h,size,cudaMemcpyHostToDevice); cudaMemcpy(elements_h,elements_d,size,cudaMemcpyDeviceToHost) ;

Function Declaration _ global _ is used to declare a kernel. It must be void.

Useful Variables gridDim.(x|y) = grid dimension on x and y blockDim = number of threads in a block blockIdx = block index whithin the grid blockIdx.(x|y) threadIdx = Thread index within a block threadIdx.(x|y|z)

Variable Type Qualifiers Variable type qualifiers specify the memory location of a variable on the device’s memory __device__ Declares a variable in the device __constant__ Declares a constant in the device __shared__ Declares a variable in thread shared memory Note: All shared memory variables start at the same address. You must use offsets if multiple variables are declared in shared memory.