CUDA and GPU Training Workshop April 21, 2014

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
What is GPGPU? Many of these slides are taken from Henry Neeman’s presentation at the University of Oklahoma.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA and GPU Training: Sessions 1 & 2 April 16 & 23, 2012 University of Georgia CUDA Teaching Center.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Lecture 20 Computing with Graphical Processing Units
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Programming Massively Parallel Graphics Processors
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

CUDA and GPU Training Workshop April 21, 2014 University of Georgia CUDA Teaching Center

UGA CUDA Teaching Center UGA, through the efforts of Professor Thiab Taha, has been selected by NVIDIA as a CUDA Teaching Center, est. 2011 Presenters: Jennifer Rouan and Sayali Kale Visit us at http://cuda.uga.edu We are sponsored in part by a grant from Nvidia, so if we make some claims that are still debatable in the industry at large, bear in mind that our bias is toward Nvidia’s philosophy.

Workshop Outline Intruduction to GPUs and CUDA CUDA Programming Concepts Current GPU Research at UGA GPU Computing Resources at UGA “My First CUDA Program” – hands-on programming project

A Little Bit of GPU Background GPU is a computer chip that performs rapid mathematical calculations in parallel, primarily for the purpose of rendering images. NVIDIA introduced the first GPU, The GeForce256, in 1999 and remains one of the major players in the market. Using CUDA, the GPUs can be used for general purpose processing this approach is known as GPGPU.

Question What are different ways hardware designer make computers run faster? Higher clock speeds More work/clock cycle More processors

What is CUDA? Compute Unified Device Architecture It is a parallel computing platform and programming model created by NVDIA and implemented by the GPUs(Graphics Processing Units) that they produce. CUDA compiler uses variation of C with future support of C++ CUDA was released on February 15, 2007 for PC and Beta version for Mac OS X on August 19, 2008. 

Why CUDA? CUDA provides ability to use high-level languages .    GPUs allow creation of very large number of concurrently executed threads at very low system resource cost. CUDA also exposes fast shared memory (16KB) that can be shared between threads.  such as C to develop application that can take advantage of high level of performance and scalability that GPUs architecture offer. 

CPU v/s GPU Central Processing Units (CPUs) consists of a few cores optimized for sequential serial processing . Graphics Processing Units (GPUs) consists of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously. Video memory CPU GPU

CUDA Computing System GPU-accelerated computing offers great performance by running portions of the application to the GPU, while the remainder of the code still runs on the CPU.

CUDA Computing System A CUDA computing system consists of a host (CPU) and one or more devices (GPUs) The portions of the program that can be evaluated in parallel are executed on the device. The host handles the serial portions and the transfer of execution and data to and from the device

CUDA Program Source Code A CUDA program is a unified source code encompassing both host and device code. Convention: program_name.cu NVIDIA’s compiler (nvcc) separates the host and device code at compilation The host code is compiled by the host’s standard C compilers. The device code is further compiled by nvcc for execution on the GPU

CUDA Program Execution steps: CPU allocated storage on GPU (CUDA Malloc) CPU copies I/P data from CPU to GPU (CUDA Memcpy) CPU launches kernel (Invoking program )on GPU to process data (Kernel Launch) CPU copies results back to CPU from GPU (CUDA Memcpy)

Processing Flow Processing Flow of CUDA: Copy data from main memory to GPU memory. CPU instructs the process to GPU. GPU execute parallel in each core. Copy the result from GPU memory to main memory.

CUDA Program Execution Execution of a CUDA program begins on the host CPU When a kernel function (or simply “kernel”) is launched, execution is transferred to the device and a massive “grid” of lightweight threads is spawned When all threads of a kernel have finished executing, the grid terminates and control of the program returns to the host until another kernel is launched

Thread Batching: Grids and Blocks A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other. Threads and blocks have IDs So each thread can decide what data to work on Grid Dim: 1D or 2D Block Dim: 1D, 2D, or 3D Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) Courtesy: NDVIA

CUDA Program Structure example int main(void) { float *a_h, *a_d; // pointers to host and device arrays const int N = 10; // number of elements in array size_t size = N * sizeof(float); // size of array in memory // allocate memory on host and device for the array // initialize array on host (a_h) // copy array a_h to allocated device memory location (a_d) // kernel invocation code – to have the device perform // the parallel operations // copy a_d from the device memory back to a_h // free allocated memory on device and host } In this case, since it’s such a simple program, we are going to write over our input data with our results on both the host and device. For a more complicated program, you would probably use more arrays to store the results on the device and host.

Data Movement example int main(void) { float *a_h, *a_d; const int N = 10; size_t size = N * sizeof(float); // size of array in memory a_h = (float *)malloc(size); // allocate array on host cudaMalloc((void **) &a_d, size); // allocate array on device for (i=0; i<N; i++) a_h[i] = (float)i; // initialize array cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); // kernel invocation code cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); cudaFree(a_d); free(a_h); // free allocated memory } Be aware, a pointer to an address looks like a pointer to an address, there is no way for the computing system to tell whether it refers to an address on the host or the device. Use useful variable names with “h” or “d” in them, and pay attention when dereferencing. cudaMemcopy( dest_add, source_add, size, direction )

Kernel Invocation example int main(void) { float *a_h, *a_d; const int N = 10; size_t size = N * sizeof(float); // size of array in memory a_h = (float *)malloc(size); // allocate array on host cudaMalloc((void **) &a_d, size); // allocate array on device for (i=0; i<N; i++) a_h[i] = (float)i; // initialize array cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); int block_size = 4; // set up execution parameters int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); cudaFree(a_d); free(a_h); // free allocated memory }

Kernel Function Call #include <stdio.h> #include <cuda.h> // kernel function that executes on the CUDA device __global__ void square_array(float *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx<N) a[idx] = a[idx] * a[idx]; } // main routine that executes on the host Compare with serial C version: void square_array(float *a, int N) { int i; for (i = 0; i < N; i++) a[i] = a[i] * a[i]; }

CUDA Thread Organization Since all threads of a grid execute the same code, they rely on a two-level hierarchy of coordinates to distinguish themselves: blockIdx and threadIdx The example code fragment: ID = blockIdx.x * blockDim.x + threadIdx.x; will yield a unique ID for every thread across all blocks of a grid Every thread executes the exact same code. They calculate their individual ID, then use it as the index to an input array and an output array. Rather than iterate through an array, each thread operates on a single element at the same time.

Execution Parameters and Kernel Launch A kernel is invoked by the host program with execution parameters surrounded by ‘<<<’ and ‘>>>’ as in: function_name <<< grid_dim, block_dim >>> (arg1, arg2); At kernel launch, a “grid” is spawned on the device. A grid consists of a one- or two-dimensional array of “blocks”. In turn, a block consists of a one-, two-, or three-dimensional array of “threads”. Grid and block dimensions are passed to the kernel function at invocation as execution parameters.

Execution Parameters and Kernel Launch gridDim and blockDim are CUDA built-in variables of type dim3, essentially a C struct with three unsigned integer fields, x, y, and z Since a grid is generally two-dimensional, gridDim.z is ignored but should be set to 1 for clarity dim3 grid_d = (n_blocks, 1, 1); // this is still dim3 block_d = (block_size, 1, 1); // host code function_name <<< grid_d, block_d >>> (arg1, arg2); For one-dimensional grids and blocks, scalar values can be used instead of dim3 type

Execution Parameters and Kernel Launch dim3 grid_dim = (2, 2, 1) dim3 block_dim = (4, 2, 2) The programmer chooses the dimensions based on what works for the data and what they want to do with it. For example, when using two-dimensional arrays, you would probably choose two-dimensional blocks.

Kernel Functions A kernel function specifies the code to be executed by all threads in parallel – an instance of single-program, multiple-data (SPMD) parallel programming. A kernel function declaration is a C function extended with one of three keywords: “__device__”, “__global__”, or “__host__”. Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc()

CUDA Device Memory Types Global Memory and Constant Memory can be accessed by the host and device. Constant Memory serves read-only data to the device at high bandwidth. Global Memory is read-write and has a longer latency Registers, Local Memory, and Shared Memory are accessible only to the device. Registers and Local Memory are available only to their own thread. Shared Memory is accessible to all threads within the same block. No communication between blocks due to synchronization constraint More on “latency hiding” later

CUDA Device Memory Model Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory (Device) Grid Constant Memory Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Global, constant, and texture memory spaces are persistent across kernels called by the same application.

CUDA- Advantages Huge increase in processing power over conventional CPU processing. Early reports suggest speed increases of 10x to 200x over CPU processing speed. Researchers can use several GPU's to preform the same amount of operations as many servers in less time, thus saving money, time, and space. C language is widely used, so it is easy for developers to learn how to program for CUDA. All graphics cards in the G80 series and beyond support CUDA. Harnesses the power of the GPU by using parallel processing; running thousands of simultaneous reads.

Disadvantages Limited user base- Only NVIDIA G80 and onward video cards can use CUDA, thus isolating all ATI users. Speeds may be bottlenecked at the bus between CPU and GPU. Mainly developed for researchers- not many uses for average users. System is still in development.

Current GPU Research at UGA Institute of Bioinformatics “Solving Nonlinear Systems of First Order Ordinary Differential Equations Using a Galerkin Finite Element Method”, Al-Omari, A., Schuttler, H.-B., Arnold, J., and Taha, T. R. IEEE Access, 1, 408-417 (2013). “Solving Large Nonlinear Systems of First-Order Ordinary Differential Equations With Hierarchical Structure Using Multi-GPGPUs and an Adaptive Runge Kutta ODE Solver”, Al- Omari, A., Arnold, J., Taha, T. R., Schuttler, H.-B. IEEE Access, 1, 770-777 (2013).

Current GPU Research at UGA Institute of Bioinformatics GTC Poster: http://cuda.uga.edu/docs/GPGPU_runge-kutta.pdf

Current GPU Research at UGA Department of Physics and Astronomy “A generic, hierarchical framework for massively parallel Wang-Landau sampling”, T. Vogel, Y. W. Li, T. Wüst, and D. P. Landau, Phys. Rev. Lett. 110, 210603 (2013). “Massively parallel Wang-Landau Sampling on Multiple GPUs”, J. Yin and D. P. Landau, Comput. Phys. Commun. 183, 1568-1573 (2012).

Current GPU Research at UGA Department of Physics and Astronomy

Current GPU Research at UGA Department of Computer Science “Analysis of Surface Folding Patterns of DICCCOLS Using the GPU-Optimized Geodesic Field Estimate”, Mukhopadhyay, A., Lim, C.-W., Bhandarkar, S. M., Chen, H., New, A., Liu, T., Rasheed, K. M., Taha, T. R. Nagoyaa: Proceedings of MICCAI Workshop on Mesh Processing in Medical Image Analysis (2013).

Current GPU Research at UGA Department of Computer Science

Current GPU Research at UGA Department of Computer Science “Using Massively Parallel Evolutionary Computation on GPUs for Biological Circuit Reconstruction”, Cholwoo Lim, master's thesis under the direction of Dr. Khaled Rasheed (2013).

Current GPU Research at UGA

Current GPU Research at UGA Department of Computer Science “GPU Acceleration of High-Dimensional k-Nearest Neighbor Search for Face Recognition using EigenFaces”, Jennifer Rouan, master's thesis under the direction of Dr. Thiab Taha (2014) “Using CUDA for GPUs over MPI to solve Nonlinear Evolution Equations”, Jennifer Rouan and Thiab Taha, presented at: The Eighth IMACS International Conference on Nonlinear Evolution Equations and Wave Phenomena: Computation and Theory (2013)

Current GPU Research at UGA Department of Computer Science Research Day 2014 Poster: http://cuda.uga.edu/docs/J_Rouan_Research_Day_poster.pdf Waves 2013 Poster: http://cuda.uga.edu/docs/waves_Poster_Rouan.pdf

More CUDA Training Resources University of Georgia CUDA Teaching Center: http://cuda.uga.edu Nvidia training and education site: http://developer.nvidia.com/cuda-education-training Stanford University course on iTunes U: http://itunes.apple.com/us/itunes- u/programming-massively-parallel/id384233322 University of Illinois: http://courses.engr.illinois.edu/ece498/al/Syllabus.html University of California, Davis: https://smartsite.ucdavis.edu/xsl-portal/site/1707812c- 4009-4d91-a80e-271bde5c8fac/page/de40f2cc-40d9-4b0f-a2d3-e8518bd0266a University of Wisconsin: http://sbel.wisc.edu/Courses/ME964/2011/me964Spring2011.pdf University of North Carolina at Charlotte: http://coitweb.uncc.edu/~abw/ITCS6010S11/index.html

GPUs Available at UGA CUDA Teaching Center lab in 207A Twelve NVIDIA GeForce GTX 480 GPUs Six Linux hosts on cs.uga.edu domain: cuda01, cuda02, cuda03, cuda04, cuda05 & cuda06 SSH from nike.cs.uga.edu with your CS login and password More GPUs available on the Z-cluster visit http://gacrc.uga.edu for an account and more information Do they need to get on an access list?

References Kirk, D., & Hwu, W. (2010). Programming Massively Parallel Processors: A Hands-on Approach, 1 – 75 Tarjan, D. (2010). Introduction to CUDA, Stanford University on iTunes U Atallah, M. J. (Ed.), (1998). Algorithms and theory of computation handbook. Boca Raton, FL: CRC Press von Neumann, J. (1945). First draft of a report on the EDVAC. Contract No. W-670-ORD- 4926, U.S. Army Ordnance Department and University of Pennsylvania Sutter, H., & Larus, J. (2005). Software and the concurrency revolution. ACM Queue, 3(7), 54 – 62 Stratton, J. A., Stone, S. S., & Hwu, W. W. (2008). MCUDA: And efficient implementation of CUDA kernels for multi-core CPUs. Canada: Edmonton Vandenbout, Dave (2008). My First Cuda Program, http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/

References “Using Massively Parallel Evolutionary Computation on GPUs for Biological Circuit Reconstruction”, Cholwoo Lim, master's thesis under the direction of Dr. Khaled Rasheed (2013) and Prof. Taha is one of the Advisory Committee members “Solving large Nonlinear Systems of ODE with Hierarchical Structure Using Multi- GPGPUs and an Adaptive Runge Kutte”, Ahmad Al-Omari, Thiab Taha, B. Schuttler, Jonathan Arnold, presented at: GPU Technology Conference, March 2014 “Using CUDA for GPUs over MPI to solve Nonlinear Evolution Equations”, Jennifer Rouan and Thiab Taha, presented at: The Eighth IMACS International Conference on Nonlinear Evolution Equations and Wave Phenomena: Computation and Theory, March 2013 “GPU Acceleration of High-Dimensional k-Nearest Neighbor Search for Face Recognition using EigenFaces”, Jennifer Rouan, and Thiab Taha, presented at: UGA Department of Computer Science Research Day, April 2014