CS 193G Lecture 2: GPU History & CUDA Programming Basics.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA Grids, Blocks, and Threads
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
CUDA Basics. © NVIDIA Corporation 2009 CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs C/C++ OpenCL DirectCompute.
Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Parallel Computing in CUDA Michael Garland NVIDIA Research.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Lecture 19 Revisting Strides, CUDA Threads… Topics Strides through memory Practical Performance considerationsReadings November 7, 2012 CSCE 513 Advanced.
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Basic CUDA Programming
Lecture 16 Revisiting Strides, CUDA Threads…
Some things are naturally parallel
Introduction to CUDA C Slide credit: Slides adapted from
CUDA Parallelism Model
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
6- General Purpose GPU Programming
Presentation transcript:

CS 193G Lecture 2: GPU History & CUDA Programming Basics

Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management Coordinating CPU and GPU Execution See the Programming Guide for the full API

BASIC KERNELS AND EXECUTION ON GPU

CUDA Programming Model Parallel code (kernel) is launched and executed on a device by many threads Launches are hierarchical Threads are grouped into blocks Blocks are grouped into grids Familiar serial code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables

High Level View SMEM Global Memory CPU Chipset PCIe

Blocks of threads run on an SM Thread Memory Threadblock Per-block Shared Memory SMEM Streaming ProcessorStreaming Multiprocessor Registers Memory

Whole grid runs on GPU Many blocks of threads... SMEM Global Memory

Thread Hierarchy Threads launched for a parallel section are partitioned into thread blocks Grid = all blocks for a given launch Thread block is a group of threads that can: Synchronize their execution Communicate via shared memory

Memory Model Kernel 0... Per-device Global Memory... Kernel 1 Sequential Kernels

Memory Model Device 0 memory Device 1 memory Host memory cudaMemcpy()

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // Run grid of N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C); } Device Code

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // Run grid of N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C); } Host Code

Example: Host code for vecAdd // allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; *h_C = …(empty) // allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) ); // execute grid of N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C);

Example: Host code for vecAdd (2) // execute grid of N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy( h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost) ); // do something with the result… // free device (GPU) memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; } Kernel Variations and Output Output: Output: Output:

Code executed on GPU C/C++ with some restrictions: Can only access GPU memory No variable number of arguments No static variables No recursion No dynamic polymorphism Must be declared with a qualifier: __global__ : launched by CPU, cannot be called from GPU must return void __device__ : called from other GPU functions, cannot be called by the CPU __host__ : can be called by CPU __host__ and __device__ qualifiers can be combined sample use: overloading operators

Memory Spaces CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU Very similar to corresponding C functions Pointers are just addresses Can’t tell from the pointer value whether the address is on CPU or GPU Must exercise care when dereferencing: Dereferencing CPU pointer on GPU will likely crash Same for vice versa

GPU Memory Allocation / Release Host (CPU) manages device (GPU) memory: cudaMalloc (void ** pointer, size_t nbytes) cudaMemset (void * pointer, int value, size_t count) cudaFree (void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

Data Copies cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); returns after the copy is complete blocks CPU thread until all bytes have been copied doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice Non-blocking copies are also available

Code Walkthrough 1 // walkthrough1.cu #include int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

Code Walkthrough 1 // walkthrough1.cu #include int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

Code Walkthrough 1 // walkthrough1.cu #include int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

Code Walkthrough 1 // walkthrough1.cu #include int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; }

Example: Shuffling Data // Reorder values based on keys // Each thread moves one element __global__ void shuffle(int* prev_array, int* new_array, int* indices) { int i = threadIdx.x + blockDim.x * blockIdx.x; new_array[i] = prev_array[indices[i]]; } int main() { // Run grid of N/256 blocks of 256 threads each shuffle >>(d_old, d_new, d_ind); } Host Code

IDs and Dimensions Threads: 3D IDs, unique within a block Blocks: 2D IDs, unique within a grid Dimensions set at launch Can be unique for each grid Built-in variables: threadIdx, blockIdx blockDim, gridDim Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; } Kernel with 2D Indexing

int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel >>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0; } __global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }

Blocks must be independent Any possible interleaving of blocks should be valid presumed to run to completion without pre- emption can run in any order can run concurrently OR sequentially Blocks may coordinate but not synchronize shared queue pointer: OK shared lock: BAD … can easily deadlock Independence requirement gives scalability

Questions?

CS 193G History of GPUs

Graphics in a Nutshell Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques use every trick imaginable build monster hardware Eugene d’Eon, David Luebke, Eric Enderton In Proc. EGSR 2007 and GPU Gems 3

The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer

The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer

The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer

The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer

The Graphics Pipeline Key abstraction of real-time graphics Hardware used to look like this One chip/board per stage Fixed data flow through pipelineVertexRasterize Pixel Test & Blend Framebuffer

The Graphics Pipeline Everything fixed function, with a certain number of modes Number of modes for each stage grew over time Hard to optimize HW Developers always wanted more flexibilityVertexRasterize Pixel Test & Blend Framebuffer

The Graphics Pipeline Remains a key abstraction Hardware used to look like this Vertex & pixel processing became programmable, new stages added GPU architecture increasingly centers around shader executionVertexRasterize Pixel Test & Blend Framebuffer

The Graphics Pipeline Exposing a (at first limited) instruction set for some stages Limited instructions & instruction types and no control flow at first Expanded to full ISAVertexRasterize Pixel Test & Blend Framebuffer

Why GPUs scale so nicely Workload and Programming Model provide lots of parallelism Applications provide large groups of vertices at once Vertices can be processed in parallel Apply same transform to all vertices Triangles contain many pixels Pixels from a triangle can be processed in parallel Apply same shader to all pixels Very efficient hardware to hide serialization bottlenecks

With Moore’s Law… Raster Vertex Pixel Blend Raster Vertex Pixel 0 Blend Pixel 1 Pixel 2 Pixel 3 Vrtx 0 Vrtx 2Vrtx 1

More Efficiency Note that we do the same thing for lots of pixels/vertices ALU Control ALU Control ALU Control ALU Control ALU Control ALU Control ALU Control ALU A warp = 32 threads launched together Usually, execute together as well

Early GPGPU All this performance attracted developers To use GPUs, re-expressed their algorithms as graphics computations Very tedious, limited usability Still had some very nice results This was the lead up to CUDA

Questions?