Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA Grids, Blocks, and Threads
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CS 193G Lecture 2: GPU History & CUDA Programming Basics.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
CUDA Programming Model
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Some things are naturally parallel
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
CUDA Programming Model
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

From Graphics to General Purpose Processing – CPU vs GPU  CPU : general purpose computation (SISD)  GPU : data-parallel computation (SIMD) 2

What is CUDA?  Compute Unified Device Architecture  Hardware or software?  A programming model  A parallel computing platform 3

Heterogeneous computing: CPU+GPU Cooperation 4 CPU (host) GPU w/ local DRAM (device) Host-Device Architecture :

Heterogeneous computing: CUDA Code Execution (1/2) 5

Heterogeneous computing: CUDA Code Execution (2/2) 6

Heterogeneous computing: NIVIDA G80 series  Texture Processor Cluster (TPC)  Streaming Multiprocessor (SM)  Streaming Processor (SP) / CUDA core  Special Function Unit (SFU)  For NV G8800(G80), the number of SPs is

Heterogeneous computing: NIVIDA G80 series – CUDA mode 8

Heterogeneous computing: NVIDIA CUDA Compiler (NVCC)  NVCC separates CPU and GPU source code into two parts.  For host codes, NVCC invokes typical C compiler like GCC, Intel C compiler, or MS C compiler.  All the device codes are compiled by NVCC.  The extension of device source files should be “.cu”.  All executable with CUDA code requires :  CUDA core library (cuda)  CUDA runtime library (cudart) 9

CUDA Programming Model (1/7)  Define  Programming model  Memory model  Help developers map the current applications or algorithms onto CUDA devices more easily and clearly.  NVIDIA GPUs have different architecture compared with common CPUs. It is important to follow CUDA’s programming model to obtain higher performance of program execution. 10

CUDA Programming Model (2/7)  C/C++ for CUDA  Subset of C with extensions  C++ templates for GPU code  CUDA goals:  Scale code to 100s of cores and 1000s of parallel threads.  Facilitate heterogeneous computing: CPU + GPU  CUDA defines:  Programming model  Memory model 11

CUDA Programming Model (3/7)  CUDA Kernels and Threads :  Parallel portions of an application are executed on the device as kernels. And only one kernel is executed at a time.  All the threads execute the same kernel at a time.  Differences between CUDA and GPU threads  CUDA threads are extremely lightweight  Very little creation overhead  Fast switching  CUDA uses 1000s of threads to achieve efficiency  Multi-core CPUs can only use a few 12

CUDA Programming Model (4/7) Arrays of Parallel Threads :  A CUDA kernel is executed by an array of threads  All threads run the same code  Each thread has an ID that it uses to compute memory addresses and make control decisions 13

CUDA Programming Model (5/7) Thread Batching :  Kernel launches a grid of thread blocks  Threads within a block cooperate via shared memory  Threads in different blocks cannot cooperate  Allows programs to transparently scale to different GPUs 14

CUDA Programming Model (6/7) CUDA Programming Model :  A kernel is executed by a grid of thread blocks  Block can be 1D or 2D.  A thread block is a batch of threads  Thread can be 1D or 2D or 3D.  Data can be shared through shared memory  Execution synchronization  But threads from different blocks can’t cooperate. 15

CUDA Programming Model (7/7) Memory Model :  Registers  Per thread  Data lifetime = thread lifetime  Local memory  Per thread off-chip memory (physically in device DRAM)  Data lifetime = thread lifetime  Shared memory  Per thread block on-chip memory  Data lifetime = block lifetime  Global (device) memory  Accessible by all threads as well as host (CPU)  Data lifetime = from allocation to deallocation  Host (CPU) memory  Not directly accessible by CUDA threads 16

CUDA C Basic 17

GPU Memory Allocation/Release  Memory allocation on GPU  cudaMalloc(void **pointer, size_t nbytes)  Preset value for specific memory area  cudaMemset(void *pointer, int value, size_t count)  Release memory allocation  cudaFree(void *pointer) 18 int n = 1024; int nbytes = 1024*sizeof(int); int *d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

Data Copies  cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);  direction specifies locations (host or device) of src and dst  Blocks CPU thread: returns after the copy is complete  Doesn’t start copying until previous CUDA calls complete  enum cudaMemcpyKind  cudaMemcpyHostToDevice  cudaMemcpyDeviceToHost  cudaMemcpyDeviceToDevice 19

Function Qualifiers  __global__ : invoked from within host (CPU) code, cannot be called from device (GPU) code must return void  __device__ : called from other GPU functions, cannot be called from host (CPU) code  __host__ : can only be executed by CPU, called from host 20

Variable Qualifiers (GPU code)  __device__  Stored in device memory (large capacity, high latency, uncached)  Allocated with cudaMalloc ( __device__ qualifier implied)  Accessible by all threads  Lifetime: application  __shared__  Stored in on-chip shared memory (SRAM, low latency)  Allocated by execution configuration or at compile time  Accessible by all threads in the same thread block  Lifetime: duration of thread block  Unqualified variables:  Scalars and built-in vector types are stored in registers  Arrays may be in registers or local memory (registers are not addressable) 21

CUDA Built-in Device Variables  All __global__ and __device__ functions have access to these automatically defined variables  dim3 gridDim ;  Dimensions of the grid in blocks (at most 2D)  dim3 blockDim ;  Dimensions of the block in threads  dim3 blockIdx ;  Block index within the grid  dim3 threadIdx ;  Thread index within the block 22

Executing Code on the GPU  Kernels are C functions with some restrictions  Can only access GPU memory  Must have void return type  No variable number of arguments (“varargs”)  Not recursive  No static variables  Function arguments automatically copied from CPU to GPU memory 23

Launching Kernels  Modified C function call syntax :  kernel >>(…);  Execution configuration (“ >>”) :  Grid dimensions: x and y  Thread-block dimensions: x, y, and z 24 dim3 grid(16,16); dim3 block(16,16); kernel1 >>(…); kernel2 >>(…);

Data Decomposition  Often want each thread in kernel to access a different element of an array. 25 blockIdx.x blockDim.x = 5 threadIdx.x idx = blockIdx.x*blockDim.x + threadIdx.x

Data Decomposition Example: Increment Array Elements  Increment N-element vector a by scalar b 26 CPU program void increment_cpu(float *a, float *b, int N) { for(int idx=0;idx<N;idx++) a[idx]=a[idx]+b; } void main() { … increment_cpu(a,b,N); } CUDA program __global__ void increment_gpu(float *a, float *b, int N) { int idx = blockIdx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=a[idx]+b; } void main() { … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N/(float)blocksize)); increment_gpu >>(a,b,N); }

LAB0: Setup CUDA Environment & Device Query 27

CUDA Environment Setup  Install Microsoft Visual Studio 2010  Available from  Express version from MS website  Check your NVIDIA GPU  Compute capability  GPU’s generation   Download CUDA Development files   CUDA driver  CUDA toolkit  PC Room ED417B used version 5.5.  Install CUDA  Test CUDA  Device Query (check in the sample codes) 28

Setup CUDA for MS Visual Studio (ED417B) In PC Room ED417B :  CUDA device: NV 8400GS  CUDA toolkit version: 5.5  Visual Studio 2010  Modified from existing project  CUDA Sample codes  Please refer to 

CUDA Device Query Example: (CUDA toolkit 6.0) 30

Lab1: First CUDA Program Yet Another Random Number Sequence Generator 31

Yet Another Random Number Sequence Generator  Implemented by CPU and GPU  Functionality:  Given an random integer array A holding 8192 elements  Generated by rand()  Re-generate random number by multiplying itself 256 times without regard to overflow occurred.  B[i] new = B[i] old *A[i] (GPU)  C[i] new = C[i] old *A[i] (CPU)  Check the consistency between the execution results of CPU and GPU. 32

Data Manipulation between Host and Device  cudaError_t cudaMalloc( void** devPtr, size_t count )  Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory  cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)  Copies count bytes from memory area pointed to by src to the memory area pointed to by dst  kind indicates the type of memory transfer  cudaMemcpyHostToHost  cudaMemcpyHostToDevice  cudaMemcpyDeviceToHost  cudaMemcpyDeviceToDevice  cudaError_t cudaFree( void* devPtr )  Frees the memory space pointed to by devPtr 33

Now, go and finish your first CUDA program !!! 34

 Source code download   Create a VS project and add the following files  How to create a VS project from CUDA Sample code   main.cu  Random input generation, output validation, result reporting  device.cu  Lunch GPU kernel, GPU kernel code  parameter.h  Fill in appropriate APIs  GPU_kernel() in device.cu  Please change SIZE in parameter.h to

Lab2: Make the Parallel Code Faster Yet Another Random Number Sequence Generator 36

Parallel Processing in CUDA  Parallel code can be partitioned into blocks and threads  cuda_kernel >>( … )  Multiple tasks will be initialized, each with different block id and thread id  The tasks are dynamically scheduled  Tasks within the same block will be scheduled on the same stream multiprocessor  Each task take care of single data partition according to its block id and thread id 37

Locate Data Partition by Built-in Variables  Built-in Variables  gridDim  x, y  blockIdx  x, y  blockDim  x, y, z  threadIdx  x, y, z 38

Data Partition for Previous Example 39 When processing 64 integer data: cuda_kernel >>(…) int total_task = gridDim.x * blockDim.x ; int task_sn = blockIdx.x * blockDim.x + threadIdx.x ; int length = SIZE / total_task ; int head = task_sn * length ;

Processing Single Data Partition 40

Parallelize Your Program !!! 41

 Change to larger SIZE (in prarmeter.h)  SIZE = 1024, 2048, 4096 …  Partition kernel into threads  Increase nTid from 1 to 512  Keep nBlk = 1  Group threads into blocks  Adjust nBlk and see if it helps  Maintain total number of threads below 512, and make sure that SIZE can be divisible by that number.  e.g. nBlk * nTid <

Lab3: Resolve Memory Contention Yet Another Random Number Sequence Generator 43

Parallel Memory Architecture  Memory is divided into banks to achieve high bandwidth  Each bank can service one address per cycle  Successive 32-bit words are assigned to successive banks 44

Lab2 Review 45 When processing 64 integer data: cuda_kernel >>(…)

How about Interleave Accessing? 46 When processing 64 integer data: cuda_kernel >>(…)

Implementation of Interleave Accessing  head = task_sn  stripe = total_task 47 cuda_kernel >>( … )

Improve Your Program !!! 48

 Modify original kernel code in interleaving manner  cuda_kernel() in device.cu  Adjusting nBlk and nTid as in Lab2 and examine the effect  Maintain total number of threads below 512, and make sure that 8192 can be divisible by that number.  e.g. nBlk * nTid <

Thank You  Lab3 answer: Lab3 answer: * Group member & demo time should be registered after final ED412 50