Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang (chenyen.yang@gmail.com)

From Graphics to General Purpose Processing – CPU vs GPU  CPU ： general purpose computation (SISD)  GPU ： data-parallel computation (SIMD) 2

What is CUDA?  Compute Unified Device Architecture  Hardware or software?  A programming model  A parallel computing platform 3

Heterogeneous computing: CPU+GPU Cooperation 4 CPU (host) GPU w/ local DRAM (device) Host-Device Architecture ：

Heterogeneous computing: NVIDIA CUDA Compiler (NVCC)  NVCC separates CPU and GPU source code into two parts.  For host codes, NVCC invokes typical C compiler like GCC, Intel C compiler, or MS C compiler.  All the device codes are compiled by NVCC.  The extension of device source files should be “.cu”.  All executable with CUDA code requires ：  CUDA core library (cuda)  CUDA runtime library (cudart) 5

Heterogeneous computing: CUDA Code Execution (1/2) 6

Heterogeneous computing: CUDA Code Execution (2/2) 7

Heterogeneous computing: NIVIDA G80 series  Texture Processor Cluster (TPC)  Streaming Multiprocessor (SM)  Streaming Processor (SP) / CUDA core  Special Function Unit (SFU)  For NV G8800(G80), the number of SPs is 128. 8

Heterogeneous computing: NIVIDA G80 series – CUDA mode 9

CUDA Programming Model (1/7)  Define  Programming model  Memory model  Help developers map the current applications or algorithms onto CUDA devices more easily and clearly.  NVIDIA GPUs have different architecture compared with common CPUs. It is important to follow CUDA’s programming model to obtain higher performance of program execution. 10

CUDA Programming Model (2/7)  C/C++ for CUDA  Subset of C with extensions  C++ templates for GPU code  CUDA goals:  Scale code to 100s of cores and 1000s of parallel threads.  Facilitate heterogeneous computing: CPU + GPU  CUDA defines:  Programming model  Memory model 11

CUDA Programming Model (3/7)  CUDA Kernels and Threads ：  Parallel portions of an application are executed on the device as kernels. And only one kernel is executed at a time.  All the threads execute the same kernel at a time.  Differences between CUDA and GPU threads  CUDA threads are extremely lightweight  Very little creation overhead  Fast switching  CUDA uses 1000s of threads to achieve efficiency  Multi-core CPUs can only use a few 12

CUDA Programming Model (4/7) Arrays of Parallel Threads ：  A CUDA kernel is executed by an array of threads  All threads run the same code  Each thread has an ID that it uses to compute memory addresses and make control decisions 13

CUDA Programming Model (5/7) Thread Batching ：  Kernel launches a grid of thread blocks  Threads within a block cooperate via shared memory  Threads in different blocks cannot cooperate  Allows programs to transparently scale to different GPUs 14

CUDA Programming Model (6/7) CUDA Programming Model ：  A kernel is executed by a grid of thread blocks  Block can be 1D or 2D.  A thread block is a batch of threads  Thread can be 1D or 2D or 3D.  Data can be shared through shared memory  Execution synchronization  But threads from different blocks can’t cooperate. 15

CUDA Programming Model (7/7) Memory Model ：  Registers  Per thread  Data lifetime = thread lifetime  Local memory  Per thread off-chip memory (physically in device DRAM)  Data lifetime = thread lifetime  Shared memory  Per thread block on-chip memory  Data lifetime = block lifetime  Global (device) memory  Accessible by all threads as well as host (CPU)  Data lifetime = from allocation to deallocation  Host (CPU) memory  Not directly accessible by CUDA threads 16

CUDA C Basic 17

GPU Memory Allocation/Release  Memory allocation on GPU  cudaMalloc(void **pointer, size_t nbytes)  Preset value for specific memory area  cudaMemset(void *pointer, int value, size_t count)  Release memory allocation  cudaFree(void *pointer) 18 int n = 1024; int nbytes = 1024*sizeof(int); int *d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

Data Copies  cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);  direction specifies locations (host or device) of src and dst  Blocks CPU thread: returns after the copy is complete  Doesn’t start copying until previous CUDA calls complete  enum cudaMemcpyKind  cudaMemcpyHostToDevice  cudaMemcpyDeviceToHost  cudaMemcpyDeviceToDevice 19

Function Qualifiers  __global__ : invoked from within host (CPU) code, cannot be called from device (GPU) code must return void  __device__ : called from other GPU functions, cannot be called from host (CPU) code  __host__ : can only be executed by CPU, called from host 20

Variable Qualifiers (GPU code)  __device__  Stored in device memory (large capacity, high latency, uncached)  Allocated with cudaMalloc ( __device__ qualifier implied)  Accessible by all threads  Lifetime: application  __shared__  Stored in on-chip shared memory (SRAM, low latency)  Allocated by execution configuration or at compile time  Accessible by all threads in the same thread block  Lifetime: duration of thread block  Unqualified variables:  Scalars and built-in vector types are stored in registers  Arrays may be in registers or local memory (registers are not addressable) 21

CUDA Built-in Device Variables  All __global__ and __device__ functions have access to these automatically defined variables  dim3 gridDim ;  Dimensions of the grid in blocks (at most 2D)  dim3 blockDim ;  Dimensions of the block in threads  dim3 blockIdx ;  Block index within the grid  dim3 threadIdx ;  Thread index within the block 22

Executing Code on the GPU  Kernels are C functions with some restrictions  Can only access GPU memory  Must have void return type  No variable number of arguments (“varargs”)  Not recursive  No static variables  Function arguments automatically copied from CPU to GPU memory 23

Launching Kernels  Modified C function call syntax ：  kernel >>(…);  Execution configuration (“ >>”) ：  Grid dimensions: x and y  Thread-block dimensions: x, y, and z 24 dim3 grid(16,16); dim3 block(16,16); kernel1 >>(…); kernel2 >>(…);

Data Decomposition  Often want each thread in kernel to access a different element of an array. 25 blockIdx.x blockDim.x = 5 threadIdx.x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 idx = blockIdx.x*blockDim.x + threadIdx.x

Data Decomposition Example: Increment Array Elements  Increment N-element vector a by scalar b 26 CPU program void increment_cpu(float *a, float *b, int N) { for(int idx=0;idx<N;idx++) a[idx]=a[idx]+b; } void main() { … increment_cpu(a,b,N); } CUDA program __global__ void increment_gpu(float *a, float *b, int N) { int idx = blockIdx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=a[idx]+b; } void main() { … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N/(float)blocksize)); increment_gpu >>(a,b,N); }

LAB0: Setup CUDA Environment & Device Query 27

CUDA Environment Setup  Check your NVIDIA GPU  Compute capability  GPU’s generation  https://developer.nvidia.com/cuda-gpus https://developer.nvidia.com/cuda-gpus  Download CUDA Development files  https://developer.nvidia.com/cuda-toolkit-archive https://developer.nvidia.com/cuda-toolkit-archive  CUDA driver  CUDA toolkit  PC Room ED417B used version 2.3.  CUDA SDK (optional)  Install CUDA  Test CUDA  Device Query (check in the sample codes) 28

Setup CUDA for MS Visual Studio (WindowsXP/VS2005~2010) In PC Room ED417B :  CUDA device: NV 8400GS  CUDA toolkit version: 2.0 or 2.3  VS20005 or VS2008  From scratch  http://forums.nvidia.com/index.php?showtopic=30273 http://forums.nvidia.com/index.php?showtopic=30273  CUDA VS Wizard  http://sourceforge.net/projects/cudavswizard/ http://sourceforge.net/projects/cudavswizard/  Modified from existing project 29

Setup CUDA for MS Visual Studio 2012 (Win7/VS2012)  MS Visual Studio 2012 installed  Download and install CUDA toolkit 5.0  https://developer.nvidia.com/cuda-toolkit-50-archive https://developer.nvidia.com/cuda-toolkit-50-archive  Follow the instructions on this blog  http://blog.norture.com/2012/10/gpu-parallel-programming-in- vs2012-with-nvidia-cuda/ http://blog.norture.com/2012/10/gpu-parallel-programming-in- vs2012-with-nvidia-cuda/  You can start to write a CUDA program by creating new project from VS2012’s template.  If you have previous version of VS projects, it is recommended to re-create the project by using VS2012’s new template. 30

CUDA Device Query Example: (CUDA toolkit 6.0) 31

Lab1: First CUDA Program Yet Another Random Number Sequence Generator 32

Yet Another Random Number Sequence Generator  Implemented by CPU and GPU  Functionality:  Given an random integer array A holding 8192 elements  Generated by rand()  Re-generate random number by multiplying itself 256 times without regard to overflow occurred.  B[i] new = B[i] old *A[i] (CPU)  C[i] new = C[i] old *A[i] (GPU)  Check the consistency between the results of CPU and GPU. 33

Data Manipulation between Host and Device  cudaError_t cudaMalloc( void** devPtr, size_t count )  Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory  cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)  Copies count bytes from memory area pointed to by src to the memory area pointed to by dst  kind indicates the type of memory transfer  cudaMemcpyHostToHost  cudaMemcpyHostToDevice  cudaMemcpyDeviceToHost  cudaMemcpyDeviceToDevice  cudaError_t cudaFree( void* devPtr )  Frees the memory space pointed to by devPtr 34

Now, go and finish your first CUDA program !!! 35

 Download http://twins.ee.nctu.edu.tw/~skchen/lab1.ziphttp://twins.ee.nctu.edu.tw/~skchen/lab1.zip  Open project with Visual C++ 2008 ( lab1/cuda_lab/cuda_lab.vcproj )  main.cu  Random input generation, output validation, result reporting  device.cu  Lunch GPU kernel, GPU kernel code  parameter.h  Fill in appropriate APIs  GPU_kernel() in device.cu 36

Lab2: Make the Parallel Code Faster Yet Another Random Number Sequence Generator 37

Parallel Processing in CUDA  Parallel code can be partitioned into blocks and threads  cuda_kernel >>( … )  Multiple tasks will be initialized, each with different block id and thread id  The tasks are dynamically scheduled  Tasks within the same block will be scheduled on the same stream multiprocessor  Each task take care of single data partition according to its block id and thread id 38

Locate Data Partition by Built-in Variables  Built-in Variables  gridDim  x, y  blockIdx  x, y  blockDim  x, y, z  threadIdx  x, y, z 39

Data Partition for Previous Example 40 When processing 64 integer data: cuda_kernel >>(…) int total_task = gridDim.x * blockDim.x ; int task_sn = blockIdx.x * blockDim.x + threadIdx.x ; int length = SIZE / total_task ; int head = task_sn * length ;

Processing Single Data Partition 41

Parallelize Your Program !!! 42

 Partition kernel into threads  Increase nTid from 1 to 512  Keep nBlk = 1  Group threads into blocks  Adjust nBlk and see if it helps  Maintain total number of threads below 512, and make sure that 8192 can be divisible by that number.  e.g. nBlk * nTid < 512 43

Lab3: Resolve Memory Contention Yet Another Random Number Sequence Generator 44

Parallel Memory Architecture  Memory is divided into banks to achieve high bandwidth  Each bank can service one address per cycle  Successive 32-bit words are assigned to successive banks 45

Lab2 Review 46 When processing 64 integer data: cuda_kernel >>(…)

How about Interleave Accessing? 47 When processing 64 integer data: cuda_kernel >>(…)

Implementation of Interleave Accessing  head = task_sn  stripe = total_task 48 cuda_kernel >>( … )

Improve Your Program !!! 49

 Modify original kernel code in interleaving manner  cuda_kernel() in device.cu  Adjusting nBlk and nTid as in Lab2 and examine the effect  Maintain total number of threads below 512, and make sure that 8192 can be divisible by that number.  e.g. nBlk * nTid < 512 50

Thank You  http://twins.ee.nctu.edu.tw/~skchen/lab3.zip http://twins.ee.nctu.edu.tw/~skchen/lab3.zip  Final project issue  Subject:  Porting & optimizing any algorithm on any multi-core  Demo:  1 week after final exam @ ED412  Group:  1 ~ 2 person per group * Group member & demo time should be registered after final exam @ ED412 51

Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

Similar presentations

Presentation on theme: "Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

Similar presentations

Presentation on theme: "Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang"— Presentation transcript:

Similar presentations

About project

Feedback