GPU Computing CIS-543 Lecture 03: Introduction to CUDA

GPU Computing CIS-543 Lecture 03: Introduction to CUDA
Dr. Muhammad Abid, DCIS, PIEAS

Programmer's View of Computing System
To a CUDA programmer, the computing system consists of a host and one or more devices Simplified -A northbridge or host bridge is one of the two chips in the core logic chipset architecture on a PC motherboard, the other being the southbridge. The northbridge typically handles communications among the CPU, in some cases RAM, and PCI Express (or AGP) video cards, and thesouthbridge -In modern software applications, program sections often exhibit a rich amount of data parallelism, a property allowing many arithmetic operations to be safely performed on program data structures in a simultaneous manner. The CUDA devices accelerate the execution of these applications by harvesting a large amount of data parallelism. Detailed

Data Parallelism GPUs expedite the execution of a program sections exhibiting a rich amount of data parallelism Data parallelism refers to the program property whereby many arithmetic operations can be safely performed on the data structures in a simultaneous manner Many software applications exhibit this property, ranging from image processing to Bioinformatics, to molecular dynamics, to physics simulations, etc. Data parallelism refers to scenarios in which the same operation is performed concurrently (that is, in parallel) on elements in a source collection or array. In data parallel operations, the source collection is partitioned so that multiple threads can operate on different segments concurrently.

Data Parallelism Example
Matrix Multiplication: P = M X N ; P, M, and N are matrices. Matrix P is generated by performing a dot product b/w a row of input matrix M and a column of input matrix N N WIDTH All Dot Product operations can be performed in parallel. Therefore, matrix multiplication of large dimensions can have very large amount of data parallelism. By executing many dot products in parallel, a CUDA device can significantly accelerate the execution of the matrix multiplication over a traditional host CPU. M P For large matrices, the number of dot products can be very large; for example, a matrix multiplication has 1,000,000 independent dot products, each involving 1000 multiply and 1000 accumulate arithmetic operations WIDTH WIDTH

CUDA Program Structure
Host code and Device code in a unified CUDA program host code for serial part of software applications device code for program sections exhibiting data parallelism Host code: straight ANSI C code Device code: ANSI C code with extension The NVIDIA C/C++ compiler (nvcc) separates host and device code during the compilation process A CUDA program is a unified source code encompassing both host and device code. The NVIDIA C compiler (nvcc) separates the two during the compilation process. The host code is straight ANSI C code; it is further compiled with the host’s standard C compilers and runs as an ordinary CPU process. The device code is written using ANSI C extended with keywords for labeling data-parallel functions, called kernels, and their associated data structures. The device code is typically further compiled by the nvcc and executed on a GPU device.

Kernel Functions Data-parallel function in CUDA known as kernel function or simply kernel generates a large number of threads to exploit data parallelism. Matrix multiplication kernel can generate thousands of threads where each thread computes dot product. For this case M = N = P = 1000 X 1000, kernel will generate 1,000,000 threads when invoked/ called. CUDA threads are of much lighter weight than the CPU threads. take very few cycles to generate and schedule due to efficient hardware support.

Kernel Functions A kernel specifies the code to be executed by all threads during a parallel phase. Because all of these threads execute the same code, CUDA programming is an instance of the well-known single-program, multiple-data (SPMD) parallel programming style

Execution of a CUDA Program
The execution starts with host (CPU) execution. When a kernel function is invoked, or launched, the execution is moved to a device (GPU), where a large number of threads are generated to take advantage of abundant data parallelism. All the threads that are generated by a kernel during an invocation are collectively called a grid. Figure shows the execution of two grids of threads. We will discuss how these grids are organized soon. When all threads of a kernel complete their execution, the corresponding grid terminates, and the execution continues on the host until another kernel is invoked.

Hello from CUDA host! CUDA: Basic example HelloCuda1.cu
#include <stdio.h> int main(void){ printf("Hello from CUDA host! \n"); return(0); } To build the program, use nvcc compiler: % nvcc -o helloCuda1 helloCuda1.cu

Hello from CUDA device! CUDA: Basic example HelloCuda2.cu
#include <stdio.h> __global__ void printkernel(void){ printf("Hello, I am CUDA kernel ! Nice to meet you!\n"); } int main(void){ printkernel<<<2,2>>>(); cudaDeviceSynchronize(); return(0); Devices with compute capability 2.x or higher support calls to printf from within a CUDA kernel. (You must be using CUDA version 3.1 or higher). -GPU executes threads in a SIMT fashion -Divide threads into warps of 32 threads -each inst is executed per warp -multiple kernels in a default stream are executed in the order speified in the prog -kernel launch is asynchronous w.r.t. host

Hello from CUDA device kernel!
CUDA: Basic example HelloCuda3.cu #include <stdio.h> __global__ void printkernel(void){ printf("Hello, I am CUDA thread %d! Nice to meet you!\n“, threadIdx.x); } int main(void){ printkernel<<<1,4>>>(); cudaDeviceSynchronize(); return(0);

Compiling CUDA programs “nvcc”
NVIDIA provides nvcc -- the NVIDIA CUDA C/C++ compiler Actually it's a compiler driver Will separate out code for host and for device Regular C/C++ compiler used for host (needs to be available) Programmer simply uses nvcc instead of gcc/cc compiler on a Linux system

Compiling code - Linux Command line:
nvcc –O3 –o <exe> <source_file> -I/usr/local/cuda/include –L/usr/local/cuda/lib –lcuda –lcudart -CUDA source file that includes device code has the extension .cu nvcc separates code for CPU and for GPU and compiles code. Need regular C compiler installed for CPU. Directories for #include files Optimization level if you want optimized code Directories for libraries Libraries to be linked See “The CUDA Compiler Driver NVCC” from NVIDIA for more details

Compilation process nvcc frontend divides code into host and device parts. Host part compiled by regular C compiler Device part compiled by NVIDIA device compiler Two compiled parts combined into one executable

Executing Program Simple type name of executable created by nvcc:
Fatbinary has the code for the device only The embedded fatbinary is inspected by the CUDA runtime system whenever the device code is launched by the host program to obtain an appropriate fatbinary image for the current GPU.

Kernel Execution Configuration
Kernel_name<<<Dg, Db, Ns, S >>> (arg1,arg2,…); Dg is of type dim3; grid dimension; specifies no. of thread blocks in the grid; Dg.x * Dg.y * Dg.z equals the number of thread blocks being launched; Db is of type dim3; thread block dimension; specifies no. of threads in a thread block; Db.x * Db.y * Db.z equals the number of threads per block;

Kernel Execution Configuration
Ns is of type size_t and specifies the number of bytes in dynamically allocated shared memory. Ns is an optional argument which defaults to 0; S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0. Examples: Kn<<<1,4>>>(); //creates 4 threads at run time Kn<<<4,4>>>(); //creates 4 threads at run time

Formatted Output only supported by devices of compute capability 2.x and higher int printf(const char *format[, arg, ...]); prints formatted output from a kernel to a host-side output stream. The printf() command is executed as any other device-side function: per-thread, and in the context of the calling thread. The in-kernel printf() function behaves in a similar way to the standard C-library printf() function, and the user is referred to the host system's manual pages for a complete description of printf() behavior. In essence, the string passed in as format is output to a stream on the host, with substitutions made from the argument list wherever a format specifier is encountered. Supported format specifiers are listed below. Unlike the C-standard printf(), which returns the number of characters printed, CUDA's printf() returns the number of arguments parsed.

Formatted Output Format specifiers take the form: %[flags][width][.precision][size]type Flags: `#' ` ' `0' `+' `-' Width: `*' `0-9' Precision: `0-9' Size: `h' `l' `ll' Type: `%cdiouxXpeEfgGaAs Note: CUDA's printf()will accept any combination of flag, width, precision, size and type, whether or not overall they form a valid format specifier. Printf reference

Formatted Output Limitations
Final formatting of the printf() output takes place on the host system format string must be understood by the host-system's compiler and C library. The printf() command can accept at most 32 arguments in addition to the format string. Additional arguments beyond this will be ignored, and the format specifier output as-is. Every effort has been made to ensure that the format specifiers supported by CUDA's printf function form a universal subset from the most common host compilers, but exact behavior will be host-OS-dependent.

Make sure compiling system and running system support the same data type size. Differing size of the long type: a kernel which is compiled on a machine with long type size 8B but then run on a machine with 4B long will see corrupted output for all format strings which include "%ld". It is recommended that the compilation platform matches the execution platform to ensure safety.

The output buffer for printf(): set to a fixed size before kernel launch. circular and if more output is produced during kernel execution than can fit in the buffer, buffer, older output is overwritten.

Output Buffer Flushed only when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as well), Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(), cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(), or cuEventSynchronize(), CUDA_LAUNCH_BLOCKING: Don’t use any other synchronization methods. Compilation has no link with CUDA_LAUNCH_BLOCKING Use export CUDA_LAUNCH_BLOCKING=1 and then run the application To delete CUDA_LAUNCH_BLOCKING variable use following command: unset CUDA_LAUNCH_BLOCKING

Output Buffer Memory copies via any blocking version of cudaMemcpy() or cuMemcpy(), Module loading/unloading via cuModuleLoad() or cuModuleUnload(), Context destruction via cudaDeviceReset() or cuCtxDestroy(). Prior to executing a stream callback added by cudaStreamAddCallback or cuStreamAddCallback. Note: buffer is not flushed automatically when the program exits.

GPU Computing CIS-543 Lecture 03: Introduction to CUDA

Similar presentations

Presentation on theme: "GPU Computing CIS-543 Lecture 03: Introduction to CUDA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPU Computing CIS-543 Lecture 03: Introduction to CUDA

Similar presentations

Presentation on theme: "GPU Computing CIS-543 Lecture 03: Introduction to CUDA"— Presentation transcript:

Similar presentations

About project

Feedback