An Introduction to GPU Computing

An Introduction to GPU Computing
Ryan Szypowski Department of Mathematics and Statistics

Outline Who am I? Why GPU? General Model OpenCL Example

What are GPUs good for?

What are GPUs REALLY good for?
Lots and lots of independent calculations Specifically, the same “kernel” of computation on independent “streams” of data Development is driven by the deep-pocketed gaming industry Computing and memory bandwidth gap between CPU and GPU is widening

Shameless theft (stolen from a talk by Wen-mei Hwu and John Stone and credited to John Owens)

General Model Based on “stream” processing
A stream is a set of records requiring the same processing Data parallelism or loop-level parallelism The processing is called a “kernel” In graphics applications, the streams are vertices and fragments, and the kernels are vertex and fragment shaders

General Model Data inside a kernel is either input or output, never both Ideal GPU computing has high amounts of data which requires a significant amount of computing, but that the computing is as independent as possible

OpenCL (stolen from the Khronos Group talk by Ofer Rosenberg)

OpenCL OpenCL has a single host program (standard C code or preferred language) The computation to be parallelized is called a work- item The code for a work item is stored in a kernel For platform independence, the kernels are compiled at run-time

OpenCL Work-items are grouped into workgroups
Work-items within a workgroup execute simultaneously Workgroups are scheduled asynchronously Workgroups can be organized in different topologies Memory model is complicated and must be dealt with explicitly

OpenCL

OpenCL: Basic Structure
Create a “Context” Basically something that contains all the other structures Get the “Device(s)” that you will work on Create a “Command Queue” in your context

OpenCL: Basic Structure
Allocate memory “Buffers” in your context Compile your “Kernel” Copy data from host memory into device memory “Execute” the kernel! Copy results back

Example Code modified (very slightly) from Erik Smistad's blog post about OpenCL Vector addition (CPU) for i = 1:n c[i] = a[i] + b[i] (GPU) i = get_global_id(0) c[i] = a[i] + b[i]

Example: vector_add_kernel.cl
__kernel void vector_add(__global int *A, __global int *B, __global int *C) { // Get the index of the current element int i = get_global_id(0); // Do the operation C[i] = A[i] + B[i]; }

Example: main.c (portions)
// Get platform and device information cl_platform_id platform_id = NULL; cl_device_id device_id = NULL; cl_uint ret_num_devices; cl_uint ret_num_platforms; cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms); ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_ALL, 1, &device_id, &ret_num_devices);

// Create an OpenCL context cl_context context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret);

// Create memory buffers on the device for each vector cl_mem a_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY, LIST_SIZE * sizeof(float), NULL, &ret); cl_mem b_mem_obj = clCreateBuffer(context, cl_mem c_mem_obj = clCreateBuffer(context, CL_MEM_WRITE_ONLY,

// Copy the lists A and B to respective memory buffers ret = clEnqueueWriteBuffer(command_queue, a_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(float), A, 0, NULL, NULL); b_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(float), B, 0, NULL, NULL);

// Build the program and create the OpenCL kernel ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); cl_kernel kernel = clCreateKernel(program, "vector_add", &ret); // Set the arguments of the kernel ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj); ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj);

// Execute the OpenCL kernel on the list size_t global_item_size = LIST_SIZE; size_t local_item_size = 64; // groups of 64 ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);

// Read the memory buffer C on the device to // the local variable C int *C = (int*)malloc(sizeof(float)*LIST_SIZE); ret = clEnqueueReadBuffer(command_queue, c_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(float), C, 0, NULL, NULL);

Results Run on home desktop Running Fedora 17
Intel i (quad core + hyperthreading) NVIDIA GeForce GTX 570 (480 CUDA cores)

Results

References http://www.khronos.org/opencl/
ncl_lec1.pdf opencl-and-gpu-computing/

Thanks!

An Introduction to GPU Computing

Similar presentations

Presentation on theme: "An Introduction to GPU Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Introduction to GPU Computing

Similar presentations

Presentation on theme: "An Introduction to GPU Computing"— Presentation transcript:

Similar presentations

About project

Feedback