An Introduction to GPU Computing

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,
GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
The Open Standard for Parallel Programming of Heterogeneous systems James Xu.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
Computer Graphics Ken-Yi Lee National Taiwan University.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.
Dr. Lawlor, U. Alaska: EPGPU 1 EPGPU: Expressive Programming for GPGPU Dr. Orion Sky Lawlor U. Alaska Fairbanks
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
My Coordinates Office EM G.27 contact time:
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
OpenCL. #include int main() { cl_platform_id platform[10]; cl_uint num_platforms; clGetPlatformIDs(10, platform, &num_platforms); cl_int clGetPlatformIDs.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
GPGPU Architectures Martin Kruliš Martin Kruliš
GPU Architecture and Its Application
CS 179: GPU Programming Lecture 1: Introduction 1
Objective To Understand the OpenCL programming model
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Basic CUDA Programming
Краткое введение в программирование на языке OpenCL.
CS 179: GPU Programming Lecture 1: Introduction 1
Lecture 11 – Related Programming Models: OpenCL
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
GPU Programming using OpenCL
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
© 2012 Elsevier, Inc. All rights reserved.
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
OpenCL introduction.
OpenCL introduction III.
OpenCL introduction II.
© David Kirk/NVIDIA and Wen-mei W. Hwu,
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

An Introduction to GPU Computing Ryan Szypowski Department of Mathematics and Statistics

Outline Who am I? Why GPU? General Model OpenCL Example

What are GPUs good for?

What are GPUs good for?

What are GPUs good for?

What are GPUs good for?

What are GPUs REALLY good for? Lots and lots of independent calculations Specifically, the same “kernel” of computation on independent “streams” of data Development is driven by the deep-pocketed gaming industry Computing and memory bandwidth gap between CPU and GPU is widening

Shameless theft (stolen from a talk by Wen-mei Hwu and John Stone and credited to John Owens)

General Model Based on “stream” processing A stream is a set of records requiring the same processing Data parallelism or loop-level parallelism The processing is called a “kernel” In graphics applications, the streams are vertices and fragments, and the kernels are vertex and fragment shaders

General Model Data inside a kernel is either input or output, never both Ideal GPU computing has high amounts of data which requires a significant amount of computing, but that the computing is as independent as possible

OpenCL (stolen from the Khronos Group talk by Ofer Rosenberg)

OpenCL OpenCL has a single host program (standard C code or preferred language) The computation to be parallelized is called a work- item The code for a work item is stored in a kernel For platform independence, the kernels are compiled at run-time

OpenCL Work-items are grouped into workgroups Work-items within a workgroup execute simultaneously Workgroups are scheduled asynchronously Workgroups can be organized in different topologies Memory model is complicated and must be dealt with explicitly

OpenCL

OpenCL: Basic Structure Create a “Context” Basically something that contains all the other structures Get the “Device(s)” that you will work on Create a “Command Queue” in your context

OpenCL: Basic Structure Allocate memory “Buffers” in your context Compile your “Kernel” Copy data from host memory into device memory “Execute” the kernel! Copy results back

Example Code modified (very slightly) from Erik Smistad's blog post about OpenCL Vector addition (CPU) for i = 1:n c[i] = a[i] + b[i] (GPU) i = get_global_id(0) c[i] = a[i] + b[i]

Example: vector_add_kernel.cl __kernel void vector_add(__global int *A, __global int *B, __global int *C) { // Get the index of the current element int i = get_global_id(0); // Do the operation C[i] = A[i] + B[i]; }

Example: main.c (portions) // Get platform and device information cl_platform_id platform_id = NULL; cl_device_id device_id = NULL; cl_uint ret_num_devices; cl_uint ret_num_platforms; cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms); ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_ALL, 1, &device_id, &ret_num_devices);

Example: main.c (portions) // Create an OpenCL context cl_context context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret);

Example: main.c (portions) // Create memory buffers on the device for each vector cl_mem a_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY, LIST_SIZE * sizeof(float), NULL, &ret); cl_mem b_mem_obj = clCreateBuffer(context, cl_mem c_mem_obj = clCreateBuffer(context, CL_MEM_WRITE_ONLY,

Example: main.c (portions) // Copy the lists A and B to respective memory buffers ret = clEnqueueWriteBuffer(command_queue, a_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(float), A, 0, NULL, NULL); b_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(float), B, 0, NULL, NULL);

Example: main.c (portions) // Build the program and create the OpenCL kernel ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); cl_kernel kernel = clCreateKernel(program, "vector_add", &ret); // Set the arguments of the kernel ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj); ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj);

Example: main.c (portions) // Execute the OpenCL kernel on the list size_t global_item_size = LIST_SIZE; size_t local_item_size = 64; // groups of 64 ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);

Example: main.c (portions) // Read the memory buffer C on the device to // the local variable C int *C = (int*)malloc(sizeof(float)*LIST_SIZE); ret = clEnqueueReadBuffer(command_queue, c_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(float), C, 0, NULL, NULL);

Results Run on home desktop Running Fedora 17 Intel i7-3770 (quad core + hyperthreading) NVIDIA GeForce GTX 570 (480 CUDA cores)

Results

References http://www.khronos.org/opencl/ www.ks.uiuc.edu/Research/gpu/files/upcrc_ope ncl_lec1.pdf http://www.thebigblob.com/getting-started-with- opencl-and-gpu-computing/

Thanks!