OpenCL introduction II.

Slides:



Advertisements
Similar presentations
GPGPU Labor 8.. CLWrapper OpenCL Framework primitívek – ClWrapper(cl_device_type _device_type); – cl_device_id device_id() – cl_context context() – cl_command_queue.
Advertisements

MPI Message Passing Interface
Lectures 10 & 11.
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
R4 Dynamically loading processes. Overview R4 is closely related to R3, much of what you have written for R3 applies to R4 In R3, we executed procedures.
CS 450 Module R4. R4 Overview Due on March 11 th along with R3. R4 is a small yet critical part of the MPX system. In this module, you will add the functionality.
Module R2 Overview. Process queues As processes enter the system and transition from state to state, they are stored queues. There may be many different.
National Tsing Hua University ® copyright OIA National Tsing Hua University Image.
CS1061: C Programming Lecture 21: Dynamic Memory Allocation and Variations on struct A. O’Riordan, 2004, 2007 updated.
Memory allocation CSE 2451 Matt Boggus. sizeof The sizeof unary operator will return the number of bytes reserved for a variable or data type. Determine:
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.
1 Chapter 2 C++ Syntax and Semantics, and the Program Development Process Dale/Weems/Headington.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
OpenCL Usman Roshan Department of Computer Science NJIT.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.
The Open Standard for Parallel Programming of Heterogeneous systems James Xu.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to Processes CS Intoduction to Operating Systems.
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.
Dr. Lawlor, U. Alaska: EPGPU 1 EPGPU: Expressive Programming for GPGPU Dr. Orion Sky Lawlor U. Alaska Fairbanks
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
1 Programs Composed of Several Functions Syntax Templates Legal C++ Identifiers Assigning Values to Variables Declaring Named Constants String Concatenation.
OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
1 C++ Syntax and Semantics, and the Program Development Process.
CacheLab Recitation 7 10/8/2012. Outline Memory organization Caching – Different types of locality – Cache organization Cachelab – Tips (warnings, getopt,
UNIX Files File organization and a few primitives.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Instructor Notes Discusses synchronization, timing and profiling in OpenCL Coarse grain synchronization covered which discusses synchronizing on a command.
Copyright 2005, The Ohio State University 1 Pointers, Dynamic Data, and Reference Types Review on Pointers Reference Variables Dynamic Memory Allocation.
1 Chapter 2 C++ Syntax and Semantics, and the Program Development Process.
OpenCL Programming James Perry EPCC The University of Edinburgh.
12/22/ Thread Model for Realizing Concurrency B. Ramamurthy.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
OpenCL. #include int main() { cl_platform_id platform[10]; cl_uint num_platforms; clGetPlatformIDs(10, platform, &num_platforms); cl_int clGetPlatformIDs.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL Ashwin M. Aji, Antonio J. Pena, Pavan Balaji and Wu-Chun Feng Virginia Tech and.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Pointers & Arrays 1-d arrays & pointers 2-d arrays & pointers.
Objective To Understand the OpenCL programming model
An Introduction to GPU Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Краткое введение в программирование на языке OpenCL.
Lecture 11 – Related Programming Models: OpenCL
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Memory Allocation CS 217.
CS111 Computer Programming
OpenCL introduction.
OpenCL introduction III.
OpenCL introduction II.
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Presentation transcript:

OpenCL introduction II.

OpenCL infrastructure OpenCl platform Compute devices OpenCL context Command queue Synchronization Memory objects OpenCL programs OpenCL functions

OpenCL platform Querying platforms num_entries: number of platforms to query platforms: platform IDs num_platforms: number of platforms returned cl_int clGetPlatformIDs(cl_int num_entries, cl_platform_id *platforms, cl_uint *num_platforms)

OpenCL platform Querying platforms Query the number of platforms Allocate memory for the results Query the paramteres cl_uint num_platforms = 0; clGetPlatformIDs(0, NULL, &num_platforms) cl_platform_id *platforms = new cl_platform_id[num_platforms]; clGetPlatformIDs(num_platforms, platforms, NULL);

OpenCL platform Platform information platform: ID of the selected platform param_name: the parameter to query param_value_size: max size of the answer in bytes param_value: the memory to store the answer param_value_size_ret: size of the answer in bytes cl_int clGetPlatformInfo(cl_platform_id platform, cl_platform_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret)

OpenCL platform Platform information CL_PLATFORM_PROFILE FULL_PROFILE EMBEDDED_PROFILE CL_PLATFORM_VERSION CL_PLATFORM_NAME CL_PLATFORM_VENDOR CL_PLATFORM_EXTENSIONS

OpenCL Device The available devices in a platform device_type: the type of the queried device CL_DEVICE_TYPE_CPU CL_DEVICE_TYPE_GPU CL_DEVICE_TYPE_ACCELERATOR CL_DEVICE_TYPE_DEFAULT CL_DEVICE_TYPE_ALL cl_int clGetDeviceIDs(cl_platform_id platform, cl_device_type device_type, cl_uint num_entries, cl_device_id *devices, cl_uint *num_devices)

OpenCL Device Device information device: the ID of the device to query param_ name: the property to query ... cl_int clGetDeviceInfo(cl_device_id device, cl_device_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret)

OpenCL Device Device properties CL_DEVICE_TYPE CL_DEVICE_COMPILER_AVAILABLE CL_DEVICE_NAME CL_DEVICE_VENDOR CL_DEVICE_VERSION CL_DRIVER_VERSION CL_DEVICE_EXTENSIONS

OpenCL Device Device properties CL_DEVICE_MAX_COMPUTE_UNITS CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS CL_DEVICE_MAX_WORK_ITEM_SIZES CL_DEVICE_MAX_WORK_GROUP_SIZE CL_DEVICE_AVAILABLE

OpenCL Device Device memory properties CL_DEVICE_MAX_PARAMETER_SIZE CL_DEVICE_GLOBAL_MEM_SIZE CL_DEVICE_MAX_MEM_ALLOC_SIZE CL_DEVICE_LOCAL_MEM_SIZE CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE CL_DEVICE_MAX_CONSTANT_ARGS CL_DEVICE_IMAGE_SUPPORT

OpenCL Context Creating a context props: list of properties CL_CONTEXT_PLATFORM – cl_platform_id num_devices: number of devices in the devices argument device: devices to be associated with the context cl_context clCreateContext(const cl_context_properties *props, cl_uint num_devices, const cl_device_id *devices, void (*pfn_notify)(...), void *user_data, cl_int *errcode_ret)

OpenCL Context Creating a context pfn_notify: callback function for error handling Async error management in the context Thread-safe implementation is needed arguments const char* errinfo: the text message of the error const void* private_info: implementation dependent debug information size_t cb: size of the private_info argument void *user_data: user information cl_context clCreateContext(const cl_context_properties *props, cl_uint num_devices, const cl_device_id *devices, void (*pfn_notify)(...), void *user_data, cl_int *errcode_ret)

OpenCL Context Creating a context user_data: user information Handed over to the pfn_notify function errcode_ret: error information cl_context clCreateContext(const cl_context_properties *props, cl_uint num_devices, const cl_device_id *devices, void (*pfn_notify)(...), void *user_data, cl_int *errcode_ret)

OpenCL Context Increase the context reference counter Decrease the context reference counter Context information param_value CL_CONTEXT_REFERENCE_COUNT CL_CONTEXT_DEVICES CL_CONTEXT_PROPERTIES cl_int clRetainContext(cl_context context) cl_int clReleaseContext(cl_context context) cl_int clGetContextInfo(cl_context context, cl_context_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret)

OpenCL Command Queue Creating a command queue properties CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_QUEUE_PROFILING_ENABLE cl_command_queue clCreateCommandQueue(cl_context context, cl_device_id device, cl_command_queue_properties properties, cl_int *errcode_ret)

OpenCL Command queue Increase the command queue reference counter Decrease the command queue reference counter Command queue infromation param_value CL_QUEUE_CONTEXT CL_QUEUE_DEVICE CL_QUEUE_PROPERTIES cl_int clRetainCommandQueue(cl_command_queue queue) cl_int clReleaseCommandQueue(cl_command_queue queue) cl_int clGetCommandQueueInfo(cl_command_queue queue, cl_command_queue_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret)

Synchronization Waiting on an event Barrier Implicit synchronization explicit synchronization cl_int clEnqueueWaitForEvents(cl_command_queue queue, cl_uint num_events, const cl_event *event_list) cl_int clEnqueueBarrier(cl_command_queue queue)

Synchronization Empting the command queue Issues all previously queued commands to the device Blocks until all commands have completed cl_int clFlush(cl_command_queue queue) cl_int clFinish(cl_command_queue queue)

Memory Objects Buffer object Image object Linear memory area Scalar, vector, struct types Can be addressed with pointers Image object 2D, 3D memory area Predefined texture formats Can be accessed through samplers

Memory Objects Creating memory objects size: size of the buffer(byte) host_ptr: pointer to a previously allocated host memory area cl_mem clCreateBuffer(cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret)

Memory objects Creating memory objects flags: CL_MEM_READ_WRITE CL_MEM_WRITE_ONLY CL_MEME_READ_ONLY CL_MEM_USE_HOST_PTR CL_MEM_ALLOC_HOST_PTR CL_MEME_COPY_HOST_PTR cl_mem clCreateBuffer(... cl_mem_flags flags, ...)

Memory Objects Reading memory objects buffer: ID of the buffer to be read blocking_read: sync or async operation cl_int clEnqueueReadBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t cb, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

Memory Objects Reading memory objects offset: offset in the memory cb: number of bytes to read ptr: where to put the result cl_int clEnqueueReadBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t cb, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

Memory Objects Reading memory objects num_events_in_wait_list: number of events to wait on event_wait_list: events to wait on event: event indicating the end of the read operation cl_int clEnqueueReadBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t cb, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

Memory Objects Reading memory objects Arguments are similar to the arguments of the read operation cl_int clEnqueueWriteBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t cb, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

Memory objects Copying memory objects src_buffer, src_offset: the memory to be copied dst_buffer, dst_offset: target memory area cb: the size of bytes to copy cl_int clEnqueueCopyBuffer(cl_command_queue command_queue, cl_mem src_buffer, cl_mem dst_buffer, size_t src_offset, size_t dst_offset, size_t cb, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

Memory Objects Increase memory object reference counter Decrease memory object reference counter Memory object information param_value CL_MEM_TYPE CL_MEM_SIZE CL_MEM_MAP_COUNT cl_int clRetainMemObject(cl_mem memobj) cl_int clReleaseMemObject(cl_mem memobj) cl_int clGetMemObjectInfo(cl_mem memobj, cl_mem_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret)

Memory Objects Mapping memory objects map_flags CL_MAP_READ CL_MAP_WRITE void* clEnqueueMapBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_map, cl_map_flags map_flags, size_t offset, size_t cb, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event, cl_int *errcode_ret)

Memory objects Mapping memory objects buffer: the memory object to mapped mapped_ptr: host pointer to the mapped memory cl_int clEnqueueUnmapBuffer(cl_command_queue command_queue, cl_mem buffer, void *mapped_ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

Program Objects Program object Context information Source of the program or its binary representation The result of the last successful compilation Compile information Compile options log Number of kernel objects

Program Object Creating a program object strings: the source of the program lengths: the length of the source count: number of the sources cl_program clCreateProgramWithSource(cl_context context, cl_uint count, const char **strings, const size_t *lengths, cl_int *errcode_ret)

Program Obejcts Creating program objects device_list: devices to be used binaries: binaries for the devices binary_status: if the operation was successful cl_program clCreateProgramWithBinary(cl_context context, cl_uint num_devices, const cl_device_id *device_list, const size_t *lengths, const unsigned char **binaries, cl_int *binary_status, cl_int *errcode_ret)

Program Objects Increase the program ref. counter Decrease the program ref. counter Program object information param_value CL_PROGRAM_DEVICES CL_PROGRAM_SOURCE CL_PROGRAM_BINARIES cl_int clRetainProgram(cl_program program) cl_int clReleaseProgram(cl_program program) cl_int clGetProgramInfo(cl_program program, cl_program_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret)

Program Objects Building a program program: the program object to be compiled num_devices: number of devices to build the program on device_list: devices to build the program on cl_int clBuildProgram(cl_program program, cl_uint num_devices, const cl_device_id *device_list, const char* options, void (*pfn_notify)(...), void *user_data)

Program Objects Compiler options Preprocessor Math Intrinsic Options -D name -D name = definition -I dir Math Intrinsic Options -cl-single-precision-constant -cl-denorms-are-zero Compiler Options -w -Werror

Program Objects Compiler Options Optimization Options -cl-opt-disable -cl-strict-aliasing -cl-mad-enable -cl-no-signed-zeros -cl-unsafe-math-optimizations -cl-finite-math-only -cl-fast-relaxed-math

Program Objects Compile information Removing the compiler CL_PROGRAM_BUILD_STATUS CL_PROGRAM_BUILD_OPTIONS CL_PROGRAM_BUILD_LOG Removing the compiler The compiler can be removed from the memory Just a hint, if the compiler is necessary the system restores it cl_int clGetProgramBuildInfo(cl_program program, cl_device_id device, cl_program_build_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret) cl_int clUnloadCompiler(void)

Kernel Objects Creating kernel obejcts Increase kernel ref. counter kernel_name: a létrehozandó kernel neve Increase kernel ref. counter Decrease crenel ref. counter cl_kernel clCreateKernel(cl_program program, const char *kernel_name, cl_int *errcode_ret) cl_int clRetainKernel(cl_kernel kernel) cl_int clReleaseKernel(cl_kernel kernel)

Kernel Objects Kernel object information param_name CL_KERNEL_FUNCTION_NAME CL_KERNEL_NUM_ARGS CL_KERNEL_PROGRAM cl_int clGetKernelInfo(cl_kernel kernel, cl_kernel_info param_name, size_t param_value_size, void *param_value, size_t *param_value_size_ret)

Kernel Objects Kernel arguments arg_index: index of the argument to be set arg_size: the size of the corresponding argument arg_value: the value of the corresponding argument cl_int clSetKernelArg(cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value)

Kernel Object Kernel arguments Kernel Argument setup __kernel void sampleKernel(const float count, __global float* data){ ... } clSetKernelArg(sampleKernel, 0, sizeof(float), fData); clSetKernelArg(sampleKernel, 1, sizeof(dataPtr), &dataPtr);

Kernel Object Work-group information pname CL_KERNEL_WORK_GROUP_SIZE CL_KERNEL_COMPILE_WORK_GROUP_SIZE CL_KERNEL_LOCAL_MEM_SIZE cl_int clGetKernelWorkGroupInfo(cl_kernel kernel, cl_device_id device, cl_kernel_work_group_info pname, size_t param_value_size, void *param_value, size_t *param_value_size_ret)

Kernel objects Executing a kernel work_dim: dimension of NDRange global_work_offset: NULL global_work_size: size of the whole problem space local_work_size: size of a single work-group cl_int clEnqueueNDRangeKernel(cl_command_queue queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size, const size_t *local_work_size, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

OpenCL example Host program void square(){ cl_kernel squareKernel = createKernel(program, "square"); const int data_size = 1024; float* inputData = (float*)malloc(sizeof(float) * data_size); for(int i = 0; i < data_size; ++i){ inputData[i] = i; } cl_mem clInputData = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * data_size, NULL, NULL); clEnqueueWriteBuffer(commands, clInputData, CL_TRUE, 0, sizeof(float) * data_size, inputData, 0, NULL, NULL) ); // ...

OpenCL example Host program float* data = (float*)malloc(sizeof(float)*data_size); cl_mem clData = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * data_size, NULL, NULL); clSetKernelArg(squareKernel, 0, sizeof(cl_mem), &clInputData); clSetKernelArg(squareKernel, 1, sizeof(cl_mem), &clData); clSetKernelArg(squareKernel, 2, sizeof(int), &data_size); size_t workgroupSize = 0; clGetKernelWorkGroupInfo(squareKernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(workgroupSize), &workgroupSize, NULL); clEnqueueNDRangeKernel(commands, squareKernel, 1, NULL, &workSize, &workgroupSize, 0, NULL, NULL) ); clFinish(commands);

OpenCL example Host program clEnqueueReadBuffer(commands, clData, CL_TRUE, 0, sizeof(float) * data_size, data, 0, NULL, NULL); int wrong = 0; for(int i = 0; i < data_size; ++i){ if(data[i] != inputData[i] * inputData[i]){ wrong++; } std::cout << "Wrong squares: " << wrong << std::endl; clReleaseKernel(squareKernel); free(data); free(inputData);

OpenCL example OpenCL program __kernel void square(__global float* inputData, __global float* outputData, const int data_size){ __private int id = get_global_id(0); outputData[id] = inputData[id] * inputData[id]; }