Heterogeneous Computing with OpenCL Dr. Sergey Axyonov
Overview What is OpenCL? Execution Model Conceptual OpenCL Device Architecture Program execution sequence Kernel Functions & Examples
1.OpenCL allows to Use different processors (CPU&GPU) to accelerate parallel computation Get speedups for computationally intensive applications Write portable code across different devices and architectures
OpenCL Parallelism ConceptCUDA Equivalent kernel host program NDRange (index space)grid work itemthread work groupblock 2.OpenCL to CUDA data parallelism model mapping
3.OpenCL execution model
4.Mapping of OpenCL dimensions and indices to CUDA OpenCL API CallExplanationCUDA Equivalent get_global_id(0);global index of the work item in the x dimension blockIdx.x×blockDim.x+threadIdx.x get_local_id(0)local index of the work item within the work group in the x dimension blockIdx.x get_global_size(0);size of NDRange in the x dimension gridDim.x ×blockDim.x get_local_size(0);Size of each work group in the x dimension blockDim.x
5. Conceptual OpenCL Device Architecture
6. Mapping OpenCL memory types to CUDA OpenCL Memory TypesCUDA Equivalent global memory constant memory local memoryshared memory private memorylocal memory
9.Program execution sequence Set up Set work sizes for kernel execution Allocate and init host data buffers Create context for device Query compute devices Create command queue Create buffers on the device Create and build program Create kernel and set its arguments Core sequence Copy data from host to device Launce kernel in command-queue Copy data from device to host Clean up
8. OpenCL context for device management
9. Useful functions To get the list of platforms available cl_intcl_int clGetPlatformIDs ( cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_pl atforms) cl_uintcl_platform_idcl_uint To get the list of devices available on a platform cl_intcl_int clGetDeviceIDs( cl_platform_id platform, cl_device_type device_type, cl_uint num _entries, cl_device_id *devices, cl_uint *num_devices) cl_platform_idcl_device_typecl_uintcl_device_idcl_uint
10. Kernel example __kernel void vectorAdd(__global const float *a, __global const float *b, __global float *result) { int id = get_global_id(0); result[id] = a[id] + b[id]; }
11.K ernel storage const char * simple_kernel[] = { “__kernel void vectorAdd(__global const float *a, \n”, “__global const float *b, __global float *result) \n”, “{\n”, “int id = get_global_id(0);\n”, “result[id] = a[id] + b[id];\n”, “}\n” } In File As **char
12.Host code: context & program cl_context context; context = clCreateContext(NULL, 1, devices, NULL, NULL, &err); cl_program program; program = clCreateProgramWithSource(context, sizeof(program_source) / sizeof(*program_source), program_source, NULL, &err); clUnloadCompiler();
13.Host code: memory objects cl_mem input_Abuffer; input_Abuffer = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int)*NUM_DATA, NULL, &err); cl_mem input_Bbuffer; input_Bbuffer = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int)*NUM_DATA, NULL, &err); cl_mem output_buffer; output_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(int)*NUM_DATA, NULL, &err);
14.Host code: kernel & queue cl_kernel kernel; kernel = clCreateKernel(program, "simple_demo", &err); clSetKernelArg(kernel, 0, sizeof(input_Abuffer), &input_Abuffer); clSetKernelArg(kernel, 1, sizeof(input_Bbuffer), &input_Bbuffer); clSetKernelArg(kernel, 2, sizeof(output_buffer), &output_buffer); cl_command_queue queue; queue = clCreateCommandQueue(context, devices[0], 0, &err);
15.Host code: Copy data from host to device & back clEnqueueWriteBuffer(queue, input_bufferA, CL_TRUE, 0, NUM_OF_ELEMENTS*sizeof(int), sourceA, 0, NULL, NULL); clEnqueueReadBuffer(queue, output_buffer, CL_TRUE, i*sizeof(int), sizeof(int), &data, 0, NULL, NULL);
16.Host code: Kernel cl_event kernel_completion; size_t global_work_size[1] = { NUM_OF_ELEMENTS }; clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, &kernel_completion); clWaitForEvents(1, &kernel_completion); clReleaseEvent(kernel_completion);
17.Host code: Clean up clReleaseMemObject(input_Abuffer); clReleaseMemObject(input_Bbuffer); clReleaseMemObject(output_buffer); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseContext(context);