OpenCL introduction III.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

China MCP 1 OpenCL. Agenda OpenCL Overview Usage Memory Model Synchronization Operational Flow Availability.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.

OpenCL Usman Roshan Department of Computer Science NJIT.

ALGORITHM ANALYSIS AND DESIGN INTRODUCTION TO ALGORITHMS CS 413 Divide and Conquer Algortihms: Binary search, merge sort.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

The Open Standard for Parallel Programming of Heterogeneous systems James Xu.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.

Advanced / Other Programming Models Sathish Vadhiyar.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Instructor Notes Discusses synchronization, timing and profiling in OpenCL Coarse grain synchronization covered which discusses synchronizing on a command.

High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014.

CS333 Intro to Operating Systems Jonathan Walpole.

OpenCL Programming James Perry EPCC The University of Edinburgh.

CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

OpenCL The Open Standard for Heterogenous Parallel Programming.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Lecture 15 Introduction to OpenCL

Lecture 2 Sorting.

5.13 Recursion Recursive functions Functions that call themselves

Objective To Understand the OpenCL programming model

An Introduction to GPU Computing

Quick Sort and Merge Sort

Patrick Cozzi University of Pennsylvania CIS Spring 2011

CUDA and OpenCL Kernels

Краткое введение в программирование на языке OpenCL.

Lecture 11 – Related Programming Models: OpenCL

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CS 179: GPU Programming Lecture 7.

Quicksort and Mergesort

Chapter 6 - Arrays Outline 6.1 Introduction 6.2 Arrays

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Matrix Operations

Parallel Computation Patterns (Scan)

Parallel Computation Patterns (Reduction)

ECE 498AL Lecture 15: Reductions and Their Implementation

Parallel build blocks.

Functions continued.

OpenCL introduction.

OpenCL introduction II.

ECE 498AL Lecture 10: Control Flow

© David Kirk/NVIDIA and Wen-mei W. Hwu,

ECE 498AL Spring 2010 Lecture 10: Control Flow

Recursive Algorithms 1 Building a Ruler: drawRuler()

Threads CSE 2431: Introduction to Operating Systems

Presentation transcript:

OpenCL introduction III.

Parallel reduction on large data A combination of input elements Associative binary operations Min, max, add, sub Introduction to Parallel Computing, University of Oregon, IPCC

Parallel reduction on large data Split the original input data into multiple partitions Run parallel reduction on the partitions Store the results Run reduction again on the previous results Repeat until one element remains Introduction to Parallel Computing, University of Oregon, IPCC

Parallel reduction on large data Use multiple work-groups Copy the corresponding data into a local array Perform a simple reduction The result of each work-group shall be stored in an output array Run the same kernel multiple times

Reduction – original solution __kernel void reduce_global(__global float* data) { int id = get_global_id(0); for(unsigned int s = get_global_size(0) / 2; s > 0; s >>= 1) if(id < s) data[id] = max(data[id], data[id + s]); } barrier(CLK_GLOBAL_MEM_FENCE);

Parallel reduction on large data wg 0 wg 1 Work-groups Global array 2 * work-group size 4 * work-group size offset = 2 * work-group ID * work-group size l_data[2 * l_id + 0] = g_data[offset + 2 * l_id + 0] l_data[2 * l_id + 1] = g_data[offset + 2 * l_id + 1] Local array 2 * work-group size 2 * work-group size Run normal Reduction on the local data Run normal Reduction on the local data If(0 == local ID) result[work-group ID] = l_data[0] Global array

Parallel reduction on large data __kernel void reduce_global(__global float* data, __global float* output) { __local float l_data[2048]; int wgid = get_group_id(0); int localSize = get_local_size(0); int lid = get_local_id(0); int offset = 2 * wgid * localSize; l_data[2 * lid + 0] = data[offset + 2 * lid + 0]; l_data[2 * lid + 1] = data[offset + 2 * lid + 1]; barrier(CLK_LOCAL_MEM_FENCE);

Parallel reduction on large data for (unsigned int s = localSize; s > 0; s >>= 1) { if (lid < s) l_data[lid] = max(l_data[lid], l_data[lid + s]); } barrier(CLK_LOCAL_MEM_FENCE); if (0 == lid) output[wgid] = l_data[0];

Parallel reduction on large data for (unsigned int kernelNum = dataSize / 2; outputSize >= 1; ) { clSetKernelArg(kernel, 0, sizeof(cl_mem), input); clSetKernelArg(kernel, 1, sizeof(cl_mem), output); clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &kernelNum, &maxWorkGroupSize, 0, NULL, NULL); clFinish(queue); cl_mem* tmp = input; input = output; output = tmp; kernelNum = outputSize / 2; outputSize = (kernelNum + maxWorkGroupSize - 1) / maxWorkGroupSize; kernelNum = std::max(kernelNum, maxWorkGroupSize); } float gpuMaxValue = 0.0f; clEnqueueReadBuffer(queue, *input, CL_TRUE, 0, sizeof(float) * 1, &gpuMaxValue, 0, NULL, NULL);

InOrder vs OutOfOrder Execution In order execution Commands submitted to the command queue are executed in the order of submission Out of order execution Commands in the queue can be scheduled in `any` order

InOrder vs OutOfOrder Execution Running multiple tasks In order Out of order Explicit synchronization is needed Events Write data Task Read data Write data Task Read data Write data Write data Read data Write data Read data Read data Task Task Task

InOrder vs OutOfOrder Execution inOrderQueue = clCreateCommandQueue(context, deviceID, CL_QUEUE_PROFILING_ENABLE, &err); if (!CheckCLError(err)) exit(-1); outOfOrderQueue = clCreateCommandQueue(context, deviceID, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err); for (int i = 0; i < instances; i++) { clEnqueueWriteBuffer(inOrderQueue, inputBuffers[i], CL_TRUE, 0, sizeof(float) * dataSize, hostBuffer, 0, NULL, NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), inputBuffers[i]); clSetKernelArg(kernel, 1, sizeof(cl_mem), outputBuffers[i]); clEnqueueNDRangeKernel(inOrderQueue, kernel, 1, NULL, &dataSize, NULL, 0, NULL, NULL); clEnqueueReadBuffer(inOrderQueue, outputBuffers[i], CL_TRUE, 0, sizeof(float) * dataSize, hostBuffer, 0, NULL, NULL); }

InOrder vs OutOfOrder Execution for (int i = 0; i < instances; i++) { clEnqueueWriteBuffer(outOfOrderQueue, inputBuffers[i], CL_FALSE, 0, sizeof(float) * dataSize, hostBuffer, 0, NULL, &events[2 * i + 0]); clSetKernelArg(kernel, 0, sizeof(cl_mem), inputBuffers[i]); clSetKernelArg(kernel, 1, sizeof(cl_mem), outputBuffers[i]); clEnqueueNDRangeKernel(outOfOrderQueue, kernel, 1, NULL, &dataSize, NULL, 1, &events[2 * i + 0], &events[2 * i + 1]); clEnqueueReadBuffer(outOfOrderQueue, outputBuffers[i], CL_FALSE, 0, sizeof(float) * dataSize, hostBuffer, 1, &events[2 * i + 1], NULL); }