Introduction to OpenCL 2.0

Slides:



Advertisements
Similar presentations
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Threads, Thread management & Resource Management.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Martin Kruliš by Martin Kruliš (v1.0)1.
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
Lecture 20: Consistency Models, TM
TensorFlow– A system for large-scale machine learning
Gwangsun Kim, Jiyun Jeong, John Kim
Processes and threads.
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Ph.D. in Computer Science
The Mach System Sri Ramkrishna.
Distributed Processors
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
COMBINED PAGING AND SEGMENTATION
Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gómez-Luna1, Izzat El Hajj2, Li-Wen Chang2, Víctor García-Flores3,4, Simon.
Parallel Programming By J. H. Wang May 2, 2017.
Computer Engg, IIT(BHU)
Task Scheduling for Multicore CPUs and NUMA Systems
Data Structures Interview / VIVA Questions and Answers
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Implementation of Efficient Check-pointing and Restart on CPU - GPU
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Lecture 5: GPU Compute Architecture
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
GPU Programming using OpenCL
Linchuan Chen, Xin Huo and Gagan Agrawal
C++ for Engineers and Scientists Second Edition
Operation System Program 4
Optimizing Malloc and Free
Lecture 5: GPU Compute Architecture for the last time
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Chapter 2: The Linux System Part 2
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Main Memory Background Swapping Contiguous Allocation Paging
Parallel Computation Patterns (Reduction)
Lecture 22: Consistency Models, TM
Top Half / Bottom Half Processing
Concurrency: Mutual Exclusion and Process Synchronization
Chapter 14: Protection.
ECE 498AL Lecture 15: Reductions and Their Implementation
Contents Memory types & memory hierarchy Virtual memory (VM)
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Lecture 2 The Art of Concurrency
CS703 - Advanced Operating Systems
GPU Scheduling on the NVIDIA TX2:
Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.
COMP755 Advanced Operating Systems
6- General Purpose GPU Programming
Presentation transcript:

Introduction to OpenCL 2.0 Jeng Bai-Cheng Engineer Computing Platform Technology Division in CTO MediaTek, Inc. Bai-Cheng.Jeng@mediatek.com May 4, 2015 – NTHU, HsinChu

Background It’s a Heterogeneous World today A modern platform includes: One or more CPUs One or more GPUs Optional accelerators (e.g., DSPs)

What is OpenCL?

OpenGL-based Ecosystem

OpenCL Platform Model

OpenCL Execution Model

Decompose task into work-items Define N-dimensional computation domain Execute a kernel at each point in computation domain

An N-dimension domain of work-items Define the N-dimensioned index space for your algorithm Kernels are executed across a global domain of work-items Work-items are grouped into local work-groups

Anatomy of an OpenCL Application Serial code executes in a Host (CPU) thread Parallel code executes in many Device (GPU) threads across multiple processing elements

OpenCL memory hierarchy

Big Picture

OpenCL 2.0 New Features

Shared Virtual Memory Before OpenCL 2.0 Doesn’t provide any guarantees that a pointer assigned on the host side can be used to access data in device side Data with pointers cannot be shared between the sides, and the application should be designed accordingly, for example, with indices used instead of pointers ■ Device address space ■ Host address space

Shared Virtual Memory After OpenCL 2.0 Benefits Enables the host and device portions to seamlessly share pointers and complex pointer-containing data-structures Benefits Host-allocated buffers are directly accessible by devices. Host and devices can refer to the same memory locations using the pointer values. Reduce data movements between host and devices. ■ Device address space ■ Host address space

Coarse-grained buffer Category of SVM SVM Feature SVM Type Coarse-grained buffer Fine-grained buffer Fine-grained system w/o atomics with atomics Shared virtual address space ● Fine-grained coherent access (cache coherent) Fine-grained synchronization Sharing the entire host address space Non-coherent Coherent Coarse-grained buffer is core feature The other are optional feature MMU & OS

C11 Atomics Support Core feature Before OpenCL 2.0 After OpenCL 2.0 OpenCL 1.2 implements atomic operations using built-ins After OpenCL 2.0 mainly aim to achieve compatibility with C++11 standard control atomic operation memory synchronization ordering and scope order: relaxed, acquire, release, acq_rel, seq_cst scope: work_group, device, all_svm_devices

C11 Atomics Support Benefits: The work-group scope generally provides more optimization opportunities for the compiler Platform coherency provides coherent accesses to the shared memory locations across host and devices. An efficient way to coordinate between host and devices, e.g. when both can dispatch kernels to a task queue.

SVM example User launches the demo application Application immediately starts playing back a 1 minute looping video clip Face detection runs on GPU Faces are identified on-screen with a blue rectangle Face recognition runs on CPU Faces are tagged with the person’s name Courtney Anna Britney

With Fine-grained buffer: Without SVM: Latency for Face 3 Input Image Kernel launch Recognize Face 1 Recognize Face 2 Recognize Face 3 CPU: Detect Faces GPU: Kernel completed With Fine-grained buffer: Latency for Face 3 Kernel launch Callback Callback Callback Input Image Recognize Face 1 Recognize Face 2 Recognize Face 3 CPU: Detect Faces GPU: time

Generic Address Space Core feature Before OpenCL 2.0 After OpenCL 2.0 programmer had to specify an address space of what a pointer points to when that pointer was declared or the pointer was passed as an argument to a function. After OpenCL 2.0  the pointer itself remains in the private address space, but what the pointer points has changed its default to be generic meaning it can point to any of the named address spaces within the generic address space.

Generic Address Space Benefits Makes it really easy to write functions without having to worry about which address space arguments point to To allow a single pointer to join different memory segments reached from different control flow paths. Simplify compiler implementation Reduce the number of function entries to serve arguments with different memory segments SW requirements Compilers Implicit and explicit casting rules btw named and generic address space New built-in functions, e.g. is_global(), is_local, cl_mem_fence_flags get_fence() Driver Conversions btw segmented and flat addressing Copyright © MediaTek Inc. All rights reserved. 2018/11/15

Nested Parallelism Core feature The ability to launch new tasks from the device

Nested Parallelism with Data-Dependent Parallelism Computational Power allocated to regions of interest Benefits: Allow kernel to dispatch new kernels without having to go back to host. Reduce overheads. Fit for nested or recursive algorithms (Nested Parallelism) (traditional)

Pipe Core feature new mechanism for passing data(packets) between kernels Before OpenCL 2.0 Data transmission only between host and devices After OpenCL 2.0 Data transmission can be inner of devices

Pipe Benefits: Enable producer – consumer relationships, like inter-process communication combine pipes with the Nested Parallelism feature in OpenCL 2.0 to dynamically construct computational data flow graphs

Work-group Functions Core feature The built-ins provide popular parallel primitives that operate at the workgroup level value broadcast, reduce, and scan Reduce and scan algorithms support add, min and max operations Reduce with add ops

Work-group Functions example Benefits work-group functions are convenient more performance efficient, as they use hardware-specific optimizations internally (traditional) (with OpenCL 2.0)

OpenCL 2.0 sample

Linked list with SVM In both Linked list samples with SVM of Intel and AMD, it only use pointers pointed to an array, not really linked list. We wonder what happen if we implement a traditional linked list with SVM Just give the pointer of header to GPU Each workitem takes a node in parallel workitem 2 workitem 0 workitem 1

The implementation of Linked list with SVM Yes, we can use traditional linked list with AMD OpenCL SDK 3.0 beta Input: 0->1->2->3… Output: 0->2->4->6… However, there is strange bug. It only can create 43645 nodes

Performance of Linked list with SVM

More info. of linked list with SVM Address: 0xec00000000 // header 0xec00010000 // first node 0xec00020000 ... 0xecaa7c0000 0xecaa7d0000 0xee01140000 Size of page: 64 KB Max. size of SVM buffer: 2.4 GB~ Just give the pointer of header to GPU Each workitem takes a node in parallel workitem 2 workitem 0 workitem 1

Parallel Binary search divide the array into segments Each workitem takes 1 segment Find the segment to which the key belongs and further divide the segment If key is between the lower-bound and upper-bound of segment 0, only workitem 0 writes to the output buffer A0……………… An An+1……………… A2n A2n+1……………… A3n workitem 0 workitem 1 workitem 2

Parallel Binary search According output, we narrow our search space by subdividing the array, then go to the next pass A0……………… An An+1……………… A2n A2n+1……………… A3n workitem 0 workitem 1 workitem 2 A0……………… Am Am+1……………… A2m An-m……………… An workitem 0 workitem 1 workitem 2

Binary search without Device Enqueue In AMD OpenCL 1.x test-case Host have to read the output to decide that which segments should be input and how many workitem should be dispatched for the next pass Busy communication between host and devices

Binary Search with Device Enqueue With OpenCL 2.0 Recursively enqueue kernel by itself until find the key or sub-segment can not be divided any more

Performance of Binary search We can expect device enqueue can improve the performance of this algorithm. However there is bug of AMD’s implementation for OpenCL 1.x test-case. It only enqueue kernel 1 times, then let CPU finish remaining jobs

Prefix Sum Input: [a 0, a 1,..., a n–1] Output: [a 0, (a 0 + a 1),..., (a 0  +  a 1  +  ...   + a n–2)] Example: Input: [3 1 7 0 4 1 6 3] Output: [3 4 11 11 15 16 22] Sequential algorithm: O(n)

Parallel Prefix Sum Simple version

Parallel Prefix Sum Work-Efficient version (Blelloch 1990) Step 1. The Up-Sweep: O(n) Step 2. The Down-Sweep: O(n)

Parallel Prefix Sum with OpenCL Local prefix sum Use shared local memory and barrier to implement local prefix sum Global prefix sum merge results of each local prefix sum [3 4 11 11 15 16 22] workgroup 0 [4 9 18 19 22 24 24] workgroup 1 [3 8 15 15 17 21 27] workgroup 2 [3 4 11 11 15 16 22] workgroup 0 [26 31 40 41 44 46 46 ] workgroup 1 [49 54 61 61 63 67 73 ] workgroup 2

Local Prefix Sum with Workgroup Function Without workgroup function With workgroup function

Global prefix sum Local prefix sum → Global prefix sum ↓ [3 4 11 11 15 16 22] workgroup 0 [4 9 18 19 22 24 24] workgroup 1 [3 8 15 15 17 21 27] workgroup 2 [26 31 40 41 44 46 46 ] [27 32 39 39 41 45 51 ] [49 54 61 61 63 67 73 ]

Performance of Prefix Sum Compare 2 implementations of prefix sum in the AMD SDK In the small size of data, workgroup function provide better performance When size of data > 512K, it has the performance drop caused by global prefix sum, the implementation is simple parallel prefix sum

Issue of Global Prefix Sum The implementaion of prefix sum with workgroup function, non-work efficient [3 4 11 11 15 16 22] workgroup 0 [4 9 18 19 22 24 24] workgroup 1 [3 8 15 15 17 21 27] workgroup 2 [26 31 40 41 44 46 46 ] [27 32 39 39 41 45 51 ] [49 54 61 61 63 67 73 ] The implementation of prefix sum without workgroup function [0 22 46] [3 4 11 11 15 16 22] workgroup 0 [26 31 40 41 44 46 46 ] workgroup 1 [49 54 61 61 63 67 73 ] workgroup 2 [4 9 18 19 22 24 24] [3 8 15 15 17 21 27]

More info. of local prefix sum In order to measures the performance of workgroup function, remove the global prefix sum Input: [a 0, a 1,..., a n–1] Output: [a 0, (a 0 + a 1),..., (a 0  +  a 1  +  ...   + a n–2)] Example: Input: [3 1 7 0 4 1 6 3] Output: [3 4 11 11 15 16 22] log2(time)

Conclusion OpenCL is the most popular open programming standard for heterogeneous computing. OpenCL 2.0 is a big step forward with key features on execution and memory models. OpenCl 2.0 is a key driving force to our heterogeneous computing HW and SW technology.