China MCP 1 OpenCL. Agenda OpenCL Overview Usage Memory Model Synchronization Operational Flow Availability.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

OpenMP China MCP.

The Open Standard for Parallel Programming of Heterogeneous systems James Xu.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

Computer Graphics Ken-Yi Lee National Taiwan University.

Advanced / Other Programming Models Sathish Vadhiyar.

Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1.

OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014.

CS333 Intro to Operating Systems Jonathan Walpole.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

My Coordinates Office EM G.27 contact time:

OpenCL The Open Standard for Heterogenous Parallel Programming.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Martin Kruliš by Martin Kruliš (v1.1)1.

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Lecture 15 Introduction to OpenCL

NFV Compute Acceleration APIs and Evaluation

OpenCL 소개 류관희 충북대학교 소프트웨어학과.

CS399 New Beginnings Jonathan Walpole.

An Introduction to GPU Computing

Computer Engg, IIT(BHU)

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Texas Instruments TDA2x and Vision SDK

CUDA and OpenCL Kernels

SDK for developing High Performance Computing Applications

GPU Programming using OpenCL

Introduction to OpenCL 2.0

Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos

© 2012 Elsevier, Inc. All rights reserved.

OpenCL introduction III.

Presentation transcript:

China MCP 1 OpenCL

Agenda OpenCL Overview Usage Memory Model Synchronization Operational Flow Availability

Agenda OpenCL Overview Usage Memory Model Synchronization Operational Flow Availability

© Copyright Texas Instruments Inc., 2013 radar & communications computing 4 Video and audio infrastructure NetworkingDVR / NVR & smart camera Wireless testers Industrial control High-performance and cloud computing Home AVR and automotive audio Portable mobile radio Medical imaging Mission critical systems media processingindustrial electronics Industrial imaging Analytics OpenCL Overview: Motivation

© Copyright Texas Instruments Inc., 2013 Many current TI DSP users: Comfortable working with TI platforms Large software teams, low level programming models for algorithmic control Understand DSP programming Many customers in new markets like High-Performance-Compute: Often not DSP programmers Not familiar with TI proprietary software, especially in early stages Comfortable with workstation parallel programming models Important that customers in these new markets are comfortable with leveraging TI’s heterogeneous multicore offerings 5 OpenCL Overview: Motivation

© Copyright Texas Instruments Inc., 2013 Framework for expressing programs where parallel computation is dispatched to any attached heterogeneous device Open, standard and royalty-free Consists of two components 1. API for host program to create and submit kernels for execution (Host-based generic header and vendor-supplied library file) 2. Cross-platform language for expressing kernels (Based on C99 C w/ some additions/restrictions, built-in functions) Promotes portability of applications from device to device and across generations of a single device roadmap 6 OpenCL Overview: What it is

© Copyright Texas Instruments Inc., 2013 Node 0 MPI Communication APIs Node 1 Node N MPI allows expression of parallelism across nodes in a distributed system MPI’s first specification was in OpenCL Overview: Where it fits in

© Copyright Texas Instruments Inc., 2013 CPU OpenMP Threads Node 0 MPI Communication APIs CPU OpenMP Threads Node 1 CPU OpenMP Threads Node N OpenMP allows expression of parallelism across homogeneous, shared-memory cores OpenMP’s first specification was in OpenCL Overview: Where it fits in

© Copyright Texas Instruments Inc., 2013 CPU OpenMP Threads GPU CUDA/OpenCL Node 0 MPI Communication APIs CPU OpenMP Threads GPU CUDA/OpenCL Node 1 CPU OpenMP Threads GPU CUDA/OpenCL Node N CUDA / OpenCL can leverage parallelism across heterogeneous computing devices in a system, even with distinct memory spaces CUDA’s first specification was in 2007 OpenCL’s first specification was in OpenCL Overview: Where it fits in

© Copyright Texas Instruments Inc., 2013 CPU OpenMP Threads DSP OpenCL Node 0 MPI Communication APIs CPU OpenMP Threads DSP OpenCL Node 1 CPU OpenMP Threads DSP OpenCL Node N Focus on OpenCL as an open alternative to CUDA Focus on OpenCL devices other than GPU, like DSPs 10 OpenCL Overview: Where it fits in

© Copyright Texas Instruments Inc., 2013 CP U OpenCL Node 0 MPI Communication APIs Node 1 Node N CP U OpenCL CP U OpenCL OpenCL is expressive enough to allow efficient control over all compute engines in a node. 11 OpenCL Overview: Where it fits in

© Copyright Texas Instruments Inc., 2013 Host connected to one or more OpenCL devices –Commands are submitted from host to OpenCL devices –Host can also be an OpenCL device OpenCL device is a collection of one or more compute units (cores) –OpenCL device viewed by programmer as single virtual processor –Programmer does not need to know how many cores are in the device –OpenCL runtime efficiently divides total processing effort across cores 12 Example on 66AK2H12 -A15 running OpenCL process acts as host -8 C66x DSPs available as a single device (Accelerator type, 8 compute units) -4 A15’s available as single device (CPU type, 4 compute units) OpenCL Overview: Model

Agenda OpenCL Overview OpenCL Usage Memory Model Synchronization Operational Flow Availability

© Copyright Texas Instruments Inc., OpenCL Usage: Platform Layer Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfo (); Platform Layer APIs allow an OpenCL application to: –Query the platform for OpenCL devices –Query OpenCL devices for their configuration and capabilities –Create OpenCL contexts using one or more devices Context: –Environment within which work-items execute –Includes devices and their memories and command queues Kernels dispatched within this context will run on accelerators (DSPs) To change the program to run kernels on a CPU device instead: change CL_DEVICE_TYPE_ACCELERATOR to CL_DEVICE_TYPE_CPU

© Copyright Texas Instruments Inc., 2013 C int err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &device_id, NULL); if (err != CL_SUCCESS) { … } context = clCreateContext(0, 1, &device_id, NULL, NULL, &err); if (!context) { … } commands = clCreateCommandQueue(context, device_id, 0, &err); if (!commands) { … } C++ Context context(CL_DEVICE_TYPE_CPU); std::vector devices = context.getInfo (); CommandQueue Q(context, devices[0]); 15 Usage: Contexts & Command Queues Typical flow Query the platform for all available accelerator devices Create an OpenCL context containing all those devices Query the context to enumerate the devices and place them in a vector

© Copyright Khronos Group, 2009 OpenCL C Kernel – Basic unit of executable code on a device - similar to a C function – Can be data-parallel or task-parallel OpenCL C Program – Collection of kernels and other functions OpenCL Applications queue kernel execution instances – Application defines command queues Command queue is tied to a specific device Any/All devices may have command queues – Application enqueues kernels to these queues – Kernels will then run asynchronously to the main application thread – Queues can be defined to execute in-order or allow out-of-order 16 Usage: Execution Model

© Copyright Texas Instruments Inc., 2013 Usage: Data Kernel Execution Kernel enqueuing is a combination of 1.OpenCL C kernel definition (expressing an algorithm for a work-item) 2.Description of the total number of work-items required for the kernel 17 CommandQueue Q (context, devices[0]); Kernel kernel (program, "mpy2"); Q.enqueueNDRangeKernel(kernel, NDRange(1024)); Kernel void mpy2(global int *p) { int i = get_global_id(0); p[i] *= 2; } Work-items for a kernel execution are grouped into workgroups –Workgroup is executed by a compute unit (core) –Size of a workgroup can be specified, or left to the runtime to define –Different workgroups can execute asynchronously across multiple cores Q.enqueueNDRangeKernel(kernel, NDRange(1024), NDRange(128)); Code line above enqueues kernel with 1024 work-items grouped in workgroups of 128 work-items each 1024/128 => 8 workgroups, that could execute simultaneously on 8 cores.

© Copyright Texas Instruments Inc., 2013 Execution order of work-items in workgroup not defined by spec. Portable OpenCL code must assume they could all execute concurrently. – GPU implementations typically execute work-items within a workgroup concurrently – CPU / DSP implem. typically serialize work-items within workgroup – OpenCL C barrier instructions can be used to ensure that all work- items in a workgroup reach the barrier, before any work-items in the workgroup proceed past the barrier. Execution order of workgroups associated with 1 kernel execution is not defined by the spec. Portable OpenCL code must assume any order is valid No mechanism exists in OpenCL to synchronize or order workgroups 18 Usage: Execution Order Work-Items & Workgroups

© Copyright Texas Instruments Inc., 2013 OpenCL Host Code Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfo (); Program program(context, devices, source); Program.build(devices); Buffer buf (context, CL_MEM_READ_WRITE, sizeof(input)); Kernel kernel (program, "mpy2"); kernel.setArg(0, buf); CommandQueue Q (context, devices[0]); Q.enqueueWriteBuffer (buf, CL_TRUE, 0, sizeof(input), input); Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz)); Q.enqueueReadBuffer (buf, CL_TRUE, 0, sizeof(input), input); Host code uses optional OpenCL C++ bindings – Creates a buffer and a kernel, sets the arguments, writes the buffer, invokes the kernel and reads the buffer. Kernel is purely algorithmic – No dealing with DMA’s, cache flushing, communication protocols, etc. OpenCL Kernel Kernel void mpy2(global int *p) { int i = get_global_id(0); p[i] *= 2; } 19 Usage: Example

© Copyright Texas Instruments Inc., 2013 When compiling, tell gcc where the headers are: gcc –I$TI_OCL_INSTALL/include … Link with the TI OpenCL library as: gcc -L$TI_OCL_INSTALL/lib –lTIOpenCL … 20 Usage: Compiling & Linking

Agenda OpenCL Overview OpenCL Usage Memory Model Synchronization Operational Flow Availability

© Copyright Khronos Group, 2009 Private Memory – Per work-item – Typically registers Local Memory – Shared within a workgroup – Local to a compute unit (core) Global/Constant Memory – Shared across all compute units (cores) in a device Host Memory – Attached to the Host CPU – Can be distinct from global memory Read / Write buffer model – Can be same as global memory Map / Unmap buffer model 22 OpenCL Memory Model: Overview Workgroup Work-Item Computer Device Work-Item Workgroup Work-ItemWork-Item Host Private Memory Local Memory Global/Constant Memory Host Memory

© Copyright Khronos Group, 2009 Buffers – Simple chunks of memory – Kernels can access however they like (array, pointers, structs) – Kernels can read and write buffers Images – Opaque 2D or 3D formatted data structures – Kernels access only via read_image() and write_image() – Each image can be read or written in a kernel, but not both – Only required for GPU devices ! 23 OpenCL Memory: Resources

© Copyright Texas Instruments Inc., 2013 OpenCL Memory: Distinct Host and Global Device Memory 1.char *ary = malloc(globsz); 2.for (int i = 0; i < globsz; i++) ary[i] = i; 3.Buffer buf (context, CL_MEM_READ_WRITE, sizeof(ary)); 4.Q.enqueueWriteBuffer (buf, CL_TRUE, 0, sizeof(ary), ary); 5.Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz)); 6.Q.enqueueReadBuffer (buf, CL_TRUE, 0, sizeof(ary), ary); 7.for (int i = 0; i < globsz; i++) … = ary[i]; 24 Host MemoryDevice Global Memory 0,1,2,3, …0,1,2,3 … 0,2,4,6 …0,2,4,6, …

© Copyright Texas Instruments Inc., 2013 OpenCL Memory: Shared Host and Global Device Memory 1.Buffer buf (context, CL_MEM_READ_WRITE, globsz); 2.char* ary = Q.enqueueMapBuffer(buf, CL_TRUE, CL_MAP_WRITE, 0, globsz); 3.for (int i = 0; i < globsz; i++) ary[i] = i; 4.Q.enqueueUnmapMemObject(buf, ary); 5.Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz)); 6.ary = Q.enqueueMapBuffer(buf, CL_TRUE, CL_MAP_READ, 0, globsz); 7.for (int i = 0; i < globsz; i++) … = ary[i]; 8.Q.enqueueUnmapMemObject(buf, ary); 25 Shared Host + Device Global Memory 0,1,2,3, …0,2,4,6, … Ownership to host Ownership to device Ownership to host Ownership to device

Agenda OpenCL Overview OpenCL Usage Memory Model Synchronization Operational Flow Availability

© Copyright Texas Instruments Inc., 2013 OpenCL Synchronization Kernel execution is defined to be the execution and completion of all work-items associated with an enqueue kernel command Kernel executions can synchronize at their boundaries through OpenCL events at the Host API level Within a workgroup, work-items can synchronize through barriers and fences, expressed as OpenCL C built-in functions Workgroups cannot synchronize with workgroups Work-items in different workgroups cannot synchronize 27

Agenda OpenCL Overview OpenCL Usage Memory Model Synchronization Operational Flow Availability

© Copyright Texas Instruments Inc., 2013 OpenCL Operational Flow 29

Agenda OpenCL Overview OpenCL Usage Memory Model Synchronization Operational Flow Availability

© Copyright Texas Instruments Inc., 2013 TI OpenCL 1.1 Products Advantech DSPC8681 with four 8-core DSPs Advantech DSPC8682 with eight 8-core DSPs Each 8 core DSP is an OpenCL device Ubuntu Linux PC as OpenCL host OpenCL in limited distribution Alpha GA approx. End of Q GB DDR3 TMS320C C66 DSPs TMS320C C66 DSPs TMS320C C66 DSPs TMS320C C66 DSPs * Product is based on a published Khronos Specification, and is expected to pass the Khronos Conformance Testing Process. Current conformance status can be found at OpenCL on a chip 4 ARM A15s running Linux as OpenCL host 8 core DSP as an OpenCL Device 6M on chip shared memory. Up to 10G attached DDR3 GA approx. End of Q

BACKUP KeyStone OpenCL

© Copyright Texas Instruments Inc., 2013 int acc = 0; for (int i = 0; i < N; ++i) acc += buffer[i]; return acc; Sequential in nature Not parallel 33 Usage: Vector Sum Reduction Example

© Copyright Texas Instruments Inc., 2013 Usage: Example //Vector Sum Reduction kernel void sum_reduce(global float* buffer, global float* result) { int gid = get_global_id(0);//which work-item am I of all work-items int lid = get_local_id (0); //which work-item am I within workgroup for (int offset = get_local_size(0) >> 1; offset > 0; offset >>= 1) { if (lid < offset) buffer[gid] += buffer[gid + offset]; barrier(CLK_GLOBAL_MEM_FENCE); } if (lid == 0) result[get_group_id(0)] = buffer[gid]; } 34

© Copyright Texas Instruments Inc., 2013 kernel void sum_reduce(global float* buffer, local float *acc, global float* result) { int gid = get_global_id(0); //which work-item am I out of all work-items int lid = get_local_id (0); // which work-item am I within my workgroup bool first_wi = (lid == 0); bool last_wi = (lid == get_local_size(0) – 1); int wg_index = get_group_id (0); // which workgroup am I if (first_wi) acc[wg_index] = 0; acc[wg_index] += buffer[gid]; if (last_wi) result[wg_index] = acc[wg_index]; } 35 Not valid on a GPU Could be valid on a device that serializes work-items in a workgroup, i.e. DSP Usage: Example // Vector Sum Reduction (Iterative DSP)

© Copyright Texas Instruments Inc., 2013 kernel void sum_reduce(global float* buffer, local float* scratch, global float* result) { int lid = get_local_id (0);// which work-item am I within my workgroup scratch[lid] = buffer[get_global_id(0)]; barrier(CLK_LOCAL_MEM_FENCE); for (int offset = get_local_size(0) >> 1; offset > 0; offset >>= 1) { if (lid < offset) scratch[lid] += scratch[lid + offset]; barrier(CLK_LOCAL_MEM_FENCE); } if (lid == 0) result[get_group_id(0)] = scratch[lid]; } 36 OpenCL Memory: // Vector Sum Reduction