OpenCL Introduction A TECHNICAL REVIEW LU OCT. 11 2014.

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

Computer Architecture
Chapter 3 Process Description and Control
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.
Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.
1 Copyright © 2012 Mahindra & Mahindra Ltd. All rights reserved. 1 Change Management – Process and Roles.
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
AMD platform security processor
Conditions and Terms of Use
Chapter 3 Process Description and Control Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.
Copyright 2011, Atmel December, 2011 Atmel ARM-based Flash Microcontrollers 1 1.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
April 30, 2007 openSUSE.org Build Service a short introduction Moiz Kohari VP Engineering.
1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.
Multithreading in Java Project of COCS 513 By Wei Li December, 2000.
Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.
Advanced / Other Programming Models Sathish Vadhiyar.
OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.
FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
CUDA - 2.
Enhancement Package Innovations Gabe Rodriguez - Halliburton Stefan Kneis – SAP Marco Valencia - SAP.
OpenCL Programming James Perry EPCC The University of Edinburgh.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
My Coordinates Office EM G.27 contact time:
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.
µC-States: Fine-grained GPU Datapath Power Management
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Muen Policy & Toolchain
CS427 Multicore Architecture and Parallel Computing
Measuring and Modeling On-Chip Interconnect Power on Real Hardware
BLIS optimized for EPYCTM Processors
Implementation of Efficient Check-pointing and Restart on CPU - GPU
The Small batch (and Other) solutions in Mantle API
Many-core Software Development Platforms
Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,
SOC Runtime Gregory Stoner.
libflame optimizations with BLIS
Interference from GPU System Service Requests
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
RegMutex: Inter-Warp GPU Register Time-Sharing
An Introduction to Software Architecture
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
© 2012 Elsevier, Inc. All rights reserved.
Advanced Micro Devices, Inc.
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
Presentation transcript:

OpenCL Introduction A TECHNICAL REVIEW LU OCT

2OPENCL INTRODUCTION | APRIL 11, 2014 CONTENTS 1.OpenCL Architecture 2.OpenCL Programming 3.An Matrix Multiplication Example

1. OPENCL ARCHITECTURE

4OPENCL INTRODUCTION | APRIL 11, OPENCL ARCHITECTURE 1.Four Architectural Models Platform Model Execution Model Memory Model Programming Model 2.OpenCL Framework

5OPENCL INTRODUCTION | APRIL 11, FOUR ARCHITECTURAL MODELS  Platform Model  Execution Model  Memory Model  Programming Model

6OPENCL INTRODUCTION | APRIL 11, PLATFORM MODEL

7OPENCL INTRODUCTION | APRIL 11, PLATFORM MODEL (CONT.)  One host equipped with OpenCL device(s).  An OpenCL device consists of compute unit(s)/CU(s).  A CU consists of processing element(s), or PE(s). –Computations on a device occur within PEs.

8OPENCL INTRODUCTION | APRIL 11, EXECUTION MODEL  Kernels –execute on one or more OpenCL devices  Host Program –executes on the host –defines the context for the kernels –manages the execution of kernels

9OPENCL INTRODUCTION | APRIL 11, EXECUTION MODEL (CONT.)  NDRange –an N-dimensional index space, where N is 1, 2 or 3  WORK-ITEM –an instance of the kernel –identified by a global ID in the NDRange –executes the same code in parallel The specific execution pathway through the code and the data operated upon can vary per work-item.

10OPENCL INTRODUCTION | APRIL 11, EXECUTION MODEL (CONT.)  WORK-GROUP –Provide a coarse-grained decomposition of NDRange; –Be assigned a unique work-group ID with the same dimensionality as NDRange; –Use a unique local ID to identify each of its work-items. –Its work-items execute concurrently on the PEs of a single CU. –Kernels could use some synchronization controls within a work-group. –The NDRange size should be a multiple of the work-group size.

11OPENCL INTRODUCTION | APRIL 11, EXECUTION MODEL (CONT.)

12OPENCL INTRODUCTION | APRIL 11, EXECUTION MODEL (CONT.)  Context –The host defines a context for the execution of the kernels.  Resources in the context: –Devices The collection of OpenCL devices to be used by the host. –Kernels The OpenCL functions that run on OpenCL devices. –Program Objects The program source and executable that implement the kernels. –Memory Objects A set of memory objects visible to the host and the OpenCL devices. Memory objects contain values that can be operated on by instances of a kernel.

13OPENCL INTRODUCTION | APRIL 11, EXECUTION MODEL (CONT.)  Command-queue –The host creates a data structure called a command-queue to coordinate execution of the kernels on the devices. –The host places commands into the command-queue which are then scheduled onto the devices within the context. –The command-queue schedules commands for execution on a device. –Commands execute asynchronously between the host and the device.

14OPENCL INTRODUCTION | APRIL 11, EXECUTION MODEL (CONT.)  Commands in command-queue: –Kernel execution commands Execute a kernel on the processing elements of a device. –Memory commands Transfer data to, from, or between memory objects, or map and unmap memory objects from the host address space. –Synchronization commands Constrain the order of execution of commands.

15OPENCL INTRODUCTION | APRIL 11, EXECUTION MODEL (CONT.)  Commands execute modes: –In-order Execution –Out-of-order Execution Any order constraints are enforced by the programmer through explicit synchronization commands

16OPENCL INTRODUCTION | APRIL 11, MEMORY MODEL

17OPENCL INTRODUCTION | APRIL 11, MEMORY MODEL (CONT.)  Private Memory –Per work-item  Local Memory –Shared within a work-group  Global/Constant Memory –Latter is cached  Host Memory –On the CPU  Memory management is explicit –must move data from host -> global -> local and back

18OPENCL INTRODUCTION | APRIL 11, MEMORY MODEL (CONT.)  Memory Region –Allocation and Memory Access Capabilities

19OPENCL INTRODUCTION | APRIL 11, MEMORY MODEL (CONT.)  Memory Consistency –OpenCL uses a relaxed consistency memory model; i.e., the state of memory visible to a work-item is not guaranteed to be consistent across the collection of work-items at all times –Within a work-item, memory has load/store consistency –Within a work-group at a barrier, local memory has consistency across work-items –Global memory is consistent within a work-group, at a barrier, but not guaranteed across different work-groups –Consistency of memory shared between commands are enforced through synchronization

20OPENCL INTRODUCTION | APRIL 11, PROGRAMMING MODEL  Data Parallel Programming Model –All the work-items in NDRange execute in parallel.  Task Parallel Programming Model –Executing a kernel on a compute unit with a work-group containing a single work-item. –Express parallelism by: using vector data types implemented by the device, enqueuing multiple tasks, and/or enqueuing native kernels developed using a programming model orthogonal to OpenCL.

21OPENCL INTRODUCTION | APRIL 11, PROGRAMMING MODEL (CONT.)  Synchronization –Work-items in a single work-group Work-group barrier –Commands enqueued to command-queue(s) in a single context Command-queue barrier Waiting on an event.

22OPENCL INTRODUCTION | APRIL 11, PROGRAMMING MODEL (CONT.)  Events Synchronization

23OPENCL INTRODUCTION | APRIL 11, OPENCL FRAMEWORK  OpenCL Platform layer –This layer allows a host program to discover OpenCL devices and their capabilities and to create contexts.  OpenCL Runtime –The runtime allows the host program to manipulate created contexts.  OpenCL Compiler –The compiler creates executable program containing OpenCL kernels. The OpenCL programming language implemented by the compiler supports a subset of the ISO C99 language with parallelism extensions.

2. OPENCL PROGRAMMING

25OPENCL INTRODUCTION | APRIL 11, BASIC STEPS  Step 1: Discover and initialize the platforms  Step 2: Discover and initialize the devices  Step 3: Create the context  Step 4: Create a command queue  Step 5: Create device buffers  Step 6: Write the host data to device buffers

26OPENCL INTRODUCTION | APRIL 11, BASIC STEPS (CONT.)  Step 7: Create and compile the program  Step 8: Create the kernel  Step 9: Set the kernel arguments  Step 10: Configure the work-item structure  Step 11: Enqueue the kernel for execution  Step 12: Read the output buffer back to the host  Step 13: Release the OpenCL resources

27OPENCL INTRODUCTION | APRIL 11, BASIC STRUCTURE  Host program –Query compute devices –Create the context and command-queue –Create memory objects associated to the context –Compile and create kernel objects –Issue commands to command-queue –Synchronization of commands –Release OpenCL resources  Kernels –C code with come restrictions and extensions Platform Layer Runtime Language

3. AN EXAMPLE

29OPENCL INTRODUCTION | APRIL 11, DESCRIPTION OF THE PROBLEM

30OPENCL INTRODUCTION | APRIL 11, SERIAL IMPLEMENTATION

31OPENCL INTRODUCTION | APRIL 11, CALCULATION PROCEDURE DIAGRAM A A B B C C

32OPENCL INTRODUCTION | APRIL 11, CHARACTERS OF THE CALCULATION

33OPENCL INTRODUCTION | APRIL 11, OPENCL IMPLEMENTATION

34OPENCL INTRODUCTION | APRIL 11, OPENCL MATRIX-MULTIPLY CODE  kernel

35OPENCL INTRODUCTION | APRIL 11, OPENCL IMPLEMENTATION Platform layer Runtime layer Compiler Query platform Query devices Command queue Create kernel Compile program Create buffers Set arguments Execute kernel

36OPENCL INTRODUCTION | APRIL 11, 2014 THANK YOU!

37OPENCL INTRODUCTION | APRIL 11, 2014 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.