OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Contemporary Languages in Parallel Computing Raymond Hummel.
GPGPU platforms GP - General Purpose computation using GPU
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
The Open Standard for Parallel Programming of Heterogeneous systems James Xu.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
CUDA - 2.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
My Coordinates Office EM G.27 contact time:
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Computer Engg, IIT(BHU)
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
OpenCL 소개 류관희 충북대학교 소프트웨어학과.
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Heterogeneous Computing with D
Accelerating MapReduce on a Coupled CPU-GPU Architecture
GPU Programming using OpenCL
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
© 2012 Elsevier, Inc. All rights reserved.
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
6- General Purpose GPU Programming
Presentation transcript:

OpenCL Peter Holvenstot

OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2 Manufacturers release their own SDK and drivers Major backers: Apple, AMD/ATI, Intel

OpenCL Alternative to CUDA Not limited to ATI GPUs Designed for “heterogenous computing” Executable on many devices, including CPUs, GPUs, DSPs, and FPGAs

OpenCL Similar structure of host programs and kernels Set of compute devices is called a 'context' Kernels executed by 'processing elements' Kernels can be compiled at run-time or build-time

OpenCL Task Parallelism – many kernels running at once OpenCL 1.2 – device can be partitioned down to single Compute Unit Built-in kernels for device-specific functionality

Advantages Same code can be run on different devices  Can also be run on NVIDIA GPUs! AMD/ATI attempting to integrate compute elements into other platforms (Accelerated Processing Units) Limited library of portable math routines  Most common BLAST and FFT routines

Performance

Disadvantages No “official” implementation Vendors may meet specs or add restrictions  Apple adds restrictions on group size Devices need appropriate settings to perform well  Different capabilities → different performance  Solution: Tuning/load balancing framework

Non-Optimized Performance

Restrictions No recursion, variadics, or function pointer Cannot dynamically allocate memory from device No native variable-length arrays, double-precision Some can be worked around by extensions

Terminology CUDA: Scalar Core Streaming Multiprocssr Warp PTX OpenCL: Stream Core Compute Unit Wavefront Intermediate Language

Terminology CUDA: Host Memory Global/Device Memory Local Memory Constant Memory Shared Memory Registers OpenCL: Host Memory Global Memory Constant Memory Local Memory Private Memory

Terminology CUDA: Grid Block Thread Thread ID Block Index Thread Index OpenCL: NDRange Work group Work item Global ID Block ID Local ID

References content/uploads/2012/02/CUDAvsOpenCL.pdf content/uploads/2012/02/CUDAvsOpenCL.pdf /Cuda+and+OpenCL+API+comparison_presented.p df /Cuda+and+OpenCL+API+comparison_presented.p df 28/opencl_gains_ground_on_cuda.html 28/opencl_gains_ground_on_cuda.html ERS/parcocudaopencl.pdf ERS/parcocudaopencl.pdf