The Open Standard for Parallel Programming of Heterogeneous systems James Xu.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Prasanna Pandit R. Govindarajan
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,
GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
Henry Neeman, University of Oklahoma Charlie Peck, Earlham College
Introduction to Parallel Programming & Cluster Computing GPGPU: Number Crunching in Your Graphics Card Josh Alexander, University of Oklahoma Ivan Babic,
OpenCL Usman Roshan Department of Computer Science NJIT.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Parallel Programming & Cluster Computing GPGPU: Number Crunching in Your Graphics Card Henry Neeman, University of Oklahoma Charlie Peck, Earlham College.
Parallel Programming & Cluster Computing GPGPU: Number Crunching in Your Graphics Card Henry Neeman, Director OU Supercomputing Center for Education &
Supercomputing in Plain English GPGPU: Number Crunching in Your Graphics Card Henry Neeman, Director OU Supercomputing Center for Education & Research.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009.
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014.
1 Latest Generations of Multi Core Processors
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
My Coordinates Office EM G.27 contact time:
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Parallel Computing on Graphics Cards Keith Kelley, CS6260.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
CS427 Multicore Architecture and Parallel Computing
OpenCL 소개 류관희 충북대학교 소프트웨어학과.
An Introduction to GPU Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
© 2012 Elsevier, Inc. All rights reserved.
Introduction to CUDA.
OpenCL introduction III.
Presentation transcript:

The Open Standard for Parallel Programming of Heterogeneous systems James Xu

Introduction Parallel Applications Becoming common place GPGPU MATLAB Quad Cores

Challenges Vendor specific APIs CPU – GPGPU Programming gap

OpenCL Open Computing Langauage Introduces uniformity “Close-to-silicon” Parallel Computing using all possible resources on end system Initially by Apple Khronos group, OpenGL, OpenAL Major Vendor support

OpenCL Overview All computational resources on an end system seen as peers CPU, GPU, ARM, DSPs etc Strict IEEE 754 Floating Point specification. Fixed rounding, error Defines architecture models and software stack

Architecture Model – Platform

Architecture – Execution Model Kernel – Smallest unit of execution, like a C function Host program – A collection of kernels Work item, an instance of kernel at run time Work group, a collection of work items

Architecture – Execution Model

Architecture – Memory Model

Architecture – Programming Model Data Parallel, work group consist of instances of same kernel (work items) Different data elements are fed into the work items in the group Task Parallel, work group consist of a single work item (instance of kernel) Work group can run independently Each compute device sees a number of work groups in parallel, thus task parallel

Architecture – Programming Model Only CPUs are expected to have task parallel mechanisms Data parallel model must be present on all OpenCL compatible devices

OpenCL Runtime Language derived from ISO C99 (C Language) Restrictions: No recursion no function points All standard data types, including vectors OpenGL extension

OpenCL Software Stack Shows the steps to develop an OpenCL program

OpenCL Example in C __kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int blockIdx = get_group_id(0) * tid; float2 data[16]; in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); FFT Example using GPU

OpenCL Example in C fftRadix16Pass(data); twiddleFactorMul(data, tid, 1024, 0); localShuffle(data, sMemx, sMemy, tid,(((tid&15)*65) + (tid >> 4))); fftRadix16Pass(data); twiddleFactorMul(data, tid, 64, 4); localShuffle(data, sMemx, sMemy, tid,(((tid>>4)*64) + (tid & 15))); fftRadix4Pass(data); fftRadix4Pass(data + 4); fftRadix4Pass(data + 8); fftRadix4Pass(data + 12); globalStores(data, out, 64); }

OpenCL Example in C context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); queue = clCreateWorkQueue(context, NULL, NULL, 0); memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL); program = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL); clBuildProgramExecutable(program, false, NULL, NULL); kernel = clCreateKernel(program, "fft1D_1024"); global_work_size[0] = n; local_work_size[0] = 64; range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size);

OpenCL Example in C clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);