Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1.

Slides:

Advertisements

Similar presentations

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

Advertisements

Prasanna Pandit R. Govindarajan

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Intel ® Xeon ® Processor E v2 Product Family Ivy Bridge Improvements *Other names and brands may be claimed as the property of others. FeatureXeon.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

Software and Services Group Optimization Notice Advancing HPC == advancing the business of software Rich Altmaier Director of Engineering Sept 1, 2011.

Perceptual Computing SDK Q2, 2013 Update Building Momentum with the SDK 1 Barry Solomon, Senior Product Manager, Intel Xintian Wu, Architect, Intel.

Software & Services Group Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Intel® Education Fluid Math™

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

HEVC Commentary and a call for local temporal distortion metrics Mark Buxton - Intel Corporation.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Intel® Education Learning in Context: Science Journal Intel Solutions Summit 2015, Dallas, TX.

Getting Reproducible Results with Intel® MKL 11.0

Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

OpenMP * Support in Clang/LLVM: Status Update and Future Directions 2014 LLVM Developers' Meeting Alexey Bataev, Zinovy Nis Intel.

Orion Granatir Omar Rodriguez GDC 3/12/10 Don’t Dread Threads.

Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Intel® Education Learning in Context: Concept Mapping Intel Solutions Summit 2015, Dallas, TX.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Computer Graphics Ken-Yi Lee National Taiwan University.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

IBIS-AMI and Direction Indication February 17, 2015 Michael Mirmak.

Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.

GPU Architecture and Programming

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Template Library for Vector Loops A presentation of P0075 and P0076

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Tuning Threaded Code with Intel® Parallel Amplifier.

© Copyright Khronos Group, Page 1 Real-Time Shallow Water Simulation with OpenCL for CPUs Arnon Peleg, Adam Lake software, Intel OpenCL WG, The.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

OpenCL 소개 류관희 충북대학교 소프트웨어학과.

Enabling machine learning in embedded systems

BLIS optimized for EPYCTM Processors

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

Many-core Software Development Platforms

SOC Runtime Gregory Stoner.

Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos

12/26/2018 5:07 AM Leap forward with fast, agile & trusted solutions from Intel & Microsoft* Eman Yarlagadda (for Christine McMonigal) Hybrid Cloud – Product.

Enabling TSO in OvS-DPDK

A Scalable Approach to Virtual Switching

6- General Purpose GPU Programming

COMPUTER ORGANIZATION AND ARCHITECTURE

CSE 502: Computer Architecture

Presentation transcript:

Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Evolution of OpenCL* 2 Sequential Programs void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } int main() { //read input scalar_mul(…) return 0; } int main() { //read input scalar_mul(…) return 0; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Evolution of OpenCL* Multi-threaded Programs void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } int main() { //read input pthread_start(…, scalar_mul); scalar_mul(n/2, …); pthread_join(…); return 0; } int main() { //read input pthread_start(…, scalar_mul); scalar_mul(n/2, …); pthread_join(…); return 0; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Problems – concurrent programs Writing concurrent programs is hard Concurrent algorithms Threads Work balancing Need to update programs when adding new cores to the system Dataraces, livelocks, deadlocks Solving bugs in concurrent programs is harder 4

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Evolution of OpenCL* 5 Vector instruction utilization void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i+=4){ __m128 a_vec = _mm_load_ps(a+i); __m128 b_vec = _mm_load_ps(b+i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); } void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i+=4){ __m128 a_vec = _mm_load_ps(a+i); __m128 b_vec = _mm_load_ps(b+i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); } int main() { //read input scalar_mul(…) return 0; } int main() { //read input scalar_mul(…) return 0; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Problems – vector instructions usage Utilizing vector instructions in also not a trivial task Vendor dependent code Usage is not future proof New efficient instruction Wider vector registers 6

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos GPGPU GPGPU stands for General-Purpose computation on Graphics Processing Units (GPUs). GPUs are high-performance many- core processors that can be used to accelerate a wide range of applications ( 7 Photo taken from:

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos GPUs utilization Many cores can be utilized for computation GPUs become programmable - GPGPU CUDA* Problems Each vendor has its own language Requires tweaking to get performance How can I run both on CPUs and GPUs? 8

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos What do we need? Heterogeneous Automatically utilizes all available processing units Portable High Performance Utilize Hardware characteristics Future Proof Abstract concurrency from the user 9

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos OpenCL* – heterogeneous computing 10 Diagram based on deck presented in OpenCL* BOF at SIGGRAPH 2010 by Neil Trevett, NVIDIA, OpenCL* Chair

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos OpenCL* in a nutshell An OpenCL* application consists two parts: A set of APIs in C that allows compiling and running OpenCL* “Kernels” A code that is executed on the device by the OpenCL* runtime 11

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Data parallelism 12 A fundamental pattern in high-performance parallel algorithms Applying same computation logic across multiple data elements C[i] = A[i] * B[i] i = 0 i = i + 1 C[i] = A[i] * B[i] i = 0 i = 1 i = 2 i = 3 i = N-2 i = N-1

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Data parallelism Usage Client machines Video transcoding and editing Pro image editing Facial recognition Workstations CAD tools 3D data content creation Servers Science and simulations Medical imaging Oil & Gas Finance (e.g., Black-Scholes) … 13

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos OpenCL* kernel example 14 void array_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } void array_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } __kernel void array_mul( __global const float *a, __global const float *b, __global float *c){ int id = get_global_id(0); c[id] = a[id] * b[id]; } __kernel void array_mul( __global const float *a, __global const float *b, __global float *c){ int id = get_global_id(0); c[id] = a[id] * b[id]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos OpenCL* kernel example 15 __kernel void array_mul(__global const float *a, __global const float *b, __global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } __kernel void array_mul(__global const float *a, __global const float *b, __global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } a a b b c c get_global_id(0)

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Execution Model 16 Work Group Work Item Global

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos The OpenCL* model OpenCL* runtime is invoked on Host CPU (using OpenCL* API) –Choose target device/s for parallel computation Data-parallel functions, called Kernels, are compiled (on host) Compiled for specific target devices (CPU, GPU, etc..) Data chunks (called Buffers) are moved across devices Kernel “commands” queued for execution on target devices –Asynchronous execution

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos The OpenCL* - C language Derived from ISO C99 Few restrictions e.g., recursion, function pointers Short vector types e.g., float4, short2, int16 Built-in functions –math (e.g., sin), geometric, common (e.g., min, clamp) 18

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Unified programming model for all devices Develop once, run everywhere Designed for massive data-parallelism Implicitly takes care of threading and intrinsics for optimal performance 19 OpenCL* key features

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Dynamic compilation model (Just In Time - JIT) Future proof, provided vendors update their implementations Enables heterogeneous computing A clever application can use all resources of the platform simultaneously 20 OpenCL* key features

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Benefits to User Hardware abstraction write once, run everywhere Cross devices, cross vendors Automatic parallelization Good tradeoff between development simplicity and performance Future proof optimizations Open standard Supported by many vendors 21

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Benefits to Hardware Vendor Enables good hardware ‘time to market’ Programming model enables good hardware utilization Applications are automatically portable and future proof –JIT compilation 22

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos OpenCL* Cons Low level – based on C99 No heap! Lean framework Expert tool In term of correctness and performance OpenCL* is not performance portable Tweaking is needed for each vendor Future specs and implementations may require no tweaking? 23

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Vector dot multiplication 24 void vectorDotMul(int* vecA, int* vecB, int size, int* result){ *result = 0; for (int i=0; i < size; ++i) *result += vecA[i] * vecB[i]; } void vectorDotMul(int* vecA, int* vecB, int size, int* result){ *result = 0; for (int i=0; i < size; ++i) *result += vecA[i] * vecB[i]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Single work item *= 2 2 * = * = * = * = *= *= *=

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Vector dot multiplication in OpenCL* 26 __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result) { if (get_global_id(0) == 0){ *result = 0; for (int i=0; i<size; ++i) *result += vecA[i] * vecB[i]; } __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result) { if (get_global_id(0) == 0){ *result = 0; for (int i=0; i<size; ++i) *result += vecA[i] * vecB[i]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Single work group *= 2 2 * = * = 2 2 * = 2 2 * = 2 2 *= 2 2 *= 2 2 *=

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 28 __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize; int start = id*work; int end = start+work; for (int j=start; j<end; ++j) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i]; } __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize; int start = id*work; int end = start+work; for (int j=start; j<end; ++j) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i]; } Work item calculation Reduction

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Efficient reduction *= 2 2 * = * = 2 2 * = 2 2 * = 2 2 *= 2 2 *= 2 2 *=

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Vectorization Processors provide vector units SIMD on CPUs Warp on GPUs Utilize to perform few operations in parallel –Arithmetic operations –Binary operations –Memory operation 30

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Loop vectorization 31 void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Loop vectorization 32 void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Loop vectorization 33 void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { __m128 a_vec = _mm_load_ps(a + i); __m128 b_vec = _mm_load_ps(b + i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { __m128 a_vec = _mm_load_ps(a + i); __m128 b_vec = _mm_load_ps(b + i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic loop vectorization 34 Is there dependency between a, b, and c? void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic loop vectorization 35 cb void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic loop vectorization 36 cb void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic vectorization in OpenCL* 37 __kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } __kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); c[id] = a[id] * b[id]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic vectorization in OpenCL* 38 for (int id=workGroupIdStart; id < workGroupIdEnd; ++id) { c[id] = a[id] * b[id]; } for (int id=workGroupIdStart; id < workGroupIdEnd; ++id) { c[id] = a[id] * b[id]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic vectorization in OpenCL* 39 for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { c[id] = a[id] * b[id]; c[id+1] = a[id+1] * b[id+1]; c[id+2] = a[id+2] * b[id+2]; c[id+3] = a[id+3] * b[id+3]; } for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { c[id] = a[id] * b[id]; c[id+1] = a[id+1] * b[id+1]; c[id+2] = a[id+2] * b[id+2]; c[id+3] = a[id+3] * b[id+3]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic vectorization in OpenCL* 40 for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + id, c_vec); } for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + id, c_vec); }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Single work group *= 2 2 * = * = 2 2 * = 2 2 * = 2 2 *= 2 2 *= 2 2 *=

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Vectorizer friendly *= 2 2 * = * = 2 2 * = 2 2 * = 2 2 *= 2 2 *= 2 2 *=

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 43 __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize; for (int j=start; j < cols; j + = size) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i]; } __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize; for (int j=start; j < cols; j + = size) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i]; } Work item calculation Reduction

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Predication 44 __kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; } __kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Predication 45 for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; } for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; } How can we vectorize the loop?

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Predication 46 for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { bool mask = (id > 6); int c1 = a[id] * b[id]; int c2 = a[id] + b[id]; c[id] = (mask) ? c1 : c2; } for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { bool mask = (id > 6); int c1 = a[id] * b[id]; int c2 = a[id] + b[id]; c[id] = (mask) ? c1 : c2; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Predication 47 for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 idVec = // vector of consecutive ids __m128 mask = _mm_cmpgt_epi32(idVec, Vec6); __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c1_vec = _mm_mul_ps(a_vec, b_vec); __m128 c2_vec = _mm_add_ps(a_vec, b_vec); __m128 c3_vec = _mm_blendv_ps(c1_vec, c2_vec, mask); __mm_store_ps(c + id, c3_vec); } for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 idVec = // vector of consecutive ids __m128 mask = _mm_cmpgt_epi32(idVec, Vec6); __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c1_vec = _mm_mul_ps(a_vec, b_vec); __m128 c2_vec = _mm_add_ps(a_vec, b_vec); __m128 c3_vec = _mm_blendv_ps(c1_vec, c2_vec, mask); __mm_store_ps(c + id, c3_vec); }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos General tweaking Consecutive memory accesses SIMD, WARP How can we vectorize with control flow? Can we somehow create an efficient code with control flow? Uniform CF CF diverge in SIMD size Enough work groups to utilize machine 48

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Architecture tweaking CPU Locality No local memory (also slow in some GPUs) Enough compute for a work group Overcome thread creation overhead GPU Use local memory Avoid bank conflicts 49

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Conclusion OpenCL* is an open standard that lets developers: –Write the same code for any type of processor Use all existing resources of a platform in their application Automatic parallelism OpenCL* applications are automatically portable and forward compatible OpenCL* is still an expert tool –OpenCL* is not performance portable –Tweaking for each vendor should be done 50

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright ©, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # Legal Disclaimer & Optimization Notice Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 51