Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1.

Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Evolution of OpenCL* 2 Sequential Programs void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } int main() { //read input scalar_mul(…) return 0; } int main() { //read input scalar_mul(…) return 0; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Evolution of OpenCL* Multi-threaded Programs void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } int main() { //read input pthread_start(…, scalar_mul); scalar_mul(n/2, …); pthread_join(…); return 0; } int main() { //read input pthread_start(…, scalar_mul); scalar_mul(n/2, …); pthread_join(…); return 0; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Problems – concurrent programs Writing concurrent programs is hard Concurrent algorithms Threads Work balancing Need to update programs when adding new cores to the system Dataraces, livelocks, deadlocks Solving bugs in concurrent programs is harder 4

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Evolution of OpenCL* 5 Vector instruction utilization void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i+=4){ __m128 a_vec = _mm_load_ps(a+i); __m128 b_vec = _mm_load_ps(b+i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); } void scalar_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i+=4){ __m128 a_vec = _mm_load_ps(a+i); __m128 b_vec = _mm_load_ps(b+i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); } int main() { //read input scalar_mul(…) return 0; } int main() { //read input scalar_mul(…) return 0; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Problems – vector instructions usage Utilizing vector instructions in also not a trivial task Vendor dependent code Usage is not future proof New efficient instruction Wider vector registers 6

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos GPGPU GPGPU stands for General-Purpose computation on Graphics Processing Units (GPUs). GPUs are high-performance many- core processors that can be used to accelerate a wide range of applications (www.gpgpu.org)www.gpgpu.org 7 Photo taken from: http://folding.stanford.edu/English/FAQ-NVIDIA

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos GPUs utilization Many cores can be utilized for computation GPUs become programmable - GPGPU CUDA* Problems Each vendor has its own language Requires tweaking to get performance How can I run both on CPUs and GPUs? 8

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos What do we need? Heterogeneous Automatically utilizes all available processing units Portable High Performance Utilize Hardware characteristics Future Proof Abstract concurrency from the user 9

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos OpenCL* – heterogeneous computing 10 Diagram based on deck presented in OpenCL* BOF at SIGGRAPH 2010 by Neil Trevett, NVIDIA, OpenCL* Chair

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos OpenCL* in a nutshell An OpenCL* application consists two parts: A set of APIs in C that allows compiling and running OpenCL* “Kernels” A code that is executed on the device by the OpenCL* runtime 11

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Data parallelism 12 A fundamental pattern in high-performance parallel algorithms Applying same computation logic across multiple data elements C[i] = A[i] * B[i] i = 0 i = i + 1 C[i] = A[i] * B[i] i = 0 i = 1 i = 2 i = 3 i = N-2 i = N-1

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Data parallelism Usage Client machines Video transcoding and editing Pro image editing Facial recognition Workstations CAD tools 3D data content creation Servers Science and simulations Medical imaging Oil & Gas Finance (e.g., Black-Scholes) … 13

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos OpenCL* kernel example 14 void array_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } void array_mul(int n, const float *a, const float *b, float *c) { int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } __kernel void array_mul( __global const float *a, __global const float *b, __global float *c){ int id = get_global_id(0); c[id] = a[id] * b[id]; } __kernel void array_mul( __global const float *a, __global const float *b, __global float *c){ int id = get_global_id(0); c[id] = a[id] * b[id]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos OpenCL* kernel example 15 __kernel void array_mul(__global const float *a, __global const float *b, __global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } __kernel void array_mul(__global const float *a, __global const float *b, __global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } a a b b c c get_global_id(0)

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Execution Model 16 Work Group Work Item Global

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos The OpenCL* model OpenCL* runtime is invoked on Host CPU (using OpenCL* API) –Choose target device/s for parallel computation Data-parallel functions, called Kernels, are compiled (on host) Compiled for specific target devices (CPU, GPU, etc..) Data chunks (called Buffers) are moved across devices Kernel “commands” queued for execution on target devices –Asynchronous execution

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos The OpenCL* - C language Derived from ISO C99 Few restrictions e.g., recursion, function pointers Short vector types e.g., float4, short2, int16 Built-in functions –math (e.g., sin), geometric, common (e.g., min, clamp) 18

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Unified programming model for all devices Develop once, run everywhere Designed for massive data-parallelism Implicitly takes care of threading and intrinsics for optimal performance 19 OpenCL* key features

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Dynamic compilation model (Just In Time - JIT) Future proof, provided vendors update their implementations Enables heterogeneous computing A clever application can use all resources of the platform simultaneously 20 OpenCL* key features

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Benefits to User Hardware abstraction write once, run everywhere Cross devices, cross vendors Automatic parallelization Good tradeoff between development simplicity and performance Future proof optimizations Open standard Supported by many vendors 21

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Benefits to Hardware Vendor Enables good hardware ‘time to market’ Programming model enables good hardware utilization Applications are automatically portable and future proof –JIT compilation 22

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos OpenCL* Cons Low level – based on C99 No heap! Lean framework Expert tool In term of correctness and performance OpenCL* is not performance portable Tweaking is needed for each vendor Future specs and implementations may require no tweaking? 23

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Vector dot multiplication 24 void vectorDotMul(int* vecA, int* vecB, int size, int* result){ *result = 0; for (int i=0; i < size; ++i) *result += vecA[i] * vecB[i]; } void vectorDotMul(int* vecA, int* vecB, int size, int* result){ *result = 0; for (int i=0; i < size; ++i) *result += vecA[i] * vecB[i]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 25 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 Single work item *= 2 2 * = 2 2 4 4 * = 2 2 6 6 * = 2 2 8 8 * = 2 2 10 *= 2 2 12 *= 2 2 14 *= 2 2 16

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Vector dot multiplication in OpenCL* 26 __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result) { if (get_global_id(0) == 0){ *result = 0; for (int i=0; i<size; ++i) *result += vecA[i] * vecB[i]; } __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result) { if (get_global_id(0) == 0){ *result = 0; for (int i=0; i<size; ++i) *result += vecA[i] * vecB[i]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 27 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 Single work group *= 2 2 * = 2 2 4 4 * = 2 2 * = 2 2 * = 2 2 *= 2 2 *= 2 2 *= 2 2 4 4 4 4 4 4 8 8 12 16

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 28 __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize; int start = id*work; int end = start+work; for (int j=start; j<end; ++j) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i]; } __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize; int start = id*work; int end = start+work; for (int j=start; j<end; ++j) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i]; } Work item calculation Reduction

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 29 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 Efficient reduction *= 2 2 * = 2 2 4 4 * = 2 2 * = 2 2 * = 2 2 *= 2 2 *= 2 2 *= 2 2 4 4 4 4 4 4 8 8 4 4 8 8 16

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Vectorization Processors provide vector units SIMD on CPUs Warp on GPUs Utilize to perform few operations in parallel –Arithmetic operations –Binary operations –Memory operation 30

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Loop vectorization 31 void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Loop vectorization 32 void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Loop vectorization 33 void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { __m128 a_vec = _mm_load_ps(a + i); __m128 b_vec = _mm_load_ps(b + i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { __m128 a_vec = _mm_load_ps(a + i); __m128 b_vec = _mm_load_ps(b + i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic loop vectorization 34 Is there dependency between a, b, and c? void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic loop vectorization 35 cb void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic loop vectorization 36 cb void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; } void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic vectorization in OpenCL* 37 __kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } __kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); c[id] = a[id] * b[id]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic vectorization in OpenCL* 38 for (int id=workGroupIdStart; id < workGroupIdEnd; ++id) { c[id] = a[id] * b[id]; } for (int id=workGroupIdStart; id < workGroupIdEnd; ++id) { c[id] = a[id] * b[id]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic vectorization in OpenCL* 39 for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { c[id] = a[id] * b[id]; c[id+1] = a[id+1] * b[id+1]; c[id+2] = a[id+2] * b[id+2]; c[id+3] = a[id+3] * b[id+3]; } for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { c[id] = a[id] * b[id]; c[id+1] = a[id+1] * b[id+1]; c[id+2] = a[id+2] * b[id+2]; c[id+3] = a[id+3] * b[id+3]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Automatic vectorization in OpenCL* 40 for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + id, c_vec); } for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + id, c_vec); }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 41 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 Single work group *= 2 2 * = 2 2 4 4 * = 2 2 * = 2 2 * = 2 2 *= 2 2 *= 2 2 *= 2 2 4 4 4 4 4 4 8 8 4 4 8 8 16

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 42 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 Vectorizer friendly *= 2 2 * = 2 2 4 4 * = 2 2 * = 2 2 * = 2 2 *= 2 2 *= 2 2 *= 2 2 4 4 4 4 4 4 8 8 4 4 8 8 16

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos 43 __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize; for (int j=start; j < cols; j + = size) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i]; } __kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize; for (int j=start; j < cols; j + = size) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i]; } Work item calculation Reduction

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Predication 44 __kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; } __kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Predication 45 for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; } for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; } How can we vectorize the loop?

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Predication 46 for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { bool mask = (id > 6); int c1 = a[id] * b[id]; int c2 = a[id] + b[id]; c[id] = (mask) ? c1 : c2; } for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { bool mask = (id > 6); int c1 = a[id] * b[id]; int c2 = a[id] + b[id]; c[id] = (mask) ? c1 : c2; }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Predication 47 for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 idVec = // vector of consecutive ids __m128 mask = _mm_cmpgt_epi32(idVec, Vec6); __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c1_vec = _mm_mul_ps(a_vec, b_vec); __m128 c2_vec = _mm_add_ps(a_vec, b_vec); __m128 c3_vec = _mm_blendv_ps(c1_vec, c2_vec, mask); __mm_store_ps(c + id, c3_vec); } for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 idVec = // vector of consecutive ids __m128 mask = _mm_cmpgt_epi32(idVec, Vec6); __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c1_vec = _mm_mul_ps(a_vec, b_vec); __m128 c2_vec = _mm_add_ps(a_vec, b_vec); __m128 c3_vec = _mm_blendv_ps(c1_vec, c2_vec, mask); __mm_store_ps(c + id, c3_vec); }

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos General tweaking Consecutive memory accesses SIMD, WARP How can we vectorize with control flow? Can we somehow create an efficient code with control flow? Uniform CF CF diverge in SIMD size Enough work groups to utilize machine 48

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Architecture tweaking CPU Locality No local memory (also slow in some GPUs) Enough compute for a work group Overcome thread creation overhead GPU Use local memory Avoid bank conflicts 49

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos Conclusion OpenCL* is an open standard that lets developers: –Write the same code for any type of processor Use all existing resources of a platform in their application Automatic parallelism OpenCL* applications are automatically portable and forward compatible OpenCL* is still an expert tool –OpenCL* is not performance portable –Tweaking for each vendor should be done 50

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright ©, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimer & Optimization Notice Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 51

Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1.

Similar presentations

Presentation on theme: "Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1.

Similar presentations

Presentation on theme: "Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1."— Presentation transcript:

Similar presentations

About project

Feedback