Heterogeneous Programming OpenCL and High Level Programming Tools Stéphane BIHAN, CAPS Stream Computing Workshop, Dec. 7-9 2009, KTH, Stockholm.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

OpenCL Introduction A TECHNICAL REVIEW LU OCT

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

GPU Programming with CUDA – Optimisation Mike Griffiths

Computer Graphics Ken-Yi Lee National Taiwan University.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

GPU Architecture and Programming

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

OpenCL Programming James Perry EPCC The University of Edinburgh.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Martin Kruliš by Martin Kruliš (v1.0)1.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

My Coordinates Office EM G.27 contact time:

OpenCL The Open Standard for Heterogenous Parallel Programming.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Lecture 15 Introduction to OpenCL

Computer Engg, IIT(BHU)

Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.

Code Generation PetaQCD, Orsay, 2009 January 20th-21st.

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

CS427 Multicore Architecture and Parallel Computing

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Lecture 11 – Related Programming Models: OpenCL

GPU Programming using OpenCL

Linchuan Chen, Xin Huo and Gagan Agrawal

6- General Purpose GPU Programming

CUDA Fortran Programming with the IBM XL Fortran Compiler

Presentation transcript:

Heterogeneous Programming OpenCL and High Level Programming Tools Stéphane BIHAN, CAPS Stream Computing Workshop, Dec , KTH, Stockholm

Outline  Introduction to accelerating technologies  OpenCL: overview, portability and performance  OpenCL with high level programming tools Stream Computing Workshop, Dec , KTH, Stockholm 2

3 CAPS Profile

Company Profile  Founded in 2002 Spin-off of French INRIA Research Lab Expertise in processor architecture and code generation 30 employees  Mission: to help its customer to leverage the performance of manycore processors Software editor of HMPP Professional services: training, porting, consulting Stream Computing Workshop, Dec , KTH, Stockholm 4

5 Accelerating Technologies Current and upcoming

Multicore/Manycore Architectures  Multicore/manycore is mainstream Intel Nehalem AMD Shangai and Magny-Cours to come About 100GFLOPS for 100 W/core  About performance not parallelism  Manycore is now Performance/Watt is the new efficiency scale NVIDIA Tesla as coprocessor (4TFLOPS SP S1070), Fermi AMD FireStream and Fusion project Intel Larrabee: CPU-GPU convergence Stream Computing Workshop, Dec , KTH, Stockholm 6

Multiple Levels of Parallelism  Amdahl’s law is forever, all levels of parallelism need to be exploited  Programming various hardware components of a node cannot be done separately Stream Computing Workshop, Dec , KTH, Stockholm 7 Network CPU PCIe gen2 GPU PCIe gen2 GPU OpenMP Stream programming MPI

Stream Programming  Hardware languages and API Brook, the pioneer (Stanford university) AMD Brook+ with CAL/IL NVIDIA CUDA compiler and libraries Intel Ct library with compiler runtime OpenCL standardization (Apple initiative, Khronos group)  Directive-based compiler technologies CAPS HMPP Workbench PGI compiler Stream Computing Workshop, Dec , KTH, Stockholm 8

Stream Computing  A similar computation is performed on a collection of data (stream) There is no data dependence between the computation on different stream elements Stream Computing Workshop, Dec , KTH, Stockholm 9

GPU Stream Architecture  Massively data parallel  Needs 1000s of computation threads to be efficient  Optimizations consist in Efficiently use memories/registers Providing high compute/bandwidt h ratio Stream Computing Workshop, Dec , KTH, Stockholm 10

Stream Computing Workshop, Dec , KTH, Stockholm 11 OpenCL Overview

 Open Computing Language Royaltee-free, cross platform C-based cross-platform programming interface Subset of ISO C99 with language extensions Data-, vector-, and task- parallel compute model  Host-to-Compute Devices (GPUs) model  Platform layer API and runtime API Hardware abstraction layer, … Manage resources  Just-in-time compilation Kernel source is sent to the driver  Supported on most OSs 12 Stream Computing Workshop, Dec , KTH, Stockholm

OpenCL Memory Hierarchy Stream Computing Workshop, Dec , KTH, Stockholm 13

Platform Layer API and Runtime  Context Collection of devices  Workgroups and work-items  Data buffer objects  Command queues Kernel execution commands Memory commands (transfer or mapping) Synchronization  Platform Layer Querying devices Creating contexts Stream Computing Workshop, Dec , KTH, Stockholm 14

Command Queues Stream Computing Workshop, Dec , KTH, Stockholm 15

OpenCL Data Parallelism  A kernel is executed by the work-items Stream Computing Workshop, Dec , KTH, Stockholm 16 // OpenCL Kernel Function for element by element vector addition __kernel void VectorAdd(__global const float8* a, __global const float8* b, __global float8* c) { // get oct-float index into global data array int iGID = get_global_id(0); // read inputs into registers float8 f8InA = a[iGID]; float8 f8InB = b[iGID]; float8 f8Out = (float8)0.0f; // add the vector elements f8Out.s0 = f8InA.s0 + f8InB.s0; f8Out.s1 = f8InA.s1 + f8InB.s1; f8Out.s2 = f8InA.s2 + f8InB.s2; f8Out.s3 = f8InA.s3 + f8InB.s3; f8Out.s4 = f8InA.s4 + f8InB.s4; f8Out.s5 = f8InA.s5 + f8InB.s5; f8Out.s6 = f8InA.s6 + f8InB.s6; f8Out.s7 = f8InA.s7 + f8InB.s7; // write back out to GMEM c[get_global_id(0)] = f8Out; }

OCL Kernel Example Stream Computing Workshop, Dec , KTH, Stockholm 17 __kernel void DotProduct ( __global const float16* a, __global const float16* b, __global float4* c, __local float16 f16InA[LOCAL_WORK_SIZE],__local float16 f16InB[LOCAL_WORK_SIZE],__local float4 f4Out[LOCAL_WORK_SIZE]){ // find position in global oct-float array int iGID = get_global_id(0); int iLID = get_local_id(0); // read 16 floats into LMEM from GMEM for each input array f16InA[iLID] = a[iGID]; f16InB[iLID] = b[iGID]; // process 4 pixels into output LMEM f4Out[iLID].x = f16InA[iLID].s0 * f16InB[iLID].s0 + f16InA[iLID].s1 * f16InB[iLID].s1 + f16InA[iLID].s2 * f16InB[iLID].s2 + f16InA[iLID].s3 * f16InB[iLID].s3;... f4Out[iLID].w = f16InA[iLID].sc * f16InB[iLID].sc + f16InA[iLID].sd * f16InB[iLID].sd + f16InA[iLID].se * f16InB[iLID].se + f16InA[iLID].sf * f16InB[iLID].sf; // write out 4 floats to GMEM c[iGID] = f4Out[iLID]; }

OCL Application Example Stream Computing Workshop, Dec , KTH, Stockholm 18 int main(int /*argc*/, char **argv) { cl_context cxMainContext; // OpenCL context cl_command_queue cqCommandQue; // OpenCL command que cl_device_id* cdDevices; // OpenCL device list cl_program cpProgram; // OpenCL program cl_kernel ckKernel; // OpenCL kernel cl_mem cmMemObjs[6]; // OpenCL memory buffer objects host & device size_t szGlobalWorkSize[1]; // Total # of work items size_t szLocalWorkSize[1]; // # of work items in the work group size_t szParmDataBytes; // Byte size of context inf. size_t szKernelLength; // Byte size of kernel code cl_int ciErr1, ciErr2; // Error code var int iTestN = * 16; // Size of Vectors to process // set Global and Local work size dimensions #define LOCAL_WORK_SIZE 32 szGlobalWorkSize[0] = iTestN >> 2; // compute 4 at a time szLocalWorkSize[0]= LOCAL_WORK_SIZE; // start log and timer 0 and 1 ocutSetLogFileName ("OpenclSdkDotProductTest.txt"); ocutWriteLog(LOGBOTH, 0.0, "oclDotProduct.exe Starting,... »); ocutDeltaT(0); ocutDeltaT(1);

OCL Application Example (2) Stream Computing Workshop, Dec , KTH, Stockholm 19 // Allocate and initialize host arrays for golden computations srcA = (void *) malloc(sizeof(cl_float4) * iTestN); srcB = (void *) malloc(sizeof(cl_float4) * iTestN); dst = (void *) malloc(sizeof(cl_float) * iTestN); Golden = (void *) malloc(sizeof(cl_float) * iTestN); ocutFillArray((float*)srcA, 4 * iTestN); ocutFillArray((float*)srcB, 4 * iTestN); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "Allocate... \n"); // Create the OpenCL context on a GPU device cxMainContext = clCreateContextFromType (0, CL_DEVICE_TYPE_GPU,...); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "clCreateContextFromType\n"); // Get the list of GPU devices associated with context ciErr1 |= clGetContextInfo(cxMainContext, CL_CONTEXT_DEVICES, 0,...); cdDevices = (cl_device_id*)malloc(szParmDataBytes); ciErr1 |= clGetContextInfo(cxMainContext, CL_CONTEXT_DEVICES,...); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "clGetContextInfo\n"); ocutPrintDeviceInfo(cdDevices[0]); // Create a command-queue cqCommandQue = clCreateCommandQueue (cxMainContext,cdDevices[0],0,&ciErr2); ciErr1 |= ciErr2; ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1),"clCreateCommandQueue\n");

OCL Application Example (3) Stream Computing Workshop, Dec , KTH, Stockholm 20 // Allocate and initialize OpenCL source and result buffer Pinned memory objects on the host cmMemObjs[0] = clCreateBuffer (cxMainContext,...); ciErr1 |= ciErr2; cmMemObjs[1] = clCreateBuffer(cxMainContext,...); ciErr1 |= ciErr2; cmMemObjs[2] = clCreateBuffer(cxMainContext,...); ciErr1 |= ciErr2; ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "clCreateBuffer pinned\n"); // Allocate the OpenCL source and result buffer memory objects on the device GMEM cmMemObjs[3] = clCreateBuffer (cxMainContext,...); ciErr1 |= ciErr2; cmMemObjs[4] = clCreateBuffer(cxMainContext,...); ciErr1 |= ciErr2; cmMemObjs[5] = clCreateBuffer(cxMainContext, CL_MEM_WRITE_ONLY,...); ciErr1 |= ciErr2; if (ciErr1 != CL_SUCCESS) exit (...); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "clCreateBuffer GMEM\n"); // Read the kernel in from file const char* cPathAndName = ocutFindFilePath(clSourcefile, argv[0]); char* cDotProduct = ocutLoadProgramSource (cPathAndName, "// My comment\n", &szKernelLength); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "ocutLoadProgramSource\n");

OCL Application Example (4) Stream Computing Workshop, Dec , KTH, Stockholm 21 // Create the program cpProgram = clCreateProgramWithSource (cxMainContext, 1, (const char **)&cDotProduct, &szKernelLength, &ciErr1); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "clCreateProgramWithSource\n"); // Build the program ciErr1 |= clBuildProgram (cpProgram, 0, NULL, NULL, NULL, NULL); if (ciErr1 != CL_SUCCESS) { // write out standard error ocutWriteLog(LOGBOTH | ERRORMSG, (double)ciErr1, STDERROR); // write out the build log char cBuildLog[10240]; clGetProgramBuildInfo (cpProgram, ocutGetFirstDevice(cxMainContext), CL_PROGRAM_BUILD_LOG, sizeof(cBuildLog), cBuildLog, NULL); ocutWriteLog(LOGBOTH, 0.0, "\n\nLog:\n%s\n\n\n", cBuildLog); // write out the ptx and then exit char* cPtx; size_t szPtxLength; ocutGetProgramBinary (cpProgram, ocutGetFirstDevice(cxMainContext), &cPtx, &szPtxLength); ocutWriteLog(LOGBOTH| CLOSELOG, 0.0, "\n\nPtx:\n%s\n\n\n", cPtx); exit (-1); } ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "clBuildProgram\n");

OCL Application Example (5) Stream Computing Workshop, Dec , KTH, Stockholm 22 // Create the kernel ckKernel = clCreateKernel(cpProgram, "DotProduct", &ciErr1); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "clCreateKernel\n"); // Set the Argument values ciErr1 = clSetKernelArg (ckKernel, 0, sizeof(cl_mem), (void*)&cmMemObjs[3]); ciErr1 |= clSetKernelArg(ckKernel, 1, sizeof(cl_mem), (void*)&cmMemObjs[4]); ciErr1 |= clSetKernelArg(ckKernel, 2, sizeof(cl_mem), (void*)&cmMemObjs[5]); ciErr1 |= clSetKernelArg(ckKernel, 3, (LOCAL_WORK_SIZE*sizeof(cl_float16)),NULL); ciErr1 |= clSetKernelArg(ckKernel, 4, (LOCAL_WORK_SIZE*sizeof(cl_float16)),NULL); ciErr1 |= clSetKernelArg(ckKernel, 5, (LOCAL_WORK_SIZE*sizeof(cl_float4)), NULL); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "clSetKernelArg\n"); // Warmup GPU driver ciErr1 |= clEnqueueNDRangeKernel(cqCommandQue, ckKernel, 1, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "Warmup GPU Driver\n"); // Execute kernel iNumIterations times for (int i = 0; i < iNumIterations; i++){ ciErr1 |= clEnqueueNDRangeKernel(cqCommandQue, ckKernel, 1, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL); } ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1)/iNumIterations, "clEnqueueNDRangeKernel (compute)\n");

OCL Application Example (6) Stream Computing Workshop, Dec , KTH, Stockholm 23 // Read output ciErr1 |= clEnqueueReadBuffer (cqCommandQue, cmMemObjs[5], CL_TRUE, 0, sizeof(cl_float4) * szGlobalWorkSize[0], dst, 0, NULL, NULL); if (ciErr1 != CL_SUCCESS) exit (ocutWriteLog(LOGBOTH | ERRORMSG | CLOSELOG, (double)ciErr1, STDERROR)); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "clEnqueueReadBuffer\n"); // Release kernel, program, and memory objects free(cdDevices); free(cDotProduct); clReleaseKernel (ckKernel); clReleaseProgram (cpProgram); clReleaseCommandQueue (cqCommandQue); clReleaseContext (cxMainContext); ocutDeleteMemObjs(cmMemObjs, 6); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1), "Release OpenCL objects\n"); ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(0), "Total Program Time\n\n"); // Compute results for golden-host (execute iNumIterations times) for (int i = 0; i < iNumIterations; i++){ DotProductHost ((const float*)srcA,(const float*)srcB,(float*)Golden,iTestN); } ocutWriteLog(LOGBOTH | DELTAT, ocutDeltaT(1)/(float)iNumIterations, "Host Processing\n"); // Compare results (golden-host vs. device) and report errors and pass/fail ocutDiffArray((const float*)dst, (const float*)Golden, iTestN);

How Portable is an OpenCL application?  OpenCL provides a standard syntax to program GPUs (and CPUs)  Stream programming is a mix of parallel programming and hardware resources management GPUs are not time-shared devices (contrarily to CPUs) Memory hierarchy is exposed  When moving to a different architecture, hardware resource constraints may make OpenCL programs: non efficient in some cases, incorrect some other times  Optimized OpenCL codes tend to map very accurately to a given set of resources Stream Computing Workshop, Dec , KTH, Stockholm 24

OpenCL Tuning  Hardware resources are implicitly part of the OpenCL parallel programming model via the work-group threads Work-groups are associated to stream multiprocessors Threads in work-groups share a fix amount of hardware resources The number of threads is usually linked to the problem size Synchronization is within a workgroup  Per work-group (i.e. stream multiprocessor) Maximum number of threads Maximum number of registers Maximum number of local memory Maximum code size Stream Computing Workshop, Dec , KTH, Stockholm 25

Shared Memory and Blocks of Threads  Local memory use depends on the number of work- items in a group  Shared memory overflow leads to incorrect programs Stream Computing Workshop, Dec , KTH, Stockholm 26

Threads and Registers  Registers of a work-group are distributed among the threads  Register use per thread depends on Kernel code GPU compiler Optimizations (via pragma etc.)  No way to predict registers uses  Spill code is very, very inefficient on GPU, too many spilling leads to non executable code (code size is limited) Stream Computing Workshop, Dec , KTH, Stockholm 27

Can the OpenCL Compiler Help?  Very unlikely  Work-groups are declared in data structures // Total # of work items size_t szGlobalWorkSize[1]; // # of work items in the work group size_t szLocalWorkSize[1];  The kernel code itself is text and thread id dependent, the JIT cannot change the work-group configuration char* cDotProduct = ocutLoadProgramSource (cPathAndName, "// My comment\n", &szKernelLength); Stream Computing Workshop, Dec , KTH, Stockholm 28

OpenCL Portability and Performance  OCL kernels to be tuned for each device Express data parallelism Consider device-specific information  Number of work-items in a work-group  Local memory Dramatic impact on performance  Might even be under performing  OpenCL to support a large variety of devices Abstract the specifics of hardware Is OpenCL a good candidate for high level compilers? Stream Computing Workshop, Dec , KTH, Stockholm 29

Stream Computing Workshop, Dec , KTH, Stockholm 30 HMPP A directive-based compiler

HMPP Objectives Incrementally program GPU-accelerated applications Rapidly build hybrid applications Keep accelerated kernels hardware independent Ensure application portability and interoperability Stream Computing Workshop, Dec , KTH, Stockholm 31 To give developers a high level abstraction for manycore programming

Overview  C and Fortran GPU programming directives Define and execute GPU-accelerated versions of code Optimize CPU-GPU data movements Complementary to OpenMP and MPI  A source-to-source hybrid compiler Generate powerful accelerated kernels Works with standard compilers and target tools Tuning directives to optimize accelerated kernels  A runtime library Dispatch computations on available GPUs Scale to multi-GPUs systems Stream Computing Workshop, Dec , KTH, Stockholm 32

History and Status  HMPP 1.x: High level abstraction of GPU programming Focuses on efficiently offloading computations in remote accelerators Introduction of CUDA back-end generator  HMPP 2.x: Achieving a set of directives that fully exploit GPU capabilities Programming directives  Group of codelets and data mapping  Partial transfers, resident data  Regions of code Tuning directives  Exploit device memories  Complex, reduction  Parallelizing directives New CAL/IL and OpenCL back-ends to be released Windows version Stream Computing Workshop, Dec , KTH, Stockholm 33 Launch of HMPP 1.0 Nov 2007 HMPP CUDA generator Nov 2008 HMPP 2.0 June 2009 CAL-IL OpenCL Windows Nov. 2009

HMPP Targeting OpenCL  Naturally integrate in HMPP workflow Stream Computing Workshop, Dec , KTH, Stockholm 34 Main App OCL kernel App binary Standard CPU compiler OpenCL GPU compiler Kernel binaries HMPP compiler HMPP annotated App. OpenCL generator HMPP preprocessor

Stream Computing Workshop, Dec , KTH, Stockholm 35 HMPP Directive Programming

Accelerate Codelet Function Stream Computing Workshop, Dec , KTH, Stockholm 36 #pragma hmpp sgemm codelet, target=OCL, args[vout].io=inout extern void sgemm( int m, int n, int k, float alpha, const float vin1[n][n], const float vin2[n][n], float beta, float vout[n][n] ); int main(int argc, char **argv) { /*... */ for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); } /*... */ } Declare OCL codelets Synchronous codelet call  Declare GPU-accelerated versions of a function

Allocate and Release Stream Computing Workshop, Dec , KTH, Stockholm 37 int main(int argc, char **argv) { /*... */ #pragma hmpp sgemm allocate, args[vin1;vin2;vout].size={size,size} /*... */ for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); /*... */ } /*... */ #pragma hmpp sgemm release } Allocate and initialize device early Release device  Avoid initializing and allocating each codelet call

Optimize Data Movements Stream Computing Workshop, Dec , KTH, Stockholm 38 int main(int argc, char **argv) { /*... */ #pragma hmpp sgemm allocate, args[vin1;vin2;vout].size={size,size} /*... */ #pragma hmpp sgemm advancedload, args[vin1;m;n;k;alpha;beta] /*... */ for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite & #pragma hmpp sgemm args[m;n;k;alpha;beta;vin1].advancedload=true sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); /*... */ } /*... */ #pragma hmpp sgemm release Preload data Avoid reloading data  Preload data before codelet call

Compute Asynchronously Stream Computing Workshop, Dec , KTH, Stockholm 39 int main(int argc, char **argv) { /*... */ #pragma hmpp sgemm allocate, args[vin1;vin2;vout].size={size,size} /*... */ for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite, asynchronous sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); /*... */ } /*... */ #pragma hmpp sgemm synchronize #pragma hmpp sgemm delegatedstore, args[vout] #pragma hmpp sgemm release } Execute asynchronously Download result when needed  Perform CPU/GPU computations asynchronously

Grouping Codelets  To optimize CPU-GPU data movement  Codelets can share variables Keep data in GPUs between two codelets Avoid useless data transfers Map arguments of different functions in same GPU memory location (equivalence Fortran declaration) Stream Computing Workshop, Dec , KTH, Stockholm 40 More Flexibility and Performance

Data Mapping Example Stream Computing Workshop, Dec , KTH, Stockholm 41 #pragma hmpp group, target=OCL #pragma hmpp map, args[f1::inm; f2::inm] #pragma hmpp f1 codelet, args[outv].io=inout static void matvec1(int sn, int sm, float inv[sn], float inm[sn][sm], float outv[sm]) {... } #pragma hmpp f2 codelet, args[v2].io=inout static void otherfunc2(int sn, int sm, float v2[sn], float inm[sn][sm]) {... } Data share the same memory space on the device  Share data between codelets of same group

Overlapping Kernet Execution with Data Transfers Stream Computing Workshop, Dec , KTH, Stockholm 42 #pragma hmpp c1 callsite, asynchronous sgemm(m,n/2,k, alpha, vin1, vin2, beta, vout ); #pragma hmpp c2 advancedload, args[vin1;vout], asynchronous #pragma hmpp c1 synchronize #pragma hmpp c1 delegatedstore, args[vout] #pragma hmpp c2 callsite sgemm(m,n/2,k,alpha,vin1,&vin2[n/2*k],beta,&vout[n/2*m]); Codelet1 execution overlaps codelet2 data transfers

Stream Computing Workshop, Dec , KTH, Stockholm 43 HMPP Directive Tuning

Tuning Directives  To add code properties Force loop parallelization Indicate parameter aliasing  To apply code transformations Loop unrolling and jam, blocking, tiling, permute, …  To control mapping of computations Gridification Place variables in shared or constant memory Threads synchronization barriers Stream Computing Workshop, Dec , KTH, Stockholm 44

Tuning Directive Example Stream Computing Workshop, Dec , KTH, Stockholm 45 #pragma hmpp dgemm codelet, target=CUDA:CAL, args[C].io=inout void dgemm( int n, double alpha, const double *A, const double *B, double beta, double *C ) { int i; #pragma hmppcg(CUDA) grid blocksize "64x1" #pragma hmppcg(CUDA) permute j,i #pragma hmppcg(CUDA) unroll(8), jam, split, noremainder #pragma hmppcg parallel for( i = 0 ; i < n; i++ ) { int j; #pragma hmppcg(CUDA) unroll(4), jam(i), noremainder #pragma hmppcg parallel for( j = 0 ; j < n; j++ ) { int k; double prod = 0.0f; for( k = 0 ; k < n; k++ ) { prod += VA(k,i) * VB(j,k); } VC(j,i) = alpha * prod + beta * VC(j,i); } 1D gridification Using 64 threads Loop transformations

B CA Use of Shared Memory – Native computation  Matrix multiply 1 thread / 1 computation nI 1 n J K K for( i = 0 ; i < n; i++ ){ for( j = 0 ; j < n; j++ ) { for( k = 0 ; k < n; k++ ) { prod += A(k,j) * B(i,k); } C(i,j) = prod + C(i,j); }

Use of Shared Memory – Step 1  1D gridification Use 64 threads per block  Unroll inner loop by 16 Improve reuse of A(k,j) 47 #pragma hmppcg(CUDA) grid blocksize 64x1 for( i=0 ; i<n; i+=UNROLL16 ) { for( j=0 ; j<n; j++ ) { for( k = 0 ; k < n; k++ ) { prod[0] += A(k,j) * B(i,k); prod[1] += A(k,j) * B(i+1,k);... prod[15] += A(k,j) * B(i+15,k); } C(i,j) = prod[0] + C(i,j) ; C(i+1,j) = prod[1] + C(i+1,j) ;... C(i+15,j) = prod[15] + C(i+15,j); } 16 computations per thread 64 threads per block

Use of Shared Memory – Step 1  Use HMPP tuning directives to unroll and gridify 48 #pragma hmppcg(CUDA) grid blocksize 64x1 for( i=0 ; i<n; i+=UNROLL16 ) { for( j=0 ; j<n; j++ ) { for( k = 0 ; k < n; k++ ) { #pragma hmppcg fullunroll for (di=0;di<UNROLL16;di++) { prod[di] += A(k,j) * B(i+di,k); } #pragma hmppcg fullunroll for (di=0;di<UNROLL16;di++) { C(i+di,j) = prod[di] + C(i+di,j); } 16 computations per thread 64 threads per block

B CA Use of Shared Memory – Step 1 Result  64 threads computing 16 elements each A & C are properly coalesced B memory accesses are badly coalesced threads 16 1nI 1 n J

B CA Use of Shared Memory – Step 2  Create blocks Each thread computes blocks instead of columns More efficient memory loading nI 1 n J

Use of Shared Memory – Step #pragma hmppcg(CUDA) grid blocksize 64x1 for( i=0 ; i<n; i+=UNROLL16 ) { for( j=0 ; j<n; j++ ) { float prod[UNROLL16] ; // local sums (in registers) for( k=0 ; k<n; k+=UNROLL2 ) { #pragma hmppcg fullunroll for(dk=0;dk<UNROLL16;dk++) { #pragma hmppcg fullunroll for (di=0;di<UNROLL16;di++) { prod[di] += A(k+dk,j) * B(i+di,k+dk); } #pragma hmppcg fullunroll for (di=0;di<UNROLL16;di++) { C(i+di,j) = prod[di] + C(i+di,j); } Create blocks Use again blocking and unrolling on k-loop Create blocks Use again blocking and unrolling on k-loop

B CA Use of Shared Memory – Step 3 Create shared variable and load B blocks per line  Efficient use of shared memory 64 threads (4 half-warps) to load a 16x16 block  16*16/64 = 4 loads per thread  Each half-warp loads 4 well coalesced rows HMPP intrinsic RankInBlock(j) provides the thread index (0..63) on the j-gridification nI 1 n J tmp =

Use of Shared Memory – Step 3 Result  K-loop kernel 53 float BUF[16][16] ;... #pragma grid shared BUF... for( k=0 ; k<n; k+=16 ) { #pragma grid barrier... // SBUF[0:15][0:15] = transpose(B(0:15,0:15)) #pragma grid barrier #pragma hmppcg fullunroll for(dk=0;dk<16;dk++) { #pragma hmppcg fullunroll for (di=0;di<16;di++) { prod[di] += A(k+dk,j) * BUF[di][dk]; }

Stream Computing Workshop, Dec , KTH, Stockholm 54 HMPP Performance Figures

HMPP CUDA – SGEMM & DGEMM Stream Computing Workshop, Dec , KTH, Stockholm 55

HMPP CUDA – CGEMM Stream Computing Workshop, Dec , KTH, Stockholm 56

Stream Computing Workshop, Dec , KTH, Stockholm 57 Conclusion

 OpenCL Very similar to CUDA but low level programming interface Portable across various devices Not simple but much simpler than OpenGL graphics programming Can be compared as x86 assembly for manycore programming Target expert developers Suitable for higher level programming languages and tools  HMPP OpenMP-like directives for programming and tuning GPU-accelerated applications Offer incremental levels of programming from minimal to advanced and expert A source-to-source C and Fortran compiler targeting OpenCL Stream Computing Workshop, Dec , KTH, Stockholm 58

59 Stream Computing Workshop, Dec , KTH, Stockholm Innovative Software for Manycore Paradigms