Dr. Lawlor, U. Alaska: EPGPU 1 EPGPU: Expressive Programming for GPGPU Dr. Orion Sky Lawlor U. Alaska Fairbanks 2011-05-31

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

A C++ Crash Course Part II UW Association for Computing Machinery Questions & Feedback.
Intermediate GPGPU Programming in CUDA
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
What is a pointer? First of all, it is a variable, just like other variables you studied So it has type, storage etc. Difference: it can only store the.
Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.
APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Homework Turn in HW2 at start of next class. Starting Chapter 2 K&R. Read ahead. HW3 is on line. –Due: class 9, but a lot to do! –You may want to get.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 Key Concepts:  Why C?  Life Cycle Of a C program,  What is a computer program?  A program statement?  Basic parts of a C program,  Printf() function?
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
COMP 14: Intro. to Intro. to Programming May 23, 2000 Nick Vallidis.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
C++ Functions. 2 Agenda What is a function? What is a function? Types of C++ functions: Types of C++ functions: Standard functions Standard functions.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
1 Further C  Multiple source code file projects  Structs  The preprocessor  Pointers.
Operator Precedence First the contents of all parentheses are evaluated beginning with the innermost set of parenthesis. Second all multiplications, divisions,
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
1 C - Memory Simple Types Arrays Pointers Pointer to Pointer Multi-dimensional Arrays Dynamic Memory Allocation.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Hello.java Program Output 1 public class Hello { 2 public static void main( String [] args ) 3 { 4 System.out.println( “Hello!" ); 5 } // end method main.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
GPU Architecture and Programming
C++ History C++ was designed at AT&T Bell Labs by Bjarne Stroustrup in the early 80's Based on the ‘C’ programming language C++ language standardised in.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
Data TypestMyn1 Data Types The type of a variable is not set by the programmer; rather, it is decided at runtime by PHP depending on the context in which.
Looping and Counting Lecture 3 Hartmut Kaiser
Python uses boolean variables to evaluate conditions. The boolean values True and False are returned when an expression is compared or evaluated.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Engineering H192 - Computer Programming Gateway Engineering Education Coalition Lect 5P. 1Winter Quarter C Programming Basics Lecture 5.
1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor U. Alaska Fairbanks
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
CSE1222: Lecture 1The Ohio State University1. Computing Basics  Computers CPU, Memory & Input/Output (IO)  Program Sequence of instructions for the.
Sunpyo Hong, Hyesoon Kim
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Chapter 16 Pointers and Arrays Pointers and Arrays We've seen examples of both of these in our LC-3 programs; now we'll see them in C. Pointer Address.
C is a high level language (HLL)
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
Chapter 7 Continued Arrays & Strings. Arrays of Structures Arrays can contain structures as well as simple data types. Let’s look at an example of this,
My Coordinates Office EM G.27 contact time:
1 Building a program in C: Preprocessor, Compilation and Linkage.
Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Java Programming Language Lecture27- An Introduction.
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Computer Engg, IIT(BHU)
Objective To Understand the OpenCL programming model
An Introduction to GPU Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
C Basics.
Lecture 11 – Related Programming Models: OpenCL
Operation System Program 4
Presentation transcript:

Dr. Lawlor, U. Alaska: EPGPU 1 EPGPU: Expressive Programming for GPGPU Dr. Orion Sky Lawlor U. Alaska Fairbanks

Dr. Lawlor, U. Alaska: EPGPU 2 Obligatory Introductory Quote “He who controls the past, controls the future.” George Orwell, 1984

Dr. Lawlor, U. Alaska: EPGPU 3 In Parallel Programming... “He who controls the past, controls the future.” George Orwell, 1984 “He who controls the writes, controls performance.” Orion Lawlor, 2011

Dr. Lawlor, U. Alaska: EPGPU 4 Talk Outline Embedding OpenCL inside C++ FILL kernels and parallelism Who controls the writes? Application Performance Conclusions

EPGPU: Embedding OpenCL in C++

Dr. Lawlor, U. Alaska: EPGPU 6 Why Bother? Parallel hardware is here – Needs good parallel software Why OpenCL? – Just as fast as CUDA on GPU – Same *binary* works on ATI, NVIDIA, x86 SMP, cellphone,... Why C++? – Similar syntax with OpenCL – Macros, templates, operators,...

Dr. Lawlor, U. Alaska: EPGPU 7 Motivation for Expressive OpenCL // Create the program... theProg = clCreateProgramWithSource(theCtx, 1, &theSource, NULL, &err); CL_CHECK_ERROR(err); //...and build it const char * args = " -cl-mad-enable -cl-fast-relaxed-math "; err = clBuildProgram(theProg, 0, NULL, args, NULL, NULL); if (err != CL_SUCCESS) {...} // Set up input memory int n=64; int bytes=n*sizeof(float); cl_mem devP = clCreateBuffer(theCtx, CL_MEM_READ_WRITE, bytes, NULL, &err); CL_CHECK_ERROR(err); float f=1.2345; err = clEnqueueWriteBuffer(theQueue, devP, CL_TRUE, 0, sizeof(float), &f, 0, NULL, NULL); CL_CHECK_ERROR(err); // Create kernel theKrnl = clCreateKernel(theProg, "writeArr", &err); CL_CHECK_ERROR(err); // Call the kernel err=clSetKernelArg(theKrnl, 0, sizeof(cl_mem), &devP); CL_CHECK_ERROR(err); float addThis=1000; err=clSetKernelArg(theKrnl, 1, sizeof(float), &addThis); CL_CHECK_ERROR(err); size_t localsz = 32; size_t globalsz = n; err = clEnqueueNDRangeKernel(theQueue, theKrnl, 1, NULL, &globalsz, &localsz, 0, NULL, NULL); CL_CHECK_ERROR(err); // Read back the results for (int i=0;i<n;i+=4) { err = clEnqueueReadBuffer(theQueue, devP, CL_TRUE, i*sizeof(float), sizeof(float), &f, 0, NULL, NULL); CL_CHECK_ERROR(err); std::cout<<"arr["<<i<<"]="<<f<<"\n"; } // Cleanup clReleaseMemObject(devP); clReleaseKernel(theKrnl); clReleaseProgram(theProg); clReleaseCommandQueue(theQueue); clReleaseContext(theCtx); return 0; } /** "simple" OpenCL example program Adapted from the SHOC OpenCL FFT caller code. Dr. Orion Sky Lawlor, (Public Domain) */ #include #include "CL/cl.h" #define CL_CHECK_ERROR(err) do{if (err) {printf("FATAL ERROR %d at " __FILE__ ":%d\n",err,__LINE__); exit(1); } } while(0) cl_device_id theDev; cl_context theCtx; cl_command_queue theQueue; cl_kernel theKrnl; cl_program theProg; static const char *theSource="/* Lots more code here! */\n" "__kernel void writeArr(__global float *arr,float v) {\n" "int i=get_global_id(0);\n" "arr[i]+=v;\n" "}\n"; int main() { cl_int err; // Set up OpenCL // Get the platform enum {MAX_PLAT=8, MAX_DEVS=8}; cl_platform_id platforms[MAX_PLAT]; cl_uint num_platforms=MAX_PLAT; err= clGetPlatformIDs(MAX_PLAT,platforms,&num_platforms); CL_CHECK_ERROR(err); cl_platform_id cpPlatform=platforms[0]; //Get the devices cl_device_id cdDevices[MAX_DEVS]; err=clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, MAX_DEVS, cdDevices, NULL); theDev=cdDevices[0]; CL_CHECK_ERROR(err); // now get the context theCtx = clCreateContext(NULL, 1, &theDev, NULL, NULL, &err); CL_CHECK_ERROR(err); // get a queue theQueue = clCreateCommandQueue(theCtx, theDev, CL_QUEUE_PROFILING_ENABLE, &err); CL_CHECK_ERROR(err);

Dr. Lawlor, U. Alaska: EPGPU 8 Motivation for Expressive OpenCL // Create the program... theProg = clCreateProgramWithSource(theCtx, 1, &theSource, NULL, &err); CL_CHECK_ERROR(err); //...and build it const char * args = " -cl-mad-enable -cl-fast-relaxed-math "; err = clBuildProgram(theProg, 0, NULL, args, NULL, NULL); if (err != CL_SUCCESS) {...} // Set up input memory int n=64; int bytes=n*sizeof(float); cl_mem devP = clCreateBuffer(theCtx, CL_MEM_READ_WRITE, bytes, NULL, &err); CL_CHECK_ERROR(err); float f=1.2345; err = clEnqueueWriteBuffer(theQueue, devP, CL_TRUE, 0, sizeof(float), &f, 0, NULL, NULL); CL_CHECK_ERROR(err); // Create kernel theKrnl = clCreateKernel(theProg, "writeArr", &err); CL_CHECK_ERROR(err); // Call the kernel err=clSetKernelArg(theKrnl, 0, sizeof(cl_mem), &devP); CL_CHECK_ERROR(err); float addThis=1000; err=clSetKernelArg(theKrnl, 1, sizeof(float), &addThis); CL_CHECK_ERROR(err); size_t localsz = 32; size_t globalsz = n; err = clEnqueueNDRangeKernel(theQueue, theKrnl, 1, NULL, &globalsz, &localsz, 0, NULL, NULL); CL_CHECK_ERROR(err); // Read back the results for (int i=0;i<n;i+=4) { err = clEnqueueReadBuffer(theQueue, devP, CL_TRUE, i*sizeof(float), sizeof(float), &f, 0, NULL, NULL); CL_CHECK_ERROR(err); std::cout<<"arr["<<i<<"]="<<f<<"\n"; } // Cleanup clReleaseMemObject(devP); clReleaseKernel(theKrnl); clReleaseProgram(theProg); clReleaseCommandQueue(theQueue); clReleaseContext(theCtx); return 0; } /** "simple" OpenCL example program Adapted from the SHOC OpenCL FFT caller code. Dr. Orion Sky Lawlor, (Public Domain) */ #include #include "CL/cl.h" #define CL_CHECK_ERROR(err) do{if (err) {printf("FATAL ERROR %d at " __FILE__ ":%d\n",err,__LINE__); exit(1); } } while(0) cl_device_id theDev; cl_context theCtx; cl_command_queue theQueue; cl_kernel theKrnl; cl_program theProg; static const char *theSource="/* Lots more code here! */\n" "__kernel void writeArr(__global float *arr,float v) {\n" "int i=get_global_id(0);\n" "arr[i]+=v;\n" "}\n"; int main() { cl_int err; // Set up OpenCL // Get the platform enum {MAX_PLAT=8, MAX_DEVS=8}; cl_platform_id platforms[MAX_PLAT]; cl_uint num_platforms=MAX_PLAT; err= clGetPlatformIDs(MAX_PLAT,platforms,&num_platforms); CL_CHECK_ERROR(err); cl_platform_id cpPlatform=platforms[0]; //Get the devices cl_device_id cdDevices[MAX_DEVS]; err=clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, MAX_DEVS, cdDevices, NULL); theDev=cdDevices[0]; CL_CHECK_ERROR(err); // now get the context theCtx = clCreateContext(NULL, 1, &theDev, NULL, NULL, &err); CL_CHECK_ERROR(err); // get a queue theQueue = clCreateCommandQueue(theCtx, theDev, CL_QUEUE_PROFILING_ENABLE, &err); CL_CHECK_ERROR(err); GPU code: quoted string

Dr. Lawlor, U. Alaska: EPGPU 9 Motivation: Inline Kernels OpenCL code goes in as a string – More futureproof than PTX – Allows metaprogramming Hardcoded strings are tedious – Must quote every line Strings from files: I/O paths, CPU/GPU disconnect static const char *theSource="/* Simple OpenCL: */\n" "__kernel void writeArr(__global float *arr,float v) {\n" "int i=get_global_id(0);\n" "arr[i]+=v;\n" "}\n";

Dr. Lawlor, U. Alaska: EPGPU 10 Solution: Stringify with Macro Use preprocessor to make string – Feels a bit like C#/Java/C++0x Supports multi-line expressions – But bare commas need varargs Allows OpenCL and C++ to be intermixed naturally #define QUOTE_OPENCL(code) #code static const char *theSource=QUOTE_OPENCL( __kernel void writeArr(__global float *arr,float v) { int i=get_global_id(0); arr[i]+=v; });

Dr. Lawlor, U. Alaska: EPGPU 11 Motivation for Expressive OpenCL // Create the program... theProg = clCreateProgramWithSource(theCtx, 1, &theSource, NULL, &err); CL_CHECK_ERROR(err); //...and build it const char * args = " -cl-mad-enable -cl-fast-relaxed-math "; err = clBuildProgram(theProg, 0, NULL, args, NULL, NULL); if (err != CL_SUCCESS) {...} // Set up input memory int n=64; int bytes=n*sizeof(float); cl_mem devP = clCreateBuffer(theCtx, CL_MEM_READ_WRITE, bytes, NULL, &err); CL_CHECK_ERROR(err); float f=1.2345; err = clEnqueueWriteBuffer(theQueue, devP, CL_TRUE, 0, sizeof(float), &f, 0, NULL, NULL); CL_CHECK_ERROR(err); // Create kernel theKrnl = clCreateKernel(theProg, "writeArr", &err); CL_CHECK_ERROR(err); // Call the kernel err=clSetKernelArg(theKrnl, 0, sizeof(cl_mem), &devP); CL_CHECK_ERROR(err); float addThis=1000; err=clSetKernelArg(theKrnl, 1, sizeof(float), &addThis); CL_CHECK_ERROR(err); size_t localsz = 32; size_t globalsz = n; err = clEnqueueNDRangeKernel(theQueue, theKrnl, 1, NULL, &globalsz, &localsz, 0, NULL, NULL); CL_CHECK_ERROR(err); // Read back the results for (int i=0;i<n;i+=4) { err = clEnqueueReadBuffer(theQueue, devP, CL_TRUE, i*sizeof(float), sizeof(float), &f, 0, NULL, NULL); CL_CHECK_ERROR(err); std::cout<<"arr["<<i<<"]="<<f<<"\n"; } // Cleanup clReleaseMemObject(devP); clReleaseKernel(theKrnl); clReleaseProgram(theProg); clReleaseCommandQueue(theQueue); clReleaseContext(theCtx); return 0; } /** "simple" OpenCL example program Adapted from the SHOC OpenCL FFT caller code. Dr. Orion Sky Lawlor, (Public Domain) */ #include #include "CL/cl.h" #define CL_CHECK_ERROR(err) do{if (err) {printf("FATAL ERROR %d at " __FILE__ ":%d\n",err,__LINE__); exit(1); } } while(0) cl_device_id theDev; cl_context theCtx; cl_command_queue theQueue; cl_kernel theKrnl; cl_program theProg; static const char *theSource="/* Lots more code here! */\n" "__kernel void writeArr(__global float *arr,float v) {\n" "int i=get_global_id(0);\n" "arr[i]+=v;\n" "}\n"; int main() { cl_int err; // Set up OpenCL // Get the platform enum {MAX_PLAT=8, MAX_DEVS=8}; cl_platform_id platforms[MAX_PLAT]; cl_uint num_platforms=MAX_PLAT; err= clGetPlatformIDs(MAX_PLAT,platforms,&num_platforms); CL_CHECK_ERROR(err); cl_platform_id cpPlatform=platforms[0]; //Get the devices cl_device_id cdDevices[MAX_DEVS]; err=clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, MAX_DEVS, cdDevices, NULL); theDev=cdDevices[0]; CL_CHECK_ERROR(err); // now get the context theCtx = clCreateContext(NULL, 1, &theDev, NULL, NULL, &err); CL_CHECK_ERROR(err); // get a queue theQueue = clCreateCommandQueue(theCtx, theDev, CL_QUEUE_PROFILING_ENABLE, &err); CL_CHECK_ERROR(err); Workgroup Size: - Many constraints - Performance critical

Dr. Lawlor, U. Alaska: EPGPU 12 Workgroup Size Determination Workgroup size MUST be less than CL_KERNEL_WORK_GROUP_SIZE and CL_DEVICE_MAX_WORK_GROUP_SIZE – Yet still be big enough (performance) – And be a multiple of global size (err) In theory: constrained autotuner In practice: hardcoded constant size_t localsz = 256; size_t globalsz = n; err = clEnqueueNDRangeKernel(theQueue, theKrnl, 1, NULL, &globalsz, &localsz, 0, NULL, NULL);

Dr. Lawlor, U. Alaska: EPGPU 13 Workgroup Size vs Time Lower is better Limit = 1/memory bandwidth

Dr. Lawlor, U. Alaska: EPGPU 14 Workgroup Size vs Time Too small: Workgroup creation overhead Too big: Not enough parallelism Just Right!

Dr. Lawlor, U. Alaska: EPGPU 15 Solution: Remove Constraints Generate OpenCL code “if (i<n)...” – Adds one highly coherent branch – Automatically added to your kernel Round up global size to be a multiple of an efficient workgroup size – Obey hardware constraints – Correct answer for any global size Even large prime numbers Still get good performance Do hard work once, in a library – Don't need to re-solve problem

Dr. Lawlor, U. Alaska: EPGPU 16 Motivation for Expressive OpenCL // Create the program... theProg = clCreateProgramWithSource(theCtx, 1, &theSource, NULL, &err); CL_CHECK_ERROR(err); //...and build it const char * args = " -cl-mad-enable -cl-fast-relaxed-math "; err = clBuildProgram(theProg, 0, NULL, args, NULL, NULL); if (err != CL_SUCCESS) {...} // Set up input memory int n=64; int bytes=n*sizeof(float); cl_mem devP = clCreateBuffer(theCtx, CL_MEM_READ_WRITE, bytes, NULL, &err); CL_CHECK_ERROR(err); float f=1.2345; err = clEnqueueWriteBuffer(theQueue, devP, CL_TRUE, 0, sizeof(float), &f, 0, NULL, NULL); CL_CHECK_ERROR(err); // Create kernel theKrnl = clCreateKernel(theProg, "writeArr", &err); CL_CHECK_ERROR(err); // Call the kernel err=clSetKernelArg(theKrnl, 0, sizeof(cl_mem), &devP); CL_CHECK_ERROR(err); float addThis=1000; err=clSetKernelArg(theKrnl, 1, sizeof(float), &addThis); CL_CHECK_ERROR(err); size_t localsz = 32; size_t globalsz = n; err = clEnqueueNDRangeKernel(theQueue, theKrnl, 1, NULL, &globalsz, &localsz, 0, NULL, NULL); CL_CHECK_ERROR(err); // Read back the results for (int i=0;i<n;i+=4) { err = clEnqueueReadBuffer(theQueue, devP, CL_TRUE, i*sizeof(float), sizeof(float), &f, 0, NULL, NULL); CL_CHECK_ERROR(err); std::cout<<"arr["<<i<<"]="<<f<<"\n"; } // Cleanup clReleaseMemObject(devP); clReleaseKernel(theKrnl); clReleaseProgram(theProg); clReleaseCommandQueue(theQueue); clReleaseContext(theCtx); return 0; } /** "simple" OpenCL example program Adapted from the SHOC OpenCL FFT caller code. Dr. Orion Sky Lawlor, (Public Domain) */ #include #include "CL/cl.h" #define CL_CHECK_ERROR(err) do{if (err) {printf("FATAL ERROR %d at " __FILE__ ":%d\n",err,__LINE__); exit(1); } } while(0) cl_device_id theDev; cl_context theCtx; cl_command_queue theQueue; cl_kernel theKrnl; cl_program theProg; static const char *theSource="/* Lots more code here! */\n" "__kernel void writeArr(__global float *arr,float v) {\n" "int i=get_global_id(0);\n" "arr[i]+=v;\n" "}\n"; int main() { cl_int err; // Set up OpenCL // Get the platform enum {MAX_PLAT=8, MAX_DEVS=8}; cl_platform_id platforms[MAX_PLAT]; cl_uint num_platforms=MAX_PLAT; err= clGetPlatformIDs(MAX_PLAT,platforms,&num_platforms); CL_CHECK_ERROR(err); cl_platform_id cpPlatform=platforms[0]; //Get the devices cl_device_id cdDevices[MAX_DEVS]; err=clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, MAX_DEVS, cdDevices, NULL); theDev=cdDevices[0]; CL_CHECK_ERROR(err); // now get the context theCtx = clCreateContext(NULL, 1, &theDev, NULL, NULL, &err); CL_CHECK_ERROR(err); // get a queue theQueue = clCreateCommandQueue(theCtx, theDev, CL_QUEUE_PROFILING_ENABLE, &err); CL_CHECK_ERROR(err); Kernel Arguments: - Too much code - Not typesafe - Silent failure modes

Dr. Lawlor, U. Alaska: EPGPU 17 OpenCL Kernel Argument Passing One function call per argument – Discourages use of GPU All parameters must match, or runtime err – Pass too many? Runtime error. – Pass double to cl_mem? Crash! – Pass int to float? Wrong answer! C'mon! Do it at compile time! // OpenCL kernel arguments: (__global float *ptr,int value) cl_mem devPtr=...; err=clSetKernelArg(theKrnl, 0, sizeof(cl_mem), &devPtr); int val=1000; err=clSetKernelArg(theKrnl, 1, sizeof(int), &val);

Dr. Lawlor, U. Alaska: EPGPU 18 Solution: Template Arguments Specialized templated function object – “weird C++ magic” (cf Boost, Thrust) Instantiate template from our kernel macro – So same arguments in OpenCL & C++ Compile-time argument promotion and typecheck, zero runtime cost (inline) // OpenCL kernel arguments: (__global float *ptr,int value) // C++ template: gpu_kernel,int)> template class gpu_kernel {public:... void operator()(T0 A0,T1 A1) { checkErr(clSetKernelArg(k, 0, sizeof(T0), &A0)); checkErr(clSetKernelArg(k, 1, sizeof(T1), &A1)); } };

FILL Kernels and Expressive Programming

Dr. Lawlor, U. Alaska: EPGPU 20 Problem: Shared Memory Access Multithreaded code is hard It's easy to have multiple threads overwrite each others' results – Array indexing malfunctions – 2D or 3D arrays: row, column? – Glitchy, various HW/SW/config “Where does this data go?” is useless cognitive burden – Again, solve it *once*

Dr. Lawlor, U. Alaska: EPGPU 21 Solution: FILL kernel result ≡ your array value, at your index – Read and written by EPGPU automatically – Essentially new language keyword – Automatically does array indexing – Surprisingly easier for user Lots of new potential for library – Higher-locality array indexing Write in 2D Morton order, via bit interleave – Sensitivity analysis (write Jacobian of results) – Fault tolerance (replicate computes) GPU_FILLKERNEL(float, addf, (float v), { result+=v; } )

Dr. Lawlor, U. Alaska: EPGPU 22 Example: FILL kernel GPU_FILLKERNEL(float, addf, (float v), { result+=v; } ) Input and return type Kernel name (in OpenCL and C++) Arguments User code // Call from C++ using operator= myArray=addf(2.34); // Plain C++ CPU-side equivalent: for (int i=0;i<len;i++) addf(&myArray[i],2.34);

Dr. Lawlor, U. Alaska: EPGPU 23 Example: FILL kernel (Generated) GPU_FILLKERNEL(float, addf, (float v), { result+=v; } ) Extra arguments // Generated OpenCL: __kernel void addf(int length,__global float *array, float v) { int i=get_global_id(0); if (i<length) { const int result_index=i; float result=array[result_index]; { result+=v; } array[result_index]=result; } Bounds check (local vs global) Also has 2D indexing

Dr. Lawlor, U. Alaska: EPGPU 24 Related Work Thrust: like STL for GPU – But CUDA is NVIDIA-only Intel ArBB: SIMD from kernel – Based on RapidMind – When will we see GPU support? Many other parallel languages My “GPGPU” library – Based on GLSL: nice; but limited!

Performance Examples

Dr. Lawlor, U. Alaska: EPGPU 26 Example: EPGPU Hello World #include "epgpu.h" #include /* OpenCL code: return value = array index plus a constant*/ GPU_FILLKERNEL(float, do_work, (float k), { result = i+k; } ) /* C++ code: allocate, run, and print */ int main() { int n=1000; gpu_array arr(n); /* make storage on GPU */ arr=do_work( ); /* run code on GPU */ for (int i=0;i<n;i+=100) { /* read the result */ std::cout<<"arr["<<i<<"] = "<<arr[i]<<"\n"; }

Dr. Lawlor, U. Alaska: EPGPU 27 Example: EPGPU Mandelbrot #include "epgpu.h" #include /* OpenCL code */ GPU_FILLKERNEL_2D(unsigned char, mandelbrot, (float sz,float xoff,float yoff), { /* Create complex numbers c and z */ float2 c=(float2)(i*sz+xoff,(h-1-j)*sz+yoff); float2 z=c; /* Run the mandelbrot iteration */ int count; enum { max_count=250}; for (count=0;count<max_count;count++) { if ((z.x*z.x+z.y*z.y)>4.0f) break; z=(float2)( z.x*z.x-z.y*z.y + c.x, 2.0f*z.x*z.y + c.y ); } /* Return the output pixel color */ result=count; } ) /* C++ main function */ int main() { int w=1024, h=1024; gpu_array2d img(w,h); img=mandelbrot( ,0.317,0.414); // Write rendered data to output file std::ofstream file("mandel.ppm"); file<<"P5\n"<<w<<" "<<h<<"\n255\n"; unsigned char *ca=new unsigned char[w*h]; img.read(ca); file.write((char *)ca,w*h); }

Dr. Lawlor, U. Alaska: EPGPU 28 Example: EPGPU Stencil... headers, initial conditions... /* Do one neighborhood averaging pass over src array. */ GPU_FILLKERNEL_2D(float, stencil_sweep,(__global src), int n=i+w*j; // 2D to 1D indexing if (i>0 && i 0 && j<h-1) { // Interior result = (src[n-1]+src[n+1]+ src[n-w]+src[n+w])*0.25; } else { // Boundary--copy old value result = src[n]; } ) int main() { int w=1024, h=1024; gpu_array2d stencil_src(w,h); stencil_src=stencil_initial(0.01,6.0,2.4,3.0); // an EPGPU FILLkernel (not shown) for (int time=0;time<1000;time++) { gpu_array2d stencil_dst(w,h); // cheap, due to buffer reuse inside the library stencil_dst=stencil_sweep(stencil_src); // EPGPU kernel above std::swap(stencil_dst,stencil_src); // ping-pongs the buffers }... do something with the resulting image...

Dr. Lawlor, U. Alaska: EPGPU 29 Example Performance EPGPU seems to be performance competitive with hand-coded OpenCL & CUDA Most GPU applications are memory bound (gigabytes, not gigaflops) Fermi cards (460M, 580) are much more lenient for memory access patterns

The Future

Dr. Lawlor, U. Alaska: EPGPU 31 GPU/CPU Convergence GPU, per socket: SIMD: way (“warps”) SMT: way (register limited) SMP: 4-36 way (“SMs”) CPUs will get there, soon! SIMD: 8 way AVX (or 64-way SWAR) SMT: 2 way Intel; 4 way IBM SMP: 6-8 way/socket already Intel has shown 48 way many-core chips ▀ Biggest difference: CPU has branch prediction, cache, out of order

Dr. Lawlor, U. Alaska: EPGPU 32 The Future: Memory Bandwidth Today: 1TF/s, but only 0.1TB/s Don't communicate, recompute multistep stencil methods FILL lets compiler reorder writes 64-bit -> 32-bit -> 16-bit -> 8? Spend flops scaling the data Split solution + residual storage Most flops use fewer bits, in residual Fight roundoff with stochastic rounding Add noise to improve precision

Dr. Lawlor, U. Alaska: EPGPU 33 Conclusions C++ is dead. Long live C++! CPU and GPU on collision course SIMD+SMT+SMP+network Software is the bottleneck Exciting time to build software! EPGPU model Mix C++ and OpenCL easily Simplify programmer's life Add flexibility for runtime system BUT must scale to real applications!

Backup Slides

Dr. Lawlor, U. Alaska: EPGPU 35 GLSL vs CUDA GLSL Programs

Dr. Lawlor, U. Alaska: EPGPU 36 GLSL vs CUDA GLSL Programs CUDA Programs Mipmaps; texture writes Arbitrary writes

Dr. Lawlor, U. Alaska: EPGPU 37 GLSL vs CUDA GLSL Programs CUDA Programs Correct Programs

Dr. Lawlor, U. Alaska: EPGPU 38 GLSL vs CUDA GLSL Programs CUDA Programs Correct Programs High Performance Programs