© Wen-mei W. Hwu and John Stone, Urbana July 22, 2010 1 Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 2: Case Studies Wen-mei Hwu.

Slides:

Advertisements

Similar presentations

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

Advertisements

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

© John E. Stone, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 19: Performance Case Studies: Ion Placement Tool, VMD Guest.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

NIH BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, U. Illinois at Urbana-Champaign Programming in CUDA:

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC OpenCL: Molecular Modeling on Heterogeneous.

CIS 565 Fall 2011 Qing Sun

© Copyright John E. Stone, Page 1 OpenCL for Molecular Modeling Applications: Early Experiences John Stone University of Illinois at Urbana-Champaign.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC S03: High Performance Computing with CUDA Heterogeneous.

© John E. Stone, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 21: Application Performance Case Studies: Molecular Visualization.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Matrix Multiplication in CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

OpenCL Programming James Perry EPCC The University of Edinburgh.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 18: Performance Case Studies: Ion.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Multilevel Summation of Electrostatic Potentials.

An Introduction to OpenCL

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

My Coordinates Office EM G.27 contact time:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Lecture 15 Introduction to OpenCL

An Introduction to OpenCL

Objective To Understand the OpenCL programming model

Patrick Cozzi University of Pennsylvania CIS Spring 2011

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Lecture 11 – Related Programming Models: OpenCL

John Stone Theoretical and Computational Biophysics Group

Slides from “PMPP” book

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Memory and Data Locality

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

6- General Purpose GPU Programming

Presentation transcript:

© Wen-mei W. Hwu and John Stone, Urbana July 22, Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 2: Case Studies Wen-mei Hwu and John Stone with special contributions from Deepthi Nandakumar

OpenCL Data Parallel Model Parallel work is submitted to devices by launching kernels Kernels run over global dimension index ranges (NDRange), broken up into “work groups”, and “work items” Work items executing within the same work group can synchronize with each other with barriers or memory fences Work items in different work groups can’t sync with each other, except by launching a new kernel 2 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

3 Transparent Scalability Hardware is free to assigns work groups to any processor at any time –A kernel scales across any number of parallel processors Device Group 0Group 1 Group 2Group 3 Group 4Group 5 Group 6Group 7 NDRange Group 0Group 1 Group 2Group 3 Group 4Group 5 Group6Group 7 Device Group 0Group 1Group2Group 3Group 4Group 5Group 6Group 7 Each group can execute in any order relative to other groups. time © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL NDRange Configuration 0,00,1 1,01,1 … … ……… Work Group Work Item Global Size(0) Global Size(1) Local Size(0) Local Size(1) Group ID 4 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Parallelism ConceptCUDA Equivalent kernel host program NDRange (index space)grid work itemthread work groupblock Mapping Data Parallelism Models: OpenCL to CUDA 5 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

6 A Simple Running Example Matrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in OpenCL programs –Private register usage –Work item ID usage –Memory data transfer API between host and device –Assume square matrix for simplicity

© Wen-mei W. Hwu and John Stone, Urbana July 22, Programming Model: Square Matrix Multiplication Example P = M * N of size WIDTH x WIDTH Without tiling: –One work item calculates one element of P –M and N are loaded WIDTH times from global memory M N P WIDTH

© Wen-mei W. Hwu and John Stone, Urbana July 22, M 0,2 M 1,1 M 0,1 M 0,0 M 1,0 M 0,3 M 1,2 M 1,3 Memory Layout of a Matrix in C M 0,2 M 0,1 M 0,0 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 2,1 M 2,0 M 2,,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 M 3,1 M 3,0 M 3,2 M 3,3 M

© Wen-mei W. Hwu and John Stone, Urbana July 22, Step 1: Matrix Multiplication A Simple Host Version in C M N P WIDTH // Matrix multiplication on the (CPU) host void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * Width + k]; float b = N[k * Width + j]; sum += a * b; } P[i * Width + j] = sum; } i k k j

© Wen-mei W. Hwu and John Stone, Urbana July 22, void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); cl_mem Md, Nd, Pd; Md=clCreateBuffer(clctxt, CL_MEM_READ_WRITE, mem_size_M, NULL, NULL); Nd=clCreateBuffer(clctxt, CL_MEM_READ_WRITE, mem_size_N, NULL, &ciErrNum); clEnqueueWriteBuffer(clcmdque, Md, CL_FALSE, 0, mem_size_M, (const void * )M, 0, 0, NULL); clEnqueueWriteBuffer(clcmdque, Nd, CL_FALSE, 0, mem_size_N, (const void *)N, 0, 0, NULL); Pd=clCreateBuffer(clctxt, CL_MEM_READ_WRITE, mem_size_P, NULL, NULL); Step 2: Input Matrix Data Transfer (Host-side Code)‏

© Wen-mei W. Hwu and John Stone, Urbana July 22, Step 3: Output Matrix Data Transfer (Host-side Code)‏ 2. // Kernel invocation code – to be shown later 3. // Read P from the device clEnqueueReadBuffer(clcmdque, Pd, CL_FALSE, 0, mem_size_P,(void*)P), 0, 0, &ReadDone); // Free device matrices clReleaseMemObject(Md); clReleaseMemObject(Nd); clReleaseMemObject(Pd); }

12 Md Nd Pd Pd sub TILE_WIDTH WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTHE WIDTH Matrix Multiplication Using Multiple Work Groups Break-up Pd into tiles Each work group calculates one tile –Each work item calculates one element –Set work group size to tile size © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

13 P 0,1 P 0,0 P 1,0 P 0,2 P 0,3 P 1,1 P 2,0 P 2,2 P 2,3 P 2,1 P 1,3 P 1,2 P 3,0 P 3,2 P 3,3 P 3,1 Group(0,0)Group(0,1) Group(1,1)Group(1,0) TILE_WIDTH = 2 A Very Small Example © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

14 Pd 0,1 A Very Small Example: Multiplication Md 0,2 Md 1,1 Md 0,1 Md 0,0 Md 1,0 Md 0,3 Md 1,2 Pd 0,0 Md 1,3 Pd 1,0 Pd 0,2 Pd 0,3 Nd 3,0 Nd 3,1 Nd 2,1 Nd 1,1 Nd 0,1 Nd 0,0 Nd 1,0 Nd 2,0 Pd 1,1 Pd 2,0 Pd 2,2 Pd 2,3 Pd 2,1 Pd 1,3 Pd 1,2 Pd 3,0 Pd 3,2 Pd 3,3 Pd 3,1 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010 work item (0,0)

© Wen-mei W. Hwu and John Stone, Urbana July 22, OpenCL Matrix Multiplication Kernel __kernel void MatrixMulKernel(__global float* Md, __global float* Nd, __global float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = get_global_id(1); // Calculate the column idenx of Pd and N int Col = get_global_id(0); float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }

© Wen-mei W. Hwu and John Stone, Urbana July 22, // Setup the execution configuration size_t cl_DimBlock[2], cl_DimGrid[2]; cl_DimBlock[0] = TILE_WIDTH; cl_DimBlock[1] = TILE_WIDTH; cl_DimGrid[0] = Width; cl_DimGrid[1] = Width; clSetKernelArg(clkern, 0, sizeof (cl_mem), (void*)(&deviceP)); clSetKernelArg(clkern, 1, sizeof (cl_mem), (void*)(&deviceM)); clSetKernelArg(clkern, 2, sizeof (cl_mem), (void*)(&deviceN)); clSetKernelArg(clkern, 3, sizeof (int), (void *)(&Width)); // Launch the device kernel clEnqueueNDRangeKernel(clcmdque, clkern, 2, NULL, cl_DimGrid, cl_DimBlock, 0, NULL, &DeviceDone); Revised Step 5: Kernel Invocation (Host-side Code)

© Wen-mei W. Hwu and John Stone, Urbana July 22, A Real Application Example

Electrostatic Potential Maps Electrostatic potentials evaluated on 3-D lattice: Applications include: –Ion placement for structure building –Time-averaged potentials for simulation –Visualization and analysis Isoleucine tRNA synthetase 18 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Direct Coulomb Summation Each lattice point accumulates electrostatic potential contribution from all atoms: potential[j] += charge[i] / r ij atom[i] r ij : distance from lattice[j] to atom[i] Lattice point j being evaluated 19 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Data Parallel Direct Coulomb Summation Algorithm Work is decomposed into tens of thousands of independent calculations –multiplexed onto all of the processing units on the target device (hundreds in the case of modern GPUs) Single-precision FP arithmetic is adequate for intended application Numerical accuracy can be improved by compensated summation, spatially ordered summation groupings, or accumulation of potential in double-precision Starting point for more sophisticated linear-time algorithms like multilevel summation 20 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

DCS Data Parallel Decomposition (unrolled, coalesced) Padding waste Grid of thread blocks: 0,00,1 1,01,1 … …… … Work Groups: work items … Unrolling increases computational tile size Work items compute up to 8 potentials, skipping by memory coalescing width 21 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Global Memory Texture Parallel Data Cache GPU Constant Memory Direct Coulomb Summation in OpenCL Host Atomic Coordinates Charges Work items compute up to 8 potentials, skipping by coalesced memory width Work groups: work items NDRange containing all work items, decomposed into work groups Lattice padding 22 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Direct Coulomb Summation Kernel Setup OpenCL: __kernel void clenergy(…) { unsigned int xindex = (get_global_id(0) - get_local_id(0)) * UNROLLX + get_local_id(0); unsigned int yindex = get_global_id(1); unsigned int outaddr = get_global_size(0) * UNROLLX * yindex + xindex; CUDA: __global__ void cuenergy (…) { unsigned int xindex = blockIdx.x * blockDim.x * UNROLLX + threadIdx.x; unsigned int yindex = blockIdx.y * blockDim.y + threadIdx.y; unsigned int outaddr = gridDim.x * blockDim.x * UNROLLX * yindex + xindex; 23 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

DCS Inner Loop (CUDA) …for (atomid=0; atomid<numatoms; atomid++) { float dy = coory - atominfo[atomid].y; float dyz2 = (dy * dy) + atominfo[atomid].z; float dx1 = coorx – atominfo[atomid].x; float dx2 = dx1 + gridspacing_coalesce; float dx3 = dx2 + gridspacing_coalesce; float dx4 = dx3 + gridspacing_coalesce; float charge = atominfo[atomid].w; energyvalx1 += charge * rsqrtf(dx1*dx1 + dyz2); energyvalx2 += charge * rsqrtf(dx2*dx2 + dyz2); energyvalx3 += charge * rsqrtf(dx3*dx3 + dyz2); energyvalx4 += charge * rsqrtf(dx4*dx4 + dyz2); } 24 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

DCS Inner Loop (OpenCL on NVIDIA GPU) …for (atomid=0; atomid<numatoms; atomid++) { float dy = coory - atominfo[atomid].y; float dyz2 = (dy * dy) + atominfo[atomid].z; float dx1 = coorx – atominfo[atomid].x; float dx2 = dx1 + gridspacing_coalesce; float dx3 = dx2 + gridspacing_coalesce; float dx4 = dx3 + gridspacing_coalesce; float charge = atominfo[atomid].w; energyvalx1 += charge * native_rsqrt(dx1*dx1 + dyz2); energyvalx2 += charge * native_rsqrt(dx2*dx2 + dyz2); energyvalx3 += charge * native_rsqrt(dx3*dx3 + dyz2); energyvalx4 += charge * native_rsqrt(dx4*dx4 + dyz2); } 25 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

DCS Inner Loop (OpenCL on AMD CPU) float4 gridspacing_u4 = { 0.f, 1.f, 2.f, 3.f }; gridspacing_u4 *= gridspacing_coalesce; float4 energyvalx=0.0f; … for (atomid=0; atomid<numatoms; atomid++) { float dy = coory - atominfo[atomid].y; float dyz2 = (dy * dy) + atominfo[atomid].z; float4 dx = gridspacing_u4 + (coorx – atominfo[atomid].x); float charge = atominfo[atomid].w; energyvalx1 += charge * native_rsqrt(dx1*dx1 + dyz2); } 26 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Wait a Second, Why Two Different OpenCL Kernels??? Existing OpenCL implementations don’t necessarily autovectorize your code for the native hardware’s SIMD vector width Although you can run the same code on very different devices and get the correct answer, performance will vary wildly… In many cases, getting peak performance on multiple device types or hardware from different vendors will presently require multiple OpenCL kernels 27 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Host Code Roughly analogous to CUDA driver API: –Memory allocations, memory copies, etc –Create and manage device context(s) and associate work queue(s), etc… –OpenCL uses reference counting on all objects OpenCL programs are normally compiled entirely at runtime, which must be managed by host code 28 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Context Setup Code (simple) cl_int clerr = CL_SUCCESS; cl_context clctx = clCreateContextFromType(0, CL_DEVICE_TYPE_ALL, NULL, NULL, &clerr); size_t parmsz; clerr = clGetContextInfo(clctx, CL_CONTEXT_DEVICES, 0, NULL, &parmsz); cl_device_id* cldevs = (cl_device_id *) malloc(parmsz); clerr = clGetContextInfo(clctx, CL_CONTEXT_DEVICES, parmsz, cldevs, NULL); cl_command_queue clcmdq = clCreateCommandQueue(clctx, cldevs[0], 0, &clerr); 29 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Kernel Compilation Example const char* clenergysrc = "__kernel __attribute__((reqd_work_group_size_hint(BLOCKSIZEX, BLOCKSIZEY, 1))) \n" "void clenergy(int numatoms, float gridspacing, __global float *energy, __constant float4 *atominfo) { \n“ […etc and so forth…] cl_program clpgm; clpgm = clCreateProgramWithSource(clctx, 1, &clenergysrc, NULL, &clerr); char clcompileflags[4096]; sprintf(clcompileflags, "-DUNROLLX=%d -cl-fast-relaxed-math -cl-single- precision-constant -cl-denorms-are-zero -cl-mad-enable", UNROLLX); clerr = clBuildProgram(clpgm, 0, NULL, clcompileflags, NULL, NULL); cl_kernel clkern = clCreateKernel(clpgm, "clenergy", &clerr); OpenCL kernel source code as a big string Gives raw source code string(s) to OpenCL Set compiler flags, compile source, and retreive a handle to the “clenergy” kernel 30 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

1. doutput= clCreateBuffer(clctx, CL_MEM_READ_WRITE,volmemsz, NULL, NULL); 2. datominfo= clCreateBuffer(clctx, CL_MEM_READ_ONLY, MAXATOMS *sizeof(cl_float4), NULL, NULL); … 3. clerr= clSetKernelArg(clkern, 0,sizeof(int), &runatoms); 4. clerr= clSetKernelArg(clkern, 1,sizeof(float), &zplane); 5. clerr= clSetKernelArg(clkern, 2,sizeof(cl_mem), &doutput); 6. clerr= clSetKernelArg(clkern, 3,sizeof(cl_mem), &datominfo); 7. cl_event event; 8. clerr= clEnqueueNDRangeKernel(clcmdq,clkern, 2, NULL, Gsz,Bsz, 0, NULL, &event); 9. clerr= clWaitForEvents(1, &event); 10. clerr= clReleaseEvent(event); … 11. clEnqueueReadBuffer(clcmdq,doutput, CL_TRUE, 0, volmemsz, energy, 0, NULL, NULL); 12. clReleaseMemObject(doutput); 13. clReleaseMemObject(datominfo); Host Code for OpenCL Kernel Launch 31 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Apples to Oranges Performance Results: OpenCL Direct Coulomb Summation Kernels MADD, RSQRT = 2 FLOPS All other FP instructions = 1 FLOP OpenCL Target DeviceOpenCL “cores” Scalar Kernel: Ported from original CUDA kernel 4-Vector Kernel: Replaced manually unrolled loop iterations with float4 vector ops AMD 2.2GHz Opteron 148 CPU (a very old Linux test box) Bevals/sec, 2.19 GFLOPS 0.49 Bevals/sec, 3.59 GFLOPS Intel 2.2Ghz Core2 Duo, (Apple MacBook Pro) Bevals/sec, 6.55 GFLOPS 2.38 Bevals/sec, GFLOPS IBM QS22 CellBE *** __constant not implemented yet Bevals/sec, GFLOPS **** 6.21 Bevals/sec, GFLOPS **** AMD Radeon 4870 GPU Bevals/sec, GFLOPS Bevals/sec, GFLOPS NVIDIA GeForce GTX 285 GPU Bevals/sec, GFLOPS Bevals/sec, GFLOPS

To Learn More Khronos OpenCL headers, specification, etc: Khronos OpenCL samples, tutorials, etc: AMD OpenCL Resources: TutorialOpenCL.aspx TutorialOpenCL.aspx NVIDIA OpenCL Resources: Kirk and Hwu, “Programming Massively Parallel Processors – a Hands-on Approach,” Morgan- Kaufman, ISBN: © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Summary Incorporating OpenCL into an application requires adding far more “plumbing” in an application than for the CUDA Runtime API Although OpenCL code is portable in terms of correctness, performance of any particular kernel is not guaranteed across different device types/vendors Apps have to check performance-related properties of target devices, e.g. whether __local memory is fast/slow (query CL_DEVICE_LOCAL_MEM_TYPE) It remains to be seen how OpenCL “platforms” will allow apps to concurrently use an AMD CPU runtime and NVIDIA GPU runtime (may already work on MacOS X?) 34 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Acknowledgements Additional Information and References: – Questions, source code requests: –John Stone: Acknowledgements: J. Phillips, D. Hardy, J. Saam, UIUC Theoretical and Computational Biophysics Group, NIH Resource for Macromolecular Modeling and Bioinformatics Christopher Rodrigues, UIUC IMPACT Group CUDA team at NVIDIA UIUC NVIDIA CUDA Center of Excellence NIH support: P41-RR © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Publications Probing Biomolecular Machines with Graphics Processors. J. Phillips, J. Stone. Communications of the ACM, 52(10):34-41, GPU Clusters for High Performance Computing. V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, W. Hwu. Workshop on Parallel Programming on Accelerator Clusters (PPAC), IEEE Cluster In press. Long time-scale simulations of in vivo diffusion using GPU hardware. E. Roberts, J. Stone, L. Sepulveda, W. Hwu, Z. Luthey-Schulten. In IPDPS’09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Computing, pp. 1-8, High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs. J. Stone, J. Saam, D. Hardy, K. Vandivort, W. Hwu, K. Schulten, 2nd Workshop on General-Purpose Computation on Graphics Pricessing Units (GPGPU-2), ACM International Conference Proceeding Series, volume 383, pp. 9-18, Multilevel summation of electrostatic potentials using graphics processing units. D. Hardy, J. Stone, K. Schulten. J. Parallel Computing, 35: , © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Publications (cont) Adapting a message-driven parallel application to GPU-accelerated clusters. J. Phillips, J. Stone, K. Schulten. Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, IEEE Press, GPU acceleration of cutoff pair potentials for molecular modeling applications. C. Rodrigues, D. Hardy, J. Stone, K. Schulten, and W. Hwu. Proceedings of the 2008 Conference On Computing Frontiers, pp , GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 96: , Accelerating molecular modeling applications with graphics processors. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, K. Schulten. J. Comp. Chem., 28: , Continuous fluorescence microphotolysis and correlation spectroscopy. A. Arkhipov, J. Hüve, M. Kahms, R. Peters, K. Schulten. Biophysical Journal, 93: , © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010