Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

Computer Graphics Ken-Yi Lee National Taiwan University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

OpenCL Programming James Perry EPCC The University of Edinburgh.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

GPU Programming Shirley Moore CPS 5401 Fall 2013

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Lecture 15 Introduction to OpenCL

Computer Engg, IIT(BHU)

CS427 Multicore Architecture and Parallel Computing

Objective To Understand the OpenCL programming model

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Basic CUDA Programming

Lecture 11 – Related Programming Models: OpenCL

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Mattan Erez The University of Texas at Austin

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

Mattan Erez The University of Texas at Austin

© David Kirk/NVIDIA and Wen-mei W. Hwu,

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi Nandakumar © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and workstation – massive volume and potential impact Why GPU Computing Courtesy: John Owens © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Role of GPUs - large data sets © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Future Apps Reflect a Concurrent World Exciting applications in future computing have been traditionally considered “ supercomputing applications ” –Video and audio synthesis/analysis, 3D imaging and visualization, consumer game physics, virtual reality products, computational financing, molecular dynamics simulation, computational fluid dynamics –These “Super-apps” represent and model the physical, concurrent world Various granularities of parallelism exist, but… –programming model must not hinder scalable implementation –data delivery needs careful management © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

DRAM Bandwidth Trends Sets Programming Agenda Random access BW 1.2% of peak for DDR3-1600, 0.8% for GDDR (and falling) 3D stacking and optical interconnects will unlikely help. © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

UIUC/NCSA AC Cluster 32 nodes –4-GPU (GTX280, Tesla), 1- FPGA, quad-core Opteron node at NCSA –GPUs donated by NVIDIA –FPGA donated by Xilinx –128 TFLOPS single precision, 10 TFLOPS double precision Coulomb Summation: –1.78 TFLOPS/node –271x speedup vs. Intel QX6700 CPU core w/ SSE UIUC/NCSA AC Cluster A partnership between NCSA and academic departments. © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

What is (Historical) GPGPU ? General Purpose computation using GPU and graphics API in applications other than 3D graphics –GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes –Large data arrays, streaming throughput –Fine-grain SIMD parallelism –Low-latency floating point (FP) computation Applications – see //GPGPU.org –Game effects (FX) physics, image processing –Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Previous GPGPU Constraints Dealing with graphics API –Working with the corner cases of the graphics API Addressing modes –Limited texture size/dimension Shader capabilities –Limited outputs Instruction sets –Lack of Integer & bit ops Communication limited –Between pixels –Scatter a[i] = p Input Registers Fragment Program Output Registers Constants Texture Temp Registers per thread per Shader per Context FB Memory © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread IssuePixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB The future of GPUs is programmable processing So – build the architecture around the processor G80 – Graphics Mode © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

CUDA – Recent OpenCL Predecessor “Compute Unified Device Architecture” General purpose programming model –User kicks off batches of threads on the GPU –GPU = dedicated super-threaded, massively data parallel co-processor Targeted software stack –Compute oriented drivers, language, and tools Driver for loading computation programs into GPU –Standalone Driver - Optimized for computation –Interface designed for compute – graphics-free API –Data sharing with OpenGL buffer objects –Guaranteed maximum download & readback speeds –Explicit GPU memory management © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

G80 CUDA mode – A Device Example Processors execute computing threads New operating mode/HW interface for computing © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

What is OpenCL? Cross-platform parallel computing API and C-like language for heterogeneous computing devices Code is portable across various target devices: –Correctness is guaranteed –Performance of a given kernel is not guaranteed across differing target devices OpenCL implementations already exist for AMD and NVIDIA GPUs, x86 CPUs In principle, OpenCL could also target DSPs, Cell, and perhaps also FPGAs © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

More on Multi-Platform Targeting Targets a broader range of CPU-like and GPU-like devices than CUDA –Targets devices produced by multiple vendors –Many features of OpenCL are optional and may not be supported on all devices OpenCL codes must be prepared to deal with much greater hardware diversity A single OpenCL kernel will likely not achieve peak performance on all device types © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Overview OpenCL programming model – basic concepts and data types OpenCL application programming interface - basic Simple examples to illustrate basic concepts and functionalities Case study to illustrate performance considerations © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Programs An OpenCL “program” contains one or more “kernels” and any supporting routines that run on a target device An OpenCL kernel is the basic unit of parallel code that can be executed on a target device Kernel A Kernel B Kernel C Misc support functions OpenCL Program © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Execution Model Integrated host+device app C program –Serial or modestly parallel parts in host C code –Highly parallel parts in device SPMD kernel C code Serial Code (host)‏... Parallel Kernel (device)‏ KernelA >>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB >>(args); © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Kernels Code that actually executes on target devices Kernel body is instantiated once for each work item –An OpenCL work item is equivalent to a CUDA thread Each OpenCL work item gets a unique index __kernel void vadd(__global const float *a, __global const float *b, __global float *result) { int id = get_global_id(0); result[id] = a[id] + b[id]; } © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

18 Array of Parallel Work Items An OpenCL kernel is executed by an array of work items –All work items run the same code (SPMD)‏ –Each work item has an index that it uses to compute memory addresses and make control decisions … int id = get_global_id(0); result[id] = a[id] + b [id]; … threads © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

… int id = get_global_id(0); result[id] = a[id] + b [id]; … work items work group 0 … … int id = get_global_id(0); result[id] = a[id] + b [id]; … work group 1 … int id = get_global_id(0); result[id] = a[id] + b [id]; … work group 7 Work Groups: Scalable Cooperation Divide monolithic work item array into work groups –Work items within a work group cooperate via shared memory, atomic operations and barrier synchronization –Work items in different work groups cannot cooperate © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Data Parallel Model Summary Parallel work is submitted to devices by launching kernels Kernels run over global dimension index ranges (NDRange), broken up into “work groups”, and “work items” Work items executing within the same work group can synchronize with each other with barriers or memory fences Work items in different work groups can’t sync with each other, except by launching a new kernel © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Host Code Prepare and trigger device code execution –Create and manage device context(s) and associate work queue(s), etc… –Memory allocations, memory copies, etc –Kernel launch OpenCL programs are normally compiled entirely at runtime, which must be managed by host code © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Hardware Abstraction OpenCL exposes CPUs, GPUs, and other Accelerators as “devices” Each “device” contains one or more “compute units”, i.e. cores, SMs, etc... Each “compute unit” contains one or more SIMD “processing elements” OpenCL Device Compute Unit PE Compute Unit PE © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

An Example of Physical Reality Behind OpenCL Abstraction CPU (host) GPU w/ local DRAM (device) © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Context Contains one or more devices OpenCL memory objects are associated with a context, not a specific device clCreateBuffer() is the main data object allocation function –error if an allocation is too large for any device in the context Each device needs its own work queue(s) Memory transfers are associated with a command queue (thus a specific device) OpenCL Device OpenCL Context © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Context Setup Code (simple) cl_int clerr = CL_SUCCESS; cl_context clctx = clCreateContextFromType(0, CL_DEVICE_TYPE_ALL, NULL, NULL, &clerr); size_t parmsz; clerr = clGetContextInfo(clctx, CL_CONTEXT_DEVICES, 0, NULL, &parmsz); cl_device_id* cldevs = (cl_device_id *) malloc(parmsz); clerr = clGetContextInfo(clctx, CL_CONTEXT_DEVICES, parmsz, cldevs, NULL); cl_command_queue clcmdq = clCreateCommandQueue(clctx, cldevs[0], 0, &clerr); © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Memory Model Overview Global memory –Main means of communicating R/W Data between host and device –Contents visible to all threads –Long latency access We will focus on global memory for now NDRange Global Memory Work Group (0, 0)‏ local Memory Thread (0, 0)‏ private Thread (1, 0)‏ private Work Group (1, 0)‏ local Memory Thread (0, 0)‏ private Thread (1, 0)‏ private Host

© Wen-mei W. Hwu and John Stone, Urbana July 22, 2010 OpenCL Device Memory Allocation clCreateBuffer(); Global Memory –Allocates object in the device Global Memory –Returns a pointer to the object –Requires five parameters OpenCL context pointer Flags for access type by device Size of allocated object Host memory pointer, if used in copy-from-host mode Error code clReleaseMemObject()‏ –Frees object Pointer to freed object Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

© Wen-mei W. Hwu and John Stone, Urbana July 22, 2010 OpenCL Device Memory Allocation (cont.)‏ Code example: –Allocate a 1024 single precision float array –Attach the allocated storage to d_a –“d” is often used to indicate a device data structure VECTOR_SIZE = 1024; cl_mem d_a; int size = VECTOR_SIZE* sizeof(float); d_a = clCreateBuffer(clctx, CL_MEM_READ_ONLY, size, NULL, NULL); clReleaseMemObject(d_a);

OpenCL Device Command Execution OpenCL Device Cmd Queue Command Application Cmd Queue Command OpenCL Device OpenCL Context © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Host-to-Device Data Transfer clEnqueueWriteBuffer(); –memory data transfer to device –Requires nine parameters OpenCL command queue pointer Destination OpenCL memory buffer Blocking flag Offset in bytes Size of bytes of written data Host memory pointer List of events to be completed before execution of this command Event object tied to this command Asynchronous transfer later Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

© Wen-mei W. Hwu and John Stone, Urbana July 22, 2010 OpenCL Device-to-Host Data Transfer clEnqueueReadBuffer(); –memory data transfer to host –Requires nine parameters OpenCL command queue pointer Source OpenCL memory buffer Blocking flag Offset in bytes Size of bytes of read data Destination host memory pointer List of events to be completed before execution of this command Event object tied to this command Asynchronous transfer later Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

© Wen-mei W. Hwu and John Stone, Urbana July 22, 2010 OpenCL Host-Device Data Transfer (cont.)‏ Code example: –Transfer a 64 * 64 single precision float array –a is in host memory and d_a is in device memory clEnqueueWriteBuffer(clcmdq, d_a, CL_FALSE, 0, mem_size, (const void * )a, 0, 0, NULL); clEnqueueReadBuffer(clcmdq, d_result, CL_FALSE, 0, mem_size, (void * ) host_result, 0, 0, NULL);

OpenCL Host-Device Data Transfer (cont.)‏ clCreateBuffer and clEnqueueWriteBuffer can be combined into a single command using special flags. Eg: d_A=clCreateBuffer(clctxt, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, mem_size, h_A, NULL); –Combination of 2 flags here. CL_MEM_COPY_HOST_PTR to be used only if a valid host pointer is specified. –This creates a memory buffer on the device, and copies data from h_A into d_A. –Includes an implicit clEnqueueWriteBuffer operation, for all devices/command queues tied to the context clctxt. © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Memory Systems __global – large, long latency __private – on-chip device registers __local – memory accessible from multiple PEs or work items. May be SRAM or DRAM, must query… __constant – read-only constant cache Device memory is managed explicitly by the programmer, as with CUDA © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Kernel Execution Launch OpenCL DeviceCmd Queue Kernel Application Cmd Queue Kernel OpenCL Device OpenCL Context © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Kernel Compilation Example const char* vaddsrc = “__kernel void vadd(__global float *d_A, __global float *d_B, __global float *d_C, int N) { \n“ […etc and so forth…] cl_program clpgm; clpgm = clCreateProgramWithSource(clctx, 1, &vaddsrc, NULL, &clerr); char clcompileflags[4096]; sprintf(clcompileflags, “-cl-mad-enable"); clerr = clBuildProgram(clpgm, 0, NULL, clcompileflags, NULL, NULL); cl_kernel clkern = clCreateKernel(clpgm, “vadd", &clerr); OpenCL kernel source code as a big string Gives raw source code string(s) to OpenCL Set compiler flags, compile source, and retreive a handle to the “vadd” kernel © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Summary: Host code for vadd int main() { // allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; // allocate device (GPU) memory cl_mem d_A, d_B, d_C; d_A = clCreateBuffer(clctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, N *sizeof(float), h_A, NULL); d_B = clCreateBuffer(clctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, N *sizeof(float), h_B, NULL); d_C = clCreateBuffer(clctx, CL_MEM_WRITE_ONLY, N *sizeof(float), NULL, NULL); clkern=clCreateKernel(clpgm, “vadd", NULL); clerr= clSetKernelArg(clkern, 0,sizeof(cl_mem), (void *) &d_A); clerr= clSetKernelArg(clkern, 1,sizeof(cl_mem), (void *) &d_B); clerr= clSetKernelArg(clkern, 2,sizeof(cl_mem), (void *) &d_C); clerr= clSetKernelArg(clkern, 3,sizeof(int), &N); cl_event event=NULL; clerr= clEnqueueNDRangeKernel(clcmdq,clkern, 2, NULL, Gsz,Bsz, 0, NULL, &event); clerr= clWaitForEvents(1, &event); clEnqueueReadBuffer(clcmdq, d_C, CL_TRUE, 0, N*sizeof(float), h_C, 0, NULL, NULL); clReleaseMemObject (d_A); clReleaseMemObject (d_B); clReleaseMemObject (d_C); } © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Let’s take a break! © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010