Advanced / Other Programming Models Sathish Vadhiyar.

Advanced / Other Programming Models Sathish Vadhiyar

OpenCL – Command Queues, Runtime Compilation, Multiple Devices Sources: OpenCL overview from AMD OpenCL learning kit from AMD

Introduction  OpenCL is a programming framework for heterogeneous computing resources  Resources include CPUs, GPUs, Cell Broadband Engine, FPGAs, DSPs  Many similarities with CUDA

Command Queues  A command queue is the mechanism for the host to request that an action be performed by the device Perform a memory transfer, begin executing, etc. Interesting concept of enqueuing kernels and satisfying dependencies using events  A separate command queue is required for each device  Commands within the queue can be synchronous or asynchronous  Commands can execute in-order or out-of-order 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

 Example – Image Rotation

 Slides 8, 11-16 of lecture 5 in openCL University kit

 Synchronization

Synchronization in OpenCL  Synchronization is required if we use an out-of-order command queue or multiple command queues  Coarse synchronization granularity Per command queue basis  Finer synchronization granularity Per OpenCL operation basis using events 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

OpenCL Command Queue Control  Command queue synchronization methods work on a per-queue basis  Flush: clFlush( cl_commandqueue ) Send all commands in the queue to the compute device No guarantee that they will be complete when clFlush returns  Finish: clFinish( cl_commandqueue ) Waits for all commands in the command queue to complete before proceeding (host blocks on this call)  Barrier: clEnqueueBarrier( cl_commandqueue ) Enqueue a synchronization point that ensures all prior commands in a queue have completed before any further commands execute 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

OpenCL Events  Previous OpenCL synchronization functions only operated on a per- command-queue granularity  OpenCL events are needed to synchronize at a function granularity  Explicit synchronization is required for Out-of-order command queues Multiple command queues 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Using User Events  A simple example of user events being triggered and used in a command queue //Create user event which will start the write of buf1 user_event = clCreateUserEvent(ctx, NULL); clEnqueueWriteBuffer( cq, buf1, CL_FALSE,..., 1, &user_event, NULL); //The write of buf1 is now enqued and waiting on user_event X = foo(); //Lots of complicated host processing code clSetUserEventStatus(user_event, CL_COMPLETE); //The clEnqueueWriteBuffer to buf1 can now proceed as per OP of foo() 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

 Multiple Devices

Multiple Devices  OpenCL can also be used to program multiple devices (CPU, GPU, Cell, DSP etc.)  OpenCL does not assume that data can be transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an intermediate transfer to the host  OpenCL events are used to synchronize execution on different devices within a context

Compiling Code for Multiple Devices

Charm++ Source: Tutorial Slides from Parallel Programming Lab, UIUC Authors (Laxmikant Kale, Eric Bohm)

Virtualization: Object-based Decomposition  In MPI, the number of processes is typically equal to the number of processors  Virtualization:  Divide the computation into a large number of pieces Independent of number of processors Typically larger than number of processors  Let the system map objects to processors

The Charm++ Model  Parallel objects (chares) communicate via asynchronous method invocations (entry methods).  The runtime system maps chares onto processors and schedules execution of entry methods.  Chares can be dynamically created on any available processor  Can be accessed from remote processors 18Charm++ Basics

10/17/2015CS 42019 Processor Virtualization User View System implementation User is only concerned with interaction between objects (VPs)

20 Adaptive Overlap via Data-driven Objects  Problem: Processors wait for too long at “receive” statements  With Virtualization, you get Data-driven execution There are multiple entities (objects, threads) on each proc  No single object or threads holds up the processor  Each one is “continued” when its data arrives So: Achieves automatic and adaptive overlap of computation and communication

Load Balancing

10/17/2015CS 42022 Using Dynamic Mapping to Processors  Migration Charm objects can migrate from one processor to another Migration creates a new object on the destination processor while destroying the original Use that for dynamic (and static, initial) load balancing  Measurement based, predictive strategies Based on object communication patterns and computational loads

Summary: Primary Advantages  Automatic mapping  Migration and load balancing  Asynchronous and message driven communications  Computation-communication overlap

How it looks?

Asynchronous Hello World  Program’s asynchronous flow Mainchare sends message to Hello object Hello object prints “Hello World!” Hello object sends message back to the mainchare Mainchare quits the application Charm++ Basics25

Code and Workflow Charm++ Basics26

Hello World: Array Version Main Code Charm++ Basics27

Array Code Charm++ Basics28

Result $./charmrun +p3./hello 10 Running “Hello World” with 10 elements using 3 processors. “Hello” from Hello chare #0 on processor 0 (told by -1) “Hello” from Hello chare #1 on processor 0 (told by 0) “Hello” from Hello chare #2 on processor 0 (told by 1) “Hello” from Hello chare #3 on processor 0 (told by 2) “Hello” from Hello chare #4 on processor 1 (told by 3) “Hello” from Hello chare #5 on processor 1 (told by 4) “Hello” from Hello chare #6 on processor 1 (told by 5) “Hello” from Hello chare #7 on processor 2 (told by 6) “Hello” from Hello chare #8 on processor 2 (told by 7) “Hello” from Hello chare #9 on processor 2 (told by 8) Charm++ Basics29

Advanced / Other Programming Models Sathish Vadhiyar.

Similar presentations

Presentation on theme: "Advanced / Other Programming Models Sathish Vadhiyar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced / Other Programming Models Sathish Vadhiyar.

Similar presentations

Presentation on theme: "Advanced / Other Programming Models Sathish Vadhiyar."— Presentation transcript:

Similar presentations

About project

Feedback