Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced / Other Programming Models Sathish Vadhiyar.

Similar presentations


Presentation on theme: "Advanced / Other Programming Models Sathish Vadhiyar."— Presentation transcript:

1 Advanced / Other Programming Models Sathish Vadhiyar

2 OpenCL – Command Queues, Runtime Compilation, Multiple Devices Sources: OpenCL overview from AMD OpenCL learning kit from AMD

3 Introduction  OpenCL is a programming framework for heterogeneous computing resources  Resources include CPUs, GPUs, Cell Broadband Engine, FPGAs, DSPs  Many similarities with CUDA

4

5 Command Queues  A command queue is the mechanism for the host to request that an action be performed by the device Perform a memory transfer, begin executing, etc. Interesting concept of enqueuing kernels and satisfying dependencies using events  A separate command queue is required for each device  Commands within the queue can be synchronous or asynchronous  Commands can execute in-order or out-of-order 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

6  Example – Image Rotation

7  Slides 8, 11-16 of lecture 5 in openCL University kit

8  Synchronization

9 Synchronization in OpenCL  Synchronization is required if we use an out-of-order command queue or multiple command queues  Coarse synchronization granularity Per command queue basis  Finer synchronization granularity Per OpenCL operation basis using events 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

10 OpenCL Command Queue Control  Command queue synchronization methods work on a per-queue basis  Flush: clFlush( cl_commandqueue ) Send all commands in the queue to the compute device No guarantee that they will be complete when clFlush returns  Finish: clFinish( cl_commandqueue ) Waits for all commands in the command queue to complete before proceeding (host blocks on this call)  Barrier: clEnqueueBarrier( cl_commandqueue ) Enqueue a synchronization point that ensures all prior commands in a queue have completed before any further commands execute 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

11 OpenCL Events  Previous OpenCL synchronization functions only operated on a per- command-queue granularity  OpenCL events are needed to synchronize at a function granularity  Explicit synchronization is required for Out-of-order command queues Multiple command queues 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

12 Using User Events  A simple example of user events being triggered and used in a command queue //Create user event which will start the write of buf1 user_event = clCreateUserEvent(ctx, NULL); clEnqueueWriteBuffer( cq, buf1, CL_FALSE,..., 1, &user_event, NULL); //The write of buf1 is now enqued and waiting on user_event X = foo(); //Lots of complicated host processing code clSetUserEventStatus(user_event, CL_COMPLETE); //The clEnqueueWriteBuffer to buf1 can now proceed as per OP of foo() 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

13  Multiple Devices

14 Multiple Devices  OpenCL can also be used to program multiple devices (CPU, GPU, Cell, DSP etc.)  OpenCL does not assume that data can be transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an intermediate transfer to the host  OpenCL events are used to synchronize execution on different devices within a context

15 Compiling Code for Multiple Devices

16 Charm++ Source: Tutorial Slides from Parallel Programming Lab, UIUC Authors (Laxmikant Kale, Eric Bohm)

17 Virtualization: Object-based Decomposition  In MPI, the number of processes is typically equal to the number of processors  Virtualization:  Divide the computation into a large number of pieces Independent of number of processors Typically larger than number of processors  Let the system map objects to processors

18 The Charm++ Model  Parallel objects (chares) communicate via asynchronous method invocations (entry methods).  The runtime system maps chares onto processors and schedules execution of entry methods.  Chares can be dynamically created on any available processor  Can be accessed from remote processors 18Charm++ Basics

19 10/17/2015CS 42019 Processor Virtualization User View System implementation User is only concerned with interaction between objects (VPs)

20 20 Adaptive Overlap via Data-driven Objects  Problem: Processors wait for too long at “receive” statements  With Virtualization, you get Data-driven execution There are multiple entities (objects, threads) on each proc  No single object or threads holds up the processor  Each one is “continued” when its data arrives So: Achieves automatic and adaptive overlap of computation and communication

21 Load Balancing

22 10/17/2015CS 42022 Using Dynamic Mapping to Processors  Migration Charm objects can migrate from one processor to another Migration creates a new object on the destination processor while destroying the original Use that for dynamic (and static, initial) load balancing  Measurement based, predictive strategies Based on object communication patterns and computational loads

23 Summary: Primary Advantages  Automatic mapping  Migration and load balancing  Asynchronous and message driven communications  Computation-communication overlap

24 How it looks?

25 Asynchronous Hello World  Program’s asynchronous flow Mainchare sends message to Hello object Hello object prints “Hello World!” Hello object sends message back to the mainchare Mainchare quits the application Charm++ Basics25

26 Code and Workflow Charm++ Basics26

27 Hello World: Array Version Main Code Charm++ Basics27

28 Array Code Charm++ Basics28

29 Result $./charmrun +p3./hello 10 Running “Hello World” with 10 elements using 3 processors. “Hello” from Hello chare #0 on processor 0 (told by -1) “Hello” from Hello chare #1 on processor 0 (told by 0) “Hello” from Hello chare #2 on processor 0 (told by 1) “Hello” from Hello chare #3 on processor 0 (told by 2) “Hello” from Hello chare #4 on processor 1 (told by 3) “Hello” from Hello chare #5 on processor 1 (told by 4) “Hello” from Hello chare #6 on processor 1 (told by 5) “Hello” from Hello chare #7 on processor 2 (told by 6) “Hello” from Hello chare #8 on processor 2 (told by 7) “Hello” from Hello chare #9 on processor 2 (told by 8) Charm++ Basics29

30


Download ppt "Advanced / Other Programming Models Sathish Vadhiyar."

Similar presentations


Ads by Google