Advanced / Other Programming Models Sathish Vadhiyar.

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Modified from Silberschatz, Galvin and Gagne ©2009 Lecture 7 Chapter 4: Threads (cont)
Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,
Reference: Message Passing Fundamentals.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Parallel Programming Models and Paradigms
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 11 Operating Systems
1 Chapter 4 Threads Threads: Resource ownership and execution.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
1 Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Martin Kruliš by Martin Kruliš (v1.0)1.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Instructor Notes Discusses synchronization, timing and profiling in OpenCL Coarse grain synchronization covered which discusses synchronizing on a command.
CUDA - 2.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Martin Kruliš by Martin Kruliš (v1.0)1.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Using Charm++ with Arrays Laxmikant (Sanjay) Kale Parallel Programming Lab Department of Computer Science, UIUC charm.cs.uiuc.edu.
Copyright © Curt Hill More on Operating Systems Continuation of Introduction.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.
Introduction to threads
Processes and threads.
Sathish Vadhiyar Parallel Programming
Parallel Objects: Virtualization & In-Process Components
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
© 2012 Elsevier, Inc. All rights reserved.
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
An Orchestration Language for Parallel Objects
Presentation transcript:

Advanced / Other Programming Models Sathish Vadhiyar

OpenCL – Command Queues, Runtime Compilation, Multiple Devices Sources: OpenCL overview from AMD OpenCL learning kit from AMD

Introduction  OpenCL is a programming framework for heterogeneous computing resources  Resources include CPUs, GPUs, Cell Broadband Engine, FPGAs, DSPs  Many similarities with CUDA

Command Queues  A command queue is the mechanism for the host to request that an action be performed by the device Perform a memory transfer, begin executing, etc. Interesting concept of enqueuing kernels and satisfying dependencies using events  A separate command queue is required for each device  Commands within the queue can be synchronous or asynchronous  Commands can execute in-order or out-of-order 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

 Example – Image Rotation

 Slides 8, of lecture 5 in openCL University kit

 Synchronization

Synchronization in OpenCL  Synchronization is required if we use an out-of-order command queue or multiple command queues  Coarse synchronization granularity Per command queue basis  Finer synchronization granularity Per OpenCL operation basis using events 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

OpenCL Command Queue Control  Command queue synchronization methods work on a per-queue basis  Flush: clFlush( cl_commandqueue ) Send all commands in the queue to the compute device No guarantee that they will be complete when clFlush returns  Finish: clFinish( cl_commandqueue ) Waits for all commands in the command queue to complete before proceeding (host blocks on this call)  Barrier: clEnqueueBarrier( cl_commandqueue ) Enqueue a synchronization point that ensures all prior commands in a queue have completed before any further commands execute 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

OpenCL Events  Previous OpenCL synchronization functions only operated on a per- command-queue granularity  OpenCL events are needed to synchronize at a function granularity  Explicit synchronization is required for Out-of-order command queues Multiple command queues 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Using User Events  A simple example of user events being triggered and used in a command queue //Create user event which will start the write of buf1 user_event = clCreateUserEvent(ctx, NULL); clEnqueueWriteBuffer( cq, buf1, CL_FALSE,..., 1, &user_event, NULL); //The write of buf1 is now enqued and waiting on user_event X = foo(); //Lots of complicated host processing code clSetUserEventStatus(user_event, CL_COMPLETE); //The clEnqueueWriteBuffer to buf1 can now proceed as per OP of foo() 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

 Multiple Devices

Multiple Devices  OpenCL can also be used to program multiple devices (CPU, GPU, Cell, DSP etc.)  OpenCL does not assume that data can be transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an intermediate transfer to the host  OpenCL events are used to synchronize execution on different devices within a context

Compiling Code for Multiple Devices

Charm++ Source: Tutorial Slides from Parallel Programming Lab, UIUC Authors (Laxmikant Kale, Eric Bohm)

Virtualization: Object-based Decomposition  In MPI, the number of processes is typically equal to the number of processors  Virtualization:  Divide the computation into a large number of pieces Independent of number of processors Typically larger than number of processors  Let the system map objects to processors

The Charm++ Model  Parallel objects (chares) communicate via asynchronous method invocations (entry methods).  The runtime system maps chares onto processors and schedules execution of entry methods.  Chares can be dynamically created on any available processor  Can be accessed from remote processors 18Charm++ Basics

10/17/2015CS Processor Virtualization User View System implementation User is only concerned with interaction between objects (VPs)

20 Adaptive Overlap via Data-driven Objects  Problem: Processors wait for too long at “receive” statements  With Virtualization, you get Data-driven execution There are multiple entities (objects, threads) on each proc  No single object or threads holds up the processor  Each one is “continued” when its data arrives So: Achieves automatic and adaptive overlap of computation and communication

Load Balancing

10/17/2015CS Using Dynamic Mapping to Processors  Migration Charm objects can migrate from one processor to another Migration creates a new object on the destination processor while destroying the original Use that for dynamic (and static, initial) load balancing  Measurement based, predictive strategies Based on object communication patterns and computational loads

Summary: Primary Advantages  Automatic mapping  Migration and load balancing  Asynchronous and message driven communications  Computation-communication overlap

How it looks?

Asynchronous Hello World  Program’s asynchronous flow Mainchare sends message to Hello object Hello object prints “Hello World!” Hello object sends message back to the mainchare Mainchare quits the application Charm++ Basics25

Code and Workflow Charm++ Basics26

Hello World: Array Version Main Code Charm++ Basics27

Array Code Charm++ Basics28

Result $./charmrun +p3./hello 10 Running “Hello World” with 10 elements using 3 processors. “Hello” from Hello chare #0 on processor 0 (told by -1) “Hello” from Hello chare #1 on processor 0 (told by 0) “Hello” from Hello chare #2 on processor 0 (told by 1) “Hello” from Hello chare #3 on processor 0 (told by 2) “Hello” from Hello chare #4 on processor 1 (told by 3) “Hello” from Hello chare #5 on processor 1 (told by 4) “Hello” from Hello chare #6 on processor 1 (told by 5) “Hello” from Hello chare #7 on processor 2 (told by 6) “Hello” from Hello chare #8 on processor 2 (told by 7) “Hello” from Hello chare #9 on processor 2 (told by 8) Charm++ Basics29