Prasanna Pandit R. Govindarajan

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advanced Piloting Cruise Plot.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
Multiplying binomials You will have 20 seconds to answer each of the following multiplication problems. If you get hung up, go to the next problem when.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Year 6 mental test 5 second questions
Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.
ZMQS ZMQS
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Sweet Storage SLOs with Frosting Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, Randy Katz.
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Shredder GPU-Accelerated Incremental Storage and Computation
Accelerated Linear Algebra Libraries James Wynne III NCCS User Assistance.
Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.
FIFO Queues CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
ABC Technology Project
© 2011 Altera CorporationPublic The Trends in Programmable Solutions SoC FPGAs for Embedded Applications and Hardware-Software Co-Design Misha Burich Senior.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
Squares and Square Root WALK. Solve each problem REVIEW:
Created by Susan Neal $100 Fractions Addition Fractions Subtraction Fractions Multiplication Fractions Division General $200 $300 $400 $500 $100 $200.
© 2012 National Heart Foundation of Australia. Slide 2.
Processes Management.
Chapter 5 Test Review Sections 5-1 through 5-4.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
© 2004, D. J. Foreman 1 Scheduling & Dispatching.
1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.
Addition 1’s to 20.
25 seconds left…...
Equal or Not. Equal or Not
Slippery Slope
Test B, 100 Subtraction Facts
Week 1.
We will resume in: 25 Minutes.
A SMALL TRUTH TO MAKE LIFE 100%
How Cells Obtain Energy from Food
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
GPU Programming with CUDA – Optimisation Mike Griffiths
GPU Architecture and Programming
OpenCL Programming James Perry EPCC The University of Edinburgh.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
My Coordinates Office EM G.27 contact time:
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Gwangsun Kim, Jiyun Jeong, John Kim
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Presentation transcript:

Prasanna Pandit R. Govindarajan Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices Prasanna Pandit R. Govindarajan Supercomputer Education and Research Centre Indian Institute of Science, Bangalore CGO—2014

Outline Introduction The Problem The FluidiCL Runtime Optimizations Results Conclusion

introduction

Introduction Heterogeneous devices are common in today’s systems such as multicore CPUs and GPUs Different programming models: OpenMP, CUDA, MPI, OpenCL How can a single parallel program Choose the right device Utilize all devices if possible Heterogeneous programming models like OpenCL need manual effort from the programmer Heterogeneous devices like CPUs and GPUs have become common in today’s computing systems. There are different programming models for each of them. Programming models like OpenCL are still manual in nature How can we have a single parallel program which can effectively use all available devices?

OpenCL An open standard for heterogeneous computing Support for different devices: CPU, GPU, Accelerators, FPGA... Program portability as long as the device vendor supports it Programmer selects the device to use

OpenCL Host and Device The diagram shows openCL device model There is a separate host program and device program. The device program is called a kernel. The diagram also shows the logical entities in a device – compute unit and processing elements. These are where work-groups and work-items will run. I will explain them shortly. Source: Khronos Group

OpenCL Expects the programmer to specify parallelism Work-group Work-item OpenCL memory consistency model does not guarantee that writes by a work-item will be visible to work-items of other work-groups A work-group can be used as a unit of scheduling across devices

OpenCL Expects the programmer to specify parallelism

GPU Architecture Let’s look at the GPU architecture. A Streaming multiprocessor is what’s shown here. A GPU has tens of such SMs. Each SM has many Streaming Processors – shown as Cores in green There is a large register file that can be used to maintain the context of thousands of in-flight threads A small Shared Memory and L1 cache is present High-end GPUs typically have discrete memory which means data has to be moved to the device before the computation Source: NVIDIA

The problem

The Problem CPUs and GPUs: Different architectures, varying performance CPUs have fewer cores, large caches, give good single thread performance GPUs emphasize high throughput over single thread performance Data transfer overheads for discrete GPUs How do I run every OpenCL kernel across both CPU and GPU to get the best possible performance?

The Problem Which device should my OpenCL program run on? Different applications run better on different devices ATAX SYR2K

The Problem Which device should my OpenCL program run on? Same application runs differently on different inputs SYRK (1100) SYRK (2048)

The Problem What if my program has multiple kernels, and each likes a different device? Where the data resides is important!

Solution It is desired to have a framework which can: identify the best performing work allocation dynamically adapt to different input sizes factor in data transfer costs transparently handle data merging support multiple kernels without requiring prior training or profiling or programmer annotations

7 mins The fluidicl runtime

Overview Application OpenCL Device Runtime Device

Overview GPU Application FluidiCL OpenCL OpenCL CPU GPU Runtime

Kernel Execution Data transfers to both devices Two additional buffers on the GPU: one for data coming from the CPU, one to hold an original copy CPU Host GPU GPUData Twin CPUData

Modified Kernels CPU Kernel GPU Kernel Start Start Is work-group within range? Has work-group finished on CPU ? No Yes Yes No Execute Kernel Code Execute Kernel Code Exit Exit

Kernel Execution CPU GPU Subkernel1 (450-499) 0 – 99 Data+Status 100 – 199 CPU subkernel considered complete only if data reaches the GPU Duplicate execution of work-groups legal as long as results are chosen carefully Subkernel2 (400-449) Data+Status 200 – 299 Subkernel3 (350-399) Data+Status 300 – 349 Subkernel4 (300-349) Data Merge Data

Kernel Execution CPU GPU Subkernel1 (450-499) 0 – 99 Data+Status 100 – 199 Subkernel2 (400-449) Data+Status 200 – 299 Subkernel3 (350-399) Data+Status 300 – 349 Subkernel4 (300-349)

Data Merging CPUData GPUData Twin

Optimizations

GPU Work-group Aborts Work-group aborts in loops necessary to reduce duplicate execution Code to check CPU status inserted inside innermost loops __kernel void someKernel (__global int *A, int n) { int i; bool done = checkCPUExecutionStatus (); if (done == true) return; for (i = 0; i < n; i ++) //... }

Loop Unrolling Inserting checks inside loops can inhibit loop unrolling optimization by the GPU compiler Restructure the loop to perform unrolling keeps pipelined ALUs in GPU occupied int i; for(i=0; i< n; i+=(1+UNROLL_FACTOR)) { int k1, ur = UNROLL_FACTOR; bool done = checkCPUExecutionStatus (); if (done == true) return; if ((i+ur) > n) ur = (n-1)-i; for (k1 = i; k1 <= (i+ur); k1++) //Expression where i is replaced //with k1 }

Other Optimizations Buffer Management: Data Location tracking: Maintain a pool of buffers for additional FluidiCL data Create a new buffer only if one big enough is not found Reclaim older buffers Data Location tracking: Where is the up-to-date data? If CPU has it, no transfer is required when the host requests it. if CPU executed the entire grid if device-to-host transfers have completed

14 mins results

Experimental Setup CPU Intel Xeon W3550 (4 cores, 8 threads) GPU NVIDIA Tesla C2070 (Fermi) System Memory 16GB GPU Memory 5GB OpenCL CPU Runtime AMD APP 2.7 OpenCL GPU Runtime NVIDIA CUDA 4.2

Benchmarks Benchmarks from the Polybench GPU suite Benchmark Input Size No. of kernels No. of work-groups ATAX (28672, 28672) 2 896, 896 BICG (24576, 24576) 96, 96 CORR (2048, 2048) 4 8, 8, 16384, 8 GESUMMV (20480) 1 80 SYR2K (1000, 1000) 4000 SYRK (2500, 2500) 24727

Overall Results All optimizations enabled 64% faster than running on GPU only and 88% faster than running only on CPU

Different Input Sizes SYRK benchmark: Close to 2x faster than the best of CPU and GPU on average across different input sizes

Work-group Abort inside Loops Increases performance by 27% on average

Effect of Loop unrolling Aborts inside loops included but without unrolling Performance would suffer by upto 2x if loop unrolling was not done

Comparison to SOCL (StarPU) FluidiCL performs 26% faster than the profiled StarPU DMDA scheduler of SOCL.

Related Work Achieving a single compute device image in opencl for multiple gpus, Kim et al. [PPoPP 2011] Considers multiple identical GPUs Performs buffer range analysis to identify partitioning Works for regular programs A static task partitioning approach for heterogeneous systems using opencl, Grewe et al. [CC 2011] Uses a machine learning model Decides at a kernel level StarPU Extension for OpenCL, Sylvain Henry Adds OpenCL support for StarPU Requires profiling for each application Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations, VT Ravi et al. [ICS 2010] Work sharing approach for generalized reduction applications

Conclusion A heterogeneous OpenCL runtime called FluidiCL which performs runtime work distribution and data management A scheme that does not require prior profiling/training Can adapt to different inputs CPU + Fermi GPU: outperforms the single best device by 14% overall

Future Work Using compiler analysis to reduce data transfer Explore the scheduling idea further on integrated CPU—GPU platforms on multiple accelerators

Thank you! questions?