Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Slides:



Advertisements
Similar presentations
Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc
Advertisements

Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An Introduction to Programming with CUDA Paul Richmond
Jan Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Extracted directly from:
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
GPU Architecture and Programming
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
CS427 Multicore Architecture and Parallel Computing
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Presented by: Isaac Martin
EE 4xx: Computer Architecture and Performance Programming
Mattan Erez The University of Texas at Austin
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science

2 Abstract Orchestrating the execution of a stream program on a multicore platform  with an accelerator [GPUs, CellBE] Formulate the partitioning of work between CPU cores and the GPU by ILP considering  The latencies for data transfer and  The required data layout transformation Also propose a heuristic partitioning algorithm Speedup of 50.96X over a single threaded CPU execution 2

Challenges The CPU cores and GPU operate on separate address spaces  requires explicit DMA from the CPU to transfer data into or out of the GPU address space The communication buffers between StreamIt filters need to be laid out in a specific fashion  Access needs to coalesced for GPU  But this coalesced memory access cause cache misses for CPU The work partitioning between the CPU and the GPU is complicated by  the DMA and buffer transformation latencies  the filters have non-identical execution times on the two devices 3

Organization of the NVIDIA GeForce 8800 series of GPUs Architecture of GeForce 8800 GPU Architecture of individual SM 4

CUDA Memory Model 5 All threads of upto 8 thread blocks can be assigned to one SM A group of thread blocks forms a grid Finally, a kernel call dispatched to the GPU through the CUDA runtime consists of exactly one grid

Buffer Layout Consideration 6 DeviceSerial (ms) Shuffled (ms) CPU GPU

A Motivating Example Assuming steady state multiplicity is one for each of the actor B is a stateful actor which run on CPU Shuffle and deshuffle costs are zero 7 A B C D E CPU: 10 GPU: 20 CPU: 20 CPU: 80 GPU: 20 CPU: 15 GPU: 20 CPU: 10 GPU: Original Stream Graph

Naïve Partitioning Naively map filter B on the CPU and execute all the other filters on the GPU CPU Load = 20 GPU Load = 75 DMA Load = 30 MII = 75 8 A B C D E CPU: 10 GPU: 20 CPU: 20 CPU: 80 GPU: 20 CPU: 15 GPU: 10 CPU: 10 GPU: A B C D E GPU: 20 CPU: 20 GPU: 20 GPU: 10 GPU: Original Stream GraphNaïve partitioning

Greedy Partitioning Greedily moving an actor to either the CPU or the GPU, where it is most beneficial to be executed CPU Load = 40 GPU Load = 35 DMA Load = 70 MII = 70 9 A B C D E CPU: 10 GPU: 20 CPU: 20 CPU: 80 GPU: 20 CPU: 15 GPU: 10 CPU: 10 GPU: A B C D E CPU: 10 CPU: 20 GPU: 20 GPU: 10 CPU: Original Stream GraphGreedy partitioning

Optimal Partitioning CPU Load = 45 GPU Load = 40 DMA Load = 40 MII = A B C D E CPU: 10 GPU: 20 CPU: 20 CPU: 80 GPU: 20 CPU: 15 GPU: 10 CPU: 10 GPU: A B C D E GPU: 20 CPU: 20 GPU: 20 CPU: 15 CPU: Original Stream GraphOptimal partitioning

Software Pipelined Kernel 11

Compilation Process 12

Overview of the Proposed Method To obtain performance increase the multiplicities of the steady state All filters that execute on the CPU are assumed to execute 128 times on each invocation  To reduce the complication  128 is a common factor of GPU threads number, i.e. 128, 256, 384, 512 Identify the number of instances of each actor 13

Partitioning: Two Steps Task Partitioning [ILP or Heuristic Algorithm]  Partition the stream graph into two sets, one for GPU and one for CPU cores  A filter (all its instances) executes either on the CPU cores or on the GPU [Reduced complexity] Instance Partitioning [ILP]  Partition the instances of each filter across the CPU cores or across the SMs of the GPU  To obtain performance increase the multiplicities of the steady state 14

DMA Transfers and Shuffle and Deshuffle Operation Whenever data is transferred from the CPU to the GPU  DMA from CPU to GPU and  A shuffle operation is performed For the GPU to CPU transfers  A deshuffle is performed on the GPU  Then DMA transfer takes place 15

Orchestrate the Execution Orchestrate the execution [simple modulo scheduling]  Filters  DMA transfers and  Shuffle and deshuffle operations The shuffle and deshuffle operations are always assigned to the GPU 16

Stage Assignment A A C C B1 S S J J DMA Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 B2 A A C C D B1 S S J J 2 2Proc 1 = 32 Proc 2 = 32 Fission and processor assignment B2 D

Heuristic Algorithm Intuitively the nodes assigned to the CPU to be the nodes most beneficial to execute on the CPU Defining The intuition is  The highest to be assigned to the CPU  Also some of their neighbouring nodes assigned to the CPU Considering DMA and shuffle and deshuffle costs 18

Performance of Heuristic Partitioning 19 BenchmarkII (ILP) (ns)II (Heur) (ns)%Degrade Bitonic Bitonic-Rec ChannelVocoder DCT DES FFT-C FFT-F Filterbank FMRadio MatrixMult MPEG2Subset TDE

Performance of the ILP vs. Heuristic Partitioner 20

Comparison of Synergistic Execution with Other Schemes 21

Questions?