Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Slides:



Advertisements
Similar presentations
Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
ECE-777 System Level Design and Automation Hardware/Software Co-design
Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Introduction to Parallel Computing
A Dataflow Programming Language and Its Compiler for Streaming Systems
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.
Jan Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Modeling and Analysis of Printer Data Paths using Synchronous Data Flow Graphs in Octopus Ashwini Moily Under the supervision of Dr. Lou Somers, Prof.
1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.
StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:
Michael I. Gordon, William Thies, and Saman Amarasinghe
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Martin Kruliš by Martin Kruliš (v1.0)1.
High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
StreamIt: A Language for Streaming Applications
Code Optimization.
Linear Filters in StreamIt
Teleport Messaging for Distributed Stream Programs
Adaptation Behavior of Pipelined Adaptive Filters
Cache Aware Optimization of Stream Programs
High Performance Stream Processing for Mobile Sensing Applications
StreamIt: High-Level Stream Programming on Raw
STUDY AND IMPLEMENTATION
EE 4xx: Computer Architecture and Performance Programming
Mattan Erez The University of Texas at Austin
Presentation transcript:

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney

2 The Free Lunch Is Over Uniprocessor hits the physical limit  Multicore, many core systems Programming challenges now  Multiple flows of control and memories  Finding algorithm that can run in parallel  Correctness of the program  Synchronization Issues  Debugging the program 2

3 Stream Programming Paradigm It expresses parallelism inherently It leverages program structure to discover parallelism and delivers high performance It is structured around notion of a “stream” A streaming computation represents  A sequence of transformations on the data streams  A stream program is the composition of filters into a stream graph Filter Input channels Output channels

4 StreamIt Language An implementation of stream programming It exposes parallelism It is architecture independent Modular parallel computation may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter

5 A Simple Filter in StreamIt int->int filter A { init { // empty } work pop 1 peek 2 push 1 { push(peek(0) + peek(1)); pop(); } A

StreamIt Example float->float pipeline Example() { add A(); add B(); add splitjoin { split duplicate; add D(); add E(); join roundrobin; } add G(); add H(); } 6 G H A B D split join E z

Research Question? How to parallelize the stream program  Optimize throughput Synchronous model  Describes bandwidths of streams in steady state Bottleneck actor  An actor constrains “z” 7 G H A B D split join E z

8 8 Static Translation of Stream Programs We propose  A simple quantitative analysis to resolve bottlenecks in stream programs  Actors mapping technique to a parallel system  Scheduling of mapped actors in each processor Our goal  To statically optimize the throughput of a stream program

Finding Closed Form We assume bandwidth functions are “linear functions” We have a system of simultaneous linear equations The output are shown as weights We express output of each actor as a function of “z” 9 G H A B D split join E z

10 Mapping Actors to Processors An NP-hard problem  Takes too long time Approximation algorithm  Much faster O(n log n)  Non-optimal solution We formulate Integer Linear Programming problem considering  Input, output and processing bandwidths of each actor  Processing capacity each processor  Inter processor communication bandwidths P1 P2 P3 G H A B D split join E z

11 Mapping Actors to Processors Condt. We develop a test for a given “z” We find a mapping using “Binary Search” 0 z qiven 1 Solution space mid

Resolve Bottleneck We find bottleneck by quantitative analysis  Actors constraining the system bandwidth Duplicate hot actors Then we run the whole process again Reason of considering bottleneck after mapping 12 0 z sys 1 < bw actor

Actor Scheduling for Processors 13 t Actor A Actor B Schedule of processor P 1

14 Entire Framework Finding a Closed Form Resolve Bottleneck Find Mapping by ILP/Approx. Schedule Actors for Processors

15 Execution Time of ILP Solver

Execution Time of a Variation of Bin-packing Algorithm 16

17 Optimal vs. Approximation

Summary We develop a synchronous model for stream programs Statically optimize the throughput of stream programs Resolving bottleneck by simple quantitative analysis Finding an approximation for the mapping problem 18

Related Works [1] Static Scheduling of SDF Programs for DSP [Lee ‘87] [2] StreamIt: A language for streaming applications [Thies ‘02] [3] Phased Scheduling of Stream Programs [Thies ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [Thies ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [Udupa‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [Udupa ‘09] 19