Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

ECE-777 System Level Design and Automation Hardware/Software Co-design

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

OpenFOAM on a GPU-based Heterogeneous Cluster

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Carnegie Mellon 1 Optimal Scheduling “in a lifetime” for the SPIRAL compiler Frédéric de Mesmay Theodoros Strigkos based on Y. Voronenko’s idea.

Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.

Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.

Jan Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz

A Computing Origami: Folding Streams in FPGAs S. M. Farhad PhD Student University of Sydney DAC 2009, California, USA.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Michael I. Gordon, William Thies, and Saman Amarasinghe

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Marilyn Wolf1 With contributions from:

CS427 Multicore Architecture and Parallel Computing

Ph.D. in Computer Science

Parallel Programming By J. H. Wang May 2, 2017.

Cache Aware Optimization of Stream Programs

EE 4xx: Computer Architecture and Performance Programming

Peter Oostema & Rajnish Aggarwal 6th March, 2019

Presentation transcript:

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

2 Outline Motivation  Multicore  Stream programming Research question Our work  Contributions  Overview  Details Summary 2

3 Cores are the New Gates # cores/chip (Shekhar Borkar, Intel) Courtesy: Scott’08 C/C++/Java CUDA X10 Peakstream Fortress Accelerator Ct C T M Rstream Rapidmind Stream Programming 3

4 Stream Programming Paradigm Research topic in parallel programming Various forms of parallelism  Pipeline, task, and data Applications  Signal Processing  Multi-media  High-Performance Computing Programs expressed as stream graphs  Streams Infinite sequence of data elements (aka. Tokens)  Actor Functions applied to streams 4 Actor Stream

5 Properties of Stream Program Regular and repeating computation Independent actors with explicit communication  Producer / Consumer dependencies 5 Adder Speaker AtoD FMDemod LPF 1 Splitter Joiner LPF 2 LPF 3 HPF 1 HPF 2 HPF 3

6 StreamIt Language [ASPLOS’2&6, PLDI’3] An implementation of stream programming Model of computation  Synchronous data flow model Program is a graph of independent actors and streams Actors have an execution step with known input/output rates Compiler schedules and manages buffers ABC 1511 x5x1x1 6

7 StreamIt Language Contd. Each construct has single input/output stream Hierarchical structure Filters can be stateful/stateless parallel computation may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter 7

8 Research Question: How to Orchestrate a Stream Graph? Mapping actors Eliminate bottlenecks (aka. hot actors) 8

9 We Focus on? Core 1 B 60 C D 5 5 A Core 2Core 3 5 A D 5 B 60 C Load=10Load=60 Make span = 60, Speedup = 130/60 = 2.17 Algorithm for actor allocation 9

10 We Focus on? Bottlenecks elimination B 60 C D 5 5 A D 5 5 A B_ s1 2 j1 B_2 20 B_3 20 C_ s2 2 j2 C_2 20 C_3 20 Hot actor duplication Core 1 Core 2 Core 3 5 A s1 2 B_1 20 C_1 20 B_2 20 j1 2 s2 2 C_2 20 j2 2 B_3 20 C_3 20 D 5 Load = 47Load = 46Load = 45 Make span = 47, Speedup = 130/47 =

11 Orchestration of Stream Program Contd. Current state of the art  Integer Linear Programming Intractable  Heuristics Unknown performance How to find a fast and good solution?  Approximation algorithms that have Polynomial runtime Quality bound for solution 11

12 Our Contribution A simple quantitative analysis to detect and eliminate bottlenecks A novel 2-approximation algorithm for deploying stream graphs on multicore platforms Experiments and comparison Results are within 5% of the optimal solution Achieves a geometric mean speedup of 6.95x for 8 processors over single processor execution 12

13 Focus of Our Work Stream Graph Scheduling Stream Graph Partitioning Layout on Target Architecture Communication Scheduling Linear Functional Equation Solver Actor Allocation on Processors Bottleneck Resolver StreamIt Compiler Phases

14 Our Data Transfer Model Arrival rate depends on the data rate of the actors (maximize) Data transfer model forms a system of sim. functional linear equation Compute a closed form of the output data rate We also consider a processor utilization function for each actor ABC 1511 z 14

15 Bottleneck Analysis The arrival rate is limited by  Processor capacity of the cores  Memory bandwidth A quantitative analysis determines  An upper bound of the arrival rate imposed by an actor  An upper bound of the arrival rate imposed by the parallel system Hot actor  Upper bound (actor) < upper bound (system) 15

16 Resolving Bottlenecks 16 B 60 C D 5 5 A D 5 5 A B_ s1 B_2 20 B_3 20 C_ j1 C_2 20 C_3 20 Hot region duplication D 5 5 A B_ s1 2 j1 B_2 20 B_3 20 C_ s2 2 j2 C_2 20 C_3 20 Hot actor duplication

17 Actor Allocation Problem The actor allocation problem (AAP) is NP- hard For a fixed arrival rate, the AAP reduces to standard bin-packing problem (closed form) There exist approximation algorithms for bin- packing  Polynomial running time  Solution quality is bounded

18 Actor Allocation Constraint % Actors with their utilizations Each core has 100% utilization n 2 1 n 5 2 3

19 Binary Search 19 0ub(z) 1.0 Solution space rightleftmid Allocation possible?

20 Binary Search 20 0ub(z) 1.0 rightleft Allocation possible? mid 20

21 Binary Search 21 0ub(z) 1.0 rightleft Allocation possible? mid 21

22 Actor Allocation of Bottleneck Free Program 22 D 5 5 A B_ s1 B_2 20 B_3 20 C_ j1 C_2 20 C_3 20 Core 1 5 A B_1 20 C_1 20 Core 2 s1 2 B_2 20 j1 2 C_2 20 Core 3 B_3 20 C_3 20 D 5 Load = 45Load = 44Load = 45 Make span = 45, Speedup = 130/45 = 2.89 Mapping Efficient Bottleneck Resolving

23 Experiments Our method implemented as an extension of StreamIt compiler We compare to ILP based method [Scott 08] (solved with CPLEX) Hardware Setup  2.33GHz dual quad-core Intel Xeon processors 16GB memory Linux kernel version  Profiler uses the x86-64’s hardware cycle counters 23

24 Experiments Contd. Experimental Process  Profiling  Computing closed form  Resolve bottlenecks  Compute the mapping  Compute the layout scheduling  Invoke the StreamIt back end  Finally we measure the performance 24

25 Experiments Contd. Benchmark ActorsStateful DCT2218 FMRadio6723 TDE5527 FFT2614 MergeSort312 FilterBankNew5334 RadixSort132 EqualizerProgram6522 BitonicSort4522 DES MPEG397 MatrixMult522 25

26 Experimental Results for 2 – 4 Processors BenchmarkProc# ILP Time (Optimal)(s) DCT FMRadio TDE FFT Our method’s run time: <1s BenchmarkProc# ILP Time (Optimal)(s) Equalizer BitonicSort DES MPEG

27 Experimental Results for 2 – 4 Processors BenchmarkProc# Arrival rate ratio (Appx/Optimal) Apx. Arrival Rate (MBps) DCT FMRadio TDE FFT BenchmarkProc# Arrival rate ratio (Appx/Optimal) Apx. Arrival Rate (MBps) Equalizer BitonicSort DES MPEG

28 Speedup Results 28

29 Summary Approximation algorithm for solving actor allocation problem Data rate transfer model that resolves bottlenecks We separate the bottleneck elimination from the actor allocation We implemented our approach and compared with an optimal approach Optimal approach has unpredictable time Our approach has negligible time for all benchmarks Quality of our approach is at most 5% off the optimum For up to 8 processors we achieve a geometric mean speedup of 6.95x over single processor execution 29

30 Related Works [1] Static Scheduling of SDF Programs for DSP [Lee ‘87] [2] StreamIt: A language for streaming applications [Thies ‘02] [3] Phased Scheduling of Stream Programs [Thies ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [Thies ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [Udupa‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [Udupa ‘09] 30

Questions?