Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

ECE-777 System Level Design and Automation Hardware/Software Co-design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
How to Get Into Graduate School in the USA: A Lecture and Workshop Bill Thies and Manish Bhardwaj Department of Electrical Engineering and.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Stream Programming: Luring Programmers into the Multicore Era Bill Thies Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.
1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.
StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:
Michael I. Gordon, William Thies, and Saman Amarasinghe
Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
StreamIt – A Programming Language for the Era of Multicores Saman Amarasinghe
StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
StreamIt: A Language for Streaming Applications
Parallel Programming By J. H. Wang May 2, 2017.
A Common Machine Language for Communication-Exposed Architectures
Parallel Algorithm Design
Teleport Messaging for Distributed Stream Programs
Many-core Software Development Platforms
Cache Aware Optimization of Stream Programs
High Performance Stream Processing for Mobile Sensing Applications
StreamIt: High-Level Stream Programming on Raw
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
How to improve (decrease) CPI
Presentation transcript:

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney

Multicores Are Here! PentiumP2P3 P4 Itanium Itanium ?? # of cores Athlon Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 Opteron 4P Xeon MP Ambric AM2045

Multicores Are Here! For uniprocessors, C was: Portable High Performance Composable Malleable Maintainable Uniprocessors: C is the common machine language PentiumP2P3 P4 Itanium Itanium Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom ?? # of cores Opteron 4P Xeon MP Athlon Ambric AM2045

Multicores Are Here! What is the common machine language for multicores? PentiumP2P3 P4 Itanium Itanium Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom ?? # of cores Opteron 4P Xeon MP Athlon Ambric AM2045

Common Machine Languages Common Properties Single flow of control Single memory image Uniprocessors: Differences: Register File ISA Functional Units Register Allocation Instruction Selection Instruction Scheduling Common Properties Multiple flows of control Multiple local memories Multicores: Differences: Number and capabilities of cores Communication Model Synchronization Model von-Neumann languages represent the common properties and abstract away the differences Need common machine language(s) for multicores

Properties of Stream Programs A large (possibly infinite) amount of data Limited lifespan of each data item Little processing of each data item A regular, static computation pattern Stream program structure is relatively constant A lot of opportunities for compiler optimizations

Streaming as a Common Machine Language Regular and repeating computation Independent filters with explicit communication –Segregated address spaces and multiple program counters Natural expression of Parallelism: –Producer / Consumer dependencies –Enables powerful, whole-program transformations Adder Speaker AtoD FMDemod LPF 1 Scatter Gather LPF 2 LPF 3 HPF 1 HPF 2 HPF 3

Application of Streaming Programming

Model of Computation Synchronous Dataflow [Lee ‘92] –Graph of autonomous filters –Communicate via FIFO channels Static I/O rates –Compiler decides on an order of execution (schedule) –Static estimation of computation A/D Duplicate LED Detect Band Pass LED Detect LED Detect LED Detect

Example StreamIt Filter input output FIR 01 float  float filter FIR (int N, float[N] weights) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i]  peek(i); } pop(); push(result); } Stateless

Example StreamIt Filter float  float filter FIR (int N, ) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i]  peek(i); } pop(); push(result); } input output FIR 01 weights = adaptChannel(weights ); float[N] weights ; (int N) { Stateful

parallel computation StreamIt Language Overview StreamIt is a novel language for streaming –Exposes parallelism and communication –Architecture independent –Modular and composable Simple structures composed to creates complex graphs –Malleable Change program behavior with small modifications may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter

13 Mapping to Multicores Baseline techniques 3-phase solution

14 Baseline 1: Task Parallelism Adder Splitter Joiner Compress BandPass Expand Process BandStop Compress BandPass Expand Process BandStop Inherent task parallelism between two processing pipelines Task Parallel Model: –Only parallelize explicit task parallelism –Fork/join parallelism Execute this on a 2 core machine ~2x speedup over single core What about 4, 16, 1024, … cores?

15 Baseline 2: Fine-Grained Data Parallelism Adder Splitter Joiner Each of the filters in the example are stateless Fine-grained Data Parallel Model: –Fiss each stateless filter N ways (N is number of cores) –Remove scatter/gather if possible We can introduce data parallelism –Example: 4 cores Each fission group occupies entire machine BandStop Adder Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner

16 3-Phase Solution [2006] RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner PolarRect Data Parallel Target a 4 core machine Data Parallel, but too little work!

17 Data Parallelize RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner RectPolar Splitter Joiner RectPolar PolarRect Splitter Joiner Target a 4 core machine

18 Data + Task Parallel Execution Time Cores 21 Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner

19 We Can Do Better! Time Cores Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner

20 Phase 3: Coarse-Grained Software Pipelining RectPolar Prologue New Steady State New steady-state is free of dependencies Schedule new steady-state using a greedy partitioning

21 Greedy Partitioning Target 4 core machine Time 16 Cores To Schedule:

Orchestrating the Execution of Stream Programs on Multicore Platforms (SGMS) [June 08] An integrated unfolding and partitioning step based on integer linear programming Then unfolds data parallel actors as needed and maximally packs actors onto cores. Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. Experimental results (Cell Architecture).

Cell Architecture

Motivation It is critical that the compiler leverage a synergistic combination of parallelism while avoiding –Structural and resource hazards Objective is to maximize concurrent execution of actors across multiple cores –While hiding communication overhead to minimize stalls (pause)

SGMS: A Phase-ordered Approach (2 Steps) First: an integrated actor fission and partitioning step is performed to assign actors to each processor ensuring maximum work balance. –Parallel data actors are selectively replicated and split to increase the opportunities for even work distribution. (Integer Linear Program) Second: Stage assignment –Each actor is assigned to a pipeline stage –Ensure data dependencies are satisfied –Inter processor communication latency is maximally overlapped with computation

Orchestrating the Execution of Stream Programs on Multicore Platforms A BC D Original stream graph A C D B1B2 S J 2 2 Fission and processor assignment Maximum speedup (unmodified graph)=60/40=1.5

Orchestrating the Execution of Stream Programs on Multicore Platforms A BC D Maximum speedup (unmodified graph)=60/40=1.5 A A C C D B1 S S J J 2 2Proc 1 = 27 Proc 2 = 32 Original stream graph Fission and processor assignment Maximum speedup (modified graph)=60/32 ~ 2 B2

Stage Assignment A A C C D B1 S S J J DMA Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 B2 A A C C D B1 S S J J 2 2Proc 1 = 27 Proc 2 = 32 Fission and processor assignment B2

Schedule Running on Cell SPE1DMA1DMA2SPE2 A S B1 A S B1toJ StoB2 AtoC C J B2 A S B1 B1toJ StoB2 AtoC A S B1 JtoD CtoD A S B1 JtoD CtoD B1toJ StoB2 AtoC B1toJ StoB2 AtoC C J B2 C J Steady State Prologue

PE2 Naïve Unfolding [1991] A1 B1 D1E1 C1 F1 A2 B2 D2E2 C2 F2 A3 B3 D3E3 C3 F3 PE1PE3 DMA State data dependence Stream data dependence

Benchmark Characteristics BenchmarkActorsStatefulPeekingState size (bytes) bitonic28204 channel dct36204 des33204 fft17204 filterbank fmradio tde28204 mpeg vocoder radar

Results

Compare

Effect of Exposed DMA Latency

ILP vs Greedy Partitioning