1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

An Empirical Characterization of Stream Programs and its Implications for Language and Compiler Design Bill Thies 1 and Saman Amarasinghe 2 1 Microsoft.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

How to Get Into Graduate School in the USA: A Lecture and Workshop Bill Thies and Manish Bhardwaj Department of Electrical Engineering and.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.

Stream Programming: Luring Programmers into the Multicore Era Bill Thies Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.

StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Michael I. Gordon, William Thies, and Saman Amarasinghe

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

StreamIt – A Programming Language for the Era of Multicores Saman Amarasinghe

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric.

Parallel Computing Presented by Justin Reschke

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

StreamIt: A Language for Streaming Applications

These slides are based on the book:

Gwangsun Kim, Jiyun Jeong, John Kim

Parallel Programming By J. H. Wang May 2, 2017.

A Common Machine Language for Communication-Exposed Architectures

Linear Filters in StreamIt

Teleport Messaging for Distributed Stream Programs

Introduction to cosynthesis Rabi Mahapatra CSCE617

Cache Aware Optimization of Stream Programs

High Performance Stream Processing for Mobile Sensing Applications

StreamIt: High-Level Stream Programming on Raw

Presentation transcript:

1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA

2 Multicores Are Here! PentiumP2P3 P4 Itanium Itanium ?? # of cores Athlon Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 Opteron 4P Xeon MP Ambric AM2045

3 Multicores Are Here! For uniprocessors, C was: Portable High Performance Composable Malleable Maintainable Uniprocessors: C is the common machine language PentiumP2P3 P4 Itanium Itanium Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom ?? # of cores Opteron 4P Xeon MP Athlon Ambric AM2045

4 Multicores Are Here! What is the common machine language for multicores? PentiumP2P3 P4 Itanium Itanium Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom ?? # of cores Opteron 4P Xeon MP Athlon Ambric AM2045

5 Common Machine Languages Common Properties Single flow of control Single memory image Uniprocessors: Differences: Register File ISA Functional Units Register Allocation Instruction Selection Instruction Scheduling Common Properties Multiple flows of control Multiple local memories Multicores: Differences: Number and capabilities of cores Communication Model Synchronization Model von-Neumann languages represent the common properties and abstract away the differences Need common machine language(s) for multicores

6 Streaming as a Common Machine Language Regular and repeating computation Independent filters with explicit communication –Segregated address spaces and multiple program counters Natural expression of Parallelism: –Producer / Consumer dependencies –Enables powerful, whole-program transformations Adder Speaker AtoD FMDemod LPF 1 Scatter Gather LPF 2 LPF 3 HPF 1 HPF 2 HPF 3

7 Types of Parallelism Task Parallelism –Parallelism explicit in algorithm –Between filters without producer/consumer relationship Data Parallelism –Peel iterations of filter, place within scatter/gather pair (fission) –parallelize filters with state Pipeline Parallelism –Between producers and consumers –Stateful filters can be parallelized Scatter Gather Task

8 Types of Parallelism Task Parallelism –Parallelism explicit in algorithm –Between filters without producer/consumer relationship Data Parallelism –Between iterations of a stateless filter –Place within scatter/gather pair (fission) –Can’t parallelize filters with state Pipeline Parallelism –Between producers and consumers –Stateful filters can be parallelized Scatter Gather Scatter Gather Task Pipeline Data Data Parallel

9 Types of Parallelism Traditionally: Task Parallelism –Thread (fork/join) parallelism Data Parallelism –Data parallel loop (forall) Pipeline Parallelism –Usually exploited in hardware Scatter Gather Scatter Gather Task Pipeline Data

10 Problem Statement Given: –Stream graph with compute and communication estimate for each filter –Computation and communication resources of the target machine Find: –Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources

11 Our 3-Phase Solution 1.Coarsen: Fuse stateless sections of the graph 2.Data Parallelize: parallelize stateless filters 3.Software Pipeline: parallelize stateful filters Compile to a 16 core architecture –11.2x mean throughput speedup over single core Coarsen Granularity Data Parallelize Software Pipeline

12 Outline StreamIt language overview Mapping to multicores –Baseline techniques –Our 3-phase solution

13 Applications –DES and Serpent [PLDI 05] –MPEG-2 [IPDPS 06] –SAR, DSP benchmarks, JPEG, … Programmability –StreamIt Language (CC 02) –Teleport Messaging (PPOPP 05) –Programming Environment in Eclipse (P-PHEC 05) Domain Specific Optimizations –Linear Analysis and Optimization (PLDI 03) –Optimizations for bit streaming (PLDI 05) –Linear State Space Analysis (CASES 05) Architecture Specific Optimizations –Compiling for Communication-Exposed Architectures (ASPLOS 02) –Phased Scheduling (LCTES 03) –Cache Aware Optimization (LCTES 05) –Load-Balanced Rendering (Graphics Hardware 05) StreamIt Program Front-end Stream-Aware Optimizations Uniprocessor backend Cluster backend Raw backend IBM X10 backend C/C++ C per tile + msg code Streaming X10 runtime Annotated Java MPI-like C/C++ Simulator (Java Library) The StreamIt Project

14 Model of Computation Synchronous Dataflow [Lee ‘92] –Graph of autonomous filters –Communicate via FIFO channels Static I/O rates –Compiler decides on an order of execution (schedule) –Static estimation of computation A/D Duplicate LED Detect Band Pass LED Detect LED Detect LED Detect

15 Example StreamIt Filter input output FIR 01 float  float filter FIR (int N, float[N] weights) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i]  peek(i); } pop(); push(result); } Stateless

16 Example StreamIt Filter float  float filter FIR (int N, ) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i]  peek(i); } pop(); push(result); } input output FIR 01 weights = adaptChannel(weights ); float[N] weights ; (int N) { Stateful

17 parallel computation StreamIt Language Overview StreamIt is a novel language for streaming –Exposes parallelism and communication –Architecture independent –Modular and composable –Simple structures composed to creates complex graphs –Malleable –Change program behavior with small modifications may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter

18 Outline StreamIt language overview Mapping to multicores –Baseline techniques –Our 3-phase solution

19 Baseline 1: Task Parallelism Adder Splitter Joiner Compress BandPass Expand Process BandStop Compress BandPass Expand Process BandStop Inherent task parallelism between two processing pipelines Task Parallel Model: –Only parallelize explicit task parallelism –Fork/join parallelism Execute this on a 2 core machine ~2x speedup over single core What about 4, 16, 1024, … cores?

20 Evaluation: Task Parallelism Raw Microprocessor 16 inorder, single-issue cores with D$ and I$ 16 memory banks, each bank with DMA Cycle accurate simulator Parallelism: Not matched to target! Synchronization: Not matched to target!

21 Baseline 2: Fine-Grained Data Parallelism Adder Splitter Joiner Each of the filters in the example are stateless Fine-grained Data Parallel Model: –Fiss each stateless filter N ways (N is number of cores) –Remove scatter/gather if possible We can introduce data parallelism –Example: 4 cores Each fission group occupies entire machine BandStop Adder Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner

22 Evaluation: Fine-Grained Data Parallelism Good Parallelism! Too Much Synchronization!

23 Outline StreamIt language overview Mapping to multicores –Baseline techniques –Our 3-phase solution

24 Phase 1: Coarsen the Stream Graph Splitter Joiner Expand BandStop Process BandPass Compress Expand BandStop Process BandPass Compress Before data-parallelism is exploited Fuse stateless pipelines as much as possible without introducing state –Don’t fuse stateless with stateful –Don’t fuse a peeking filter with anything upstream Peek Adder

25 Splitter Joiner BandPass Compress Process Expand BandPass Compress Process Expand BandStop Adder Before data-parallelism is exploited Fuse stateless pipelines as much as possible without introducing state –Don’t fuse stateless with stateful –Don’t fuse a peeking filter with anything upstream Benefits: –Reduces global communication and synchronization –Exposes inter-node optimization opportunities Phase 1: Coarsen the Stream Graph

26 Phase 2: Data Parallelize Adder Splitter Joiner Adder BandPass Compress Process Expand BandPass Compress Process Expand BandStop Splitter Joiner Fiss 4 ways, to occupy entire chip Data Parallelize for 4 cores

27 Phase 2: Data Parallelize Adder Splitter Joiner Adder BandPass Compress Process Expand Splitter Joiner BandPass Compress Process Expand BandPass Compress Process Expand Splitter Joiner BandPass Compress Process Expand BandStop Splitter Joiner Task parallelism! Each fused filter does equal work Fiss each filter 2 times to occupy entire chip Data Parallelize for 4 cores

28 BandStop Phase 2: Data Parallelize Adder Splitter Joiner Adder BandPass Compress Process Expand Splitter Joiner BandPass Compress Process Expand BandPass Compress Process Expand Splitter Joiner BandPass Compress Process Expand Splitter Joiner BandStop Splitter Joiner BandStop Splitter Joiner Task parallelism, each filter does equal work Fiss each filter 2 times to occupy entire chip Task-conscious data parallelization –Preserve task parallelism Benefits: –Reduces global communication and synchronization Data Parallelize for 4 cores

29 Evaluation: Coarse-Grained Data Parallelism Good Parallelism! Low Synchronization!

30 Simplified Vocoder RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner PolarRect Data Parallel Target a 4 core machine Data Parallel, but too little work!

31 Data Parallelize RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner RectPolar Splitter Joiner RectPolar PolarRect Splitter Joiner Target a 4 core machine

32 Data + Task Parallel Execution Time Cores 21 Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner

33 We Can Do Better! Time Cores Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner

34 Phase 3: Coarse-Grained Software Pipelining RectPolar Prologue New Steady State New steady-state is free of dependencies Schedule new steady-state using a greedy partitioning

35 Greedy Partitioning Target 4 core machine Time 16 Cores To Schedule:

36 Evaluation: Coarse-Grained Task + Data + Software Pipelining Best Parallelism! Lowest Synchronization!

37 Generalizing to Other Multicores Architectural requirements: –Compiler controlled local memories with DMA –Efficient implementation of scatter/gather To port to other architectures, consider: –Local memory capacities –Communication to computation tradeoff Did not use processor-to-processor communication on Raw

38 Related Work Streaming languages: –Brook [Buck et al. ’04] –StreamC/KernelC [Kapasi ’03, Das et al. ’06] –Cg [Mark et al. ‘03] –SPUR [Zhang et al. ‘05] Streaming for Multicores: –Brook [Liao et al., ’06] Ptolemy [Lee ’95] Explicit parallelism: –OpenMP, MPI, & HPF

39 Conclusions Good speedups across varied benchmark suite Algorithms should be applicable across multicores Task Fine-Grained Data Coarse-Grained Task + Data Coarse-Grained Task + Data + Software Pipeline Parallelism Not matched Good Best Synchronization Not matched HighLowLowest Streaming model naturally exposes task, data, and pipeline parallelism This parallelism must be exploited at the correct granularity and combined correctly