EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Slides:

Advertisements

Similar presentations

Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures Amir Hormati1, Yoonseo Choi1, Manjunath Kudlur3, Rodric Rabbah2,

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

1 Compiling with multicore Jeehyung Lee Spring 2009.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Static Process Scheduling

Michael I. Gordon, William Thies, and Saman Amarasinghe

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Parallel Computing Presented by Justin Reschke

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Compiler Research How I spent my last 22 summer vacations Philip Sweany.

University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Ph.D. in Computer Science

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Algorithm Design

Gary M. Zoppetti Gagan Agrawal

The Vector-Thread Architecture

Presentation transcript:

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today: Daya Khudia

- 1 - Announcements & Reading Material v This class »“Orchestrating the Execution of Stream Programs on Multicore Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun v Next class – GPU compilation »“Program optimization space pruning for a multithreaded GPU,” S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar

- 2 - Stream Graph Modulo Scheduling (SGMS) v Coarse grain software pipelining »Equal work distribution »Communication/computation overlap »Synchronization costs v Target : Cell processor »Cores with disjoint address spaces »Explicit copy to access remote data  DMA engine independent of PEs v Filters = operations, cores = function units SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) EIB PPE (Power PC) DRAM SPE0SPE1SPE7

- 3 - Preliminaries v Synchronous Data Flow (SDF) [Lee ’87] v StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } Push and pop items from input/output FIFOs Stateless int wgts[N]; wgts = adapt(wgts); Stateful

- 4 - SGMS Overview PE0PE1PE2PE3 PE0 T1T1 T4T4 T4T4 T1T1 ≈ 4 DMA Prologue Epilogue

- 5 - SGMS Phases Fission + Processor assignment Stage assignment Code generation Load balanceCausality DMA overlap

- 6 - Processor Assignment: Maximizing Throughputs for all filter i = 1, …, N for all PE j = 1,…,P Minimize II BC E D F W: 20 W: 30 W: 50 W: 30 A BC E D F A A D B C E F Minimum II: 50 Balanced workload! Maximum throughput T2 = 50 A B C D E F PE0 T1 = 170 T1/T2 = 3. 4 PE0 PE1PE2PE3 Four Processing Elements Assigns each filter to a processor PE1PE0 PE2PE3 W: workload

- 7 - Need More Than Just Processor Assignment v Assign filters to processors »Goal : Equal work distribution v Graph partitioning? v Bin packing? A B D C A BC D Original stream program PE0PE1 B A C D Speedup = 60/40 = 1.5 A B1 D C B2 J S Modified stream program B2 C J B1 A S D Speedup = 60/32 ~ 2

- 8 - Filter Fission Choices PE0PE1PE2PE3 Speedup ~ 4 ?

- 9 - Integrated Fission + PE Assign v Exact solution based on Integer Linear Programming (ILP) … Split/Join overhead factored in v Objective function- Maximal load on any PE »Minimize v Result »Number of times to “split” each filter »Filter → processor mapping

Step 2: Forming the Software Pipeline v To achieve speedup »All chunks should execute concurrently »Communication should be overlapped v Processor assignment alone is insufficient information A B C A C B PE0PE1 PE0PE1 Time A B A1A1 B1B1 A2A2 A1A1 B1B1 A2A2 A→BA→B A1A1 B1B1 A2A2 A→BA→B A3A3 A→BA→B Overlap A i+2 with B i X

Stage Assignment i j PE 1 S j ≥ S i i j DMA PE 1 PE 2 SiSi S DMA > S i S j = S DMA +1 Preserve causality (producer-consumer dependence) Communication-computation overlap v Data flow traversal of the stream graph »Assign stages using above two rules

Stage Assignment Example A B1 D C B2 J S A S B1 Stage 0 DMA Stage 1 C B2 J Stage 2 D DMA Stage 3 Stage 4 DMA PE 0 PE 1

Step 3: Code Generation for Cell v Target the Synergistic Processing Elements (SPEs) »PS3 – up to 6 SPEs »QS20 – up to 16 SPEs v One thread / SPE v Challenge »Making a collection of independent threads implement a software pipeline »Adapt kernel-only code schema of a modulo schedule

Complete Example void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } A S B1 DMA C B2 J D DMA A S B1 A S B1toJ StoB2 AtoC A S B1 B2 J C B1toJ StoB2 AtoC A S B1 JtoD CtoD B2 J C B1toJ StoB2 AtoC A S B1 JtoD D CtoD B2 J C B1toJ StoB2 AtoC SPE1DMA1SPE2DMA2 Time

SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) Solver time < 30 seconds for 16 processors

SGMS Conclusions v Streamroller »Efficient mapping of stream programs to multicore »Coarse grain software pipelining v Performance summary »14.7x speedup on 16 cores »Up to 35% better than greedy solution (11% on average) v Scheduling framework »Tradeoff memory space vs. load balance  Memory constrained (embedded) systems  Cache based system

Discussion Points v Is it possible to convert stateful filters into stateless? v What if the application does not behave as you expect? »Filters change execution time? »Memory faster/slower than expected? v Could this be adapted for a more conventional multiprocessor with caches? v Can C code be automatically streamized? v Now you have seen 3 forms of software pipelining: »1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling »Where else can it be used?

“Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures,”

Static Verses Dynamic Scheduling Core A A B1 B3 F F E E C C Splitter Joiner B2 B4 D1 D2 Splitter Joiner Memory Core Memory Core Memory Core Memory ? v Performing graph modulo scheduling on a stream graph statically. v What happens in case of dynamic resource changes?

Overview of Flextream Prepass Replication Work Partitioning Partition Refinement Stage Assignment Buffer Allocation Static Dynamic Streaming Application MSL Commands Adjust the amount of parallelism for the target system by replicating actors. Performs light-weight adaptation of the schedule for the current configuration of the target hardware. Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance) Find an optimal schedule for a virtualized member of a family of processors. Specifies how actors execute in time in the new actor- processor mapping. Find optimal modulo schedule for a virtualized member of a family of processors. Tries to efficiently allocate the storage requirements of the new schedule into available memory units. Goal: To perform Adaptive Stream Graph Modulo Scheduling.

Overall Execution Flow v For every application may see multiple iterations of: Resource change Request Resource change Granted

Prepass Replication [static] A A B B C C D D E E F F E0 E1 P0 : 10 P1 : 86P2 : 246P3 : 326 P4 : 566P5 : 10P6 : 0 P7 : 0 P4 : 283 D0 D1 P6 : 283 P7 : 163 P3 : 163 P0 : P1 : 147.5P2 : 184.5P3 : 163 P4 : 141.5P5 : 151.5P6 : P7 : 163 E0 E1 E2 E3 C0 C1 C2 C3 A A F F E E D D C C B B 86 C0 C2 S0 J C1 C3 D0 S1 J D1 E0 E2 S2 J E1 E

Partition Refinement [dynamic 1] v Available resources at runtime can be more limited than resources in static target architecture. v Partition refinement tunes actor to processor mapping for the active configuration. v A greedy iterative algorithm is used to achieve this goal.

Partition Refinement Example v Pick processors with most number of actors. v Sort the actors v Find processor with max work v Assign min actors until threshold A A B B P0 : P1 : 141.5P2 : 171.5P3 : P4 : 151.5P5 : 173 P7 : D0 D1 P6 : 140 E0 E1 E2 E3 C0 C1 C2 C3 S1 S2 J0 S0 J1 J2 B B E2 C2 C0 C3 C1 S1 S2 J0 S0 J1 J2 P5 : 183 S0 C3 S1 S2 J1 J2 C2 C1 B B C0 E2 F F J0 P5 : 193 P5 : 270.5P4 : P1 : 283 P3 : 289

Stage Assignment [dynamic 2] v Processor assignment only specifies how actors are overlapped across processors. v Stage assignment finds how actors are overlapped in time. v Relative start time of the actors is based on stage numbers. v DMA operations will have a separate stage.

Stage Assignment Example A A F F B B C0 C2 S0 J0 C1 C3 D0 S1 J1 D1 E0 E2 S2 J2 E1 E3 A A D0 D1 E0 E1 E3 S0 C3 S1 S2 J1 J2 C2 C1 B B C0 E2 F F J

Performance Comparison

Overhead Comparison

Flextream Conclusions v Static scheduling approaches are promising but not enough. v Dynamic adaptation is necessary for future systems. v Flextream provides a hybrid static/dynamic approach to improve efficiency.