Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

DSPs Vs General Purpose Microprocessors
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.
Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
ANTLR in SSP Xingzhong Xu Hong Man Aug Outline ANTLR Abstract Syntax Tree Code Equivalence (Code Re-hosting) Future Work.
A Dataflow Programming Language and Its Compiler for Streaming Systems
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
SSP Re-hosting System Development: CLBM Overview and Module Recognition SSP Team Department of ECE Stevens Institute of Technology Presented by Hongbing.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Component Technologies for Embedded Systems Johan Eker.
Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
X := 11; if (x == 11) { DoSomething(); } else { DoSomethingElse(); x := x + 1; } y := x; // value of y? Phase ordering problem Optimizations can interact.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Math for CSLecture 11 Mathematical Methods for Computer Science Lecture 1.
Intermediate Code. Local Optimizations
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
Topic 1: Introduction to Computers and Programming
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.
Generic Software Pipelining at the Assembly Level Markus Pister
Compiler Research in HPC Lab R. Govindarajan High Performance Computing Lab.
PLC introduction1 Discrete Event Control Concept Representation DEC controller design DEC controller implementation.
CMPE 511 DATA FLOW MACHINES Mustafa Emre ERKOÇ 11/12/2003.
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Lecture 12 – Design Procedure.
CHALLENGING SCHEDULING PROBLEM IN THE FIELD OF SYSTEM DESIGN Alessio Guerri Michele Lombardi * Michela Milano DEIS, University of Bologna.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Dataflow Frequency Analysis based on Whole Program Paths Eduard Mehofer Institute for Software Science University of Vienna
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Leiden Embedded Research Center Prof. Dr. Ed F. Deprettere, Dr. Bart Kienhuis, Dr. Todor Stefanov Leiden Embedded Research Center (LERC) Leiden Institute.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.
StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Michael I. Gordon, William Thies, and Saman Amarasinghe
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
Memory-Aware Compilation Philip Sweany 10/20/2011.
Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
StreamIt: A Language for Streaming Applications
Marilyn Wolf1 With contributions from:
Code Optimization.
A Common Machine Language for Communication-Exposed Architectures
Linear Filters in StreamIt
What is this “Viterbi Decoding”
Teleport Messaging for Distributed Stream Programs
Cache Aware Optimization of Stream Programs
High Performance Stream Processing for Mobile Sensing Applications
StreamIt: High-Level Stream Programming on Raw
Presentation transcript:

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer Science

Andrew A. Lamb4/28/ Motivation Digital devices, massive computation pervade modern life (cell phones, MP3, HDTV, etc.) Devices complex, software more complicated Performance constraints (real time, power consumption) dictate high level of optimization Best performance assembly (50% cell phone code is written in assembly) Assembly is (very) hard to reuse Automatic optimization is critical

Andrew A. Lamb4/28/ Outline Motivation StreamIt Linear Dataflow Analysis Performance Optimizations Results

Andrew A. Lamb4/28/ The StreamIt Language Goals: High performance Improved programmer productivity (modularity) Contributions: Structured model of streams Compiler buffer management Automated scheduling (Michal Karczmarek) Target complex architecture (Mike Gordon) Domain specific optimizations (Andrew Lamb)

Andrew A. Lamb4/28/ Programs in StreamIt Traditional: stream programs are graphs No simple textual representation Difficult to analyze and optimize Insight: stream programs have structure unstructuredstructured

Andrew A. Lamb4/28/ Why Structured Streams? Compared to structured control flow PRO:more robust, more analyzable CON: “restricted” style of programming GOTO statementsif / else / for statements

Andrew A. Lamb4/28/ Basic programmable unit: Filter Hierarchical structures: Pipeline SplitJoin Feedback Loop Structured Streams

Andrew A. Lamb4/28/ Representing Filters Autonomous unit of computation No access to global resources One input, one output Communicates through FIFO channels pop(), peek(index) push(value) “Firing” is the atomic execution step A firing’s peek, pop, push rates must be constant Code within filter is general purpose – like C

Andrew A. Lamb4/28/ Outline Motivation StreamIt Linear Dataflow Analysis Performance Optimizations Results

Andrew A. Lamb4/28/ What is a Linear Filter? Generic filters generate outputs (possibly) based on their inputs Linear filters: standard sense of linearity outputs ( y j ) are weighted sums of the inputs ( x i ) (possibly plus a constant) b constant a i constant for all i e is the number of inputs ?

Andrew A. Lamb4/28/ Linearity and Matricies Represent multiple inputs, multiple outputs with matrix multiply We treat inputs ( x i ) and outputs ( y j ) as vectors of values (x, and y respectively) Filter representation: Matrix of weights A Vector of constants b peek, pop, push rates A filter firing computes: y = xA + b

Andrew A. Lamb4/28/ Goal: convert the filter’s imperative code into an equivalent linear node y = xA + b Technique: “Linear Dataflow Analysis” Resembles standard constant propagation “Linear form” is a vector and a constant Keep mapping from each expr. to linear form Extract linear form for each value pushed Extracting Linear Filters expr

Andrew A. Lamb4/28/ Outline Motivation StreamIt Linear Dataflow Analysis Performance Optimizations Results

Andrew A. Lamb4/28/ Combining Linear Filters Pipelines and splitjoins containing only linear filters can be collapsed into a single node Example: pipeline with peek(B)=pop(B) [A][B] [C] x 2 y = x 2 B = x 1 A y = x 1 A’ B’ = x 1 C x1x1 y C = A’ B’ where A’ and B’ have been scaled and duplicated so that the dimensions match. x1x1 x2x2 y

Andrew A. Lamb4/28/ Frequency Replacement First, identify linear nodes with FIR filters from discrete time linear systems =(A,b,3,M,N) h 1 [n] =(A,b,peek,pop,push) Linear Node DT System h N-1 [n] switch decimator M M

Andrew A. Lamb4/28/ Automatic Selection Applying optimizations blindly is not good Combination example: =(A,b,3,1,1) =(A,b,1,1,3) =(A,b,3,1,3) originally 2 mults output 3 mults output after “optimization”

Andrew A. Lamb4/28/ Outline Motivation StreamIt Linear Dataflow Analysis Performance Optimizations Results

Andrew A. Lamb4/28/ Fmul Reduction

Andrew A. Lamb4/28/ Execution Speedup 4.53%

Andrew A. Lamb4/28/ Conclusion StreamIt is a new language for high performance DSP applications Personal research contributions: Dataflow analysis determines a linear node that represents input/output relationship Combination and optimization using linear nodes Average performance speedup of 450% Using StreamIt and domain specific optimizations, modularity does not sacrifice performance.