1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.

Slides:

Advertisements

Similar presentations

Programmable FIR Filter Design

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

DSP-CIS Chapter-5: Filter Realization

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

SirenDetect Alerting Drivers about Emergency Vehicles Jennifer Michelstein Department of Electrical Engineering Adviser: Professor Peter Kindlmann May.

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

A Dataflow Programming Language and Its Compiler for Streaming Systems

The Assembly Language Level

VLSI Communication SystemsRecap VLSI Communication Systems RECAP.

An Empirical Characterization of Stream Programs and its Implications for Language and Compiler Design Bill Thies 1 and Saman Amarasinghe 2 1 Microsoft.

StreamBit: Sketching high-performance implementations of bitstream programs Armando Solar-Lezama, Rastislav Bodik UC Berkeley.

Common Sub-expression Elim Want to compute when an expression is available in a var Domain:

Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...

Manipulating Lossless Video in the Compressed Domain William Thies 1, Steven Hall 2, Saman Amarasinghe 2 1 Microsoft Research India 2 Massachusetts Institute.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.

L29:Lower Power Embedded Architecture Design 성균관대학교 조 준 동 교수,

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Trigger design engineering tools. Data flow analysis Data flow analysis through the entire Trigger Processor allow us to refine the optimal architecture.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

 Embedded Digital Signal Processing (DSP) systems  Specification with floating-point data types  Implementation in fixed-point architectures  Precision.

Stream Programming: Luring Programmers into the Multicore Era Bill Thies Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

Michael I. Gordon, William Thies, and Saman Amarasinghe

Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

StreamIt – A Programming Language for the Era of Multicores Saman Amarasinghe

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.

A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

StreamIt: A Language for Streaming Applications

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Linear Filters in StreamIt

Embedded Systems Design

Teleport Messaging for Distributed Stream Programs

Introduction to cosynthesis Rabi Mahapatra CSCE617

Cache Aware Optimization of Stream Programs

High Performance Stream Processing for Mobile Sensing Applications

StreamIt: High-Level Stream Programming on Raw

Multiplier-less Multiplication by Constants

Armando Solar-Lezama, Rastislav Bodik UC Berkeley

Real time signal processing

Fixed-point Analysis of Digital Filters

Presentation transcript:

1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of Technology 2 Sandbridge Technologies CASES

2 Streaming Application Domain Based on a stream of data –Graphics, multimedia, software radio –Radar tracking, microphone arrays, HDTV editing, cell phone base stations Properties of stream programs –Regular and repeating computation –Parallel, independent actors with explicit communication –Data items have short lifetimes AtoD Decode duplicate LPF 2 LPF 1 LPF 3 HPF 2 HPF 1 HPF 3 Transmit roundrobin Encode

3 Conventional DSP Design Flow DSP Optimizations Rewrite the program Design the Datapaths (no control flow) Architecture-specific Optimizations (performance, power, code size) Spec. (data-flow diagram) Coefficient Tables C/Assembly Code Signal Processing Expert in Matlab Software Engineer in C and Assembly

4 Ideal DSP Design Flow DSP Optimizations High-Level Program (dataflow + control) Architecture-Specific Optimizations C/Assembly Code Application-Level Design Compiler Application Programmer Challenge: maintaining performance

5 The StreamIt Language Goals: –Provide a high-level stream programming model –Invent new compiler technology for streams Contributions: –Language design [CC ’02, PPoPP ’05] –Compiling to tiled architectures [ASPLOS ’02, ISCA ’04, Graphics Hardware ’05] –Cache-aware scheduling [LCTES ’03, LCTES ’05] –Domain-specific optimizations [PLDI ’03, CASES ‘05]

6 void->void pipeline FMRadio(int N, float lo, float hi) { add AtoD(); add FMDemod(); add splitjoin { split duplicate; for (int i=0; i<N; i++) { add pipeline { add LowPassFilter(lo + i*(hi - lo)/N); add HighPassFilter(lo + i*(hi - lo)/N); } join roundrobin(); } add Adder(); add Speaker(); } Adder Speaker AtoD FMDemod LPF 1 Duplicate RoundRobin LPF 2 LPF 3 HPF 1 HPF 2 HPF 3 Programming in StreamIt

7 Example StreamIt Filter float->float filter LowPassButterWorth (float sampleRate, float cutoff) { float coeff; float x; init { coeff = calcCoeff(sampleRate, cutoff); } work peek 2 push 1 pop 1 { x = peek(0) + peek(1) + coeff * x; push(x); pop(); } filter

8 Focus: Linear State Space Filters Properties: 1. Outputs are linear function of inputs and states 2. New states are linear function of inputs and states Most common target of DSP optimizations –FIR / IIR filters –Linear difference equations –Upsamplers / downsamplers –DCTs

9 Representing State Space Filters u A state space filter is a tuple  A, B, C, D  x’ = Ax + Bu y = Cx + Du  A, B, C, D  inputs states outputs

10 Representing State Space Filters u A state space filter is a tuple  A, B, C, D  x’ = Ax + Bu y = Cx + Du  A, B, C, D  float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } inputs states outputs

11 Representing State Space Filters B = A = u 2 2 C = 2 A state space filter is a tuple  A, B, C, D  x’ = Ax + Bu D = y = Cx + Du float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } inputs states outputs

12 Representing State Space Filters A state space filter is a tuple  A, B, C, D  B = A = u 2 2 C = 2 x’ = Ax + Bu D = y = Cx + Du float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } inputs states outputs

13 Representing State Space Filters A state space filter is a tuple  A, B, C, D  B = A = u 2 2 C = 2 x’ = Ax + Bu D = y = Cx + Du float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } inputs states outputs

14 Representing State Space Filters A state space filter is a tuple  A, B, C, D  B = A = u 2 2 C = 2 x’ = Ax + Bu D = y = Cx + Du float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } inputs states outputs

15 float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } Representing State Space Filters A state space filter is a tuple  A, B, C, D  B = A = u 2 2 C = 2 x’ = Ax + Bu D = y = Cx + Du inputs states outputs

16 Representing State Space Filters float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } A state space filter is a tuple  A, B, C, D  B = A = u 2 2 C = 2 x’ = Ax + Bu D = y = Cx + Du Linear dataflow analysis inputs states outputs

17 State Space Optimizations 1.State removal 2.Reducing the number of parameters 3.Combining adjacent filters

18 x’ = Ax + Bu y = Cx + Du Change-of-Basis Transformation

19 x’ = Ax + Bu y = Cx + Du Change-of-Basis Transformation T = invertible matrix Tx’ = TAx + TBu y = Cx + Du

20 x’ = Ax + Bu y = Cx + Du Change-of-Basis Transformation T = invertible matrix Tx’ = TA(T -1 T)x + TBu y = C(T -1 T)x + Du

21 x’ = Ax + Bu y = Cx + Du Change-of-Basis Transformation T = invertible matrix Tx’ = TAT -1 (Tx) + TBu y = CT -1 (Tx) + Du

22 x’ = Ax + Bu y = Cx + Du Change-of-Basis Transformation T = invertible matrix, z = Tx Tx’ = TAT -1 (Tx) + TBu y = CT -1 (Tx) + Du

23 x’ = Ax + Bu y = Cx + Du Change-of-Basis Transformation T = invertible matrix, z = Tx z’ = TAT -1 z + TBu y = CT -1 z + Du

24 x’ = Ax + Bu y = Cx + Du Change-of-Basis Transformation T = invertible matrix, z = Tx z’ = A’z + B’u y = C’z + D’u A’ = TAT -1 B’ =TB C’ = CT -1 D’ = D

25 x’ = Ax + Bu y = Cx + Du Change-of-Basis Transformation T = invertible matrix, z = Tx z’ = A’z + B’u y = C’z + D’u A’ = TAT -1 B’ =TB C’ = CT -1 D’ = D Can map original states x to transformed states z = Tx without changing I/O behavior

26 1) State Removal Can remove states which are: a. Unreachable – do not depend on input b. Unobservable – do not affect output To expose unreachable states, reduce [A | B] to a kind of row-echelon form –For unobservable states, reduce [A T | C T ] Automatically finds minimal number of states

27 State Removal Example x + x’ = y = x + 2u float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } u T = x + x’ = y = x + 2u u

28 State Removal Example x + x’ = y = x + 2u float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } u T = x + x’ = y = x + 2u u x1 is unobservable

29 State Removal Example x + x’ = y = x + 2u float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } u T = float->float filter IIR { float x; work push 1 pop 1 { float u = pop(); push(2*(x+u)); x = 0.9*x + 0.5*u; } } x’ = 0.9x + 0.5u y = 2x + 2u

30 State Removal Example float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } float->float filter IIR { float x; work push 1 pop 1 { float u = pop(); push(2*(x+u)); x = 0.9*x + 0.5*u; } } 9 FLOPs 12 load/store 5 FLOPs 8 load/store output

31 2) Parameter Reduction Goal: Convert matrix entries (parameters) to 0 or 1 Allows static evaluation: 1*x  x Eliminate 1 multiply 0*x + y  y Eliminate 1 multiply, 1 add Algorithm (Ackerman & Bucy, 1971) –Also reduces matrices [A | B] and [A T | C T ] –Attains a canonical form with few parameters

32 Parameter Reduction Example x’ = 0.9x + 0.5u y = 2x + 2u 2 T = x’ = 0.9x + 1u y = 1x + 2u 6 FLOPs output 4 FLOPs output

33 Filter 1 3) Combining Adjacent Filters Filter 2 y u z y = D 1 u z = D 2 y E Combined Filter u z z = Eu z = D 2 D 1 u

34 Filter 1 3) Combining Adjacent Filters Filter 2 y u z Combined Filter u z z = D 2 C 1 C 2 x + D 2 D 1 u A 1 0 B 2 C 1 A 2 B1B2D1 B1B2D1 x’ = x + u Also in paper: - combination of parallel streams - combination of feedback loops - expansion of mis-matching filters

35 IIR Filter Combination Example x’ = 0.9x + u y = x + 2u Decimator y = [1 0] u1u2u1u2 IIR / Decimator x’ = 0.81x + [0.9 1] y = x + [2 0] u1u2u1u2 8 FLOPs output 6 FLOPs output u1u2u1u2

36 IIR Filter Combination Example x’ = 0.9x + u y = x + 2u Decimator y = [1 0] u1u2u1u2 8 FLOPs output 6 FLOPs output IIR / Decimator x’ = 0.81x + [0.9 1] y = x + [2 0] u1u2u1u2 u1u2u1u2 As decimation factor goes to , eliminate up to 75% of FLOPs.

37 Combination Hazards Combination sometimes increases FLOPs Example: FFT –Combination results in DFT –Converts O(n log n) algorithm to O(n 2 ) Solution: only apply where beneficial –Operations known at compile time –Using selection algorithm, FLOPs never increase See PLDI ’03 paper for details

38 Results Subsumes combination of linear components –Evaluated previously [PLDI ’03] Applications: FIR, RateConvert, TargetDetect, Radar, FMRadio, FilterBank, Vocoder, Oversampler, DtoA –Removed 44% of FLOPs –Speedup of 120% on Pentium 4 Results using state space analysis Speedup (Pentium 3) IIR + 1:2 Decimator49% IIR + 1:16 Decimator87%

39 Ongoing Work Experimental evaluation –Evaluate real applications on embedded machines –In progress: MPEG2, JPEG, radar tracker Numerical precision constraints –Precision often influences choice of coefficients –Transformations should respect constraints

40 Related Work Linear stream optimizations [Lamb et al. ’03] –Deals with stateless filters Automatic optimization of linear libraries –SPIRAL, FFTW, ATLAS, Sparsity Stream languages –Lustre, Esterel, Signal, Lucid, Lucid Synchrone, Brook, Spidle, Cg, Occam, Sisal, Parallel Haskell Common sub-expression elimination

41 Conclusions Linear state space analysis: An elegant compiler IR for DSP programs Optimizations using state space representation: 1. State removal 2. Parameter reduction 3. Combining adjacent filters Step towards adding efficient abstraction layers that remove the DSP expert from the design flow