StreamIt: High-Level Stream Programming on Raw

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Programmability Issues
Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
A Dataflow Programming Language and Its Compiler for Streaming Systems
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Distributed Computations
Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David.
1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Automatic Identification of Concurrency in Handel-C Joseph C Libby, Kenneth B Kent, Farnaz Gharibian Faculty of Computer Science University of New Brunswick.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.
StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.
Lab 2 Parallel processing using NIOS II processors
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:
Michael I. Gordon, William Thies, and Saman Amarasinghe
Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
StreamIt: A Language for Streaming Applications
Spring 2016 Program Analysis and Verification
Advanced Computer Systems
User-Written Functions
Introduction to Compiler Construction
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Parallel processing is not easy
Computer Programming.
Static Memory Management for Efficient Mobile Sensing Applications
SOFTWARE DESIGN AND ARCHITECTURE
A Common Machine Language for Communication-Exposed Architectures
Linear Filters in StreamIt
Compiler Construction
Teleport Messaging for Distributed Stream Programs
Chapter 8: Introduction to High-Level Language Programming
Introduction to cosynthesis Rabi Mahapatra CSCE617
Cache Aware Optimization of Stream Programs
Big-Oh and Execution Time: A Review
A Reconfigurable Architecture for Load-Balanced Rendering
High Performance Stream Processing for Mobile Sensing Applications
Chapter 11 Introduction to Programming in C
Hossein Omidian, Guy Lemieux
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
Parallel Programming in C with MPI and OpenMP
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

StreamIt: High-Level Stream Programming on Raw Michael Gordon, Michal Karczmarek, Andrew Lamb, Jasper Lin, David Maze, William Thies, and Saman Amarasinghe March 6, 2003

The StreamIt Language Why use the StreamIt compiler? Automatic partitioning and load balancing Automatic layout Automatic switch code generation Automatic buffer management Aggressive domain-specific optimizations All with a simple, high-level syntax! Language is architecture-independent

A Simple Counter void->void pipeline Counter() { add IntSource(); add IntPrinter(); } void->int filter IntSource() { int x; init { x = 0; } work push 1 { push (x++); } int->void filter IntPrinter() { work pop 1 { print(pop()); } Counter IntSource IntPrinter

Demo Compile and run the program counter % knit --raw 4 Counter.str counter % make –f Makefile.streamit run Inspect graphs of program counter % dotty schedule.dot counter % dotty layout.dot

Representing Streams Hierarchical structures: Pipeline SplitJoin Feedback Loop Basic programmable unit: Filter

Representing Filters Autonomous unit of computation No access to global resources Communicates through FIFO channels - pop() - peek(index) - push(value) Peek / pop / push rates must be constant Looks like a Java class, with An initialization function A steady-state “work” function

Filter Example: LowPassFilter float->float filter LowPassFilter (int N) { float[N] weights; init { weights = calcWeights(N); } work push 1 pop 1 peek N { float result = 0; for (int i=0; i<N; i++) { result += weights[i] * peek(i); push(result); pop();

Filter Example: LowPassFilter float->float filter LowPassFilter (int N) { float[N] weights; init { weights = calcWeights(N); } work push 1 pop 1 peek N { float result = 0; for (int i=0; i<N; i++) { result += weights[i] * peek(i); push(result); pop(); N LPF

Filter Example: LowPassFilter float->float filter LowPassFilter (int N) { float[N] weights; init { weights = calcWeights(N); } work push 1 pop 1 peek N { float result = 0; for (int i=0; i<N; i++) { result += weights[i] * peek(i); push(result); pop(); N LPF

Filter Example: LowPassFilter float->float filter LowPassFilter (int N) { float[N] weights; init { weights = calcWeights(N); } work push 1 pop 1 peek N { float result = 0; for (int i=0; i<N; i++) { result += weights[i] * peek(i); push(result); pop(); N LPF

Filter Example: LowPassFilter float->float filter LowPassFilter (int N) { float[N] weights; init { weights = calcWeights(N); } work push 1 pop 1 peek N { float result = 0; for (int i=0; i<N; i++) { result += weights[i] * peek(i); push(result); pop(); N LPF

SplitJoin Example: BandPass Filter float->float pipeline BandPassFilter(float low, float high) { add BPFCore(low, high); add Subtract(); } float->float splitjoin BPFCore(float low, float high) { split duplicate; add LowPassFilter(high); add LowPassFilter(low); join roundrobin; float->float filter Subtract { work pop 2 push 1 { float val1 = pop(); float val2 = pop(); push(val1 – val2); } BandPassFilter BPFCore duplicate LPF LPF roundrobin Subtract

Parameterization: Equalizer float->float pipeline Equalizer (int N) { add splitjoin { split duplicate; float freq = 10000; for (int i = 0; i < N; i ++, freq*=2) { add BandPassFilter(freq, 2*freq); } join roundrobin; add Adder(N); Equalizer duplicate BPF BPF BPF roundrobin Adder

FM Radio float->float pipeline FMRadio { FMRadio add FloatSource(); add LowPassFilter(); add FMDemodulator(); add Equalizer(8); add FloatPrinter(); } FMRadio FloatSource LowPassFilter FMDemodulator Equalizer FloatPrinter

Demo: Compile and Run fm % knit --raw 4 –-partition -–numbers 10 FMRadio.str fm % make –f Makefile.streamit run Options used: --raw 4 target 4x4 raw machine --partition use automatic greedy partitioning --numbers 10 gather numbers for 10 iterations, and store in results.out

Compiler Flow Summary StreamIt Front-End Partitioning Any Java Kopi StreamIt code StreamIt Front-End Partitioning Legal Java file Any Java Kopi Load-balanced Compiler Front-End Stream Graph Layout Class file Parse Tree StreamIt SIR Quickly explain front-end explain the details of each phase explain the scheduler Since we support parameterized graph construction, we propagate constants and expand the graph to its full form Scheduler Filters assigned Java Library Conversion to Raw tiles SIR Code Processor (unexpanded) Generation Code Graph Expansion SIR Communication Switch (expanded) Scheduler Code

Stream Graph Before Partitioning fm % dotty before.dot

Stream Graph After Partitioning fm % dotty after.dot

Layout on Raw fm % dotty layout.dot

Initial and Steady-State Schedule fm % dotty schedule.dot

Work Estimates (Graph) fm % dotty work-before.dot

Work Estimates (Table) fm % cat work-before.txt Filter Reps Measured Work Estimated Work (Measured-Estimated)/Measured Total Measured Work FMDemodulator__31 1 219 LowPassFilter__21 119 LowPassFilter__49 103 LowPassFilter__67 FloatSource__3 5 8 40 FloatPrinter__82 21 Adder__79 15 Subtract__72 10

Collected Results fm % cat results.out Performance Results Tiles in configuration: 16 Tiles assigned (to filters or joiners): 16 Run for 10 steady state cycles. With 0 items skipped for init. With 1 items printed per steady state. cycles MFLOPS work_count -------------------------- 2153 350 19227 2220 347 19731 2229 310 18963 2229 291 18512

Collected Results fm % cat results.out Performance Results Tiles in configuration: 16 Tiles assigned (to filters or joiners): 16 Run for 10 steady state cycles. With 0 items skipped for init. With 1 items printed per steady state. cycles MFLOPS work_count -------------------------- 2153 350 19227 2220 347 19731 2229 310 18963 2229 291 18512 2229 292 18537 2229 293 18559 2229 291 18513 2229 292 18557 2229 289 18510 2229 291 18530 Summmary: Steady State Executions: 10 Total Cycles: 22205 Avg Cycles per Steady-State: 2220 Thruput per 10^5: 45 Avg MFLOPS: 304 workCount* = 187639 / 355280

Understanding Performance

Understanding Performance

Demo: Linear Optimization fm % knit --linearreplacement --raw 4 -–numbers 10 FMRadio.str fm % make –f Makefile.streamit run New option: --linearreplacement identifies filters which compute linear functions of their input, and replaces adjacent linear nodes with a single matrix-multiply

Stream Graph Before Partitioning fm % dotty before.dot

Stream Graph Before Partitioning fm % dotty before.dot Entire Equalizer collapsed! without linear replacement

Results with Linear Optimization fm % cat results.out Summmary: Steady State Executions: 10 Total Cycles: 7260 Avg Cycles per Steady-State: 726 Thruput per 10^5: 137 Avg MFLOPS: 128 workCount* = 15724 / 116160

Results with Linear Optimization fm % cat results.out Summmary: Steady State Executions: 10 Total Cycles: 7260 Avg Cycles per Steady-State: 726 Thruput per 10^5: 137 Avg MFLOPS: 128 workCount* = 15724 / 116160 Speedup by factor of 3

Results with Linear Optimization fm % cat results.out Summmary: Steady State Executions: 10 Total Cycles: 7260 Avg Cycles per Steady-State: 726 Thruput per 10^5: 137 Avg MFLOPS: 128 workCount* = 15724 / 116160 Speedup by factor of 3 Allows programmer to write simple, modular filters which compiler combines automatically

Other Results: Processor Utilization

Speedup Over Single Tile throughput of streamit on 16 tiles normalized to a c implementation on a single tile. all application get at least a 4x speedup for the remaining 3 benchmarks, the c implementation does not fit on a single tile For Radio we obtained the C implementation from a 3rd party For FIR, Sort, FFT, Filterbank, and 3GPP we wrote the C implementation following a reference algorithm.

Scaling of Throughput throughput of streamit on 16 tiles normalized to a c implementation on a single tile. all application get at least a 4x speedup for the remaining 3 benchmarks, the c implementation does not fit on a single tile

Compiler Status Raw backend has been working for more than a year Robust partitioning, layout, and scheduling Still working on improvements: Dynamic programming partitioner Optimized scheduling, routing, code generation Frontend is relatively new Semantic checker still in progress Some malformed inputs cause Exceptions We are eager to gain user feedback! throughput of streamit on 16 tiles normalized to a c implementation on a single tile. all application get at least a 4x speedup for the remaining 3 benchmarks, the c implementation does not fit on a single tile

Library Support Option: --library StreamIt code Option: --library Run with Java library, not the compiler. Greatly facilitates application development, debugging, and verification. Given File.str, the frontend will produce File.java, which you can edit and instrument like a normal Java file. StreamIt Front-End Legal Java file Any Java Kopi Compiler Front-End Class file Parse Tree StreamIt SIR throughput of streamit on 16 tiles normalized to a c implementation on a single tile. all application get at least a 4x speedup for the remaining 3 benchmarks, the c implementation does not fit on a single tile Java Library Conversion SIR (unexpanded) Graph Expansion SIR (expanded)

Library Support Option: --library StreamIt code Option: --library Run with Java library, not the compiler. Greatly facilitates application development, debugging, and verification. Given File.str, the frontend will produce File.java, which you can edit and instrument like a normal Java file. StreamIt Front-End Legal Java file Any Java Kopi Compiler Front-End Class file Parse Tree StreamIt SIR throughput of streamit on 16 tiles normalized to a c implementation on a single tile. all application get at least a 4x speedup for the remaining 3 benchmarks, the c implementation does not fit on a single tile Java Library Conversion SIR (unexpanded) Graph Expansion Many more options will be documented in the release. SIR (expanded)

Summary StreamIt Homepage http://cag.lcs.mit.edu/streamit Why use StreamIt? High-level, architecture-independent syntax Automatic partitioning, load balancing, layout, switch code generation, and buffer management Aggressive domain-specific optimizations Many graphical outputs for programmer Release by next Friday, 3/14/03 StreamIt Homepage http://cag.lcs.mit.edu/streamit

Backup Slides

N-Element Merge Sort (3-level)

N-Element Merge Sort (K-level) pipeline MergeSort (int N, int K) { if (K==1) { add Sort(N); } else { add splitjoin { split roundrobin; add MergeSort(N/2, K-1); joiner roundrobin; } add Merge(N);

Example: Radar App. (Original) Splitter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter Joiner Splitter Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner

Example: Radar App. (Original)

Example: Radar App. (Original) Splitter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter Joiner Splitter Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner

Example: Radar App. (Original) Splitter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter Joiner Splitter Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner

Example: Radar App. Joiner Joiner Splitter Splitter Detector FirFilter Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner

Example: Radar App. Joiner Joiner Splitter Splitter Detector FirFilter Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner

Example: Radar App. Joiner Joiner Splitter Splitter Detector FirFilter Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Vector Mult FirFilter Magnitude Detector Joiner

Example: Radar App. Joiner Joiner Splitter Splitter FIRFilter Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Joiner

Example: Radar App. Joiner Joiner Splitter Splitter FIRFilter Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Joiner

Example: Radar App. Joiner Joiner Splitter Splitter FIRFilter Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Joiner

Example: Radar App. Joiner Joiner Splitter Splitter FIRFilter Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Joiner

Example: Radar App. Joiner Joiner Splitter Splitter FIRFilter Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Joiner

Example: Radar App. Joiner Joiner Splitter Splitter FIRFilter Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Joiner

Example: Radar App. Joiner Joiner Splitter Splitter FIRFilter Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Joiner

Example: Radar App. (Balanced) Splitter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter Joiner Splitter Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Vector Mult FIRFilter Magnitude Detector Joiner

Example: Radar App. (Balanced)

A Moving Average void->void pipeline MovingAverage() { add IntSource(); add Averager(10); add IntPrinter(); } int->int filter Averager(int N) { work pop 1 push 1 peek N-1 { int sum = 0; for (int i=0; i<N; i++) { sum += peek(i); } push(sum/N); pop(); Counter IntSource Averager IntPrinter

A Moving Average void->void pipeline MovingAverage() { add IntSource(); add Averager(4); add IntPrinter(); } int->int filter Averager(int N) { work pop 1 push 1 peek N-1 { int sum = 0; for (int i=0; i<N; i++) { sum += peek(i); } push(sum/N); pop(); Counter IntSource N Averager IntPrinter

A Moving Average void->void pipeline MovingAverage() { add IntSource(); add Averager(4); add IntPrinter(); } int->int filter Averager(int N) { work pop 1 push 1 peek N-1 { int sum = 0; for (int i=0; i<N; i++) { sum += peek(i); } push(sum/N); pop(); Counter IntSource N Averager IntPrinter

A Moving Average void->void pipeline MovingAverage() { add IntSource(); add Averager(4); add IntPrinter(); } int->int filter Averager(int N) { work pop 1 push 1 peek N-1 { int sum = 0; for (int i=0; i<N; i++) { sum += peek(i); } push(sum/N); pop(); Counter IntSource N Averager IntPrinter

A Moving Average void->void pipeline MovingAverage() { add IntSource(); add Averager(4); add IntPrinter(); } int->int filter Averager(int N) { work pop 1 push 1 peek N-1 { int sum = 0; for (int i=0; i<N; i++) { sum += peek(i); } push(sum/N); pop(); Counter IntSource N Averager IntPrinter