An Empirical Characterization of Stream Programs and its Implications for Language and Compiler Design Bill Thies 1 and Saman Amarasinghe 2 1 Microsoft.

Slides:



Advertisements
Similar presentations
Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.
Advertisements

System Integration and Performance
DSPs Vs General Purpose Microprocessors
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.
Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
A Dataflow Programming Language and Its Compiler for Streaming Systems
1 HDTV in StreamIT: A Case Study Andrew A. Lamb MIT Laboratory for Computer Science Computer Architecture Group (8/2/2002)
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
Dataflow Process Networks Lee & Parks Synchronous Dataflow Lee & Messerschmitt Abhijit Davare Nathan Kitchen.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
Saman Amarasinghe. Lets stick with current sequential languages Parallel Programming is hard! Billons of LOC written in sequential languages Let the compiler.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David.
Research Directions for On-chip Network Microarchitectures Luca Carloni, Steve Keckler, Robert Mullins, Vijay Narayanan, Steve Reinhardt, Michael Taylor.
Manipulating Lossless Video in the Compressed Domain William Thies 1, Steven Hall 2, Saman Amarasinghe 2 1 Microsoft Research India 2 Massachusetts Institute.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
Stream Programming: Luring Programmers into the Multicore Era Bill Thies Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.
1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.
StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Michael I. Gordon, William Thies, and Saman Amarasinghe
Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.
From Error Control to Error Concealment Dr Farokh Marvasti Multimedia Lab King’s College London.
StreamIt – A Programming Language for the Era of Multicores Saman Amarasinghe
StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Sunpyo Hong, Hyesoon Kim
A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric.
Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
StreamIt: A Language for Streaming Applications
Parallel processing is not easy
Data Compression.
Parallel Programming By J. H. Wang May 2, 2017.
A Common Machine Language for Communication-Exposed Architectures
Linear Filters in StreamIt
Embedded Systems Design
Data Compression.
Teleport Messaging for Distributed Stream Programs
Cache Aware Optimization of Stream Programs
High Performance Stream Processing for Mobile Sensing Applications
StreamIt: High-Level Stream Programming on Raw
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

An Empirical Characterization of Stream Programs and its Implications for Language and Compiler Design Bill Thies 1 and Saman Amarasinghe 2 1 Microsoft Research India 2 Massachusetts Institute of Technology PACT 2010

What Does it Take to Evaluate a New Language? Lines of Code Small studies make it hard to assess: - Experiences of new users over time - Common patterns across large programs

What Does it Take to Evaluate a New Language? Lines of Code StreamIt (PACT’10) 10K 020K 30K

What Does it Take to Evaluate a New Language? Lines of Code StreamIt (PACT’10) 10K 020K 30K Our characterization: - 65 programs - 34,000 lines of code - Written by 22 students - Over period of 8 years This allows: - Non-trivial benchmarks - Broad picture of application space - Understanding long-term user experience

Streaming Application Domain For programs based on streams of data –Audio, video, DSP, networking, and cryptographic processing kernels –Examples: HDTV editing, radar tracking, microphone arrays, cell phone base stations, graphics Adder Speaker AtoD FMDemod LPF 1 Duplicate RoundRobin LPF 2 LPF 3 HPF 1 HPF 2 HPF 3

Streaming Application Domain For programs based on streams of data –Audio, video, DSP, networking, and cryptographic processing kernels –Examples: HDTV editing, radar tracking, microphone arrays, cell phone base stations, graphics Properties of stream programs –Regular and repeating computation –Independent filters with explicit communication Adder Speaker AtoD FMDemod LPF 1 Duplicate RoundRobin LPF 2 LPF 3 HPF 1 HPF 2 HPF 3

StreamIt: A Language and Compiler for Stream Programs Key idea: design language that enables static analysis Goals: 1.Improve programmer productivity in the streaming domain 2.Expose and exploit the parallelism in stream programs Project contributions: –Language design for streaming [CC'02, CAN'02, PPoPP'05, IJPP'05] –Automatic parallelization [ASPLOS'02, G.Hardware'05, ASPLOS'06, MIT’10] –Domain-specific optimizations [PLDI'03, CASES'05, MM'08] –Cache-aware scheduling [LCTES'03, LCTES'05] –Extracting streams from legacy code [MICRO'07] –User + application studies [PLDI'05, P-PHEC'05, IPDPS'06]

StreamIt Language Basics High-level, architecture-independent language –Backend support for uniprocessors, multicores (Raw, SMP), cluster of workstations Model of computation: synchronous dataflow –Program is a graph of independent filters –Filters have an atomic execution step with known input / output rates –Compiler is responsible for scheduling and buffer management Extensions to synchronous dataflow –Dynamic I/O rates –Support for sliding window operations –Teleport messaging [PPoPP’05] Decimate Input Output x 10 x 1 [Lee & Messerschmidt, 1987]

Stateful float->float filter LowPassFilter (int N, work peek N push 1 pop 1 { float result = 0; for (int i=0; i<weights.length; i++) { result += weights[i] * peek(i); } push(result); pop(); } Example Filter: Low Pass Filter N filter Stateless float[N] weights; ) { weights = adaptChannel();

Structured Streams may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter Each structure is single- input, single-output Hierarchical and composable

StreamIt Benchmark Suite (1/2) Realistic applications (30): –MPEG2 encoder / decoder –Ground Moving Target Indicator –Mosaic –MP3 subset –Medium Pulse Compression Radar –JPEG decoder / transcoder –Feature Aided Tracking –HDTV –H264 subset –Synthetic Aperture Radar –GSM Decoder –802.11a transmitte –DES encryption –Serpent encryption –Vocoder –RayTracer –3GPP physical layer –Radar Array Front End –Freq-hopping radio –Orthogonal Frequency Division Multiplexer –Channel Vocoder –Filterbank –Target Detector –FM Radio –DToA Converter

StreamIt Benchmark Suite (2/2) Libraries / kernels (23): –Autocorrelation –Cholesky –CRC –DCT (1D / 2D, float / int) –FFT (4 granularities) –Lattice Graphics pipelines (4): –Reference pipeline –Phong shading Sorting routines (8) –Bitonic sort (3 versions) –Bubble Sort –Comparison counting –Matrix Multiplication –Oversampler –Rate Convert –Time Delay Equalization –Trellis –VectAdd –Shadow volumes –Particle system –Insertion sort –Merge sort –Radix sort

3GPP

802.11a

Bitonic Sort

DCT

FilterBank

GSM Decoder

MP3 Decoder Subset

Radar Array Frontend

Vocoder

Characterization Overview Focus on architecture-independent features –Avoid performance artifacts of the StreamIt compiler –Estimate execution time statically (not perfect) Three categories of inquiry: 1.Throughput bottlenecks 2.Scheduling characteristics 3.Utilization of StreamIt language features

Lessons Learned from the StreamIt Language What we did right What we did wrong Opportunities for doing better

1. Expose Task, Data, & Pipeline Parallelism Splitter Joiner Task Data parallelism Analogous to DOALL loops Task parallelism Pipeline parallelism

1. Expose Task, Data, & Pipeline Parallelism Data parallelism Task parallelism Pipeline parallelism Splitter Joiner Splitter Joiner Task Pipeline Data Stateless

1. Expose Task, Data, & Pipeline Parallelism Data parallelism 74% of benchmarks contain entirely data-parallel filters In other benchmarks, 5% to 96% (median 71%) of work is data-parallel Task parallelism 82% of benchmarks contain at least one splitjoin Median of 8 splitjoins per benchmark Pipeline parallelism Splitter Joiner Splitter Joiner Task Pipeline Data

Characterizing Stateful Filters 763 Filter Types 49 Stateful Types 6% Stateful 94% Stateless 55% Avoidable State 45% Algorithmic State 27 Types with “Avoidable State” Due to induction variables Sources of Algorithmic State –MPEG2: bit-alignment, reference frame encoding, motion prediction, … –HDTV: Pre-coding and Ungerboeck encoding –HDTV + Trellis: Ungerboeck decoding –GSM: Feedback loops –Vocoder: Accumulator, adaptive filter, feedback loop –OFDM: Incremental phase correction –Graphics pipelines: persistent screen buffers

Characterizing Stateful Filters 2. Eliminate Stateful Induction Variables 763 Filter Types 49 Stateful Types 6% Stateful 94% Stateless 55% Avoidable State 45% Algorithmic State 27 Types with “Avoidable State” Due to message handlers Due to induction variables Due to Granularity Sources of Induction Variables –MPEG encoder: counts frame # to assign picture type –MPD / Radar: count position in logical vector for FIR –Trellis: noise source flips every N items –MPEG encoder / MPD: maintain logical 2D position (row/column) –MPD: reset accumulator when counter overflows Opportunity: Language primitive to return current iteration?

3. Expose Parallelism in Sliding Windows Legacy codes obscure parallelism in sliding windows –In von-Neumann languages, modulo functions or copy/shift operations prevent detection of parallelism in sliding windows Sliding windows are prevalent in our benchmark suite –57% of realistic applications contain at least one sliding window –Programs with sliding windows have 10 instances on average –Without this parallelism, 11 of our benchmarks would have a new throughput bottleneck (work: 3% - 98%, median 8%) input output FIR 01

Characterizing Sliding Windows 44% FIR Filters push 1 pop 1 peek N 3GPP, OFDM, Filterbank, TargetDetect, DToA, Oversampler, RateConvert, Vocoder, ChannelVocoder, FMRadio 34 Sliding Window Types 29% One-item windows pop N peek N+1 Mosaic, HDTV, FMRadio, JPEG decode / transcode, Vocoder 27% Miscellaneous MP3: reordering (peek >1000) : error codes (peek 3-7) Vocoder / A.beam: skip data Channel Vocoder: sliding correlation (peek 100)

4. Expose Startup Behaviors Example: difference encoder (JPEG, Vocoder) Required by 15 programs: –For delay: MPD, HDTV, Vocoder, 3GPP, Filterbank, DToA, Lattice, Trellis, GSM, CRC –For picture reordering (MPEG) –For initialization (MPD, HDTV, ) –For difference encoder or decoder: JPEG, Vocoder Stateless int->int filter Diff_Encoder() { int state = 0; work push 1 pop 1 { push(peek(0) – state); state = pop(); } int->int filter Diff_Encoder() { prework push 1 pop 1 { push(peek(0)); } work push 1 pop 1 peek 2 { push(peek(1) – peek(0)); pop(); } Stateful

5. Surprise: Mis-Matched Data Rates Uncommon This is a driving application in many papers –Eg: [MBL94] [TZB99] [BB00] [BML95] [CBL01] [MB04] [KSB08] –Due to large filter multiplicities, clever scheduling is needed to control code size, buffer size, and latency But are mis-matched rates common in practice? No! x 147x 98x 28x 32 CD-DAT benchmark multiplicities Converts CD audio (44.1 kHz) to digital audio tape (48 kHz)

5. Surprise: Mis-Matched Data Rates Uncommon Excerpt from JPEG transcoder Execute once per steady state

Characterizing Mis-Matched Data Rates In our benchmark suite: –89% of programs have a filter with a multiplicity of 1 –On average, 63% of filters share the same multiplicity –For 68% of benchmarks, the most common multiplicity is 1 Implication for compiler design: Do not expect advanced buffering strategies to have a large impact on average programs –Example: Karczmarek, Thies, & Amarasinghe, LCTES’03 –Space saved on CD-DAT: 14x –Space saved on other programs (median): 1.2x

A multi-phase filter divides its execution into many steps –Formally known a cyclo-static dataflow –Possible benefits: Shorter latencies More natural code We implemented multi-phase filters, and we regretted it –Programmers did not understand the difference between a phase of execution, and a normal function call –Compiler was complicated by presences of phases However, phases proved important for splitters / joiners –Routing items needs to be done with minimal latency –Otherwise buffers grow large, and deadlock in one case (GSM) 6. Surprise: Multi-Phase Filters Cause More Harm than Good Step 1 F F Step 2

7. Programmers Introduce Unnecessary State in Filters Programmers do not implement things how you expect Opportunity: add a “stateful” modifier to filter decl? –Require programmer to be cognizant of the cost of state void->int filter SquareWave() { int x = 0; work push 1 { push(x); x = 1 - x; } void->int filter SquareWave() { work push 2 { push(0); push(1); } Stateless Stateful

8. Leverage and Improve Upon Structured Streams Overall, programmers found it useful and tractable to write programs using structured streams –Syntax is simple to write, easy to read However, structured streams are occasionally unnatural –And, in rare cases, insufficient

8. Leverage and Improve Upon Structured Streams Original: Structured: Compiler recovers unstructured graph using synchronization removal [Gordon 2010]

8. Leverage and Improve Upon Structured Streams Original: Structured: Characterization: –49% of benchmarks have an Identity node –In those benchmarks, Identities account for 3% to 86% (median 20%) of instances Opportunity: –Bypass capability (ala GOTO) for streams

Related Work Benchmark suites in von-Neumann languages often include stream programs, but lose high-level properties –MediaBench –ALPBench –Berkeley MM Workload Brook language includes 17K LOC benchmark suite –Brook disallows stateful filters; hence, more data parallelism –Also more focus on dynamic rates & flexible program behavior Other stream languages lack benchmark characterization –StreamC / KernelC –Cg In-depth analysis of 12 StreamIt “core” benchmarks published concurrently to this paper [Gordon 2010] –HandBench –MiBench –NetBench –Baker –SPUR –SPEC –PARSEC –Perfect Club –Spidle

Conclusions First characterization of a streaming benchmark suite that was written in a stream programming language –65 programs; 22 programmers; 34 KLOC Implications for streaming languages and compilers: –DO: expose task, data, and pipeline parallelism –DO: expose parallelism in sliding windows –DO: expose startup behaviors –DO NOT: optimize for unusual case of mis-matched I/O rates –DO NOT: bother with multi-phase filters –TRY: to prevent users from introducing unnecessary state –TRY: to leverage and improve upon structured streams –TRY: to prevent induction variables from serializing filters Exercise care in generalizing results beyond StreamIt

Acknowledgments: Authors of the StreamIt Benchmarks Sitij Agrawal Basier Aziz Jiawen Chen Matthew Drake Shirley Fung Michael Gordon Ola Johnsson Andrew Lamb Chris Leger Michal Karczmarek David Maze Ali Meli Mani Narayanan Satish Ramaswamy Rodric Rabbah Janis Sermulins Magnus Stenemo Jinwoo Suh Zain ul-Abdin Amy Williams Jeremy Wong