StreamIt – A Programming Language for the Era of Multicores Saman Amarasinghe

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

DSPs Vs General Purpose Microprocessors

Programmable FIR Filter Design

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

Computer Abstractions and Technology

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

StreamBit: Sketching high-performance implementations of bitstream programs Armando Solar-Lezama, Rastislav Bodik UC Berkeley.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Saman Amarasinghe. Lets stick with current sequential languages Parallel Programming is hard! Billons of LOC written in sequential languages Let the compiler.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.

SS ZG653Second Semester, Topic Architectural Patterns Pipe and Filter.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Introduction CSE 410, Spring 2008 Computer Systems

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Multi-core architectures. Single-core computer Single-core CPU chip.

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

Multi-Core Architectures

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

SE: CHAPTER 7 Writing The Program

HW-Accelerated HD video playback under Linux Zou Nan hai Open Source Technology Center.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Stream Programming: Luring Programmers into the Multicore Era Bill Thies Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.

Michael I. Gordon, William Thies, and Saman Amarasinghe

Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe.

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.

A Compiler Infrastructure for Stream Programs Bill Thies Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin, Andrew Lamb, David Maze, Rodric.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Vector computers.

Computer Organization IS F242. Course Objective It aims at understanding and appreciating the computing system’s functional components, their characteristics,

Algorithms in Programming Computer Science Principles LO

Introduction CSE 410, Spring 2005 Computer Systems

Microarchitecture.

SOFTWARE DESIGN AND ARCHITECTURE

CSE 410, Spring 2006 Computer Systems

A Common Machine Language for Communication-Exposed Architectures

Linear Filters in StreamIt

Teleport Messaging for Distributed Stream Programs

Vector Processing => Multimedia

StreamIt: High-Level Stream Programming on Raw

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Chapter 12 Pipelining and RISC

Presentation transcript:

StreamIt – A Programming Language for the Era of Multicores Saman Amarasinghe

Pentium P2 P3 P4 Itanium Itanium 2 Moore’s Law From David Patterson 1,000,000, ,000 10,000 1,000,000 10,000, ,000,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Number of Transistors

Pentium P2 P3 P4 Itanium Itanium 2 Uniprocessor Performance (SPECint) From David Patterson 1,000,000, ,000 10,000 1,000,000 10,000, ,000,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Number of Transistors

Pentium P2 P3 P4 Itanium Itanium 2 Uniprocessor Performance (SPECint) General-purpose unicores have stopped historic performance scaling –Power consumption –Wire delays –DRAM access latency –Diminishing returns of more instruction-level parallelism From David Patterson 1,000,000, ,000 10,000 1,000,000 10,000, ,000,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Number of Transistors All major manufacturers moving to multicore architectures

5 Two choices: Bend over backwards to support old languages like C/C++ Develop high-performance architectures that are hard to program Programming Languages for Modern Architectures Modern architecture Cvon-Neumann machine

6 Parallel Programmer’s Dilemma Rapid prototyping - MATLAB - Ptolemy high Parallel Performance Malleability Portability Productivity high low Natural parallelization - StreamIt Automatic parallelization - FORTRAN compilers - C/C++ compilers Manual parallelization - C/C++ with MPI Optimal parallelization - assembly code

7 Graphics Cryptography Databases Object recognition Network processing and security Scientific codes … Stream Application Domain

8 StreamIt Project Language Semantics / Programmability –StreamIt Language (CC 02) –Programming Environment in Eclipse (P-PHEC 05) Optimizations / Code Generation –Phased Scheduling (LCTES 03) –Cache Aware Optimization (LCTES 05) Domain Specific Optimizations –Linear Analysis and Optimization (PLDI 03) –Optimizations for bit streaming (PLDI 05) –Linear State Space Analysis (CASES 05) Parallelism –Teleport Messaging (PPOPP 05) –Compiling for Communication-Exposed Architectures (ASPLOS 02) –Load-Balanced Rendering (Graphics Hardware 05) Applications –SAR, DSP benchmarks, JPEG, –MPEG [IPDPS 06], DES and Serpent [PLDI 05], … StreamIt Program Front-end Stream-Aware Optimizations Uniprocessor backend Cluster backend Raw backend IBM X10 backend C C per tile + msg code Streaming X10 runtime Annotated Java MPI-like C

9 programmability domain specific optimizations enable parallel execution target multicores, clusters, tiled architectures, DSPs, graphics processors, … simple and effective optimizations for domain specific abstractions boost productivity, enable faster development and rapid prototyping Compiler-Aware Language Design

10 Picture Reorder joiner IDCT IQuantization splitter VLD macroblocks, motion vectors frequency encoded macroblocks differentially coded motion vectors spatially encoded macroblocks recovered picture ZigZag Saturation Channel Upsample Motion Vector Decode Y Cb Cr quantization coefficients picture type reference picture Motion Compensation reference picture Motion Compensation reference picture Motion Compensation Repeat Color Space Conversion add VLD(QC, PT1, PT2); add splitjoin { split roundrobin(N  B, V); add pipeline { add ZigZag(B); add IQuantization(B) to QC; add IDCT(B); add Saturation(B); } add pipeline { add MotionVectorDecode(); add Repeat(V, N); } join roundrobin(B, V); } add splitjoin { split roundrobin(4  (B+V), B+V, B+V); add MotionCompensation(4  (B+V)) to PT1; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT1; add ChannelUpsample(B); } join roundrobin(1, 1, 1); } add PictureReorder(3  W  H) to PT2; add ColorSpaceConversion(3  W  H); MPEG bit stream Streaming Application Design Structured block level diagram describes computation and flow of data Conceptually easy to understand –Clean abstraction of functionality MPEG-2 Decoder

11 Picture Reorder joiner IDCT IQuantization splitter VLD macroblocks, motion vectors frequency encoded macroblocks differentially coded motion vectors spatially encoded macroblocks recovered picture ZigZag Saturation Channel Upsample Motion Vector Decode Y Cb Cr quantization coefficients picture type reference picture Motion Compensation reference picture Motion Compensation reference picture Motion Compensation Repeat Color Space Conversion add VLD(QC, PT1, PT2); add splitjoin { split roundrobin(N  B, V); add pipeline { add ZigZag(B); add IQuantization(B) to QC; add IDCT(B); add Saturation(B); } add pipeline { add MotionVectorDecode(); add Repeat(V, N); } join roundrobin(B, V); } add splitjoin { split roundrobin(4  (B+V), B+V, B+V); add MotionCompensation(4  (B+V)) to PT1; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT1; add ChannelUpsample(B); } join roundrobin(1, 1, 1); } add PictureReorder(3  W  H) to PT2; add ColorSpaceConversion(3  W  H); MPEG bit stream StreamIt Philosophy Preserve program structure –Natural for application developers to express Leverage program structure to discover parallelism and deliver high performance Programs remain clean –Portable and malleable

12 StreamIt Philosophy output to player Picture Reorder joiner IDCT IQuantization splitter VLD macroblocks, motion vectors frequency encoded macroblocks differentially coded motion vectors spatially encoded macroblocks recovered picture ZigZag Saturation Channel Upsample Motion Vector Decode Y Cb Cr quantization coefficients picture type reference picture Motion Compensation reference picture Motion Compensation reference picture Motion Compensation Repeat Color Space Conversion add VLD(QC, PT1, PT2); add splitjoin { split roundrobin(N  B, V); add pipeline { add ZigZag(B); add IQuantization(B) to QC; add IDCT(B); add Saturation(B); } add pipeline { add MotionVectorDecode(); add Repeat(V, N); } join roundrobin(B, V); } add splitjoin { split roundrobin(4  (B+V), B+V, B+V); add MotionCompensation(4  (B+V)) to PT1; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT1; add ChannelUpsample(B); } join roundrobin(1, 1, 1); } add PictureReorder(3  W  H) to PT2; add ColorSpaceConversion(3  W  H); MPEG bit stream

13 Picture Reorder joiner IDCT IQuantization splitter VLD ZigZag Saturation Channel Upsample Motion Vector Decode reference picture Motion Compensation reference picture Motion Compensation reference picture Motion Compensation Repeat Color Space Conversion add VLD(QC, PT1, PT2); add splitjoin { split roundrobin(N  B, V); add pipeline { add ZigZag(B); add IQuantization(B) to QC; add IDCT(B); add Saturation(B); } add pipeline { add MotionVectorDecode(); add Repeat(V, N); } join roundrobin(B, V); } add splitjoin { split roundrobin(4  (B+V), B+V, B+V); add MotionCompensation(4  (B+V)) to PT1; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT1; add ChannelUpsample(B); } join roundrobin(1, 1, 1); } add PictureReorder(3  W  H) to PT2; add ColorSpaceConversion(3  W  H); MPEG bit stream splitjoins filters pipelines Stream Abstractions in StreamIt

14 StreamIt Language Highlights Filters Pipelines Splitjoins Teleport messaging

15 Example StreamIt Filter float  float filter FIR (int N) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i]  peek(i); } push(result); pop(); } input output FIR 01

16 FIR Filter in C void FIR( int* src, int* dest, int* srcIndex, int* destIndex, int srcBufferSize, int destBufferSize, int N) { float result = 0.0; for (int i = 0; i < N; i++) { result += weights[i] * src[(*srcIndex + i) % srcBufferSize]; } dest[*destIndex] = result; *srcIndex = (*srcIndex + 1) % srcBufferSize; *destIndex = (*destIndex + 1) % destBufferSize; } FIR functionality obscured by buffer management details Programmer must commit to a particular buffer implementation strategy

17 StreamIt Language Highlights Filters Pipelines Splitjoins Teleport messaging

18 Example StreamIt Pipeline Pipeline –Connect components in sequence –Expose pipeline parallelism Column_iDCTs Row_iDCTs float  float pipeline 2D_iDCT (int N) { add Column_iDCTs(N); add Row_iDCTs(N); }

19 From Figures 7-1 and 7-4 of the MPEG-2 Specification (ISO , P. 61, 66) Preserving Program Structure int->int pipeline BlockDecode( portal quantiserData, portal macroblockType) { add ZigZagUnordering(); add InverseQuantization() to quantiserData, macroblockType; add Saturation(-2048, 2047); add MismatchControl(); add 2D_iDCT(8); add Saturation(-256, 255); } Can be reused for JPEG decoding quantiserData,

20 Decode_Picture { for (;;) { parser() for (;;) { decode_macroblock(); motion_compensation(); if (condition) then break; } frame_reorder(); } decode_macroblock() { parser(); motion_vectors(); for (comp=0;comp<block_count;comp++) { parser(); Decode_MPEG2_Block(); } motion_vectors() { parser(); decode_motion_vector parser(); } Decode_MPEG2_Block() { for (int i = 0;; i++) { parsing(); ZigZagUnordering(); inverseQuantization(); if (condition) then break; } motion_compensation() { for (channel=0;channel<3;channel++) form_component_prediction(); for (comp=0;comp<block_count;comp++) { Saturate(); IDCT(); Add_Block(); } In Contrast: C Code Excerpt Explicit for-loops iterate through picture frames Frames passed through global arrays, handled with pointers Mixing of parser, motion compensation, and spatial decoding EXTERN unsigned char *backward_reference_frame[3]; EXTERN unsigned char *forward_reference_frame[3]; EXTERN unsigned char *current_frame[3];...etc...

21 StreamIt Language Highlights Filters Pipelines Splitjoins Teleport messaging

22 Example StreamIt Splitjoin Splitjoin –Connect components in parallel –Expose task parallelism and data distribution joiner splitter float  float splitjoin Row_iDCT (int N) { split roundrobin(N); for (int i = 0; i < N; i++) { add 1D_iDCT(N); } join roundrobin(N); }

23 joiner iDCT splitter iDCT joiner iDCT splitter iDCT Example StreamIt Splitjoin joiner splitter joiner splitter float  float splitjoin Row_iDCT (int N) { split roundrobin(N); for (int i = 0; i < N; i++) { add 1D_iDCT(N); } join roundrobin(N); } float  float pipeline 2D_iDCT (int N) { add Column_iDCTs(N); add Row_iDCTs(N); } float  float splitjoin Column_iDCT (int N) { split roundrobin(1); for (int i = 0; i < N; i++) { add 1D_iDCT(N); } join roundrobin(1); }

24 add splitjoin { split roundrobin(4*(B+V), B+V, B+V); add MotionCompensation(); for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(); add ChannelUpsample(B); } join roundrobin(1, 1, 1); } joiner 1:1:1 splitter 4:1:1 recovered picture Channel Upsample Channel Upsample Cr Motion Compensation Motion Compensation Motion Compensation Y Cb Naturally Expose Data Distribution Y Cb Cr scatter macroblocks according to chroma format gather one pixel at a time

25 Stream Graph Malleability joiner splitter 4,1,1 Upsample MC Y CbCr Y Cb Cr :2:2 chroma format 4:2:0 chroma format joiner splitter 1,1 Upsample MC splitter 4,(2+2) joiner

26 StreamIt Code Sample // C = blocks per chroma channel per macroblock // C = 1 for 4:2:0, C = 2 for 4:2:2 add splitjoin { split roundrobin(4*(B+V), 2*C*(B+V)); add MotionCompensation(); add splitjoin { split roundrobin(B+V, B+V); for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation() add ChannelUpsample(C,B); } join roundrobin(1, 1); } join roundrobin(1, 1, 1); } red = code added or modified to support 4:2:2 format joiner splitter 4,1,1 Upsample MC joiner splitter 1,1 Upsample MC splitter 4,(2+2) joiner

27 In Contrast: C Code Excerpt /* Y */ form_component_prediction(src[0]+(sfield?lx2>>1:0),dst[0]+(dfield?lx2>>1:0), lx,lx2,w,h,x,y,dx,dy,average_flag); if (chroma_format!=CHROMA444) { lx>>=1; lx2>>=1; w>>=1; x>>=1; dx/=2; } if (chroma_format==CHROMA420) { h>>=1; y>>=1; dy/=2; } /* Cb */ form_component_prediction(src[1]+(sfield?lx2>>1:0),dst[1]+(dfield?lx2>>1:0), lx,lx2,w,h,x,y,dx,dy,average_flag); /* Cr */ form_component_prediction(src[2]+(sfield?lx2>>1:0),dst[2]+(dfield?lx2>>1:0), lx,lx2,w,h,x,y,dx,dy,average_flag); Adjust values used for address calculations depending on the chroma format used. red = pointers used for address calculations

28 StreamIt Language Highlights Filters Pipelines Splitjoins Teleport messaging

29 Teleport Messaging Avoids muddling data streams with control relevant information Localized interactions in large applications –A scalable alternative to global variables or excessive parameter passing Order IQ VLD MC

30 Motion Prediction and Messaging portal PT; add splitjoin { split roundrobin(4*(B+V), B+V, B+V); add MotionCompensation() to PT; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation() to PT; add ChannelUpsample(B); } join roundrobin(1, 1, 1); } joiner splitter recovered picture Channel Upsample Channel Upsample Cr Motion Compensation Motion Compensation Motion Compensation Y Cb picture type

31 Looks like method call, but timed relative to data in the stream Simple and precise for user –Exposes dependences to compiler –Adjustable latency –Can send upstream or downstream void setPicturetype(int p) { reconfigure(p); } TargetFilter x; if newPictureType(p) { 0; } Teleport Messaging Overview

32 Frame Reordering Decode Picture Decode Macroblock Decode Motion Vectors Motion Compensation Decode Block Motion Compensation For Single Channel Saturate IDCT ZigZagUnordering Inverse Quantization File Parsing Global Variable Space The MPEG Bitstream Output Video Messaging Equivalent in C

33 programmability domain specific optimizations enable parallel execution target multicores, clusters, tiled architectures, DSPs, graphics processors, … simple and effective optimizations for domain specific abstractions boost productivity, enable faster development and rapid prototyping Compiler-Aware Language Design

34 Multicores Are Here! PentiumP2P3 P4 Itanium Itanium Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom ?? # of cores Opteron 4P Xeon MP Athlon Ambric AM2045 Tilera

35 Von Neumann Languages Why C (FORTRAN, C++ etc.) became very successful? –Abstracted out the differences of von Neumann machines –Directly expose the common properties –Can have a very efficient mapping to a von Neumann machine –“C is the portable machine language for von Numann machines” von Neumann languages are a curse for Multicores –We have squeezed out all the performance out of C –But, cannot easily map C into multicores

36 Common Machine Languages Common Properties Single flow of control Single memory image Unicores: Differences: Register File ISA Functional Units Register Allocation Instruction Selection Instruction Scheduling Common Properties Multiple flows of control Multiple local memories Multicores: Differences: Number and capabilities of cores Communication Model Synchronization Model von-Neumann languages represent the common properties and abstract away the differences

37 Bridging the Abstraction layers StreamIt exposes the data movement –Graph structure is architecture independent StreamIt exposes the parallelism –Explicit task parallelism –Implicit but inherent data and pipeline parallelism Each multicore is different in granularity and topology –Communication is exposed to the compiler The compiler needs to efficiently bridge the abstraction –Map the computation and communication pattern of the program to the cores, memory and the communication substrate

38 Types of Parallelism Task Parallelism –Parallelism explicit in algorithm –Between filters without producer/consumer relationship Data Parallelism –Peel iterations of filter, place within scatter/gather pair (fission) –parallelize filters with state Pipeline Parallelism –Between producers and consumers –Stateful filters can be parallelized Scatter Gather Task

39 Types of Parallelism Task Parallelism –Parallelism explicit in algorithm –Between filters without producer/consumer relationship Data Parallelism –Between iterations of a stateless filter –Place within scatter/gather pair (fission) –Can’t parallelize filters with state Pipeline Parallelism –Between producers and consumers –Stateful filters can be parallelized Scatter Gather Scatter Gather Task Pipeline Data Data Parallel

40 Types of Parallelism Traditionally: Task Parallelism –Thread (fork/join) parallelism Data Parallelism –Data parallel loop (forall) Pipeline Parallelism –Usually exploited in hardware Scatter Gather Scatter Gather Task Pipeline Data

41 Problem Statement Given: –Stream graph with compute and communication estimate for each filter –Computation and communication resources of the target machine Find: –Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources

42 Our 3-Phase Solution 1.Coarsen: Fuse stateless sections of the graph 2.Data Parallelize: parallelize stateless filters 3.Software Pipeline: parallelize stateful filters Compile to a 16 core architecture –11.2x mean throughput speedup over single core Coarsen Granularity Data Parallelize Software Pipeline

43 Baseline 1: Task Parallelism Adder Splitter Joiner Compress BandPass Expand Process BandStop Compress BandPass Expand Process BandStop Inherent task parallelism between two processing pipelines Task Parallel Model: –Only parallelize explicit task parallelism –Fork/join parallelism Execute this on a 2 core machine ~2x speedup over single core What about 4, 16, 1024, … cores?

44 Evaluation: Task Parallelism Raw Microprocessor 16 inorder, single-issue cores with D$ and I$ 16 memory banks, each bank with DMA Cycle accurate simulator Parallelism: Not matched to target! Synchronization: Not matched to target!

45 Baseline 2: Fine-Grained Data Parallelism Adder Splitter Joiner Each of the filters in the example are stateless Fine-grained Data Parallel Model: –Fiss each stateless filter N ways (N is number of cores) –Remove scatter/gather if possible We can introduce data parallelism –Example: 4 cores Each fission group occupies entire machine BandStop Adder Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner

46 Evaluation: Fine-Grained Data Parallelism Good Parallelism! Too Much Synchronization!

47 Phase 1: Coarsen the Stream Graph Splitter Joiner Expand BandStop Process BandPass Compress Expand BandStop Process BandPass Compress Before data-parallelism is exploited Fuse stateless pipelines as much as possible without introducing state –Don’t fuse stateless with stateful –Don’t fuse a peeking filter with anything upstream Peek Adder

48 Splitter Joiner BandPass Compress Process Expand BandPass Compress Process Expand BandStop Adder Before data-parallelism is exploited Fuse stateless pipelines as much as possible without introducing state –Don’t fuse stateless with stateful –Don’t fuse a peeking filter with anything upstream Benefits: –Reduces global communication and synchronization –Exposes inter-node optimization opportunities Phase 1: Coarsen the Stream Graph

49 Phase 2: Data Parallelize Adder Splitter Joiner Adder BandPass Compress Process Expand BandPass Compress Process Expand BandStop Splitter Joiner Fiss 4 ways, to occupy entire chip Data Parallelize for 4 cores

50 Phase 2: Data Parallelize Adder Splitter Joiner Adder BandPass Compress Process Expand Splitter Joiner BandPass Compress Process Expand BandPass Compress Process Expand Splitter Joiner BandPass Compress Process Expand BandStop Splitter Joiner Task parallelism! Each fused filter does equal work Fiss each filter 2 times to occupy entire chip Data Parallelize for 4 cores

51 BandStop Phase 2: Data Parallelize Adder Splitter Joiner Adder BandPass Compress Process Expand Splitter Joiner BandPass Compress Process Expand BandPass Compress Process Expand Splitter Joiner BandPass Compress Process Expand Splitter Joiner BandStop Splitter Joiner BandStop Splitter Joiner Task parallelism, each filter does equal work Fiss each filter 2 times to occupy entire chip Task-conscious data parallelization –Preserve task parallelism Benefits: –Reduces global communication and synchronization Data Parallelize for 4 cores

52 Evaluation: Coarse-Grained Data Parallelism Good Parallelism! Low Synchronization!

53 Simplified Vocoder RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner PolarRect Data Parallel Target a 4 core machine Data Parallel, but too little work!

54 Data Parallelize RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner RectPolar Splitter Joiner RectPolar PolarRect Splitter Joiner Target a 4 core machine

55 Data + Task Parallel Execution Time Cores 21 Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner

56 We Can Do Better! Time Cores Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner

57 Phase 3: Coarse-Grained Software Pipelining RectPolar Prologue New Steady State New steady-state is free of dependencies Schedule new steady-state using a greedy partitioning

58 Greedy Partitioning Target 4 core machine Time 16 Cores To Schedule:

59 Evaluation: Coarse-Grained Task + Data + Software Pipelining Best Parallelism! Lowest Synchronization!

60 programmability domain specific optimizations enable parallel execution target multicores, clusters, tiled architectures, DSPs, graphics processors, … simple and effective optimizations for domain specific abstractions boost productivity, enable faster development and rapid prototyping Compiler-Aware Language Design

61 Conventional DSP Design Flow DSP Optimizations Rewrite the program Design the Datapaths (no control flow) Architecture-specific Optimizations (performance, power, code size) Spec. (data-flow diagram) Coefficient Tables C/Assembly Code Signal Processing Expert in Matlab Software Engineer in C and Assembly

62 Design Flow with StreamIt DSP Optimizations StreamIt Program (dataflow + control) Architecture-Specific Optimizations C/Assembly Code Application-Level Design StreamIt compiler Application Programmer

63 Design Flow with StreamIt DSP Optimizations StreamIt Program (dataflow + control) Architecture-Specific Optimizations C/Assembly Code Application-Level Design Benefits of programming in a single, high-level abstraction –Modular –Composable –Portable –Malleable The Challenge: Maintaining Performance

64 Focus: Linear State Space Filters Properties: 1. Outputs are linear function of inputs and states 2. New states are linear function of inputs and states Most common target of DSP optimizations – FIR / IIR filters – Linear difference equations – Upsamplers / downsamplers – DCTs

65 Representing State Space Filters u A state space filter is a tuple  A, B, C, D  x’ = Ax + Bu y = Cx + Du  A, B, C, D  inputs states outputs

66 Representing State Space Filters u A state space filter is a tuple  A, B, C, D  x’ = Ax + Bu y = Cx + Du  A, B, C, D  float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } inputs states outputs

67 Representing State Space Filters float->float filter IIR { float x1, x2; work push 1 pop 1 { float u = pop(); push(2*(x1+x2+u)); x1 = 0.9*x *u; x2 = 0.9*x *u; } } A state space filter is a tuple  A, B, C, D  B = A = u 2 2 C = 2 x’ = Ax + Bu D = y = Cx + Du Linear dataflow analysis inputs states outputs

68 Linear Optimizations 1.Combining adjacent filters 2.Transformation to frequency domain 3.Change of basis transformations 4.Transformation Selection

69 1) Combining Adjacent Filters y = D 1 u z = D 2 y Filter 1 Filter 2 y u z E z = D 2 D 1 u Combined Filter u z z = Eu

70 IIR Filter Combination Example x’ = 0.9x + u y = x + 2u Decimator y = [1 0] u1u2u1u2 IIR / Decimator x’ = 0.81x + [0.9 1] y = x + [2 0] u1u2u1u2 8 FLOPs output 6 FLOPs output u1u2u1u2

71 IIR Filter Combination Example x’ = 0.9x + u y = x + 2u Decimator y = [1 0] u1u2u1u2 8 FLOPs output 6 FLOPs output IIR / Decimator x’ = 0.81x + [0.9 1] y = x + [2 0] u1u2u1u2 u1u2u1u2 As decimation factor goes to , eliminate up to 75% of FLOPs.

72 Floating-Point Operations Reduction 0.3%

73 Painful to do by hand –Blocking –Coefficient calculations –Startup etc.

74 FLOPs Reduction -140% 0.3%

75 x’ = Ax + Bu y = Cx + Du 3) Change-of-Basis Transformation T = invertible matrix, z = Tx z’ = A’z + B’u y = C’z + D’u A’ = TAT -1 B’ =TB C’ = CT -1 D’ = D Can map original states x to transformed states z = Tx without changing I/O behavior

76 Change-of-Basis Optimizations 1.State removal -Minimize number of states in system 2.Parameter reduction -Increase number of 0’s and 1’s in multiplication  Formulated as general matrix operations

77 4) Transformation Selection When to apply what transformations? –Linear filter combination can increase the computation cost –Shifting to the Frequency domain is expensive for filters with pop > 1 –Compute all outputs, then decimate by pop rate –Some expensive transformations may later enable other transformations, reducing the overall cost

78 FLOPs Reduction with Optimization Selection -140% 0.3%

79 Execution Speedup 5% On a Pentium IV

80 Conclusion Streaming programming model –Can break the von Neumann bottleneck –A natural fit for a large class of applications –An ideal machine language for multicores. Natural programming language for many streaming applications –Better modularity, composability, malleability and portability than C Compiler can easily extract explicit and inherent parallelism –Parallelism is abstracted away from architectural details of multicores –Sustainable Speedups (5x to 19x on the 16 core Raw) Can we replace the DSP engineer from the design flow? –On the average 90% of the FLOPs eliminated, average speedup of 450% attained Increased abstraction does not have to sacrifice performance The compiler and benchmarks are available on the web

Thanks for Listening! Any questions?