A Parameterized Dataflow Language Extension for Embedded Streaming Systems Yuan Lin 1, Yoonseo Choi 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali Chakrabarti.

Slides:

Advertisements

Similar presentations

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Advertisements

High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

AUTOMATIC GENERATION OF CODE OPTIMIZERS FROM FORMAL SPECIFICATIONS Vineeth Kumar Paleri Regional Engineering College, calicut Kerala, India. (Currently,

Overview: Chapter 7  Sensor node platforms must contend with many issues  Energy consumption  Sensing environment  Networking  Real-time constraints.

1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Scheduling for Embedded Real-Time Systems Amit Mahajan and Haibo.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.

1 SODA: A Low-power Architecture For Software Radio Yuan Lin 1, Hyunseok Lee 1, Mark Woh 1, Yoav Harel 1, Scott Mahlke 1, Trevor.

2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

University of Michigan Electrical Engineering and Computer Science From SODA to Scotch: The Evolution of a Wireless Baseband Processor Mark Woh (University.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Performance.

Dataflow Process Networks Lee & Parks Synchronous Dataflow Lee & Messerschmitt Abhijit Davare Nathan Kitchen.

Models of Computation for Embedded System Design Alvise Bonivento.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

11 1 The Next Generation Challenge for Software Defined Radio Mark Woh 1, Sangwon Seo 1, Hyunseok Lee 1, Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

A Scalable Low-power Architecture For Software Radio

11 1 SPEX: A Programming Language for Software Defined Radio Yuan Lin, Robert Mullenix, Mark Woh, Scott Mahlke, Trevor Mudge, Alastair Reid 1, and Krisztián.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 JHDL Hardware Generation Mike Wirthlin and Matthew Koecher

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

EECE **** Embedded System Design

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

1 - CPRE 583 (Reconfigurable Computing): Compute Models Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 7: Wed 10/28/2009 (Compute.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.

Marilyn Wolf1 With contributions from:

ECE354 Embedded Systems Introduction C Andras Moritz.

Performance Optimization for Embedded Software

TensorFlow: A System for Large-Scale Machine Learning

Presentation transcript:

A Parameterized Dataflow Language Extension for Embedded Streaming Systems Yuan Lin 1, Yoonseo Choi 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali Chakrabarti 2 1 Advanced Computer Architecture Lab, University of Michigan at Ann Arbor 2 Department of Electrical Engineering, Arizona State University

Embedded Streaming Systems  Mobile computing: multimedia anywhere at anytime  Many of its key workloads are embedded streaming systems  Video/audio coding (i.e. H.264)  Wireless communications (i.e. W-CDMA)  3D graphics  and others… Cell phones are getting more complex PCs are getting more mobile

Characteristics of Streaming Systems LPF-Tx Scrambler Spreader Interleaver Channel encoder Channel encoder LPF-Rx Searcher Descrambler Despreader Combiner Descrambler Despreader Interleaver Channel decoder (Viterbi/Turbo) Channel decoder (Viterbi/Turbo) Transmitter Receiver Analog Upper layer W-CDMA Physical Layer Processing LPF-Tx LPF-Rx Scrambler Spreader Descrambler Despreader Combiner Descrambler Despreader Searcher Interleaver Channel encoder Channel encoder Interleaver Channel decoder (Viterbi/Turbo) Channel decoder (Viterbi/Turbo)  Data are processed in a pipeline of DSP algorithm kernels  Mostly vector/matrix-based data computation  Periodic system reconfigurations  i.e. changing from voice communication to data communication

Embedded DSP Processors ARM SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE Global Mem Global Mem  Current trend: multi-core DSPs for streaming applications  IBM Cell processor  TI OMAP  Many other SoCs  Common hardware characteristics  Multiple (potentially heterogeneous) data engines (DEs)  Software-managed scratchpad memories  Explicit DMA transfer operations Our DSP case study: SODA, a multi-core DSP processor

Programming Challenge  How to automatically compile streaming systems onto multi-core DSP hardware? ARM SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE Global Mem Global Mem ? ? How to divide the system into multiple threads? How to SIMDize DSP kernels? When and where to issue DMA transfers? VLIW execution scheduling? How to manage the local and global memory? Who does the execution scheduling? and many other problems….

Compile for Multi-core DSPs  Two-tier compilation approach LPF-Tx Scrambler Spreader Interleaver Channel encoder Channel encoder LPF-Rx Searcher Descrambler Despreader Combiner Descrambler Despreader Interleaver Channel decoder (Viterbi/Turbo) Channel decoder (Viterbi/Turbo) Transmitter Receiver Frontend Upper layer ARM Exe Unit Exe Unit Local Mem Local Mem PE Exe Unit Exe Unit Local Mem Local Mem PE Exe Unit Exe Unit Local Mem Local Mem PE Exe Unit Exe Unit Local Mem Local Mem PE Global Mem Global Mem SODA System Architecture void Turbo() {... } void Turbo() {... } void Turbo() {... } void Turbo() {... } 32-lane SIMD ALU 32-lane SIMD ALU SIMD RF SIMD RF 32-lane SSN 32-lane SSN SIMD to scalar SIMD to scalar EXEX EXEX WBWB WBWB S TV S TV V TS V TS scalar RF scalar RF 16-bit ALU 16-bit ALU EXEX EXEX WBWB WBWB SIMD Data MEM SIMD Data MEM Scalar Data MEM Scalar Data MEM SIMD Scalar  This presentation is focused on system-level language & compilation  Compiling functions, not instructions

System Compilation Overview SPIR Frontend Backend DE0 ARM  Coarse-grained compilation  Function-level, not instruction-level  C/C++-to-C compiler  SPEX: Signal Processing EXtension  Our high-level language extension  Frontend compilation  Translate from SPEX into SPIR  SPIR: Signal Processing IR  System compiler’s IR  Models function-level interactions  Backend compilation  Function-level compilation  Generate multi-threaded C code SPEX

System Compilation Overview SPIR Frontend Backend DE0 ARM SPEX  Coarse-grained compilation  Function-level, not instruction-level  C/C++-to-C compiler  SPEX: Signal Processing EXtension  Our high-level language extension  Frontend compilation  Translate from SPEX into SPIR  SPIR: Signal Processing IR  System compiler’s IR  Models function-level interactions  Backend compilation  Function-level compilation  Generate multi-threaded C code

SPIR: Function-level IR Frontend Backend PE0 ARM SPIR  Must captures stream applications’ system-level behaviors  Based on the dataflow computation model  Good for modeling streaming computations  Easy to generate parallel code  But which dataflow model? node FIFO buffer node FIFO buffer SPEX

Synchronous Dataflow  Synchronous dataflow (SDF)  Simplest dataflow model  Static dataflow  No conditional dataflow allowed  Pros  Efficiency: can generate execution schedule during compile-time  Optimality: We know how to compile SDFs for multi-processor DSPs Berkeley Ptolemy project, MIT StreamIt compiler  Cons  Lack of flexibility: Cannot describe run-time reconfigurations in stream computations node input_rate = 2output_rate = 3

 Parameterized dataflow (PDF)  Use parameters to model run-time system reconfiguration  Each parameter is a variable with a finite set of discrete values  Parameterized attributes in SPIR  Dataflow rates Parameterized Dataflow node input_rate = {1, 4, 8}output_rate = {2, 8} First proposed by: B. Bhattacharya and S. S. Bbhattacharyya, “Parameterized Dataflow Modeling for DSP Systems.” IEEE Transactions on Signal Processing, Oct. 2001

Parameterized Dataflow  Parameterized dataflow (PDF)  Use parameters to model run-time system reconfiguration  Each parameter is a variable with a finite set of discrete values  Parameterized attributes in SPIR  Dataflow rates  Conditional dataflow IF if_cond = {true, false} if node if node else node else node IF {1,4,8} {2,8} {6,8} {2,4}

Parameterized Dataflow  Parameterized dataflow (PDF)  Use parameters to model run-time system reconfiguration  Each parameter is a variable with a finite set of discrete values  Parameterized attributes in SPIR  Dataflow rates  Conditional dataflow  Number of dataflow actors split merge A[0] A[1] A[n] Number of A nodes = {1, 4, 12}

Parameterized Dataflow  Parameterized dataflow (PDF)  Use parameters to model run-time system reconfiguration  Each parameter is a variable with a finite set of discrete values  Parameterized attributes in SPIR  Dataflow rates  Conditional dataflow  Number of dataflow actors  Streaming size between reconfigurations  There are also other modifications to the dataflow model  Please refer to the paper for further details stream_size = {10k, 20k}

PDF Run-time Execution Model  Three stage run-time execution model  Goal: provide the efficiency of the synchronous dataflow execution on parameterized dataflow

PDF Run-time Execution Model  Stage 1: dataflow initialization  Convert a PDF graph into a SDF graph  Setting parameter variables to constant values  Perform other initialization computation

PDF Run-time Execution Model  Stage 2: dataflow computation  Dataflow computation following static SDF execution schedules Stream input Stream output

PDF Run-time Execution Model  Stage 3: dataflow finalization  Update the dataflow states with calculated results

System Compilation Frontend SPIR Frontend Backend PE0 ARM  Start from a stream system described in C or C++ with SPEX  Translate the description into dataflow representation SPEX

 Q: Why can’t we compile pure C/C++?  A: Some of C/C++’s language features cannot be translated into dataflow  i.e. passing pointers as function arguments  C/C++: pointer’s memory locations can be read and written  Dataflow: can have read-only and write-only edges SPIR Frontend Backend PE0 ARM SPEX

#include  SPEX definition headers class WCDMA: spex_kernel { pdf_node(interleaver)(...) {... }  Functions for declaring dataflow nodes pdf_node(turbo_dec)(...) {... } pdf_graph(wcdma_rec)()  Functions for declaring a dataflow graph {... interleaver(intlv_to_turbo, intlv_in); turbo_dec(turbo_out, intlv_to_turbo);... } };  SPEX is a set of keywords and language restrictions  A guideline for programmers to write stylized C/C++ code that can be translated into dataflow  Dataflow-safe C/C++ programming  SPEX code can be compiled directly with g++

SPEX pdf_node Code Snippets pdf_node(fir)(channel in, channel & out) {... z[0] = in.pop(); for (i = 0; i < TAPS; i++) { sum += z[i] * coeff[i]; } out.push(sum);... } Read-only input dataflow edgeWrite-only output dataflow edge FIR’s dataflow input FIR’s dataflow output

SPEX Code Snippets pdf_graph(WCDMA_rec)() { FIR fir;... channel fir_to_rake;... pdf { for (i = 0; i < slot_size; i++) { fir.run(fir_to_rake, AtoD); rake.run(rake_out, fir_to_rake); if (mode == voice) viterbi.run(mac_in, rake_out); else turbo.run(mac_in, rake_out); mac(mac_in); } } } pdf_graph_init(WCDMA_rec)() {... } pdf_graph_final(WCDMA_rec)() {... } Static PDF node and edge declarations PDF scope: a PDF graph description. Language restrictions within PDF scope. i.e. - Must only use for-loop constructions with constant loop- bounds - Must only include function calls to pdf_node functions. A guideline for writing dataflow- safe C++ code Descriptions for dataflow initialization and finalization stages fir rake if vit tur if mac

System Compilation Frontend SPIR Frontend Backend PE0 ARM  Translate SPEX into parameterized dataflow representation  Use traditional control-flow and dataflow analysis  Semantic error-checking to ensure dataflow-safe C/C++ code  Possible to support other high-level languages

System Compilation Backend SPIR Frontend Backend PE0 ARM  Function-level compilation  Node-to-DE assignments  Memory buffer allocations  DMA assignments  Function-level optimizations  Software pipelining  Code generation  Parallel thread generation  Physical buffer allocation  If-conversion and predicate propagation

Conclusion  System-level compilation framework  We have a working compiler for SPEX  Target: SODA-like multi-core DSPs  Parameterized dataflow is used as compiler IR  SPEX is a set of language extensions for efficient translation from C/C++ into dataflow SPIR Frontend Backend DE0 ARM

Questions 

Shared Variables In Dataflow  Shared variables are not allowed in traditional dataflow models  SPIR allows shared variables between dataflow nodes  Multi-dimensional streaming patterns  Non-sequential streaming patterns  Decoupled streaming  Shared memory buffers

Backend Compilation SPIR Frontend Backend PE0 ARM FIR Rake Turbo  Problem with function-level compilation  Requires function-level parallelism  Wireless protocols do not have many concurrent functions FIR Rake Turbo in[0..N] PE0 PE1PE2

Backend Compilation SPIR Frontend Backend PE0 ARM  Utilize existing compiler optimization  Function-level software pipelining  Processing each stream data is the same as a loop iteration  Modulo scheduling applied to function-level compilation FIR Rake Turbo in[i] PE0 PE1PE2 FIR Rake Turbo in[i+1] FIR Rake Turbo in[i+2] Turbo Rake FIR