A Parameterized Dataflow Language Extension for Embedded Streaming Systems Yuan Lin 1, Yoonseo Choi 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali Chakrabarti 2 1 Advanced Computer Architecture Lab, University of Michigan at Ann Arbor 2 Department of Electrical Engineering, Arizona State University
Embedded Streaming Systems Mobile computing: multimedia anywhere at anytime Many of its key workloads are embedded streaming systems Video/audio coding (i.e. H.264) Wireless communications (i.e. W-CDMA) 3D graphics and others… Cell phones are getting more complex PCs are getting more mobile
Characteristics of Streaming Systems LPF-Tx Scrambler Spreader Interleaver Channel encoder Channel encoder LPF-Rx Searcher Descrambler Despreader Combiner Descrambler Despreader Interleaver Channel decoder (Viterbi/Turbo) Channel decoder (Viterbi/Turbo) Transmitter Receiver Analog Upper layer W-CDMA Physical Layer Processing LPF-Tx LPF-Rx Scrambler Spreader Descrambler Despreader Combiner Descrambler Despreader Searcher Interleaver Channel encoder Channel encoder Interleaver Channel decoder (Viterbi/Turbo) Channel decoder (Viterbi/Turbo) Data are processed in a pipeline of DSP algorithm kernels Mostly vector/matrix-based data computation Periodic system reconfigurations i.e. changing from voice communication to data communication
Embedded DSP Processors ARM SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE Global Mem Global Mem Current trend: multi-core DSPs for streaming applications IBM Cell processor TI OMAP Many other SoCs Common hardware characteristics Multiple (potentially heterogeneous) data engines (DEs) Software-managed scratchpad memories Explicit DMA transfer operations Our DSP case study: SODA, a multi-core DSP processor
Programming Challenge How to automatically compile streaming systems onto multi-core DSP hardware? ARM SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE SIMD Unit SIMD Unit Local Mem Local Mem DE Global Mem Global Mem ? ? How to divide the system into multiple threads? How to SIMDize DSP kernels? When and where to issue DMA transfers? VLIW execution scheduling? How to manage the local and global memory? Who does the execution scheduling? and many other problems….
Compile for Multi-core DSPs Two-tier compilation approach LPF-Tx Scrambler Spreader Interleaver Channel encoder Channel encoder LPF-Rx Searcher Descrambler Despreader Combiner Descrambler Despreader Interleaver Channel decoder (Viterbi/Turbo) Channel decoder (Viterbi/Turbo) Transmitter Receiver Frontend Upper layer ARM Exe Unit Exe Unit Local Mem Local Mem PE Exe Unit Exe Unit Local Mem Local Mem PE Exe Unit Exe Unit Local Mem Local Mem PE Exe Unit Exe Unit Local Mem Local Mem PE Global Mem Global Mem SODA System Architecture void Turbo() {... } void Turbo() {... } void Turbo() {... } void Turbo() {... } 32-lane SIMD ALU 32-lane SIMD ALU SIMD RF SIMD RF 32-lane SSN 32-lane SSN SIMD to scalar SIMD to scalar EXEX EXEX WBWB WBWB S TV S TV V TS V TS scalar RF scalar RF 16-bit ALU 16-bit ALU EXEX EXEX WBWB WBWB SIMD Data MEM SIMD Data MEM Scalar Data MEM Scalar Data MEM SIMD Scalar This presentation is focused on system-level language & compilation Compiling functions, not instructions
System Compilation Overview SPIR Frontend Backend DE0 ARM Coarse-grained compilation Function-level, not instruction-level C/C++-to-C compiler SPEX: Signal Processing EXtension Our high-level language extension Frontend compilation Translate from SPEX into SPIR SPIR: Signal Processing IR System compiler’s IR Models function-level interactions Backend compilation Function-level compilation Generate multi-threaded C code SPEX
System Compilation Overview SPIR Frontend Backend DE0 ARM SPEX Coarse-grained compilation Function-level, not instruction-level C/C++-to-C compiler SPEX: Signal Processing EXtension Our high-level language extension Frontend compilation Translate from SPEX into SPIR SPIR: Signal Processing IR System compiler’s IR Models function-level interactions Backend compilation Function-level compilation Generate multi-threaded C code
SPIR: Function-level IR Frontend Backend PE0 ARM SPIR Must captures stream applications’ system-level behaviors Based on the dataflow computation model Good for modeling streaming computations Easy to generate parallel code But which dataflow model? node FIFO buffer node FIFO buffer SPEX
Synchronous Dataflow Synchronous dataflow (SDF) Simplest dataflow model Static dataflow No conditional dataflow allowed Pros Efficiency: can generate execution schedule during compile-time Optimality: We know how to compile SDFs for multi-processor DSPs Berkeley Ptolemy project, MIT StreamIt compiler Cons Lack of flexibility: Cannot describe run-time reconfigurations in stream computations node input_rate = 2output_rate = 3
Parameterized dataflow (PDF) Use parameters to model run-time system reconfiguration Each parameter is a variable with a finite set of discrete values Parameterized attributes in SPIR Dataflow rates Parameterized Dataflow node input_rate = {1, 4, 8}output_rate = {2, 8} First proposed by: B. Bhattacharya and S. S. Bbhattacharyya, “Parameterized Dataflow Modeling for DSP Systems.” IEEE Transactions on Signal Processing, Oct. 2001
Parameterized Dataflow Parameterized dataflow (PDF) Use parameters to model run-time system reconfiguration Each parameter is a variable with a finite set of discrete values Parameterized attributes in SPIR Dataflow rates Conditional dataflow IF if_cond = {true, false} if node if node else node else node IF {1,4,8} {2,8} {6,8} {2,4}
Parameterized Dataflow Parameterized dataflow (PDF) Use parameters to model run-time system reconfiguration Each parameter is a variable with a finite set of discrete values Parameterized attributes in SPIR Dataflow rates Conditional dataflow Number of dataflow actors split merge A[0] A[1] A[n] Number of A nodes = {1, 4, 12}
Parameterized Dataflow Parameterized dataflow (PDF) Use parameters to model run-time system reconfiguration Each parameter is a variable with a finite set of discrete values Parameterized attributes in SPIR Dataflow rates Conditional dataflow Number of dataflow actors Streaming size between reconfigurations There are also other modifications to the dataflow model Please refer to the paper for further details stream_size = {10k, 20k}
PDF Run-time Execution Model Three stage run-time execution model Goal: provide the efficiency of the synchronous dataflow execution on parameterized dataflow
PDF Run-time Execution Model Stage 1: dataflow initialization Convert a PDF graph into a SDF graph Setting parameter variables to constant values Perform other initialization computation
PDF Run-time Execution Model Stage 2: dataflow computation Dataflow computation following static SDF execution schedules Stream input Stream output
PDF Run-time Execution Model Stage 3: dataflow finalization Update the dataflow states with calculated results
System Compilation Frontend SPIR Frontend Backend PE0 ARM Start from a stream system described in C or C++ with SPEX Translate the description into dataflow representation SPEX
Q: Why can’t we compile pure C/C++? A: Some of C/C++’s language features cannot be translated into dataflow i.e. passing pointers as function arguments C/C++: pointer’s memory locations can be read and written Dataflow: can have read-only and write-only edges SPIR Frontend Backend PE0 ARM SPEX
#include SPEX definition headers class WCDMA: spex_kernel { pdf_node(interleaver)(...) {... } Functions for declaring dataflow nodes pdf_node(turbo_dec)(...) {... } pdf_graph(wcdma_rec)() Functions for declaring a dataflow graph {... interleaver(intlv_to_turbo, intlv_in); turbo_dec(turbo_out, intlv_to_turbo);... } }; SPEX is a set of keywords and language restrictions A guideline for programmers to write stylized C/C++ code that can be translated into dataflow Dataflow-safe C/C++ programming SPEX code can be compiled directly with g++
SPEX pdf_node Code Snippets pdf_node(fir)(channel in, channel & out) {... z[0] = in.pop(); for (i = 0; i < TAPS; i++) { sum += z[i] * coeff[i]; } out.push(sum);... } Read-only input dataflow edgeWrite-only output dataflow edge FIR’s dataflow input FIR’s dataflow output
SPEX Code Snippets pdf_graph(WCDMA_rec)() { FIR fir;... channel fir_to_rake;... pdf { for (i = 0; i < slot_size; i++) { fir.run(fir_to_rake, AtoD); rake.run(rake_out, fir_to_rake); if (mode == voice) viterbi.run(mac_in, rake_out); else turbo.run(mac_in, rake_out); mac(mac_in); } } } pdf_graph_init(WCDMA_rec)() {... } pdf_graph_final(WCDMA_rec)() {... } Static PDF node and edge declarations PDF scope: a PDF graph description. Language restrictions within PDF scope. i.e. - Must only use for-loop constructions with constant loop- bounds - Must only include function calls to pdf_node functions. A guideline for writing dataflow- safe C++ code Descriptions for dataflow initialization and finalization stages fir rake if vit tur if mac
System Compilation Frontend SPIR Frontend Backend PE0 ARM Translate SPEX into parameterized dataflow representation Use traditional control-flow and dataflow analysis Semantic error-checking to ensure dataflow-safe C/C++ code Possible to support other high-level languages
System Compilation Backend SPIR Frontend Backend PE0 ARM Function-level compilation Node-to-DE assignments Memory buffer allocations DMA assignments Function-level optimizations Software pipelining Code generation Parallel thread generation Physical buffer allocation If-conversion and predicate propagation
Conclusion System-level compilation framework We have a working compiler for SPEX Target: SODA-like multi-core DSPs Parameterized dataflow is used as compiler IR SPEX is a set of language extensions for efficient translation from C/C++ into dataflow SPIR Frontend Backend DE0 ARM
Questions
Shared Variables In Dataflow Shared variables are not allowed in traditional dataflow models SPIR allows shared variables between dataflow nodes Multi-dimensional streaming patterns Non-sequential streaming patterns Decoupled streaming Shared memory buffers
Backend Compilation SPIR Frontend Backend PE0 ARM FIR Rake Turbo Problem with function-level compilation Requires function-level parallelism Wireless protocols do not have many concurrent functions FIR Rake Turbo in[0..N] PE0 PE1PE2
Backend Compilation SPIR Frontend Backend PE0 ARM Utilize existing compiler optimization Function-level software pipelining Processing each stream data is the same as a loop iteration Modulo scheduling applied to function-level compilation FIR Rake Turbo in[i] PE0 PE1PE2 FIR Rake Turbo in[i+1] FIR Rake Turbo in[i+2] Turbo Rake FIR