Download presentation
Presentation is loading. Please wait.
1
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer Architecture Laboratory University of Michigan at Ann Arbor
2
22 2 University of Michigan 2 Software Defined Radio Use software routines instead of ASICs for the physical layer operations of wireless communication system Advantages: Multi-mode operation Lower costs Faster time to market Prototyping and bug fixes Chip volumes Longevity of platforms Enables future wireless communication innovations Complexity favors software-based solutions
3
33 3 University of Michigan 3 Case Study: W-CDMA Key software characteristics Multiple kernels connected together as a system Streaming computation Vector-based inter-kernel communications Mostly static computation patterns
4
44 4 University of Michigan 4 SODA: A SDR DSP Architecture (ISCA 06) Control-data decoupled multi-core architecture 1 ARM general purpose control processor Scalar algorithms and protocol controls 4 data processing elements SIMD+Scalar units Used for high-throughput DSP algorithms
5
55 5 University of Michigan 5 SODA Execution Model Software managed scratchpad memories Each PE can only access its local memory DMA operations Access global memory Inter-PE communications Algorithms statically mapped onto PEs RPCs from the ARM control processor
6
66 6 University of Michigan 6 Compilation Challenges for SDR Compilation support for SDR is essential Flexibility Lower development cost More complex protocols Compilation support for SDR is challenging Heterogeneous multiprocessor hardware ARM + DSPs Two level scratchpad memories Multiple software constraints Throughput + code & data size + real-time execution + others
7
77 7 University of Michigan 7 2-Tier Compilation Process Multiprocessor system compilation DSP kernel compilation This study is focused on system compilation Kernel compilation is treated as a black box Existing libraries SIMD compilers Objective Kernel-to-PE assignments Memory allocations Subject to Throughput constraints Memory constraints
8
88 8 University of Michigan 8 System Compilation Outline SPIR – Function level IR Traditional IR is not adequate Complex inter-function interactions Backend compilation Scheduling functions instead of instructions Function-level modulo scheduling
9
99 9 University of Michigan 9 SPIR Overview Dataflow programming model Graph consists of nodes and edges Two types of nodes Kernel (yellow) nodes for modeling functions Memory (blue) nodes for modeling vector buffers Buffer stream description + vector stream description Dataflow edges Synchronous dataflow (in the scope of this paper)
10
10 University of Michigan 10 SPIR Overview Problems with flat dataflow graph representations Matched to the highest rate SDR kernels have very different stream rates Turbo decoder: input rate = 9600; output rate = 3200 LPF: input rate = 1; output rate = 1
11
11 University of Michigan 11 SPIR Overview Problems with flat dataflow graph representations All must match to 9600 of the Turbo decoder Minimum LPF rate: input = 38.4K, output = 38.4K Stream rates translate to memory buffers Unnecessarily large memory buffers
12
12 University of Michigan 12 SPIR Overview Hierarchical dataflow graphs Different hierarchy level with different streaming rates Streaming vectors are modeled as hierarchical communications Top level: buffer queue descriptions Bottom level: vector streaming descriptions
13
13 University of Michigan 13 SPIR Overview W-CDMA Modeled with 3-level hierarchy in SPIR Memory nodes are inserted between nodes with child graph 4x decrease in memory buffer usage
14
14 University of Michigan 14 Coarse-grained System Compilation Three major tasks Resource allocation (processor, memory and DMA) Kernel execution ordering Kernel execution timing Static or dynamic? Static – compiler Less flexible, more efficient Dynamic – run-time scheduler or OS More flexible, less efficient For SDR applications Resource allocation: static Kernel execution ordering: static Kernel execution timing: dynamic
15
15 University of Michigan 15 Software Pipelining Streaming Kernels Problem with coarse-grained compilation Requires kernel-level parallelism to utilize the PEs SDR protocols do not have many data-independent kernels Compiler optimization: coarse-grained software pipelining Stream computation: pipeline parallelism Modulo scheduling
16
16 University of Michigan 16 Coarse-grained System Compilation Input Hierarchical graph Step 1 Dataflow rate matching Step 2 Stream size selection Step 3 Modulo scheduling Step 4 Hierarchical compilation Modulo compilation Dataflow rate matching Stream size selection Hierarchical scheduling
17
17 University of Michigan 17 Coarse-grained System Compilation Step 1: Dataflow rate matching Producer and consumer pair must have the same rates Edges are memory buffers Well studied with many existing algorithms Single appearance schedule Dataflow rate matching
18
18 University of Michigan 18 Coarse-grained System Compilation Step 2: Stream size selection Pick optimal input/output buffer size Multiple of the base rate Binary search algorithm Modulo schedule each candidate buffer size Stream size selection Rate = 1, Streaming N elements Case 1: N iterations Too much DMA overhead Case 2: 1 iteration Cannot software pipeline Case 3: N/M iterations
19
19 University of Michigan 19 Coarse-grained System Compilation Step 3: Function-level modulo scheduling II selection (Initiation Interval) Interval between the start of successive iterations MinII = Max(ResMII, RecMII) ResMII : total latency of all nodes divided by # of PEs RecMII : maximum latency of feedback paths Constraint-based modulo scheduling SMT-based algorithm Modulo compilation
20
20 University of Michigan 20 SMT-based Modulo Scheduling Using Satisfiability Modulo Theory (SMT) solver Yices Input: a set of constraints expressed as equations Output: a set of conditions where the constraints evaluate to true Constraints Throughput constraints i.e. total execution time must be less than or equal to II Memory constraints i.e. buffer size less than PE’s scratchpad memories Communication constraints i.e. DMA added for communicating kernels on different PEs status of kernel v i assigned to processor j (1 or 0) number of kernels
21
21 University of Michigan 21 Coarse-grained System Compilation Hierarchical scheduling Step 4: Hierarchical scheduling Bottom up scheduling Treat each child graph as a single node Memory nodes assigned to global memory
22
22 University of Michigan 22 Conclusion Compilation support for SDR is essential 2-tiered compilation process System compilation DSP compilation System compilation is function-level scheduling Hierarchical dataflow IR ~4x saving in memory buffer allocation SMT-based modulo scheduling Linear speedup up to 8 PEs Resulting in ~23% faster schedules than greedy
23
23 University of Michigan 23 Questions
24
24 University of Michigan 24 Case Study: W-CDMA
25
25 University of Michigan 25 Results: Average Speedup
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.