Download presentation
Presentation is loading. Please wait.
Published byTodd Dalton Modified over 9 years ago
1
RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia, TI, TATP and NSF
2
RICE UNIVERSITY Motivation Wireless Mobile device Baseband Programmable Communications Processor RF Unit A/D D/A Mobile: Switch between standards and between parameters Base-station: varying no. of users with different parameters Programmability - flexibility is good
3
RICE UNIVERSITY The problem GPP DSP FPGA VLSI Performance Flexibility Best architecture for Power, Area constraints ????
4
RICE UNIVERSITY An approach for the solution Algorithms well understood at data-flow level Can design real-time systems in VLSI. Pushing implementation higher in the chain Current DSPs not powerful enough for our application Use an architecture simulator to design our own
5
RICE UNIVERSITY Proposed solution Current solutions to meet real-time (Racks of DSPs) Programmable Processor for 4G wireless systems < x cm Future wireless architectures x = 2.5 (W-CDMA BS) x = 2.0 (W-LAN BS) x = 1.5 (Mobile Handset) JOE
6
RICE UNIVERSITY Past work Algorithms DSP VLSI FPGA IMAGINE Multiuser channel estimation Multiuser detection Task-partitioning Parallelism Pipelining Conventional arithmetic On-line arithmetic Architecture innovations Functional unit design and usage Distant Past Recent Past Recent and Near Future System Design
7
RICE UNIVERSITY Contents Motivation The “Imagine” simulator Parallel algorithms for estimation/detection/decoding Performance comparisons and results
8
RICE UNIVERSITY The IMAGINE architecture
9
RICE UNIVERSITY Why IMAGINE simulator? RSIM, SimpleScalar: GPP simulators Great for media processing algorithms Has a VLIW-based cluster -- DSP comparisons A good base architecture : 1024-pt FFT
10
RICE UNIVERSITY Simulator knobs that we can turn Cycle-accurate simulator Varying number of Functional units and their design Varying memory, register sizes Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead … Almost anything can be changed, some changes easier than others!
11
RICE UNIVERSITY Caveats 2 level C++ programming StreamC: transfers streams of data between main memory and stream register file (SRF) KernelC: transfers streams from the SRF to the ALU clusters Code optimized to the number of ALU clusters and the size of the data Compiler not yet fully developed
12
RICE UNIVERSITY Contents Motivation The “Imagine” simulator Parallel algorithms for estimation/detection/decoding Performance comparisons and results
13
RICE UNIVERSITY Typical workload representation (Base-station) Equalization? FFT Viterbi decoding Multiuser channel estimation Multiuser detection Viterbi decoding Turbo decoding Multiple antenna systems (MIMO) Wireless LAN W-CDMA Advanced receiver schemes
14
RICE UNIVERSITY Parallel estimation/detection/decoding Multiuser estimation replaced matrix inversion by gradient descent Multiuser detection Parallel Interference Cancellation (PIC) Pipelined algorithm that avoids block-based detection Viterbi decoding Trellis structures suited for decoding Register exchange for survivor memory No traceback latency
15
RICE UNIVERSITY Estimation/Detection (64,32 sizes) Multiuser Estimation Kernel 1,2,3 Multiuser Detection Kernel 6, 7 Massaging matrices for detection Kernel 4, 5
16
RICE UNIVERSITY X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) a. Unsuitable Trellisb. Suitable Trellisc. Shuffled Suitable Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Trellis for rate ½ code with K = 5 Upper bound on parallel clusters for good FU utilization : N/2 k Maximum 8 parallel units for rate ½ with 16 states
17
RICE UNIVERSITY Trellis structures for parallel Viterbi Definition : If from a present state p [1..N], set of next states are {m p } (m p has 2 k elements where ‘k’ is the number of inputs at the encoder), i.e. p {m p } then i,j [1..N] either {m i } = {m j } or {m i } {m j } = and a trellis that satisfies this property is denoted as a “separable” or a “fast” trellis.
18
RICE UNIVERSITY X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Y(0) Y(1) Y(2) Y(3) Y(4) Y(5) Y(6) Y(7) Y(8) Y(9) Y(10) Y(11) Y(12) Y(13) Y(14) Y(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Y(0) Y(1) Y(2) Y(3) Y(4) Y(5) Y(6) Y(7) Y(8) Y(9) Y(10) Y(11) Y(12) Y(13) Y(14) Y(15) a. Shuffled Suitable Trellis for ‘k=2’b. Rearranged Shuffled Suitable Trellis for ‘k=2’ Trellis for rate 2/3 code with K = 5 Upper bound on parallel clusters for good FU utilization : N/2 k Maximum 4 parallel units for rate 2/3 with 16 states (Having 8 will involve interprocessor comm. overhead)
19
RICE UNIVERSITY Survivor Management in Viterbi Two techniques Traceback : Commonly used Register Exchange Traceback is good for VLSI architectures where the information bits can be decoded by proper survivor memory addressing sequentially Drawback: Sequential and additional latency
20
RICE UNIVERSITY Register exchange for decoding Register for given node at given time contains information bits associated with surviving partial path that ends in that state Survivors calculated in conjunction with path metrics. Latency in conventional traceback is avoided. Higher power consumption as entire survivor memory contents are updated for all states for each bit. Suited to a parallel programmable implementation as storing bits in a register for traceback touches the previous survivors anyway
21
RICE UNIVERSITY Contents Motivation The “Imagine” simulator Parallel algorithms for estimation/detection/decoding Performance comparisons and results
22
RICE UNIVERSITY Lower bounds on + and * 050100150200250300 10 0 1 2 3 Adders/Multipliers required to meet real-time Estimation, Detection and Decoding in a W-CDMA multiuser system Number of users Add Mul SLOW FADING (estimation every 1000 bits) MEDIUM FADING (estimation every 100 bits) FAST FADING (estimation every 10 bits) DATA RATES
23
RICE UNIVERSITY Kernel 2 (mmult) for 3 +,2* Adders have limited FU utilization O(N 3 ) *, O(N 3 ) + Multipliers 100% in loop Divider not being utilized Replace / with * Communication (waiting for input) TIME LOOP FU unavailable (input ready but FU busy)
24
RICE UNIVERSITY Kernel 2 (mmult)for 3 +,3* better adder utilization needs sufficient registers for scaling [register allocation may fail] code may also need slight tuning of variables for optimization TIME
25
RICE UNIVERSITY Kernel computational time Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles *Numbers subject to change
26
RICE UNIVERSITY Communication overhead Kernels (Micro-controller executing) Memory operations Initialization Idle time between kernels
27
RICE UNIVERSITY Comparisons with TI C6701 DSPs 05101520253035 10 -6 10 -5 10 -4 10 -3 10 -2 Execution time (in seconds) Users Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user Our architecture based on Imagine X x
28
RICE UNIVERSITY Kernel comparisons KERNELS Execution time IMAGINE TI C67: Internal Memory TI C67: External Memory
29
RICE UNIVERSITY 4Gone Conclusions Various programmable architectures can be investigated for 4G systems depending on algorithms, time, area and power constraints QUICKLY Enormous potential for 4G system prototyping. Programmable baseband processor design with broad system functionality, flexibility and low-power consumption that allows a smooth and fast transition from 2G to 3G to 4G systems.
30
RICE UNIVERSITY Future work Investigating bottlenecks, functional unit design and other innovations needed to attain real-time Power and area constraints Scalability with data rates Handset algorithms The insights gained from the design can also be applied to DSPs and other processors.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.