Presentation is loading. Please wait.

Presentation is loading. Please wait.

RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

Similar presentations


Presentation on theme: "RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,"— Presentation transcript:

1 RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia, TI, TATP and NSF

2 RICE UNIVERSITY Motivation Wireless Mobile device Baseband Programmable Communications Processor RF Unit A/D D/A Mobile: Switch between standards and between parameters Base-station: varying no. of users with different parameters Programmability - flexibility is good

3 RICE UNIVERSITY The problem GPP DSP FPGA VLSI Performance Flexibility Best architecture for Power, Area constraints ????

4 RICE UNIVERSITY An approach for the solution  Algorithms well understood at data-flow level  Can design real-time systems in VLSI.  Pushing implementation higher in the chain  Current DSPs not powerful enough for our application  Use an architecture simulator to design our own

5 RICE UNIVERSITY Proposed solution Current solutions to meet real-time (Racks of DSPs) Programmable Processor for 4G wireless systems < x cm Future wireless architectures x = 2.5 (W-CDMA BS) x = 2.0 (W-LAN BS) x = 1.5 (Mobile Handset) JOE

6 RICE UNIVERSITY Past work Algorithms DSP VLSI FPGA IMAGINE Multiuser channel estimation Multiuser detection Task-partitioning Parallelism Pipelining Conventional arithmetic On-line arithmetic Architecture innovations Functional unit design and usage Distant Past Recent Past Recent and Near Future System Design

7 RICE UNIVERSITY Contents  Motivation  The “Imagine” simulator  Parallel algorithms for estimation/detection/decoding  Performance comparisons and results

8 RICE UNIVERSITY The IMAGINE architecture

9 RICE UNIVERSITY Why IMAGINE simulator?  RSIM, SimpleScalar: GPP simulators  Great for media processing algorithms  Has a VLIW-based cluster -- DSP comparisons  A good base architecture : 1024-pt FFT

10 RICE UNIVERSITY Simulator knobs that we can turn  Cycle-accurate simulator  Varying number of Functional units and their design  Varying memory, register sizes  Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead …  Almost anything can be changed, some changes easier than others!

11 RICE UNIVERSITY Caveats  2 level C++ programming  StreamC: transfers streams of data between main memory and stream register file (SRF)  KernelC: transfers streams from the SRF to the ALU clusters  Code optimized to the number of ALU clusters and the size of the data  Compiler not yet fully developed

12 RICE UNIVERSITY Contents  Motivation  The “Imagine” simulator  Parallel algorithms for estimation/detection/decoding  Performance comparisons and results

13 RICE UNIVERSITY Typical workload representation (Base-station)  Equalization?  FFT  Viterbi decoding  Multiuser channel estimation  Multiuser detection  Viterbi decoding  Turbo decoding  Multiple antenna systems (MIMO) Wireless LAN W-CDMA Advanced receiver schemes

14 RICE UNIVERSITY Parallel estimation/detection/decoding  Multiuser estimation  replaced matrix inversion by gradient descent  Multiuser detection  Parallel Interference Cancellation (PIC)  Pipelined algorithm that avoids block-based detection  Viterbi decoding  Trellis structures suited for decoding  Register exchange for survivor memory  No traceback latency

15 RICE UNIVERSITY Estimation/Detection (64,32 sizes) Multiuser Estimation Kernel 1,2,3 Multiuser Detection Kernel 6, 7 Massaging matrices for detection Kernel 4, 5

16 RICE UNIVERSITY X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) a. Unsuitable Trellisb. Suitable Trellisc. Shuffled Suitable Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Trellis for rate ½ code with K = 5 Upper bound on parallel clusters for good FU utilization : N/2 k Maximum 8 parallel units for rate ½ with 16 states

17 RICE UNIVERSITY Trellis structures for parallel Viterbi Definition : If from a present state p  [1..N], set of next states are {m p } (m p has 2 k elements where ‘k’ is the number of inputs at the encoder), i.e. p  {m p } then  i,j  [1..N] either {m i } = {m j } or {m i }  {m j } =  and a trellis that satisfies this property is denoted as a “separable” or a “fast” trellis.

18 RICE UNIVERSITY X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Y(0) Y(1) Y(2) Y(3) Y(4) Y(5) Y(6) Y(7) Y(8) Y(9) Y(10) Y(11) Y(12) Y(13) Y(14) Y(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Y(0) Y(1) Y(2) Y(3) Y(4) Y(5) Y(6) Y(7) Y(8) Y(9) Y(10) Y(11) Y(12) Y(13) Y(14) Y(15) a. Shuffled Suitable Trellis for ‘k=2’b. Rearranged Shuffled Suitable Trellis for ‘k=2’ Trellis for rate 2/3 code with K = 5 Upper bound on parallel clusters for good FU utilization : N/2 k Maximum 4 parallel units for rate 2/3 with 16 states (Having 8 will involve interprocessor comm. overhead)

19 RICE UNIVERSITY Survivor Management in Viterbi  Two techniques  Traceback : Commonly used  Register Exchange  Traceback is good for VLSI architectures where the information bits can be decoded by proper survivor memory addressing sequentially  Drawback: Sequential and additional latency

20 RICE UNIVERSITY Register exchange for decoding  Register for given node at given time contains information bits associated with surviving partial path that ends in that state  Survivors calculated in conjunction with path metrics.  Latency in conventional traceback is avoided.  Higher power consumption as entire survivor memory contents are updated for all states for each bit.  Suited to a parallel programmable implementation as storing bits in a register for traceback touches the previous survivors anyway

21 RICE UNIVERSITY Contents  Motivation  The “Imagine” simulator  Parallel algorithms for estimation/detection/decoding  Performance comparisons and results

22 RICE UNIVERSITY Lower bounds on + and * 050100150200250300 10 0 1 2 3 Adders/Multipliers required to meet real-time Estimation, Detection and Decoding in a W-CDMA multiuser system Number of users Add Mul SLOW FADING (estimation every 1000 bits) MEDIUM FADING (estimation every 100 bits) FAST FADING (estimation every 10 bits) DATA RATES

23 RICE UNIVERSITY Kernel 2 (mmult) for 3 +,2* Adders have limited FU utilization O(N 3 ) *, O(N 3 ) + Multipliers 100% in loop Divider not being utilized Replace / with * Communication (waiting for input) TIME LOOP FU unavailable (input ready but FU busy)

24 RICE UNIVERSITY Kernel 2 (mmult)for 3 +,3* better adder utilization needs sufficient registers for scaling [register allocation may fail] code may also need slight tuning of variables for optimization TIME

25 RICE UNIVERSITY Kernel computational time Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles *Numbers subject to change

26 RICE UNIVERSITY Communication overhead Kernels (Micro-controller executing) Memory operations Initialization Idle time between kernels

27 RICE UNIVERSITY Comparisons with TI C6701 DSPs 05101520253035 10 -6 10 -5 10 -4 10 -3 10 -2 Execution time (in seconds) Users Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user Our architecture based on Imagine X x

28 RICE UNIVERSITY Kernel comparisons KERNELS Execution time IMAGINE TI C67: Internal Memory TI C67: External Memory

29 RICE UNIVERSITY 4Gone Conclusions  Various programmable architectures can be investigated for 4G systems depending on algorithms, time, area and power constraints QUICKLY  Enormous potential for 4G system prototyping.  Programmable baseband processor design with broad system functionality, flexibility and low-power consumption that allows a smooth and fast transition from 2G to 3G to 4G systems.

30 RICE UNIVERSITY Future work  Investigating bottlenecks, functional unit design and other innovations needed to attain real-time  Power and area constraints  Scalability with data rates  Handset algorithms  The insights gained from the design can also be applied to DSPs and other processors.


Download ppt "RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,"

Similar presentations


Ads by Google