Presentation is loading. Please wait.

Presentation is loading. Please wait.

A programmable communications processor for future wireless systems

Similar presentations


Presentation on theme: "A programmable communications processor for future wireless systems"— Presentation transcript:

1 A programmable communications processor for future wireless systems
Sridhar Rajagopal Scott Rixner, Joseph R. Cavallaro, Behnaam Aazhang This work has been supported by Nokia, TI, TATP and NSF

2 Overview of research at Rice
Center for Multimedia Communications Behnaam Aazhang (wireless communications) Joseph R. Cavallaro (VLSI signal processing) Computer Architecture Scott Rixner (Microprocessor architecture) Vijay Pai (Simulators, Network Processors)

3 Motivation Mobile: Switch between standards and between parameters
Base-station: varying no. of users with different parameters Programmability - flexibility is good Wireless Mobile device Baseband Programmable Communications Processor RF Unit A/D D/A

4 Motivation GPP DSP FPGA VLSI Performance Flexibility

5 Lower bounds on + and * for a 500 MHz system
Estimation, Detection and Decoding in a W-CDMA multiuser system 3 10 FAST FADING (estimation every 10 bits) MEDIUM FADING (estimation every 100 bits) 2 10 Adders/Multipliers required to meet real-time SLOW FADING (estimation every 1000 bits) 1 10 DATA RATES Add Mul 10 50 100 150 200 250 300 Number of users

6 The Problem Algorithms well understood at data-flow level
Can design real-time systems in VLSI. Pushing implementation higher in the chain Current DSPs not powerful enough for our application Use an architecture simulator to design our own

7 Current solutions to meet real-time
Proposed solution < x cm Programmable Processor for 4G wireless systems < x cm Future wireless architectures x = 2.5 (W-CDMA BS) x = (W-LAN BS) x = 1.5 (Mobile Handset) Current solutions to meet real-time (Racks of DSPs)

8 Ph.D. Thesis Outline Algorithm (in Matlab) New architecture design
Complexity ? Parallelize ? Fixed point ? Compiler Operation Count Parameter-free Architecture Design New Algorithm Characteristics? Real-Time (Area/Power) Requirements Architecture Synthesizer Processor Architecture Parameters (# Functional units, # registers, # memory ....) Architecture Code New architecture design Future Work

9 Advantages of this solution
Fast and smooth transition to future standards that simultaneously meets real-time and other constraints Avoids re-designing the system from scratch Joint algorithm–architecture hardware-software co-design Matlab code can be re-used when new standards are being designed. Tries to account for data rate increases and future algorithm changes

10 Past research contributions
Algorithms Multiuser channel estimation Multiuser detection Distant Past VLSI Task-partitioning Parallelism Pipelining System Design FPGA Recent Past Conventional arithmetic On-line arithmetic DSP Recent and Near Future Architecture innovations Functional unit design and usage IMAGINE

11 Contents Motivation Parallel algorithms for estimation/detection/decoding The “Imagine” simulator Performance comparisons and results

12 Typical workload representation (Base-station)
Equalization? FFT Viterbi decoding Multiuser channel estimation Multiuser detection Turbo decoding Multiple antenna systems (MIMO) Wireless LAN W-CDMA Advanced receiver schemes

13 Parallel estimation/detection/decoding
Multiuser estimation replaced matrix inversion by gradient descent Multiuser detection Parallel Interference Cancellation (PIC) Pipelined algorithm that avoids block-based detection Viterbi decoding Trellis structures suited for decoding Register exchange for survivor memory No traceback latency

14 Estimation/Detection (64,32 sizes)
Multiuser Estimation Kernel 1,2,3 Massaging matrices for detection Kernel 4, 5 Multiuser Detection Kernel 6, 7

15 Trellis for rate ½ code with K = 5
X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable Trellis Upper bound on parallel clusters for good FU utilization : N/2k Maximum 8 parallel units for rate ½ with 16 states

16 Survivor Management in Viterbi
Two techniques Traceback : Commonly used Register Exchange Traceback is good for VLSI architectures Drawback: Sequential and additional latency Register exchange is good for programmable solutions Parallel updates Packing decoded bits in the register needs to access the entire register

17 Contents Motivation Parallel algorithms for estimation/detection/decoding The “Imagine” simulator Performance comparisons and results

18 The IMAGINE architecture
Stream Register File Network Interface Stream Controller Imagine Stream Processor Host Processor ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller

19 Why IMAGINE simulator? RSIM, SimpleScalar: GPP simulators
Great for media processing algorithms Has a VLIW-based cluster -- DSP comparisons A good base architecture : 1024-pt FFT

20 Simulator knobs that we can turn
Cycle-accurate simulator Varying number of Functional units and their design Varying memory, register sizes Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead … Almost anything can be changed, some changes easier than others!

21 Programming Imagine StreamC:
2 level C++ programming StreamC: transfers streams of data between main memory and stream register file (SRF) KernelC: transfers streams from the SRF to the ALU clusters Code optimized to the number of ALU clusters and the size of the data

22 Contents Motivation Parallel algorithms for estimation/detection/decoding The “Imagine” simulator Performance comparisons and results

23 Communication (waiting for input) FU unavailable (input ready but FU busy) Kernel 2 (mmult) for 3 +,2* Adders have limited FU utilization O(N3) *, O(N3) + Multipliers 100% in loop Divider not being utilized Replace / with * TIME LOOP

24 Kernel 2 (mmult)for 3 +,3* better adder utilization needs sufficient registers for scaling [register allocation may fail] code may also need slight tuning of variables for optimization TIME

25 Kernel computational time
Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles

26 Communication overhead
Kernels (Micro-controller executing) Memory operations Initialization Idle time between kernels

27 Comparisons with TI C6701 DSPs
5 10 15 20 25 30 35 -6 -5 -4 -3 -2 Execution time (in seconds) Users Single DSP implementation 2 DSP implementation Target data rate Kbps/user Our architecture based on Imagine I Efficiency = ? IMAGINE with increasing functional units 1 DSP 2 DSPs

28 Future work Real-time design possible with larger number of functional units but efficiency is the key Eliminating communication stalls between kernels Support for matrix transposes and bit-level operations Power and area constraints Scalability with data rates – Boundaries of architecture Handset algorithms

29 Conclusions Various programmable architectures can be investigated and implemented for future systems depending on algorithms, time, area and power constraints QUICKLY The insights gained from the design can be applied to DSPs and other processors with constraints on time, area and power.


Download ppt "A programmable communications processor for future wireless systems"

Similar presentations


Ads by Google