RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D.

RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department sridhar@rice.edu Ph.D. Thesis Proposal

RICE UNIVERSITY A proposed cellular base-station  High data rates in emerging wireless systems (Mbps)  Sophisticated algorithms for high spectral efficiency  Multiuser estimation, multiuser detection, Viterbi  What does it take to build it?

RICE UNIVERSITY Designing wireless systems  Traditional design - Area-Time-Power - ASICs  high data rates, low power, minimum area  Flexibility became more important - FPGAs, DSPs  faster algorithm evaluation and prototyping  Heterogeneous solutions -DSPs+FPGAs+ASICs  Task-partitioning & hardware-software co-design  New challenges  Flexibility in algorithms -- multi-standard support  rapidly evaluate and implement new algorithms  adapting architectures to ever increasing data rates

RICE UNIVERSITY Emerging wireless systems  GOPS of computations in emerging base-stations  Example real-time target : 32 users at 128 Kbps each

RICE UNIVERSITY Part I - Build real-time base-stations  Current single processor DSPs not powerful enough  Algorithms well understood at data-flow level  I can design real-time architectures in VLSI  What do I lose when I make the architecture programmable? Can I quantify the loss?  Can I solve the existing bottlenecks, if any?  Build a real-time “efficient” communications processor

RICE UNIVERSITY Part II - Robust to future updates  Data rates will increase  Decoding constraint lengths may increase  Number of users, spreading length may increase  Algorithms will change - with MORE computations  System complexity WILL increase.  I want to test new algorithms and changes quickly  I want my design to adapt quickly to those changes

RICE UNIVERSITY Part III - Extensions  W-LAN base-station  Different algorithms  FFT, Viterbi, FIR - main computational blocks  100+ Mbps data rates  Handset  Different algorithms - simpler  Power and area - more critical  How will my design extend to these wireless systems?

RICE UNIVERSITY Expected thesis contributions  A programmable processor design for communications  adapts to real-time requirements  “efficient” in terms of #functional units, their utilization and memory stalls  Hardware-software co-design framework to rapidly evaluate new algorithms  and rapidly implement them too.  Design limits and extensions to other wireless systems

RICE UNIVERSITY Outline  Motivation  Thesis proposal  Background  Initial results  Work proposed for thesis completion  Comparisons with existing work

RICE UNIVERSITY I - Efficient communications processor  Propose using an architecture simulator to design the communications processor  high performance streaming processor simulator based on the “Imagine” architecture  Streaming processor because  GPP architectures not good for media, wireless  streaming processor shown to be good for media applications such as FFT and FIR.  Media and communication algorithms similar  Media architectures popular --> wireless architectures?

RICE UNIVERSITY I - Simulator functionality  Simulator  cycle accurate  allows us to investigate bottlenecks  number and type of functional units flexible  gives functional unit utilization  Can propose and evaluate solutions to solve bottlenecks in the implementation

RICE UNIVERSITY New Algorithm Complexity Parallelism Fixed-point Library of existing algorithms Algorithm selection for end-to-end system Existing architecture parameters Compile Assembly Code Real-time/area/power satisfied? Done 54 3 2 1 Run on existing architecture Architecture design scaling (# Functional units, # clusters) New architecture parameters Operations count Real-time/ (area/power) requirements Compile Assembly Code Fabrication feasible? New Architecture YES Design failed Re-design algorithms/ architecture 12 11 109 8 6 7 NO YES II - Robust to future updates

RICE UNIVERSITY II - Propose this methodology  New algorithms in high level language  Easy to evaluate functionality and can use the same algorithms for actual design  If new algorithms are similar in algorithm and complexity, still real-time with no changes in architecture  If not  An automated tool scales the architecture appropriately  The proposed scaling algorithm for the tool targets real- time and FU utilization simultaneously while scaling.

RICE UNIVERSITY Outline  Motivation  Thesis proposal  Background -- Architecture and Algorithms  Initial results  Work proposed for thesis completion  Comparisons with existing work

RICE UNIVERSITY The Imagine architecture Figure borrowed from S. Rixner

RICE UNIVERSITY Arithmetic clusters  VLIW control  3 adders, 2 multipliers, 1 divider  Scratch-pad and communication unit  Distributed register files Figure borrowed from S. Rixner

RICE UNIVERSITY Stream programming  StreamC  Executes on host  C++  Controls stream transfers between main memory and SRF void main() { Stream a(256); Stream b(256); Stream c(256); Stream d(1024);... example1(a, b, c); example2(c, d);... }  KernelC  Executes on clusters  C-like Syntax  Kernel computation  Compiled by iscd KERNEL example1(istream a, istream b, ostream c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } Figure borrowed from S. Rixner

RICE UNIVERSITY Parallel W-CDMA Estimation/Detection/Decoding  Multiuser estimation  replaced matrix inversion by gradient descent  Multiuser detection  Parallel Interference Cancellation (PIC)  Pipelined algorithm that avoids block-based detection  Viterbi decoding  Trellis structures suited for decoding  Register exchange for survivor memory  No traceback latency

RICE UNIVERSITY Estimation/Detection (64,32 sizes) Multiuser Estimation Multiuser Detection Prepare Matrices for Detection

RICE UNIVERSITY X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) b. Shuffled Trellisa. Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Viterbi trellis for rate ½ code with K = 5

RICE UNIVERSITY Stream data flow Matrix transpose Viterbi kernel Matrix mult kernel Correlation update kernel Matrix mul C kernel Data rearrangement Buffer Estimation bits Detection bits Multiuser Channel Estimation Multiuser Detection Decoding Computation Communication Iteration update kernel Matched filter kernel Matrix mul L kernel PIC kernel

RICE UNIVERSITY Matrix multiplication kernel (Imagine)  32 cycle loop  Executed on all 8 clusters  Complexity  O(N 3 ) multiplies  O(N 3 ) adds  100% multiplier utilization in the loop  Divider is unnecessary! Inner Loop Instruction Communication (waiting for input) FU unavailable (input ready but FU busy) ADD0ADD1ADD2MUL0MUL1DIV0

RICE UNIVERSITY Replace divider with multiplier  22 cycle loop  Executed on all 8 clusters  97% multiplier utilization in the loop  85% adder utilization in the loop  Changing functional units  Supported by simulator/compiler  Architecturally realistic Instruction ADD0ADD1ADD2MUL0MUL1MUL2

RICE UNIVERSITY Definition of “efficiency”  Idle time includes time spent by functional units not doing any computations and the time spent between kernel operations.  Alternate metric: Idle time...any USEFUL computations....  Some computations unavoidable in a programmable architecture ( i = i +1)  Good “Efficiency” means min. memory stalls and high FU utilization

RICE UNIVERSITY Kernel computational time

RICE UNIVERSITY Estimation and detection execution Kernel ExecutionMemory TransfersCycle Stalled waiting for data from memory Estimation Detection (10 bits)

RICE UNIVERSITY Viterbi execution Initialization Decode (32 bits) Kernel ExecutionMemory TransfersCycle

RICE UNIVERSITY Viterbi discussion  Viterbi: not enough computations for Imagine.  Significantly large communication between trellis states  If re-ordering done outside kernel,  shows up as memory stalls  If re-ordering done within kernel,  shows up as poor FU utilization  poor “efficiency” in any respect  Unsolved bottleneck at this point

RICE UNIVERSITY C6711 DSP comparisons  48X improvement for estimation, 42X for detection over DSP simulator  Approx. additional 15X improvement over actual DSP  Viterbi is actually slower than a DSP implementation

RICE UNIVERSITY Summary for proposal - Part I  Estimation, detection kernels well behaved  Stalls in estimation kernel - but normalized over detection bits.  A 3 adder, 3 multiplier per cluster configuration attained reasonable good FU on most kernels.  Decoding kernel behaves extremely poorly on this architecture.

RICE UNIVERSITY Issues in scaling architectures  Scaling parameters to meet real-time  #FU, their type and #clusters  Increasing clusters more area and energy efficient than increasing #FU as intra-cluster communication and distributing instructions high [Khailany2002]  [Khailany 2002] provide exhaustive searches for best scaling and show effects on area,energy, delay.  Do not target FU utilization and which FU’s to change  Do not show how and what to scale to meet real-time

RICE UNIVERSITY “Ad hoc” algorithm for scaling architectures  Changes #FUs as well as their type  Algorithm optimizes for functional unit utilization simultaneously (not “efficiency”)  looks at functional units that have a very high efficiency  they could be bottlenecking other functional units  add more of them to see when their utilization decreases  utilization of other units will increase  look at the functional unit with the next highest utilization.

RICE UNIVERSITY Algorithm (contd.)  Once functional units have high utilization (or, increasing them by one does not give any more benefits), look at real-time requirements  Increase cluster sizes to meet real-time requirements  The remaining factor is to be achieved by scaling functional units  Will not work if all units have poor efficiency  Good and bad

RICE UNIVERSITY Example - Matrix multiplication  Base: 3 adders (53%), 2 multipliers (91%)  bottlenecked on multipliers  multipliers =  (91/53) * 2  = 3  New: 3 adders(85%), 3 multipliers(99%)  Could have been done automatically before instead of looking at the flops computations  Better as it scales FUs such as inter-cluster communication units that are based on computations.

RICE UNIVERSITY Work proposed for completion  Detailed comparisons to other architectures  Innovations needed to solve bottlenecks  Formalize scaling architectures and methodology  Extensions to W-LAN and handsets

RICE UNIVERSITY Time-scales

RICE UNIVERSITY Approach to complete work  Comparisons:  Time-consuming but straight-forward  Innovations to solve bottlenecks:  At the algorithm level re-designing to minimize data re-arrangements  At the software level BLAS sub-routines for blocking computations  At the hardware level Shuffle networks for Viterbi trellis Instruction set extensions for Viterbi acs Permutation-based interleaved memory

RICE UNIVERSITY Approach (Contd.)  Scaling algorithm for automated tool  From “ad hoc” to “optimal”  Extensions to W-LAN base-station  Imagine already shown to be good for FFT,FIR.  Viterbi already understood well by this time  Extensions to hand-sets  Power, area more critical  Dynamic adaptation of wordlengths in FUs [Leung98]

RICE UNIVERSITY Goals and deliverables  Goals  real-time processor design for communications  methodology for fast evaluation and implementation  Deliverables  Innovations in architecture  Comparison points with existing work  Algorithm for automated tool to scale architectures  Analysis of extensions to W-LAN and handset

RICE UNIVERSITY Existing work - I  Texas Instruments approach  DSP+ASIC  C64x : Viterbi & Turbo co-processor (2000) - 4.7 Mbps  Rack of DSPs needed for our W-CDMA base-station  makes sense as a system with single user algorithms  sliding correlator, matched filter, Viterbi - linear with users  DSP per user -- processing “independent”  Does not work with multiuser algorithms  exchange information between multiple DSPs  communication bottlenecks

RICE UNIVERSITY Existing work - I  Reconfigurable architectures  Chameleon (2001), PACT(2000)  RaPiD(2001), PipeRench(1999), Stallion (1998)  Adapt architectures dynamically with computations  Promising but (personal view)  harder to program than DSPs  needs detailed knowledge of hardware design  difficult to design big systems  not as robust to future systems as my proposed solution

RICE UNIVERSITY Existing work - II  Embedded architecture design and automation (CAD Tools)  PICO - VLIW (HP Labs) (April 2001)  Aachen - Blume (July 2002)  University of Austin - Lapinskii (August 2002)  Design for heterogeneous systems on a chip  ASIC/DSP/FPGA system on a chip  My target: A homogeneous, adaptable and flexible design

RICE UNIVERSITY Conclusions  Flexibility, fast evaluation and adapting architectures to ever increasing real-time are new challenges in designing wireless systems in addition to previous area-time-power efficiency.  My proposal tries to address these goals by  designing a programmable and “efficient” processor for communications  providing a hardware-software co-design methodology with a fast design cycle time between evaluation to a real-time implementation.

RICE UNIVERSITY Other research contributions  Multiuser estimation and detection algorithms  Implementation issues  Parallel, iterative, fixed point, pipelining  Task-partitioning estimation-detection on 2 DSPs  On-line arithmetic and its application to detection  Programmable processor design for W-LAN base- station [Nokia Research Center]

RICE UNIVERSITY Survivor management in Viterbi decoding  Two techniques  Traceback – commonly used  Register exchange  Traceback is simpler  Less area in VLSI architectures  Drawback: Sequential and additional latency  Register exchange is faster  Parallel updates  Packing decoded bits in the register needs to access the entire register

RICE UNIVERSITY Variable computations/ Non-2^  Variable computations with fading  Worst case design for different fading models?  If fewer computations,  Dynamic Voltage /Frequency scaling?  Users not a power of 2 then,  not all kernels will scale down  dummy data?  spare codes available that some users can use to increase data rates Multiple antennas, multi-rate systems

RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D.

Similar presentations

Presentation on theme: "RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D.

Similar presentations

Presentation on theme: "RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback