Download presentation
Presentation is loading. Please wait.
Published byBetty Merilyn Carson Modified over 9 years ago
1
RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department sridhar@rice.edu Ph.D. Thesis Proposal
2
RICE UNIVERSITY A proposed cellular base-station High data rates in emerging wireless systems (Mbps) Sophisticated algorithms for high spectral efficiency Multiuser estimation, multiuser detection, Viterbi What does it take to build it?
3
RICE UNIVERSITY Designing wireless systems Traditional design - Area-Time-Power - ASICs high data rates, low power, minimum area Flexibility became more important - FPGAs, DSPs faster algorithm evaluation and prototyping Heterogeneous solutions -DSPs+FPGAs+ASICs Task-partitioning & hardware-software co-design New challenges Flexibility in algorithms -- multi-standard support rapidly evaluate and implement new algorithms adapting architectures to ever increasing data rates
4
RICE UNIVERSITY Emerging wireless systems GOPS of computations in emerging base-stations Example real-time target : 32 users at 128 Kbps each
5
RICE UNIVERSITY Part I - Build real-time base-stations Current single processor DSPs not powerful enough Algorithms well understood at data-flow level I can design real-time architectures in VLSI What do I lose when I make the architecture programmable? Can I quantify the loss? Can I solve the existing bottlenecks, if any? Build a real-time “efficient” communications processor
6
RICE UNIVERSITY Part II - Robust to future updates Data rates will increase Decoding constraint lengths may increase Number of users, spreading length may increase Algorithms will change - with MORE computations System complexity WILL increase. I want to test new algorithms and changes quickly I want my design to adapt quickly to those changes
7
RICE UNIVERSITY Part III - Extensions W-LAN base-station Different algorithms FFT, Viterbi, FIR - main computational blocks 100+ Mbps data rates Handset Different algorithms - simpler Power and area - more critical How will my design extend to these wireless systems?
8
RICE UNIVERSITY Expected thesis contributions A programmable processor design for communications adapts to real-time requirements “efficient” in terms of #functional units, their utilization and memory stalls Hardware-software co-design framework to rapidly evaluate new algorithms and rapidly implement them too. Design limits and extensions to other wireless systems
9
RICE UNIVERSITY Outline Motivation Thesis proposal Background Initial results Work proposed for thesis completion Comparisons with existing work
10
RICE UNIVERSITY I - Efficient communications processor Propose using an architecture simulator to design the communications processor high performance streaming processor simulator based on the “Imagine” architecture Streaming processor because GPP architectures not good for media, wireless streaming processor shown to be good for media applications such as FFT and FIR. Media and communication algorithms similar Media architectures popular --> wireless architectures?
11
RICE UNIVERSITY I - Simulator functionality Simulator cycle accurate allows us to investigate bottlenecks number and type of functional units flexible gives functional unit utilization Can propose and evaluate solutions to solve bottlenecks in the implementation
12
RICE UNIVERSITY New Algorithm Complexity Parallelism Fixed-point Library of existing algorithms Algorithm selection for end-to-end system Existing architecture parameters Compile Assembly Code Real-time/area/power satisfied? Done 54 3 2 1 Run on existing architecture Architecture design scaling (# Functional units, # clusters) New architecture parameters Operations count Real-time/ (area/power) requirements Compile Assembly Code Fabrication feasible? New Architecture YES Design failed Re-design algorithms/ architecture 12 11 109 8 6 7 NO YES II - Robust to future updates
13
RICE UNIVERSITY II - Propose this methodology New algorithms in high level language Easy to evaluate functionality and can use the same algorithms for actual design If new algorithms are similar in algorithm and complexity, still real-time with no changes in architecture If not An automated tool scales the architecture appropriately The proposed scaling algorithm for the tool targets real- time and FU utilization simultaneously while scaling.
14
RICE UNIVERSITY Outline Motivation Thesis proposal Background -- Architecture and Algorithms Initial results Work proposed for thesis completion Comparisons with existing work
15
RICE UNIVERSITY The Imagine architecture Figure borrowed from S. Rixner
16
RICE UNIVERSITY Arithmetic clusters VLIW control 3 adders, 2 multipliers, 1 divider Scratch-pad and communication unit Distributed register files Figure borrowed from S. Rixner
17
RICE UNIVERSITY Stream programming StreamC Executes on host C++ Controls stream transfers between main memory and SRF void main() { Stream a(256); Stream b(256); Stream c(256); Stream d(1024);... example1(a, b, c); example2(c, d);... } KernelC Executes on clusters C-like Syntax Kernel computation Compiled by iscd KERNEL example1(istream a, istream b, ostream c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } Figure borrowed from S. Rixner
18
RICE UNIVERSITY Parallel W-CDMA Estimation/Detection/Decoding Multiuser estimation replaced matrix inversion by gradient descent Multiuser detection Parallel Interference Cancellation (PIC) Pipelined algorithm that avoids block-based detection Viterbi decoding Trellis structures suited for decoding Register exchange for survivor memory No traceback latency
19
RICE UNIVERSITY Estimation/Detection (64,32 sizes) Multiuser Estimation Multiuser Detection Prepare Matrices for Detection
20
RICE UNIVERSITY X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) b. Shuffled Trellisa. Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Viterbi trellis for rate ½ code with K = 5
21
RICE UNIVERSITY Outline Motivation Thesis proposal Background Initial results Work proposed for thesis completion Comparisons with existing work
22
RICE UNIVERSITY Stream data flow Matrix transpose Viterbi kernel Matrix mult kernel Correlation update kernel Matrix mul C kernel Data rearrangement Buffer Estimation bits Detection bits Multiuser Channel Estimation Multiuser Detection Decoding Computation Communication Iteration update kernel Matched filter kernel Matrix mul L kernel PIC kernel
23
RICE UNIVERSITY Matrix multiplication kernel (Imagine) 32 cycle loop Executed on all 8 clusters Complexity O(N 3 ) multiplies O(N 3 ) adds 100% multiplier utilization in the loop Divider is unnecessary! Inner Loop Instruction Communication (waiting for input) FU unavailable (input ready but FU busy) ADD0ADD1ADD2MUL0MUL1DIV0
24
RICE UNIVERSITY Replace divider with multiplier 22 cycle loop Executed on all 8 clusters 97% multiplier utilization in the loop 85% adder utilization in the loop Changing functional units Supported by simulator/compiler Architecturally realistic Instruction ADD0ADD1ADD2MUL0MUL1MUL2
25
RICE UNIVERSITY Definition of “efficiency” Idle time includes time spent by functional units not doing any computations and the time spent between kernel operations. Alternate metric: Idle time...any USEFUL computations.... Some computations unavoidable in a programmable architecture ( i = i +1) Good “Efficiency” means min. memory stalls and high FU utilization
26
RICE UNIVERSITY Kernel computational time
27
RICE UNIVERSITY Estimation and detection execution Kernel ExecutionMemory TransfersCycle Stalled waiting for data from memory Estimation Detection (10 bits)
28
RICE UNIVERSITY Viterbi execution Initialization Decode (32 bits) Kernel ExecutionMemory TransfersCycle
29
RICE UNIVERSITY Viterbi discussion Viterbi: not enough computations for Imagine. Significantly large communication between trellis states If re-ordering done outside kernel, shows up as memory stalls If re-ordering done within kernel, shows up as poor FU utilization poor “efficiency” in any respect Unsolved bottleneck at this point
30
RICE UNIVERSITY C6711 DSP comparisons 48X improvement for estimation, 42X for detection over DSP simulator Approx. additional 15X improvement over actual DSP Viterbi is actually slower than a DSP implementation
31
RICE UNIVERSITY Summary for proposal - Part I Estimation, detection kernels well behaved Stalls in estimation kernel - but normalized over detection bits. A 3 adder, 3 multiplier per cluster configuration attained reasonable good FU on most kernels. Decoding kernel behaves extremely poorly on this architecture.
32
RICE UNIVERSITY Issues in scaling architectures Scaling parameters to meet real-time #FU, their type and #clusters Increasing clusters more area and energy efficient than increasing #FU as intra-cluster communication and distributing instructions high [Khailany2002] [Khailany 2002] provide exhaustive searches for best scaling and show effects on area,energy, delay. Do not target FU utilization and which FU’s to change Do not show how and what to scale to meet real-time
33
RICE UNIVERSITY “Ad hoc” algorithm for scaling architectures Changes #FUs as well as their type Algorithm optimizes for functional unit utilization simultaneously (not “efficiency”) looks at functional units that have a very high efficiency they could be bottlenecking other functional units add more of them to see when their utilization decreases utilization of other units will increase look at the functional unit with the next highest utilization.
34
RICE UNIVERSITY Algorithm (contd.) Once functional units have high utilization (or, increasing them by one does not give any more benefits), look at real-time requirements Increase cluster sizes to meet real-time requirements The remaining factor is to be achieved by scaling functional units Will not work if all units have poor efficiency Good and bad
35
RICE UNIVERSITY Example - Matrix multiplication Base: 3 adders (53%), 2 multipliers (91%) bottlenecked on multipliers multipliers = (91/53) * 2 = 3 New: 3 adders(85%), 3 multipliers(99%) Could have been done automatically before instead of looking at the flops computations Better as it scales FUs such as inter-cluster communication units that are based on computations.
36
RICE UNIVERSITY Outline Motivation Thesis proposal Background Initial results Work proposed for thesis completion Comparisons with existing work
37
RICE UNIVERSITY Work proposed for completion Detailed comparisons to other architectures Innovations needed to solve bottlenecks Formalize scaling architectures and methodology Extensions to W-LAN and handsets
38
RICE UNIVERSITY Time-scales
39
RICE UNIVERSITY Approach to complete work Comparisons: Time-consuming but straight-forward Innovations to solve bottlenecks: At the algorithm level re-designing to minimize data re-arrangements At the software level BLAS sub-routines for blocking computations At the hardware level Shuffle networks for Viterbi trellis Instruction set extensions for Viterbi acs Permutation-based interleaved memory
40
RICE UNIVERSITY Approach (Contd.) Scaling algorithm for automated tool From “ad hoc” to “optimal” Extensions to W-LAN base-station Imagine already shown to be good for FFT,FIR. Viterbi already understood well by this time Extensions to hand-sets Power, area more critical Dynamic adaptation of wordlengths in FUs [Leung98]
41
RICE UNIVERSITY Goals and deliverables Goals real-time processor design for communications methodology for fast evaluation and implementation Deliverables Innovations in architecture Comparison points with existing work Algorithm for automated tool to scale architectures Analysis of extensions to W-LAN and handset
42
RICE UNIVERSITY Outline Motivation Thesis proposal Background Initial results Work proposed for thesis completion Comparisons with existing work
43
RICE UNIVERSITY Existing work - I Texas Instruments approach DSP+ASIC C64x : Viterbi & Turbo co-processor (2000) - 4.7 Mbps Rack of DSPs needed for our W-CDMA base-station makes sense as a system with single user algorithms sliding correlator, matched filter, Viterbi - linear with users DSP per user -- processing “independent” Does not work with multiuser algorithms exchange information between multiple DSPs communication bottlenecks
44
RICE UNIVERSITY Existing work - I Reconfigurable architectures Chameleon (2001), PACT(2000) RaPiD(2001), PipeRench(1999), Stallion (1998) Adapt architectures dynamically with computations Promising but (personal view) harder to program than DSPs needs detailed knowledge of hardware design difficult to design big systems not as robust to future systems as my proposed solution
45
RICE UNIVERSITY Existing work - II Embedded architecture design and automation (CAD Tools) PICO - VLIW (HP Labs) (April 2001) Aachen - Blume (July 2002) University of Austin - Lapinskii (August 2002) Design for heterogeneous systems on a chip ASIC/DSP/FPGA system on a chip My target: A homogeneous, adaptable and flexible design
46
RICE UNIVERSITY Conclusions Flexibility, fast evaluation and adapting architectures to ever increasing real-time are new challenges in designing wireless systems in addition to previous area-time-power efficiency. My proposal tries to address these goals by designing a programmable and “efficient” processor for communications providing a hardware-software co-design methodology with a fast design cycle time between evaluation to a real-time implementation.
47
RICE UNIVERSITY Other research contributions Multiuser estimation and detection algorithms Implementation issues Parallel, iterative, fixed point, pipelining Task-partitioning estimation-detection on 2 DSPs On-line arithmetic and its application to detection Programmable processor design for W-LAN base- station [Nokia Research Center]
48
RICE UNIVERSITY Survivor management in Viterbi decoding Two techniques Traceback – commonly used Register exchange Traceback is simpler Less area in VLSI architectures Drawback: Sequential and additional latency Register exchange is faster Parallel updates Packing decoded bits in the register needs to access the entire register
49
RICE UNIVERSITY Variable computations/ Non-2^ Variable computations with fading Worst case design for different fading models? If fewer computations, Dynamic Voltage /Frequency scaling? Users not a power of 2 then, not all kernels will scale down dummy data? spare codes available that some users can use to increase data rates Multiple antennas, multi-rate systems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.