DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston, TX This work has been supported by Nokia, TI, TATP and NSF

Wireless Communication Systems Flexibility is required Mobile –Switch between standards –Switch between parameters Base-station –Varying number of users –Each user has different parameters Wireless Mobile Device Baseband Programmable Communications Processor RF Unit A/D D/A

Integration of Cellular/Wireless LAN W-CDMA base-station –4Mbps –Delay constraints –Area constraints? W-LAN base-station –100Mbps –Delay constraints –Area constraints? Mobile –W-CDMA & W-LAN –1Mbps & 100Mbps/# of users –Delay, area, and power constraints!

Computation Requirements Estimation, Detection and Decoding in a 4Mbps W-CDMA cellular multiuser system 050100150200250300 10 0 1 2 3 ALUs required for real-time at 500 MHz Number of W-CDMA Cellular Users Add Multiply SLOW FADING (estimation every 1000 bits) MEDIUM FADING (estimation every 100 bits) FAST FADING (estimation every 10 bits) DATA RATES PER USER

Proposed DSP System Evolution Current solutions to meet real-time (Racks of DSPs) Programmable DSP Processor for 4G wireless systems < x cm Future wireless DSP architectures x = 2.5 (W-CDMA BS) x = 2.0 (W-LAN BS) x = 1.5 (Mobile Handset)

The System Design Challenge Current single processor DSPs not powerful enough for next generation multi-standard applications Algorithms well understood at data-flow level Can design real-time systems in fixed VLSI Pushing towards programmable implementation Stream processors provide an interesting alternative

Research Contributions Algorithms for future wireless communications –Multiuser channel estimation and detection –Task partitioning, parallelism, pipelining –Used DSPs to develop and understand algorithms Special-purpose implementations –VLSI and FPGA mappings of algorithms –Conventional and on-line arithmetic Flexible implementations (current work) –Future DSP architectures? –Stream processors? –Architectural innovation –Functional unit design and usage

Outline Motivation Parallel Algorithms for Estimation, Detection, and Decoding Stream Processor Architecture Performance Comparisons and Results

Typical Base-station Algorithms Equalization? FFT Viterbi decoding Multiuser channel estimation Multiuser detection Viterbi decoding Turbo decoding Multiple antenna systems Wireless LAN W-CDMA Advanced receiver schemes

Parallel W-CDMA Estimation/Detection/Decoding Multiuser estimation –replaced matrix inversion by gradient descent Multiuser detection –Parallel Interference Cancellation (PIC) –Pipelined algorithm that avoids block-based detection Viterbi decoding –Trellis structures suited for decoding –Register exchange for survivor memory –No traceback latency

Estimation/Detection (64,32 sizes) Multiuser Estimation Multiuser Detection Prepare Matrices for Detection

X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) a. Unsuitable Trellisb. Suitable Trellisc. Shuffled Suitable Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Viterbi Trellis for Rate ½ Code with K = 5

Survivor Management in Viterbi Decoding Two techniques –Traceback – commonly used –Register exchange Traceback is simpler –Less area in VLSI architectures –Drawback: Sequential and additional latency Register exchange is faster –Parallel updates –Packing decoded bits in the register needs to access the entire register

DSP Evolution and Trends DSP Architectures –Increased parallelism and computational throughput –TI TMS 320C6x generation of VLIW DSPs Media Processing Architectures –Orders of magnitude increase in parallelism and computational throughput – 3D graphics! –Imagine processor developed at MIT/Stanford Prototype fabricated and licensed by TI Flexible and extensible VLIW multiple cluster architecture –Applicable to wireless communications?

The Imagine Architecture

Arithmetic Clusters VLIW control 3 adders, 2 multipliers, 1 divider Scratch-pad and communication unit Distributed register files

Bandwidth Hierarchy 41.2 32-bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

Stream Programming StreamC –Executes on host processor –C++ –Controls stream transfers between main memory and SRF void main() { Stream a(256); Stream b(256); Stream c(256); Stream d(1024);... example1(a, b, c); example2(c, d);... } KernelC –Executes on clusters –C-like Syntax –Kernel computation –Compiled by iscd KERNEL example1(istream a, istream b, ostream c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; }

1024-point FFT Performance ProcessorFrequencyDataRadixTimePowerEnergy Imagine500MHz Float (32-bit) 2 7.4  s 3.8W 28  J C6711150MHz Float (32-bit) 2 138  s ~1.3W C6411300MHz Fixed (16-bit) mixed 21  s ~0.5W C6201100MHz Fixed (16-bit) 2 227  s 0.6W 144  J Virtex II125MHz Fixed (16-bit) 4 2s2s

Media vs. Communications Similarities –Data parallelism –Low-precision data –High computation rates Different characteristics of communications processing –More data reorganization, such as matrix transposes –Bit-level operations Explore space of stream processor architectures with isim –Cycle-accurate stream processor simulation –Flexible machine description language (read by both simulator and compiler) –Vary number and design of functional units –Vary memory, register sizes –etc.

Stream Data Flow Matrix transpose Viterbi kernel Matrix mult kernel Correlation update kernel Matrix mul C kernel Data rearrangement Buffer Estimation bits Detection bits Multiuser Channel Estimation Multiuser Detection Decoding Computation Communication Iteration update kernel Matched filter kernel Matrix mul L kernel PIC kernel

Matrix Multiplication Kernel (Imagine) 32 cycle loop Executed on all 8 clusters Complexity –O(N 3 ) multiplies –O(N 3 ) adds 100% multiplier utilization in the loop Divider is unnecessary! Inner Loop Instruction Communication (waiting for input) FU unavailable (input ready but FU busy) ADD0ADD1ADD2MUL0MUL1DIV0

Replace Divider with Multiplier 22 cycle loop Executed on all 8 clusters 97% multiplier utilization in the loop 85% adder utilization in the loop Changing functional units –Supported by simulator/compiler –Architecturally realistic Instruction ADD0ADD1ADD2MUL0MUL1MUL2

Kernel Computational Time

Estimation and Detection Execution Kernel ExecutionMemory TransfersCycle Stalled waiting for data from memory Estimation Detection (10 bits)

Viterbi Execution Initialization Decode (32 bits) Kernel ExecutionMemory TransfersCycle

Real-time Performance Slow FadingMedium FadingFast Fading 0 0.5 1 1.5 2 2.5 x 10 4 estimation detection decoding stall time Real-Time at 500 MHz Clock cycles

Rough DSP Comparison Estimation Execution time IMAGINE TI C67: Internal Memory TI C67: External Memory Glue Matrices Detection

Future Work Achieve real-time rates –Additional functional units (that can be used efficiently!) –Eliminate communication stalls between kernels –Support for matrix transposes and bit-level operations Power and area constraints –Low power stream processing –Scaling the architecture for handsets Scalability with data rates –Boundaries of the architecture Handset algorithms

Conclusions Future wireless communications algorithms –exceed the capabilities of current DSPs –require flexibility to change algorithms and parameters –require efficient use of resources because of delay, area, and power constraints Architectural developments are needed for future DSPs –Stream processing is a promising approach –Additional hardware acceleration, akin to Viterbi coprocessor on C64? The insights gained from our designs can be applied to DSPs and other processors with constraints on delay, area and power.

DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Similar presentations

Presentation on theme: "DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Similar presentations

Presentation on theme: "DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,"— Presentation transcript:

Similar presentations

About project

Feedback