DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Real-Time DSP Multiprocessor Implementation for Future Wireless Base-Station Receivers Bryan Jones, Sridhar Rajagopal, and Dr. Joseph Cavallaro.

Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

DSPs in Wireless Communication Systems Vishwas Sundaramurthy Electrical and Computer Engineering Department, Rice University, Houston,TX.

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Efficient FPGA Implementation of QR

A bit-streaming, pipelined multiuser detector for wireless communications Sridhar Rajagopal and Joseph R. Cavallaro Rice University

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

RICE UNIVERSITY High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University

RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.

TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.

RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication.

Programmable processors for wireless base-stations Sridhar Rajagopal December 9, 2003.

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal Srikrishna Bhashyam, Joseph R. Cavallaro,

RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.

Pipelining and number theory for multiuser detection Sridhar Rajagopal and Joseph R. Cavallaro Rice University This work is supported by Nokia, TI, TATP.

Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003.

RICE UNIVERSITY On the architecture design of a 3G W-CDMA/W-LAN receiver Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia.

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

Implementing Multiuser Channel Estimation and Detection for W-CDMA Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro and Behnaam Aazhang Rice.

DSP base-station comparisons. Second generation (2G) wireless 2 nd generation: digital: last decade: 1990’s Voice and low bit-rate data –~14.4 – 28.8.

Exploiting Parallelism

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.

Algorithms and Architectures for Future Wireless Base-Stations Sridhar Rajagopal and Joseph Cavallaro ECE Department Rice University April 19, 2000 This.

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication.

Channel Equalization in MIMO Downlink and ASIP Architectures Predrag Radosavljevic Rice University March 29, 2004.

Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro

Low-power Digital Signal Processing for Mobile Phone chipsets

A programmable communications processor for future wireless systems

Sridhar Rajagopal April 26, 2000

How to ATTACK Problems Facing 3G Wireless Communication Systems

Pipelining and Vector Processing

Sridhar Rajagopal and Joseph R. Cavallaro Rice University

Modeling of RF in W-CDMA with SystemView

Sridhar Rajagopal and Joseph R. Cavallaro Rice University

DSPs for Future Wireless Base-Stations

On-line arithmetic for detection in digital communication receivers

Programmable processors for wireless base-stations

Final Project presentation

Modeling of RF in W-CDMA with SystemView

Sridhar Rajagopal, Srikrishna Bhashyam,

DSPs in emerging wireless systems

DSP Architectures for Future Wireless Base-Stations

On-line arithmetic for detection in digital communication receivers

Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro

DSPs for Future Wireless Base-Stations

Presentation transcript:

DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston, TX This work has been supported by Nokia, TI, TATP and NSF

Wireless Communication Systems Flexibility is required Mobile –Switch between standards –Switch between parameters Base-station –Varying number of users –Each user has different parameters Wireless Mobile Device Baseband Programmable Communications Processor RF Unit A/D D/A

Integration of Cellular/Wireless LAN W-CDMA base-station –4Mbps –Delay constraints –Area constraints? W-LAN base-station –100Mbps –Delay constraints –Area constraints? Mobile –W-CDMA & W-LAN –1Mbps & 100Mbps/# of users –Delay, area, and power constraints!

Computation Requirements Estimation, Detection and Decoding in a 4Mbps W-CDMA cellular multiuser system ALUs required for real-time at 500 MHz Number of W-CDMA Cellular Users Add Multiply SLOW FADING (estimation every 1000 bits) MEDIUM FADING (estimation every 100 bits) FAST FADING (estimation every 10 bits) DATA RATES PER USER

Proposed DSP System Evolution Current solutions to meet real-time (Racks of DSPs) Programmable DSP Processor for 4G wireless systems < x cm Future wireless DSP architectures x = 2.5 (W-CDMA BS) x = 2.0 (W-LAN BS) x = 1.5 (Mobile Handset)

The System Design Challenge Current single processor DSPs not powerful enough for next generation multi-standard applications Algorithms well understood at data-flow level Can design real-time systems in fixed VLSI Pushing towards programmable implementation Stream processors provide an interesting alternative

Research Contributions Algorithms for future wireless communications –Multiuser channel estimation and detection –Task partitioning, parallelism, pipelining –Used DSPs to develop and understand algorithms Special-purpose implementations –VLSI and FPGA mappings of algorithms –Conventional and on-line arithmetic Flexible implementations (current work) –Future DSP architectures? –Stream processors? –Architectural innovation –Functional unit design and usage

Outline Motivation Parallel Algorithms for Estimation, Detection, and Decoding Stream Processor Architecture Performance Comparisons and Results

Typical Base-station Algorithms Equalization? FFT Viterbi decoding Multiuser channel estimation Multiuser detection Viterbi decoding Turbo decoding Multiple antenna systems Wireless LAN W-CDMA Advanced receiver schemes

Parallel W-CDMA Estimation/Detection/Decoding Multiuser estimation –replaced matrix inversion by gradient descent Multiuser detection –Parallel Interference Cancellation (PIC) –Pipelined algorithm that avoids block-based detection Viterbi decoding –Trellis structures suited for decoding –Register exchange for survivor memory –No traceback latency

Estimation/Detection (64,32 sizes) Multiuser Estimation Multiuser Detection Prepare Matrices for Detection

X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) a. Unsuitable Trellisb. Suitable Trellisc. Shuffled Suitable Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Viterbi Trellis for Rate ½ Code with K = 5

Survivor Management in Viterbi Decoding Two techniques –Traceback – commonly used –Register exchange Traceback is simpler –Less area in VLSI architectures –Drawback: Sequential and additional latency Register exchange is faster –Parallel updates –Packing decoded bits in the register needs to access the entire register

Outline Motivation Parallel Algorithms for Estimation, Detection, and Decoding Stream Processor Architecture Performance Comparisons and Results

DSP Evolution and Trends DSP Architectures –Increased parallelism and computational throughput –TI TMS 320C6x generation of VLIW DSPs Media Processing Architectures –Orders of magnitude increase in parallelism and computational throughput – 3D graphics! –Imagine processor developed at MIT/Stanford Prototype fabricated and licensed by TI Flexible and extensible VLIW multiple cluster architecture –Applicable to wireless communications?

The Imagine Architecture

Arithmetic Clusters VLIW control 3 adders, 2 multipliers, 1 divider Scratch-pad and communication unit Distributed register files

Bandwidth Hierarchy bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

Stream Programming StreamC –Executes on host processor –C++ –Controls stream transfers between main memory and SRF void main() { Stream a(256); Stream b(256); Stream c(256); Stream d(1024);... example1(a, b, c); example2(c, d);... } KernelC –Executes on clusters –C-like Syntax –Kernel computation –Compiled by iscd KERNEL example1(istream a, istream b, ostream c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; }

1024-point FFT Performance ProcessorFrequencyDataRadixTimePowerEnergy Imagine500MHz Float (32-bit)  s 3.8W 28  J C MHz Float (32-bit)  s ~1.3W C MHz Fixed (16-bit) mixed 21  s ~0.5W C MHz Fixed (16-bit)  s 0.6W 144  J Virtex II125MHz Fixed (16-bit) 4 2s2s

Media vs. Communications Similarities –Data parallelism –Low-precision data –High computation rates Different characteristics of communications processing –More data reorganization, such as matrix transposes –Bit-level operations Explore space of stream processor architectures with isim –Cycle-accurate stream processor simulation –Flexible machine description language (read by both simulator and compiler) –Vary number and design of functional units –Vary memory, register sizes –etc.

Outline Motivation Parallel Algorithms for Estimation, Detection, and Decoding Stream Processor Architecture Performance Comparisons and Results

Stream Data Flow Matrix transpose Viterbi kernel Matrix mult kernel Correlation update kernel Matrix mul C kernel Data rearrangement Buffer Estimation bits Detection bits Multiuser Channel Estimation Multiuser Detection Decoding Computation Communication Iteration update kernel Matched filter kernel Matrix mul L kernel PIC kernel

Matrix Multiplication Kernel (Imagine) 32 cycle loop Executed on all 8 clusters Complexity –O(N 3 ) multiplies –O(N 3 ) adds 100% multiplier utilization in the loop Divider is unnecessary! Inner Loop Instruction Communication (waiting for input) FU unavailable (input ready but FU busy) ADD0ADD1ADD2MUL0MUL1DIV0

Replace Divider with Multiplier 22 cycle loop Executed on all 8 clusters 97% multiplier utilization in the loop 85% adder utilization in the loop Changing functional units –Supported by simulator/compiler –Architecturally realistic Instruction ADD0ADD1ADD2MUL0MUL1MUL2

Kernel Computational Time

Estimation and Detection Execution Kernel ExecutionMemory TransfersCycle Stalled waiting for data from memory Estimation Detection (10 bits)

Viterbi Execution Initialization Decode (32 bits) Kernel ExecutionMemory TransfersCycle

Real-time Performance Slow FadingMedium FadingFast Fading x 10 4 estimation detection decoding stall time Real-Time at 500 MHz Clock cycles

Rough DSP Comparison Estimation Execution time IMAGINE TI C67: Internal Memory TI C67: External Memory Glue Matrices Detection

Future Work Achieve real-time rates –Additional functional units (that can be used efficiently!) –Eliminate communication stalls between kernels –Support for matrix transposes and bit-level operations Power and area constraints –Low power stream processing –Scaling the architecture for handsets Scalability with data rates –Boundaries of the architecture Handset algorithms

Conclusions Future wireless communications algorithms –exceed the capabilities of current DSPs –require flexibility to change algorithms and parameters –require efficient use of resources because of delay, area, and power constraints Architectural developments are needed for future DSPs –Stream processing is a promising approach –Additional hardware acceleration, akin to Viterbi coprocessor on C64? The insights gained from our designs can be applied to DSPs and other processors with constraints on delay, area and power.