RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

Slides:



Advertisements
Similar presentations
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Advertisements

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Real-Time DSP Multiprocessor Implementation for Future Wireless Base-Station Receivers Bryan Jones, Sridhar Rajagopal, and Dr. Joseph Cavallaro.
Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal.
L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
1 Summary of SDR Analog radio systems are being replaced by digital radio systems for various radio applications. SDR technology aims to take advantage.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.
Reconfigurable Hardware in Wearable Computing Nodes Christian Plessl 1 Rolf Enzler 2 Herbert Walder 1 Jan Beutel 1 Marco Platzner 1 Lothar Thiele 1 1 Computer.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Low-Power Wireless Sensor Networks
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
A bit-streaming, pipelined multiuser detector for wireless communications Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
RICE UNIVERSITY High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University
RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.
RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication.
Programmable processors for wireless base-stations Sridhar Rajagopal December 9, 2003.
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal Srikrishna Bhashyam, Joseph R. Cavallaro,
RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,
Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.
Pipelining and number theory for multiuser detection Sridhar Rajagopal and Joseph R. Cavallaro Rice University This work is supported by Nokia, TI, TATP.
Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003.
RICE UNIVERSITY On the architecture design of a 3G W-CDMA/W-LAN receiver Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia.
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
Implementing Multiuser Channel Estimation and Detection for W-CDMA Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro and Behnaam Aazhang Rice.
DSP base-station comparisons. Second generation (2G) wireless 2 nd generation: digital: last decade: 1990’s Voice and low bit-rate data –~14.4 – 28.8.
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication.
Channel Equalization in MIMO Downlink and ASIP Architectures Predrag Radosavljevic Rice University March 29, 2004.
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
Dynamo: A Runtime Codesign Environment
Low-power Digital Signal Processing for Mobile Phone chipsets
A programmable communications processor for future wireless systems
Sridhar Rajagopal April 26, 2000
Anne Pratoomtong ECE734, Spring2002
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
DSPs for Future Wireless Base-Stations
On-line arithmetic for detection in digital communication receivers
Programmable processors for wireless base-stations
Final Project presentation
Sridhar Rajagopal COMP 625 April 17, 2000
Department of Electrical Engineering Joint work with Jiong Luo
Sridhar Rajagopal, Srikrishna Bhashyam,
DSPs in emerging wireless systems
DSP Architectures for Future Wireless Base-Stations
On-line arithmetic for detection in digital communication receivers
Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro
DSPs for Future Wireless Base-Stations
Presentation transcript:

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX March 31, 2003 This work has been supported in part by Nokia, TI, TATP and NSF

RICE UNIVERSITY Future wireless devices :  High data rate mobile devices with multimedia  Multiple antennas w/ complex algorithms, GOPs of computation  Area-Time-Power constraints  Seamless connection across environments and standards  Use the fastest and cheapest available service Bluetooth/ Home Networks Wireless Cellular Wireless LAN

RICE UNIVERSITY Aim of the talk Design me

RICE UNIVERSITY Trends FLEXIBILITY

RICE UNIVERSITY Change in flexibility requirements Physical Layer MAC Layer Network Layer Application Layer No change (already flexible) Maximum change (needs to support multiple environments, algorithms and standards)

RICE UNIVERSITY Architecture trade-offs Past : more DSP + less ASIC, Current : less DSP + more ASIC Reason: need less flexibility OR DSPs not powerful enough? Can’t we build better DSPs? How much flexibility do we need? ASICs Intermediate Programmable Area-Time-Power benefits Flexibility Time-to-market Software updates

RICE UNIVERSITY What is the right architecture? ASICs not good:  Need much more flexibility  Multiple complex algorithms and multiple environments  Cannot keep adding co-processors DSPs not good either:  1 Mbps with 100 MHz processor  100 cycles available per bit (GOPs)  Power : bigger color displays and more complex algorithms  Only ~100 mW for baseband Need a methodology to explore flexibility-architecture tradeoffs

RICE UNIVERSITY My contributions Algorithms: Parallel, fixed point algorithms for multiuser estimation and detection Architectures: Dynamic truncation in ASICs using on-line arithmetic Processors: Scalable Wireless Application-specific Processors (SWAPs) Design methodology to explore flexibility vs. architecture tradeoffs

RICE UNIVERSITY Problems with current DSPs  Current DSPs  Not enough functional units (FUs) for GOPs of computation  Need 100’s of FUs  Not low power enough!!  Cannot extend to more FUs  Limited Instruction Level Parallelism (ILP)  Limited Subword Parallelism (such as MMX)  Cannot support more registers (area,ports)  Compilers: difficult to find ILP as FUs increase

RICE UNIVERSITY Solution: SWAPs  Exploit data parallelism (DP)  Available in many wireless algorithms  This is what ASICs do!!  Example: int i,a[N],b[N],c[N]; // 32 bits short int d[K],e[K],f[K]; // 16 bits packed for (i = 1; i<= 1024; ++i) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; } ILP DP Subword

RICE UNIVERSITY SWAPs: stream processors for wireless Kernel Viterbi decoding Stream Input Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits  Kernels (computation) and streams (communication)  Operations on kernels use local data  Streams expose data parallelism  Imagine stream processor at Stanford

RICE UNIVERSITY DSP vs. SWAPs * * * Internal Memory ILP Stream Register File (SRF) DSP (1 cluster) * * * * * * * * * * * * … ILP DP SWAPs (max. clusters All clusters same & do same operations)

RICE UNIVERSITY Arithmetic clusters  FUs (+,*,/)  Scratch-pad (Sp)  Indexed accesses  Comm. unit (CU)  Intercluster comm.  Distributed reg. Files  more FUs Intercluster Network From/To SRF Cross Point Local Register File CU * * / + / * * / + / Sp SRF

RICE UNIVERSITY SWAPs vs. Imagine trade-offs  Imagine – Stanford  Optimized for media processing  Floating point with 8 clusters  3 adders, 2 multipliers, 1 divider in each  Architecture simulator tool  Vary number of clusters, functional units, registers ….  SWAPs – Rice  Optimized for wireless communications  Minimized access to data memory  Fixed point with clusters adapting to available DP  Functional units adapting to available ILP

RICE UNIVERSITY SWAPs vs. DSPs trade-offs  Same internal memory size as DSPs  Dependent on application, not architecture  Needs more area to support more functional units  Area is less of a constraint than power  Varying levels of DP in applications  Needs reconfiguration!!  Need to turn off unused clusters (and FUs)  More parallelism  lower clock frequency  lower voltage  low power (  CV 2 f + leakage) in spite of larger area

RICE UNIVERSITY Design methodology Chain of receiver algorithms Low “complexity”, parallel, fixed point High level language implementation Modular programmable architecture design ASIC design FPGA, customized, reconfigurable, heterogeneous designs DSP, SWAPs learn H-SWAPs learn Algorithm-specific Architecture exploration Flexibility- performance tradeoffs

RICE UNIVERSITY Physical layer of wireless receivers Antenna Channel estimation DetectionDecoding Higher (MAC/Network/ OS) Layers RF Front-end Baseband processing Receiver more complex than transmitter

RICE UNIVERSITY Algorithms for  Multiple antenna systems (MIMO systems)  Complexity exponential with transmit * receive antennas  Wide range of extremely complex algorithms  Optimal depends on fading, mobility, bandwidth, antennas  GOPs of computations  Estimation: Linear MMSE, blind, conjugate gradient….  Detection: FFT, (blind) interference cancellation….  Decoding: Viterbi, Turbo, LDPC….  Implement ALL of them AND the NEXT one in line  Use for the best for the situation Example for concept demonstration: Viterbi decoding

RICE UNIVERSITY Parallel Viterbi Decoding  1. Add-Compare-Select (ACS) : trellis interconnect  Parallelism depends on constraint length (#states)  2. Conventional Traceback  Sequential (No DP)  Difficult to implement in parallel architecture  Use Register Exchange (RE)  parallel solution

RICE UNIVERSITY Re-ordering for parallel Viterbi a. Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) b. Shuffled Trellis Exploiting Viterbi DP in SWAPs:  Re-order ACS, RE  Overhead

RICE UNIVERSITY SWAP: Algorithms + Architecture Algorithm design for parallelism Architecture design?

RICE UNIVERSITY SWAP design  Decide how many clusters  Exploit DP  Decide what to put within each cluster  Maximize ILP with high functional unit efficiency  Search design space with “explore” tool  See how it meets time-area-power constraints + ? * * + * * + * * + * * … ILP DP ???

RICE UNIVERSITY Inside a SWAP cluster: EXPLORE Auto-exploration of adders and multipliers for “ACS" (Adder FU%, Multiplier FU%)

RICE UNIVERSITY “Explore” tool benefits  Instruction count vs. functional unit efficiency  What goes inside each cluster  Explore all algorithms  turn off functional units not in use for given kernel  Design customized application-specific units  Better performance with increased FU utilization Algorithm 1 : 3 adders, 3 multipliers, 32 clusters Algorithm 2 : 4 adders, 1 multiplier, 64 clusters Architecture: 4 adders, 3 multipliers, 64 clusters

RICE UNIVERSITY Viterbi reconfiguration Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters) DPCan be turned OFF

RICE UNIVERSITY Reconfiguration : 1 : Data transfer Move data to appropriate clusters via comm units Significant performance loss, additional SRF memory required Can turn off SRF too! SRF Clusters CU

RICE UNIVERSITY Reconfiguration : 2: Conditional streams Sp Transfer data via comm unit (CU) and scratchpad (Sp) Minimal loss in performance Cannot turn off SRF, comm unit, scratchpad in clusters

RICE UNIVERSITY Reconfiguration : 3 : Multiplexed buffers Use mux-demux buffers Minimal loss in performance Can turn off clusters entirely – more power savings

RICE UNIVERSITY 64-bit Packet 1 Rate ½ K = 7 Packet 2 K = 9 Packet 3 K = 5 Kernels (Computation) No Data Memory accesses Execution Time (cycles) ClustersMemory

RICE UNIVERSITY Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz Number of clusters Frequency needed to attain real-time (in MHz) K = 9 K = 7 K = 5 Static architecture SWAPs DSP Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

RICE UNIVERSITY SWAPs : Salient features  1-2 orders of magnitude better than 1 processor DSP  Any constraint length  10 MHz at 128 Kbps  Same code for all constraint lengths  no need to re-compile or load another code  as long as parallelism/cluster ratio is constant  Power savings due to dynamic cluster scaling

RICE UNIVERSITY Expected SWAP power consumption  64 clusters and 1 multiplier per cluster:  0.13 micron, 1.2 V  Peak Active Power: ~9 mW at 1 MHz  Area: ~53.7 mm 2  10 MHz, 128 Kbps with reconfiguration *Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp Active Clusters (max 64) Power (in mW) ViterbiClusters usedPeak Power K = 964~90 mW K = 716~28.57 mW K = 54~13.8 mW overhead0~8.1 mW

RICE UNIVERSITY Flexibility vs. performance  Suitable for mobile devices?  SWAPs: Real-time at ~ mW  Maybe ; but can we do better?  ASICs : Real-time at ~  W  No special customization for the application  No application-specific units  Generic inter-cluster communication network  Overhead for extracting parallelism  SWAPs suitable for base-stations?  Why not? – power is not a primary constraint!

RICE UNIVERSITY Multiuser Estimation-Detection+Decoding Real-time target : 128 Kbps per user Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

RICE UNIVERSITY Expected SWAP power : base-station  32 user base-station with 3 X’s per cluster and 64 clusters:  0.13 micron, 1.2 V  Peak Active Power: ~18.19 mW for 1 MHz (increased *)  Area: ~93.4 mm 2  Total Peak Base-station power consumption:  ~18.19 W at 1 GHz for 32 users at 128 Kbps/user

RICE UNIVERSITY Current research  SWAPs : Completely flexible and general  How do we trade-off flexibility for better performance?  Handset SWAPs (H-SWAPs)

RICE UNIVERSITY Let’s look at ASICs *VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong 128 KHz * (1 bit /cycle) DSP SWAP ASIC/ FPGA DP Task Pipelining Dedicated interconnect 10 MHz (~1 bit /100 cycles) 200 MHz (~1 bit /2000 cycles) Execution time

RICE UNIVERSITY Handset SWAPs: H-SWAPs  Trade Data Parallelism for Task Pipelining  Design SWAPlets and customize each SWAPlet SWAPs (max. clusters and reconfigure) * * * * Limited DP SWAPlet (limit clusters) * * * * Limited DP + + * + + * + + * + + * Limited DP Limited DP H-SWAPs (collection of customized SWAPlets)

RICE UNIVERSITY H-SWAPs: Viterbi decoding  Survivor management – serial  Finding parallel solution for SWAPs – expensive  > 50% of execution time : overhead  Serial solution now possible with H-SWAPs  Better performance with less flexibility!! A C S + A C S + A C S + A C S + Limited DP TBUTBU H-SWAPs for Viterbi decoding ACS unit Traceback unit

RICE UNIVERSITY H-SWAPs: Potential advantages DSP (RE) SWAP ASIC/FPGA – Real-time performance DP Task Pipelining Dedicated interconnect DSP (RE) H-SWAP Partial DP + Task Pipelining Application-specific units ASIC/FPGA – Real-time performance Dedicated interconnect H-SWAPsSWAPs Execution time

RICE UNIVERSITY Current research  Task vs. data parallelism tradeoffs  Evaluation of specialized inter-cluster communication  Integrating specialized arithmetic units (ACS, on-line)  Learning to migrate from H-SWAPs to SWAPs  Scale to future systems!!

RICE UNIVERSITY Future research: efficient algorithms

RICE UNIVERSITY Future research: architectures Generalized framework and tools for evaluating algorithm- architecture and area-time-power-flexibility trade-offs Some other potential applications  Image processing:  Cameras : variety of compression algorithms  Biomedical applications:  Hearing aids: DSP running on body heat *  Sensor networks  Compression of data before transmission *Quote: Gene Frantz, TI Fellow

RICE UNIVERSITY Conclusions  Need flexible architectures for future wireless devices  Higher data rates, lower power, more complex algorithms  Design methodology (SWAPs, H-SWAPs, ASICs)  Flexibility vs. performance trade-offs  Blurs distinction between ASICs and programmable solutions  Also need parallel, low precision algorithms for efficient mapping  Inter-disciplinary research:  Computer architecture, VLSI, wireless communications, computer arithmetic, compilers