RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication.

Slides:



Advertisements
Similar presentations
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Advertisements

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Real-Time DSP Multiprocessor Implementation for Future Wireless Base-Station Receivers Bryan Jones, Sridhar Rajagopal, and Dr. Joseph Cavallaro.
Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal.
L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
1 Summary of SDR Analog radio systems are being replaced by digital radio systems for various radio applications. SDR technology aims to take advantage.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.
Bilal Saqib. Courtesy: Northrop Grumman Corporation.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
Develop and Implementation of the Speex Vocoder on the TI C64+ DSP
A bit-streaming, pipelined multiuser detector for wireless communications Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
RICE UNIVERSITY High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.
RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication.
Programmable processors for wireless base-stations Sridhar Rajagopal December 9, 2003.
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal Srikrishna Bhashyam, Joseph R. Cavallaro,
RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,
Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.
Pipelining and number theory for multiuser detection Sridhar Rajagopal and Joseph R. Cavallaro Rice University This work is supported by Nokia, TI, TATP.
Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003.
RICE UNIVERSITY On the architecture design of a 3G W-CDMA/W-LAN receiver Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia.
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
Implementing Multiuser Channel Estimation and Detection for W-CDMA Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro and Behnaam Aazhang Rice.
DSP base-station comparisons. Second generation (2G) wireless 2 nd generation: digital: last decade: 1990’s Voice and low bit-rate data –~14.4 – 28.8.
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳 宸.
RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
Dynamo: A Runtime Codesign Environment
Evaluating Register File Size
A programmable communications processor for future wireless systems
Anne Pratoomtong ECE734, Spring2002
How to ATTACK Problems Facing 3G Wireless Communication Systems
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
DSPs for Future Wireless Base-Stations
Programmable processors for wireless base-stations
Final Project presentation
Kyoungwoo Lee, Minyoung Kim, Nikil Dutt, and Nalini Venkatasubramanian
Sridhar Rajagopal, Srikrishna Bhashyam,
DSPs in emerging wireless systems
DSP Architectures for Future Wireless Base-Stations
Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro
DSPs for Future Wireless Base-Stations
Presentation transcript:

RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication Department of Electrical and Computer Engineering Rice University, Houston TX March 23, 2003 This work has been supported in part by Nokia, TI, TATP and NSF

RICE UNIVERSITY Future wireless devices :  High data rate mobile devices with multimedia  Seamless connection across environments and standards  Use the fastest and cheapest available service Bluetooth/ Home Networks Wireless Cellular Wireless LAN

RICE UNIVERSITY Aim of the talk  How do I build such a device?  Challenges  Constraints  Solutions

RICE UNIVERSITY Wireless Trends Past Current Future?

RICE UNIVERSITY Trend comparisons

RICE UNIVERSITY Change in flexibility requirements Physical Layer MAC Layer Network Layer Application Layer No change (already flexible) Maximum change (needs to support multiple environments, algorithms and standards)

RICE UNIVERSITY Summary of Challenges for  Sophisticated algorithms (GOPs of computation)  ~1 – 10 Mbps, 500 mW  Flexibility required at physical layer  Multiple algorithms, multiple standards, multiple environments  What we would also like:  Time to market  Rapid evaluation and implementation  Scalable architecture design methodologies

RICE UNIVERSITY Physical layer of a receiver Antenna Channel estimation DetectionDecoding Higher (MAC/Network/ OS) Layers RF Front-end Baseband processing Receiver more complex than transmitter

RICE UNIVERSITY ro Past implementations : Physical layer Evolving Cellular Handset Architectures but a Continuing, Insatiable Desire for DSP MIPs M. L. McMahan, TI Report SPRA650, March 2000 Analog RF Digital Baseband Analog Baseband

RICE UNIVERSITY Past architectures: Baseband : my domain DSP for baseband ASIC for compute-intensive operations microcontroller for higher layers

RICE UNIVERSITY Current “proposed” handsets DSP for the third generation wireless communications U. Ko, M. McMahan and E. Auslander, International Conference on Computer Design,1999 pp.516 –520 Introduction to W-CDMA SoC design approach H. Chen, VIA Technologies, August INTRODUCTION%20TO%20WCDMA%20SOC%20.PDF Increased co-processors (not better DSPs?) TI VIA

RICE UNIVERSITY Can this methodology scale for Baseband increasingly important for real-time and power  Need much more flexibility  Environment-specific sophisticated algorithms  Cannot keep adding co-processors  lose flexibility of a programmable solution  1 Mbps with 100 MHz processor  100 cycles per bit to do all your work (GOPs/bit)  Power consumption with bigger color displays, video and more complex algorithms  May have only ~100 mW for baseband

RICE UNIVERSITY Architecture trade-offs ASIC solutions Intermediate solutions Programmable solutions Area-Time-Power Performance Flexibility

RICE UNIVERSITY Motivation Now that we know the challenges and constraints, Design me

RICE UNIVERSITY design How do we choose the right algorithms? the right amount of flexibility? Do we build DSPs, ASICs, heterogeneous, reconfigurable? If ASICs, how to build better ASICs? If programmable, how to build better DSPs? If both, how do we mix them better? Answers dependent on  level of flexibility needed  area-time-power architecture tradeoffs

RICE UNIVERSITY My contributions “Low-complexity” algorithms for wireless: Parallel, fixed point algorithms for multiuser estimation and detection ASIC design for wireless using computer arithmetic techniques: Dynamic truncation using on-line arithmetic Programmable architecture design for wireless: Scalable Wireless Application-specific Processors (SWAPs)

RICE UNIVERSITY Outline  Design Methodology  Programmable architecture design  Scalable Wireless Application-specific Processors (SWAPs)  Applications to base-stations and handsets  Future research directions

RICE UNIVERSITY Design methodology Chain of receiver algorithms Low “complexity”, parallel, fixed point High level language implementation Programmable implementation Modular programmable architecture design ASIC implementation FPGA, customized, reconfigurable, heterogeneous implementations Example: Pentium, DSP, SWAPs Area-Time- Power specs: no specs : no learn Example: H-SWAPs

RICE UNIVERSITY Choosing the right algorithms : theory Algorithm research:  Spectral efficiency  Low power (RF) Metrics:  Bit error rate  Frame error rate Signal to Noise Ratio Bit Error Rate Past Current Future Theory

RICE UNIVERSITY Choosing right algorithms : practice  Refine candidates from theory (using linear algebra / opt.)  lower “complexity”, parallel, fixed-point Optimization: Area: A Time: B Power: A Energy: A/B Multi-parameter optimization ? “Complexity” : #operations of equivalent type Complexity Complexity/Parallelism Execution Time Original Candidate A Candidate B

RICE UNIVERSITY Programmable architectures  Current DSPs  Not enough functional units (FUs) for GOPs of computation  Cannot extend to more FUs  Limited Instruction Level Parallelism (ILP)  Cannot support more registers (register area increases quadratically with FUs)  Compilers: difficult to find ILP as FUs increase

RICE UNIVERSITY Solution  Exploit data parallelism (DP)  Lots available in wireless algorithms  Example: Int i,a,b,c,d; // 32 bits Half2 e,f; // 16 bits for (i = 1: 1024) { a[i] = b[i] + c[i]; d[i] = e[i] * f[i]; } ILP DP

RICE UNIVERSITY DSP vs. SWAPs * * * Internal Memory ILP Stream Register File * * * * * * * * * * * * * * * * * * * * * * * * * * * … ILP DP DSP (1 cluster) SWAPs (max. clusters)

RICE UNIVERSITY SWAPs trade-offs  Same internal memory size as DSPs  Dependent on application, not architecture  Needs more area to support more functional units  Area is less of a constraint than power  Varying levels of DP in applications  Needs reconfiguration!!  Need to turn off unused clusters  More parallelism  lower clock frequency  lower voltage  low power (  CV 2 f + leakage) in spite of larger area

RICE UNIVERSITY Communication example: Viterbi decoding  Popular and widely used method for decoding convolutional codes  Consists of 3 parts  Trellis initialization  Add-Compare-Select operations  Traceback

RICE UNIVERSITY Parallel Viterbi Decoding  Add-Compare-Select (ACS) : trellis interconnect  Re-order for exploiting DP  Parallelism depends on constraint length (#states)  Conventional Traceback – sequential  Use Register Exchange (RE) for parallel solution Exploiting DP in a programmable architecture implies:  Re-order ACS  Re-order RE

RICE UNIVERSITY SWAP design  Decide how many clusters  Exploit DP  Look at the for loop () count  Decide what to put within each cluster  Maximize ILP with high functional unit efficiency  Search design space  See how it meets time-area-power constraints

RICE UNIVERSITY What goes inside a cluster?

RICE UNIVERSITY Re-ordering for parallel Viterbi X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) b. Shuffled Trellisa. Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15)

RICE UNIVERSITY Viterbi reconfiguration Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters) DPCan be turned OFF

RICE UNIVERSITY How to reconfigure?  Move data to appropriate clusters and turn off unused clusters and SRF  Significant loss in performance  Maximum power savings  Use Conditional Streams  Cannot turn off SRF, comm,scratchpad in clusters  Minimal loss in performance  Use mux-demux buffers  Can turn off clusters entirely – more power savings  Minimal loss in performance

RICE UNIVERSITY 64-bit Packet 1 Rate ½ Constraint Length 7 64-bit Packet 2 Rate ½ Constraint Length 9 64-bit Packet 3 Rate ½ Constraint Length 5 Kernels (Computation) Memory accesses

RICE UNIVERSITY Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz

RICE UNIVERSITY Viterbi decoding: Execution time 10 3 Ideal DSP C64x (w/o co-proc) *VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong 128 KHz (1 bit /cycle) DSP (RE) SWAP ASIC/FPGA – Real-time performance DP Task Pipelining Dedicated interconnect

RICE UNIVERSITY Salient features of this solution  Any constraint length  10 MHz at 128 Kbps (handset)  Same code for all constraint lengths  no need to re-compile or load another code  as long as parallelism/cluster ratio is constant  Exploiting parallelism for real-time:  Instruction Level Parallelism (DSP)  Subword Parallelism (DSP)  Data Parallelism (Imagine)  Dynamic Cluster Scaling (SWAP)  Power savings due to dynamic cluster scaling

RICE UNIVERSITY Expected SWAP power numbers Viterbi decoding  64 clusters and 1 multiplier per cluster:  Process: 0.13 micron  Voltage: 1.5 V (to min. leakage when not active)  R-T Frequency: f~10 MHz  Peak Active Power: ~16 mW/MHz (11 mW/MHz if 1.2V)  Area: ~53.7 mm 2  10 MHz, 128 Kbps  ~160 (110) mW for K = 9  ~53.33 (36.7) mW for K = 7  ~26.67 (12.5) mW for K = 5  ASICs : ~  W *Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp

RICE UNIVERSITY Problems  Suitable for handsets? - Not yet!  Still too general  Not low power enough!!!  No special customization for the application  Except for a fixed-point architecture  Generic instruction set  Generic ALUs (though, can be powered down)  Generic inter-cluster communication network  Suitable for base-stations?  Why not – power is not a primary constraint?

RICE UNIVERSITY Multiuser Estimation-Detection+Decoding Number of clusters Frequency needed to attain real-time (in MHz) FAST MEDIUM SLOW 32-user 3G base-station Hand-set Real-time target : 128 Kbps per user

RICE UNIVERSITY Expected power numbers  32 user base-station with 3 multipliers per cluster and 64 clusters:  Process: 0.13 micron  Voltage: 1.2 V (always active, leakage less important)  R-T Frequency: f~1 GHz  Peak Active Power: ~19.88 mW/MHz (increased *)  Area: ~93.4 mm 2  Total Base-station power consumption:  ~19.88 W at 1 GHz for 32 users at 128 Kbps/user

RICE UNIVERSITY H-SWAPs  Trade Data Parallelism for Task Pipelining  Customize each SWAPlet SWAPs (max. clusters and reconfigure) * * * * Limited DP SWAPlet (limit clusters) * * * * Limited DP + + * + + * + + * + + * Limited DP Limited DP H-SWAPs (collection of customized SWAPlets)

RICE UNIVERSITY Viterbi decoding  Survivor management – serial  Finding parallel solution for SWAPs – expensive  > 50% of execution time : overhead  Serial solution now possible with H-SWAPs A C S + A C S + A C S + A C S + Limited DP TBUTBU H-SWAPs for Viterbi decoding ACS unit Traceback unit

RICE UNIVERSITY Potential advantages DSP (RE) SWAP ASIC/FPGA – Real-time performance DP Task Pipelining Dedicated interconnect DSP (RE) H-SWAP Partial DP + Task Pipelining Application-specific units ASIC/FPGA – Real-time performance Dedicated interconnect Performance H-SWAPsSWAPs

RICE UNIVERSITY Current research  How to trade-off task vs. data parallelism?  Evaluation of specialized inter-cluster communication  Integrating specialized arithmetic units (ACS, on-line)  Area-Time-Power efficiency of Handset SWAPs  Learning to migrate from H-SWAPs to SWAPs  Scale to future systems!!

RICE UNIVERSITY Future work Generalized framework and tools for evaluating algorithm-architecture and area-time-power-flexibility trade-offs Some other potential applications  Image processing:  Cameras : variety of compression algorithms  Biomedical applications:  Hearing aids: DSP running on body heat *  Sensor networks *Quote: Gene Frantz, TI Fellow

RICE UNIVERSITY Conclusions  Exciting times for wireless algorithm and architecture research  More complex algorithms  Higher data rates – meet real-time requirements  Lower power  Low area  Seek to design flexible architectures  learn from ASIC solutions  Inter-disciplinary research needed:  Computer architecture, VLSI, wireless communications, computer arithmetic, compilers