RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.

RICE UNIVERSITY Handset architectures Sridhar Rajagopal sridhar@rice.edu http://www.ece.rice.edu/~sridhar ASICsProgrammable The support for this work in part by Nokia, TI and NSF is gratefully acknowledged

RICE UNIVERSITY 2G handsets ro ASIC for compute-intensive operations (spreading etc.) DSP for most of the baseband microcontroller for higher layers Evolving Cellular Handset Architectures but a Continuing, Insatiable Desire for DSP MIPs M. L. McMahan, TI Report SPRA650, March 2000

RICE UNIVERSITY Proposed 3G handsets DSP for the third generation wireless communications U. Ko, M. McMahan and E. Auslander, International Conference on Computer Design,1999 pp.516 –520 Introduction to W-CDMA SoC design approach H. Chen, VIA Technologies, August 2002 www.itpilot.org.tw/provisional/910802/ INTRODUCTION%20TO%20WCDMA%20SOC%20.PDF Increased number of co-processors as DSPs unable to do most of the baseband TI VIA

RICE UNIVERSITY Motivation How does this scale? Do we need a DSP or should we build ASICs? If ASICs, how to build better ASICs? If programmable, how to build better DSPs? If both, how do we mix them better? Answers dependent on  level of programmability needed  area-time-power architecture tradeoffs

RICE UNIVERSITY Rice innovations for ASICs and DSPs ASICs: On-line arithmetic for dynamic truncation Programmable: Scalable Wireless Application-specific Processors (SWAPs) Mix and match : Hybrid SWAPs (H-SWAPs) ASICsProgrammable

RICE UNIVERSITY Outline  On-line arithmetic for dynamic truncation  SWAPs  H-SWAPs

RICE UNIVERSITY ASIC designs  Finite precision arithmetic  Faster  Low power  Low area  How to keep finite precision bounded:  Saturation  Truncation

RICE UNIVERSITY Keeping precision bounded  Example of truncation  Multiplication by  in gradient descent  Sign detection  Example of saturation  Avoiding overflows  When probability of useful MSBs are low

RICE UNIVERSITY Dynamic precision requirements  Precision needs change with algorithms, SNR  Adapt hardware dynamically to save power  25-35% power reduction possible  Dynamic saturation vs. dynamic truncation  Easy as LSBs first – difficult  No error – significant error  Throughput benefits – no benefits

RICE UNIVERSITY On-line arithmetic for dynamic truncation  Works Most Significant Digit First  Natural way of truncation  Digit-serial  dynamic truncation  Redundant number system  error only in LSD  Throughput benefits as digit-serial

RICE UNIVERSITY Example for sign detection a i * b i Tree addition Level 1 Tree addition Result = constant = 3* R R Sign determined at this point. Stop! (d) Dynamically truncated on-line arithmetic R R R R RR t OL-MF t OL (2 MSDs) (a) Truncated conventional arithmetic Tree addition Level 1 Tree addition Result  log(d) a i * b i t CONV-MF

RICE UNIVERSITY Throughput comparisons

RICE UNIVERSITY Area comparisons

RICE UNIVERSITY ASIC design conclusion Details : Predrag Using on-line arithmetic for dynamic truncation and conventional arithmetic for dynamic saturation, one can design efficient ASICs for handsets.

RICE UNIVERSITY Outline  On-line arithmetic for dynamic truncation  SWAPs  H-SWAPs

RICE UNIVERSITY Programmable architectures  Current DSPs  Not enough functional units (FUs)  Cannot extend to more FUs  Limited Instruction Level Parallelism (ILP)  Cannot support more registers (register area increases quadratically with FUs)  Compilers: difficult to find ILP as FUs increase

RICE UNIVERSITY Solution  Exploit data parallelism (DP)  Lots available in wireless algorithms  Example: for (i = 1: 1024) { a[i] = b[i] + c[i]; d[i] = b[i] * c[i]; } ILP DP

RICE UNIVERSITY DSP vs. SWAPs + + + * * * Internal Memory ILP Internal Memory + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * … ILP DP DSP (1 cluster) SWAPs (max. clusters)

RICE UNIVERSITY SWAPs trade-offs  Same internal memory size as DSPs  Dependent on application, not architecture  Needs more area to support more functional units  Area is not a constraint (power is)  Varying levels of DP in applications  Needs reconfiguration!!  Need to turn off unused clusters  More parallelism  lower clock frequency  lower voltage  low power (  CV 2 f + leakage) in spite of larger area

RICE UNIVERSITY Example: Viterbi Decoding  Add-Compare-Select (ACS) : trellis interconnect  Re-order for exploiting DP  Traceback – sequential  Use Register Exchange (RE) Exploiting DP in programmable architecture implies:  Re-order ACS  Re-order RE

RICE UNIVERSITY Re-ordering for parallel Viterbi X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) b. Shuffled Trellisa. Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15)

RICE UNIVERSITY Viterbi reconfiguration Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters) DPCan be turned OFF

RICE UNIVERSITY 64-bit Packet 1 Rate ½ Constraint Length 7 64-bit Packet 2 Rate ½ Constraint Length 9 64-bit Packet 3 Rate ½ Constraint Length 5 Kernels (Computation) Memory accesses

RICE UNIVERSITY Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz

RICE UNIVERSITY Viterbi decoding: Comparisons 10 3 DSP C64x (w/o co-proc) *VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong 128 KHz (1 bit /cycle) DSP (RE) SWAP FPGA DP Task Pipelining Dedicated interconnect

RICE UNIVERSITY Salient features of this solution  Any constraint length  10 MHz at 128 Kbps  Same code for all constraint lengths  no need to re-compile or load another code  as long as parallelism/cluster ratio is constant  Exploiting parallelism at 3 levels for real-time:  Instruction Level Parallelism (DSP)  Subword Parallelism (DSP)  Data Parallelism (SWAP)

RICE UNIVERSITY Problems  Suitable for handsets? - Not yet!  Still too general  Not low power enough!!!  No special customization for the application  Except for a fixed-point architecture  Generic instruction set  Generic ALUs (though can be powered down)  Generic inter-cluster communication network

RICE UNIVERSITY Outline  On-line arithmetic for dynamic truncation  SWAPs  Hybrid SWAPs (H-SWAPs)

RICE UNIVERSITY H-SWAPs  Trade Data Parallelism for Task Pipelining  Customize each mini-SWAP SWAPs (max. clusters and reconfigure) + + + * + + + * + + + * + + + * Limited DP Mini-SWAP (limit clusters) + + + * + + + * + + + * + + + * Limited DP + + * + + * + + * + + * Limited DP + + + + + + + + Limited DP H-SWAPs (collection of customized mini-SWAPs)

RICE UNIVERSITY Work in progress  How to trade-off task vs. data parallelism?  Power estimation for SWAPs (actual numbers)  Comparisons with ASIC solutions in terms of area-time-power  Evaluation of specialized inter-cluster communication  Specialized instructions (ACS) and arithmetic units (on-line) I am looking for jobs!!!

RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.

Similar presentations

Presentation on theme: "RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.

Similar presentations

Presentation on theme: "RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in."— Presentation transcript:

Similar presentations

About project

Feedback