Download presentation
Presentation is loading. Please wait.
Published byRalf Griffith Modified over 8 years ago
1
RICE UNIVERSITY Handset architectures Sridhar Rajagopal sridhar@rice.edu http://www.ece.rice.edu/~sridhar ASICsProgrammable The support for this work in part by Nokia, TI and NSF is gratefully acknowledged
2
RICE UNIVERSITY 2G handsets ro ASIC for compute-intensive operations (spreading etc.) DSP for most of the baseband microcontroller for higher layers Evolving Cellular Handset Architectures but a Continuing, Insatiable Desire for DSP MIPs M. L. McMahan, TI Report SPRA650, March 2000
3
RICE UNIVERSITY Proposed 3G handsets DSP for the third generation wireless communications U. Ko, M. McMahan and E. Auslander, International Conference on Computer Design,1999 pp.516 –520 Introduction to W-CDMA SoC design approach H. Chen, VIA Technologies, August 2002 www.itpilot.org.tw/provisional/910802/ INTRODUCTION%20TO%20WCDMA%20SOC%20.PDF Increased number of co-processors as DSPs unable to do most of the baseband TI VIA
4
RICE UNIVERSITY Motivation How does this scale? Do we need a DSP or should we build ASICs? If ASICs, how to build better ASICs? If programmable, how to build better DSPs? If both, how do we mix them better? Answers dependent on level of programmability needed area-time-power architecture tradeoffs
5
RICE UNIVERSITY Rice innovations for ASICs and DSPs ASICs: On-line arithmetic for dynamic truncation Programmable: Scalable Wireless Application-specific Processors (SWAPs) Mix and match : Hybrid SWAPs (H-SWAPs) ASICsProgrammable
6
RICE UNIVERSITY Outline On-line arithmetic for dynamic truncation SWAPs H-SWAPs
7
RICE UNIVERSITY ASIC designs Finite precision arithmetic Faster Low power Low area How to keep finite precision bounded: Saturation Truncation
8
RICE UNIVERSITY Keeping precision bounded Example of truncation Multiplication by in gradient descent Sign detection Example of saturation Avoiding overflows When probability of useful MSBs are low
9
RICE UNIVERSITY Dynamic precision requirements Precision needs change with algorithms, SNR Adapt hardware dynamically to save power 25-35% power reduction possible Dynamic saturation vs. dynamic truncation Easy as LSBs first – difficult No error – significant error Throughput benefits – no benefits
10
RICE UNIVERSITY On-line arithmetic for dynamic truncation Works Most Significant Digit First Natural way of truncation Digit-serial dynamic truncation Redundant number system error only in LSD Throughput benefits as digit-serial
11
RICE UNIVERSITY Example for sign detection a i * b i Tree addition Level 1 Tree addition Result = constant = 3* R R Sign determined at this point. Stop! (d) Dynamically truncated on-line arithmetic R R R R RR t OL-MF t OL (2 MSDs) (a) Truncated conventional arithmetic Tree addition Level 1 Tree addition Result log(d) a i * b i t CONV-MF
12
RICE UNIVERSITY Throughput comparisons
13
RICE UNIVERSITY Area comparisons
14
RICE UNIVERSITY ASIC design conclusion Details : Predrag Using on-line arithmetic for dynamic truncation and conventional arithmetic for dynamic saturation, one can design efficient ASICs for handsets.
15
RICE UNIVERSITY Outline On-line arithmetic for dynamic truncation SWAPs H-SWAPs
16
RICE UNIVERSITY Programmable architectures Current DSPs Not enough functional units (FUs) Cannot extend to more FUs Limited Instruction Level Parallelism (ILP) Cannot support more registers (register area increases quadratically with FUs) Compilers: difficult to find ILP as FUs increase
17
RICE UNIVERSITY Solution Exploit data parallelism (DP) Lots available in wireless algorithms Example: for (i = 1: 1024) { a[i] = b[i] + c[i]; d[i] = b[i] * c[i]; } ILP DP
18
RICE UNIVERSITY DSP vs. SWAPs + + + * * * Internal Memory ILP Internal Memory + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * … ILP DP DSP (1 cluster) SWAPs (max. clusters)
19
RICE UNIVERSITY SWAPs trade-offs Same internal memory size as DSPs Dependent on application, not architecture Needs more area to support more functional units Area is not a constraint (power is) Varying levels of DP in applications Needs reconfiguration!! Need to turn off unused clusters More parallelism lower clock frequency lower voltage low power ( CV 2 f + leakage) in spite of larger area
20
RICE UNIVERSITY Example: Viterbi Decoding Add-Compare-Select (ACS) : trellis interconnect Re-order for exploiting DP Traceback – sequential Use Register Exchange (RE) Exploiting DP in programmable architecture implies: Re-order ACS Re-order RE
21
RICE UNIVERSITY Re-ordering for parallel Viterbi X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) b. Shuffled Trellisa. Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15)
22
RICE UNIVERSITY Viterbi reconfiguration Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters) DPCan be turned OFF
23
RICE UNIVERSITY 64-bit Packet 1 Rate ½ Constraint Length 7 64-bit Packet 2 Rate ½ Constraint Length 9 64-bit Packet 3 Rate ½ Constraint Length 5 Kernels (Computation) Memory accesses
24
RICE UNIVERSITY Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz
25
RICE UNIVERSITY Viterbi decoding: Comparisons 10 3 DSP C64x (w/o co-proc) *VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong 128 KHz (1 bit /cycle) DSP (RE) SWAP FPGA DP Task Pipelining Dedicated interconnect
26
RICE UNIVERSITY Salient features of this solution Any constraint length 10 MHz at 128 Kbps Same code for all constraint lengths no need to re-compile or load another code as long as parallelism/cluster ratio is constant Exploiting parallelism at 3 levels for real-time: Instruction Level Parallelism (DSP) Subword Parallelism (DSP) Data Parallelism (SWAP)
27
RICE UNIVERSITY Problems Suitable for handsets? - Not yet! Still too general Not low power enough!!! No special customization for the application Except for a fixed-point architecture Generic instruction set Generic ALUs (though can be powered down) Generic inter-cluster communication network
28
RICE UNIVERSITY Outline On-line arithmetic for dynamic truncation SWAPs Hybrid SWAPs (H-SWAPs)
29
RICE UNIVERSITY H-SWAPs Trade Data Parallelism for Task Pipelining Customize each mini-SWAP SWAPs (max. clusters and reconfigure) + + + * + + + * + + + * + + + * Limited DP Mini-SWAP (limit clusters) + + + * + + + * + + + * + + + * Limited DP + + * + + * + + * + + * Limited DP + + + + + + + + Limited DP H-SWAPs (collection of customized mini-SWAPs)
30
RICE UNIVERSITY Work in progress How to trade-off task vs. data parallelism? Power estimation for SWAPs (actual numbers) Comparisons with ASIC solutions in terms of area-time-power Evaluation of specialized inter-cluster communication Specialized instructions (ACS) and arithmetic units (on-line) I am looking for jobs!!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.