Download presentation
Presentation is loading. Please wait.
Published byEugene George Modified over 9 years ago
1
RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication Department of Electrical and Computer Engineering Rice University, Houston TX 77005 March 23, 2003 This work has been supported in part by Nokia, TI, TATP and NSF
2
RICE UNIVERSITY Future wireless devices : High data rate mobile devices with multimedia Seamless connection across environments and standards Use the fastest and cheapest available service Bluetooth/ Home Networks Wireless Cellular Wireless LAN
3
RICE UNIVERSITY Aim of the talk How do I build such a device? Challenges Constraints Solutions
4
RICE UNIVERSITY Trend comparisons
5
RICE UNIVERSITY Change in flexibility requirements Physical Layer MAC Layer Network Layer Application Layer No change (already flexible) Maximum change (needs to support multiple environments, algorithms and standards)
6
RICE UNIVERSITY Summary of Challenges for Sophisticated algorithms (GOPs of computation) 10’s of Mbps, < 500 mW Flexibility required at physical layer Multiple algorithms, multiple standards, multiple environments What we would also like: Time to market Rapid evaluation and implementation Scalable architecture design methodologies
7
RICE UNIVERSITY Physical layer of a receiver Antenna Channel estimation DetectionDecoding Higher (MAC/Network/ OS) Layers RF Front-end Baseband processing Receiver more complex than transmitter
8
RICE UNIVERSITY Physical layer architecture Evolving Cellular Handset Architectures but a Continuing, Insatiable Desire for DSP MIPs M. L. McMahan, TI Report SPRA650, March 2000 ro Analog RF Digital Baseband DSP ASICs controller Analog Baseband AudioA/D D/A
9
RICE UNIVERSITY Architecture trade-offs Past : more DSP + less ASIC Current “proposed” solutions : less DSP + more ASICs Reason: DSPs not powerful enough Can’t we build better DSPs? ASIC solutions Intermediate solutions Programmable solutions Area-Time-Power Performance Flexibility
10
RICE UNIVERSITY Can this methodology scale for Baseband increasingly important for real-time and power Need much more flexibility Environment-specific sophisticated algorithms Cannot keep adding co-processors lose flexibility of a programmable solution 1 Mbps with 100 MHz processor 100 cycles per bit to do all your work (GOPs/bit) Power consumption with bigger color displays, video and more complex algorithms May have only ~100 mW for baseband
11
RICE UNIVERSITY Motivation Now that we know the challenges and constraints, Design me
12
RICE UNIVERSITY design How do we choose the right algorithms? the right amount of flexibility? Do we build DSPs, ASICs, heterogeneous, reconfigurable? If ASICs, how to build better ASICs? If programmable, how to build better DSPs? If both, how do we mix them better? Answers dependent on level of flexibility needed area-time-power architecture tradeoffs
13
RICE UNIVERSITY My contributions “Low-complexity” algorithms for wireless: Parallel, fixed point algorithms for multiuser estimation and detection ASIC design for wireless using computer arithmetic techniques: Dynamic truncation using on-line arithmetic Programmable architecture design for wireless: Scalable Wireless Application-specific Processors (SWAPs)
14
RICE UNIVERSITY Programmable architectures Current DSPs Not enough functional units (FUs) for GOPs of computation Cannot extend to more FUs Limited Instruction Level Parallelism (ILP) Limited Subword Parallelism (SP) Cannot support more registers (register area increases quadratically with FUs) Compilers: difficult to find ILP as FUs increase
15
RICE UNIVERSITY Solution Exploit data parallelism (DP) Lots available in wireless algorithms Example: Int i,a,b,c; // 32 bits short int d,e,f; // 16 bits packed for (i = 1: 1024) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; } ILP DP SP
16
RICE UNIVERSITY DSP vs. SWAPs + + + * * * Internal Memory ILP Stream Register File + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * + + + * * * … ILP DP DSP (1 cluster) SWAPs (max. clusters)
17
RICE UNIVERSITY Builds on the Imagine media processor
18
RICE UNIVERSITY SWAPs trade-offs Same internal memory size as DSPs Dependent on application, not architecture Needs more area to support more functional units Area is less of a constraint than power Varying levels of DP in applications Needs reconfiguration!! Need to turn off unused clusters More parallelism lower clock frequency lower voltage low power ( CV 2 f + leakage) in spite of larger area
19
RICE UNIVERSITY Design methodology Chain of receiver algorithms Low “complexity”, parallel, fixed point High level language implementation Programmable implementation Modular programmable architecture design ASIC implementation FPGA, customized, reconfigurable, heterogeneous implementations Example: Pentium, DSP, SWAPs Area-Time- Power specs: no 1 1 23456787 specs : no learn Example: H-SWAPs
20
RICE UNIVERSITY Choosing the right algorithms : theory Algorithm research: Spectral efficiency Low power (RF) Metrics: Bit error rate Frame error rate 10 -8 10 -6 10 -4 10 -2 10 0 Signal to Noise Ratio Bit Error Rate Past Current Future Theory
21
RICE UNIVERSITY Choosing right algorithms : practice Refine candidates from theory (using linear algebra / opt.) lower “complexity”, parallel, fixed-point Optimization: Area: A Time: B Power: A Energy: A/B Multi-parameter optimization ? “Complexity” : #operations of equivalent type Complexity Complexity/Parallelism Execution Time 0 10 20 30 40 50 60 70 80 Original Candidate A Candidate B
22
RICE UNIVERSITY Example : Parallel Viterbi Decoding Add-Compare-Select (ACS) : trellis interconnect Re-order for exploiting DP Parallelism depends on constraint length (#states) Conventional Traceback – sequential Use Register Exchange (RE) for parallel solution Exploiting DP in a programmable architecture implies: Re-order ACS Re-order RE
23
RICE UNIVERSITY SWAP design Decide how many clusters Exploit DP Look at the for loop () count Decide what to put within each cluster Maximize ILP with high functional unit efficiency Search design space See how it meets time-area-power constraints
24
RICE UNIVERSITY What goes inside a cluster?
25
RICE UNIVERSITY Re-ordering for parallel Viterbi X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) b. Shuffled Trellisa. Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15)
26
RICE UNIVERSITY Viterbi reconfiguration Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters) DPCan be turned OFF
27
RICE UNIVERSITY How to reconfigure? Move data to appropriate clusters and turn off unused clusters and SRF Significant loss in performance Maximum power savings Use Conditional Streams Cannot turn off SRF, comm,scratchpad in clusters Minimal loss in performance Use mux-demux buffers Can turn off clusters entirely – more power savings Minimal loss in performance
28
RICE UNIVERSITY 64-bit Packet 1 Rate ½ Constraint Length 7 64-bit Packet 2 Rate ½ Constraint Length 9 64-bit Packet 3 Rate ½ Constraint Length 5 Kernels (Computation) Memory accesses
29
RICE UNIVERSITY Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz
30
RICE UNIVERSITY Viterbi decoding: Execution time 10 3 Ideal DSP C64x (w/o co-proc) *VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong 128 KHz (1 bit /cycle) DSP (RE) SWAP ASIC/FPGA – Real-time performance DP Task Pipelining Dedicated interconnect
31
RICE UNIVERSITY Salient features of this solution Any constraint length 10 MHz at 128 Kbps (handset) Same code for all constraint lengths no need to re-compile or load another code as long as parallelism/cluster ratio is constant Exploiting parallelism for real-time: Instruction Level Parallelism (DSP) Subword Parallelism (DSP) Data Parallelism (Imagine) Dynamic Cluster Scaling (SWAP) Power savings due to dynamic cluster scaling
32
RICE UNIVERSITY Expected SWAP power numbers Viterbi decoding 64 clusters and 1 multiplier per cluster: Process: 0.13 micron Voltage: 1.5 V (to min. leakage when not active) R-T Frequency: f~10 MHz Peak Active Power: ~16 mW/MHz (11 mW/MHz if 1.2V) Area: ~53.7 mm 2 10 MHz, 128 Kbps ~160 (110) mW for K = 9 ~53.33 (36.7) mW for K = 7 ~26.67 (12.5) mW for K = 5 ASICs : ~10-100 W *Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164
33
RICE UNIVERSITY Problems Suitable for handsets? - Not yet! Still too general Not low power enough!!! No special customization for the application Except for a fixed-point architecture Generic instruction set Generic ALUs (though, can be powered down) Generic inter-cluster communication network Suitable for base-stations? Why not – power is not a primary constraint?
34
RICE UNIVERSITY Multiuser Estimation-Detection+Decoding 10 0 1 2 1 2 3 4 5 Number of clusters Frequency needed to attain real-time (in MHz) FAST MEDIUM SLOW 32-user 3G base-station Hand-set Real-time target : 128 Kbps per user
35
RICE UNIVERSITY Expected power numbers 32 user base-station with 3 multipliers per cluster and 64 clusters: Process: 0.13 micron Voltage: 1.2 V (always active, leakage less important) R-T Frequency: f~1 GHz Peak Active Power: ~19.88 mW/MHz (increased *) Area: ~93.4 mm 2 Total Base-station power consumption: ~19.88 W at 1 GHz for 32 users at 128 Kbps/user
36
RICE UNIVERSITY H-SWAPs Trade Data Parallelism for Task Pipelining Customize each SWAPlet SWAPs (max. clusters and reconfigure) + + + * + + + * + + + * + + + * Limited DP SWAPlet (limit clusters) + + + * + + + * + + + * + + + * Limited DP + + * + + * + + * + + * Limited DP + + + + + + + + Limited DP H-SWAPs (collection of customized SWAPlets)
37
RICE UNIVERSITY Viterbi decoding Survivor management – serial Finding parallel solution for SWAPs – expensive > 50% of execution time : overhead Serial solution now possible with H-SWAPs A C S + A C S + A C S + A C S + Limited DP TBUTBU H-SWAPs for Viterbi decoding ACS unit Traceback unit
38
RICE UNIVERSITY Potential advantages DSP (RE) SWAP ASIC/FPGA – Real-time performance DP Task Pipelining Dedicated interconnect DSP (RE) H-SWAP Partial DP + Task Pipelining Application-specific units ASIC/FPGA – Real-time performance Dedicated interconnect Performance H-SWAPsSWAPs
39
RICE UNIVERSITY Current research How to trade-off task vs. data parallelism? Evaluation of specialized inter-cluster communication Integrating specialized arithmetic units (ACS, on-line) Area-Time-Power efficiency of Handset SWAPs Learning to migrate from H-SWAPs to SWAPs Scale to future systems!!
40
RICE UNIVERSITY Future research: efficient algorithms
41
RICE UNIVERSITY Future research: architectures Generalized framework and tools for evaluating algorithm-architecture and area-time-power-flexibility trade-offs Some other potential applications Image processing: Cameras : variety of compression algorithms Biomedical applications: Hearing aids: DSP running on body heat * Sensor networks *Quote: Gene Frantz, TI Fellow
42
RICE UNIVERSITY Conclusions Exciting times for wireless algorithm and architecture research More complex algorithms Higher data rates – meet real-time requirements Lower power Low area Seek to design flexible architectures learn from ASIC solutions Inter-disciplinary research needed: Computer architecture, VLSI, wireless communications, computer arithmetic, compilers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.