RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX ECE Pizza Talk March 28, 2003 This work has been supported in part by Nokia, TI, TATP and NSF

RICE UNIVERSITY 2 Future wireless devices :  High data rate mobile devices with multimedia  Multiple antennas w/ complex algorithms, GOPs of computation  Area-Time-Power constraints  Seamless connection across environments and standards  Use the fastest and cheapest available service Bluetooth/ Home Networks Wireless Cellular Wireless LAN

RICE UNIVERSITY 3 Aim of the talk Design me

RICE UNIVERSITY 4 Trends FLEXIBILITY

RICE UNIVERSITY 5 Change in flexibility requirements Physical Layer MAC Layer Network Layer Application Layer No change (already flexible) Maximum change (needs to support multiple environments, algorithms and standards)

RICE UNIVERSITY 6 Architecture trade-offs Past : more DSP + less ASIC, Current : less DSP + more ASIC Reason: need less flexibility OR DSPs not powerful enough? Can’t we build better DSPs? How much flexibility do we need? ASICs Intermediate Programmable Area-Time-Power benefits Flexibility Time-to-market Software updates

RICE UNIVERSITY 7 Problems with current DSPs  Current DSPs  Not enough functional units (FUs) for GOPs of computation  Need 100’s of FUs  Not low power enough!!  Cannot extend to more FUs  Limited Instruction Level Parallelism (ILP)  Limited Subword Parallelism (such as MMX)  Cannot support more registers (area,ports)  Compilers: difficult to find ILP as FUs increase

RICE UNIVERSITY 8 Scalable Wireless Application-specific Procesors (SWAPs)  Exploit data parallelism (DP)  Available in many wireless algorithms  This is what ASICs do!!  Example: int i,a,b,c; // 32 bits short int d,e,f; // 16 bits packed for (i = 1; i<= 1024; ++i) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; } ILP DP Subword

RICE UNIVERSITY 9 SWAPs: stream processors for wireless Kernel Viterbi decoding Stream Input Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits  Kernels (computation) and streams (communication)  Operations on kernels use local data  Streams expose data parallelism  Imagine stream processor at Stanford

RICE UNIVERSITY 10 DSP vs. SWAPs + + + * * * Internal Memory ILP Stream Register File (SRF) DSP (1 cluster) SWAPs (max. clusters All clusters same & do same operations) + + + * * * + + + * * * + + + * * * + + + * * * … ILP DP

RICE UNIVERSITY 11 Arithmetic clusters  FUs (+,*,/)  Scratch-pad (Sp)  Indexed accesses  Comm. unit (CU)  Intercluster comm.  Distributed reg. Files  more FUs Intercluster Network From/To SRF Cross Point Local Register File CU + + + * * / + / + + + * * / + / Sp SRF

RICE UNIVERSITY 12 SWAPs vs. DSPs trade-offs  Same internal memory size as DSPs  Dependent on application, not architecture  Needs more area to support more functional units  Area is less of a constraint than power  Varying levels of DP in applications  Needs reconfiguration!!  Need to turn off unused clusters (and FUs)  More parallelism  lower clock frequency  lower voltage  low power (  CV 2 f + leakage) in spite of larger area

RICE UNIVERSITY 13 Design methodology Chain of receiver algorithms Low “complexity”, parallel, fixed point High level language implementation Modular programmable architecture design ASIC design FPGA, customized, reconfigurable, heterogeneous designs DSP, SWAPs learn H-SWAPs learn Architecture exploration Flexibility- performance tradeoffs

RICE UNIVERSITY 14 Physical layer of wireless receivers Antenna Channel estimation DetectionDecoding Higher (MAC/Network/ OS) Layers RF Front-end Baseband processing Receiver more complex than transmitter

RICE UNIVERSITY 15 Algorithms for  Multiple antenna systems (MIMO systems)  Complexity exponential with transmit * receive antennas  Wide range of extremely complex algorithms  Optimal depends on fading, mobility, bandwidth, antennas  GOPs of computations  Estimation: Linear MMSE, blind, conjugate gradient….  Detection: FFT, (blind) interference cancellation….  Decoding: Viterbi, Turbo, LDPC….  Implement ALL of them AND the NEXT one in line  Use for the best for the situation Example for concept demonstration: Viterbi decoding

RICE UNIVERSITY 16 Parallel Viterbi Decoding  1. Add-Compare-Select (ACS) : trellis interconnect  Parallelism depends on constraint length (#states)  2. Conventional Traceback  Sequential (No DP)  Difficult to implement in parallel architecture  Use Register Exchange (RE)  parallel solution

RICE UNIVERSITY 17 Re-ordering for parallel Viterbi a. Trellis X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) b. Shuffled Trellis Exploiting Viterbi DP in SWAPs:  Re-order ACS, RE  Overhead

RICE UNIVERSITY 18 SWAP: Algorithms + Architecture Algorithm design for parallelism Architecture design?

RICE UNIVERSITY 19 SWAP design  Decide how many clusters  Exploit DP  Decide what to put within each cluster  Maximize ILP with high functional unit efficiency  Search design space with “explore” tool  See how it meets time-area-power constraints + ? * * + * * + * * + * * … ILP DP ???

RICE UNIVERSITY 20 Inside a SWAP cluster: EXPLORE Auto-exploration of adders and multipliers for “ACS" (Adder FU%, Multiplier FU%)

RICE UNIVERSITY 21 “Explore” tool benefits  Instruction count vs. functional unit efficiency  What goes inside each cluster  Explore all algorithms  turn off functional units not in use for given kernel  Design customized application-specific units  Better performance with increased FU utilization Algorithm 1 : 3 adders, 3 multipliers, 32 clusters Algorithm 2 : 4 adders, 1 multiplier, 64 clusters Architecture: 4 adders, 3 multipliers, 64 clusters

RICE UNIVERSITY 22 Viterbi reconfiguration Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters) DPCan be turned OFF

RICE UNIVERSITY 23 Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz 110100 1 10 100 1000 Number of clusters Frequency needed to attain real-time (in MHz) K = 9 K = 7 K = 5 Static architecture SWAPs DSP Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

RICE UNIVERSITY 24 SWAPs : Salient features  1-2 orders of magnitude better than 1 processor DSP  Any constraint length  10 MHz at 128 Kbps  Same code for all constraint lengths  no need to re-compile or load another code  as long as parallelism/cluster ratio is constant  Power savings due to dynamic cluster scaling

RICE UNIVERSITY 25 Expected SWAP power consumption  64 clusters and 1 multiplier per cluster:  0.13 micron, 1.2 V  Peak Active Power: ~9 mW at 1 MHz  Area: ~53.7 mm 2  10 MHz, 128 Kbps with reconfiguration *Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164 010203040506070 0 10 20 30 40 50 60 70 80 90 Active Clusters (max 64) Power (in mW) ViterbiClusters usedPeak Power K = 964~90 mW K = 716~28.57 mW K = 54~13.8 mW overhead0~8.1 mW

RICE UNIVERSITY 26 Flexibility vs. performance  Suitable for mobile devices?  SWAPs: Real-time at ~10-100 mW  Maybe ; but can we do better?  ASICs : Real-time at ~10-100  W  No special customization for the application  No application-specific units  Generic inter-cluster communication network  Overhead for extracting parallelism  SWAPs suitable for base-stations?  Why not? – power is not a primary constraint!

RICE UNIVERSITY 27 Multiuser Estimation-Detection+Decoding Real-time target : 128 Kbps per user Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

RICE UNIVERSITY 28 Current research  SWAPs : Completely flexible and general  How do we trade-off flexibility for better performance?  Handset SWAPs (H-SWAPs)

RICE UNIVERSITY 29 H-SWAPs: Potential advantages DSP (RE) SWAP ASIC/FPGA – Real-time performance DP Task Pipelining Dedicated interconnect DSP (RE) H-SWAP Partial DP + Task Pipelining Application-specific units ASIC/FPGA – Real-time performance Dedicated interconnect H-SWAPsSWAPs Execution time

RICE UNIVERSITY 30 Conclusions  Need flexible architectures for future wireless devices  Higher data rates, lower power, more complex algorithms  Design methodology (SWAPs, H-SWAPs, ASICs)  Flexibility vs. performance trade-offs  Blurs distinction between ASICs and programmable solutions  Also need parallel, low precision algorithms for efficient mapping  Inter-disciplinary research:  Computer architecture, VLSI, wireless communications, computer arithmetic, compilers

RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

Similar presentations

Presentation on theme: "RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

Similar presentations

Presentation on theme: "RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston."— Presentation transcript:

Similar presentations

About project

Feedback