Presentation is loading. Please wait.

Presentation is loading. Please wait.

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

Similar presentations


Presentation on theme: "RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston."— Presentation transcript:

1 RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar – University of Rochester March 31, 2003 This work has been supported in part by Nokia, TI, TATP and NSF

2 RICE UNIVERSITY 2 Future wireless devices :  High data rate mobile devices with multimedia  Multiple antennas w/ complex algorithms, GOPs of computation  Area-Time-Power constraints  Seamless connection across environments and standards  Use the fastest and cheapest available service Bluetooth/ Home Networks Wireless Cellular Wireless LAN

3 RICE UNIVERSITY 3 Change in flexibility requirements Physical Layer MAC Layer Network Layer Application Layer No change (already flexible) Maximum change (needs to support multiple environments, algorithms and standards)

4 RICE UNIVERSITY 4 Challenges faced in achieving this goal  Long time-to-market  Algorithm research  Implementation issues on current architectures  Architecture research  Ad-hoc design methodology for architecture designs  ASICs  DSPs  Heterogeneous  Reconfigurable

5 RICE UNIVERSITY 5 Research vision  Architecture design methodology to explore:  Flexibility : support variety of sophisticated algorithms  High Performance: GOPs of computation (Mbps)  Low Power: < 500 mW  Algorithms:  Need efficient algorithms for mapping to architectures Design me

6 RICE UNIVERSITY 6 My contributions: Algorithms Multi-user Estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00]  matrix-inversions, high numerical stability  Numerical techniques  conjugate-gradient descent for complexity reduction Multi-user Detection: [ISCAS’01]  Block-based computation to streaming computations  Pipelining, lower memory req. Parallel, fixed-point, streaming VLSI implementations [Trans. Wireless Comm.’02]

7 RICE UNIVERSITY 7 My contributions: Architectures Heterogeneous system designs: [ICSPAT’00] Computer arithmetic:[Symp. On Comp. Arith’01] Dynamic truncation in ASICs using on-line arithmetic [Ph.D. Thesis] Design methodology to explore flexibility-power-performance tradeoffs Scalable Wireless Application-specific Processors (SWAPs)

8 RICE UNIVERSITY 8 SWAP design methodology Chain of receiver algorithms Low “complexity”, parallel, fixed point High level language implementation Scalable programmable architecture design ASIC design FPGA, customized, reconfigurable, heterogeneous designs SWAPs learn Architecture exploration Flexibility- performance tradeoffs

9 RICE UNIVERSITY 9 Benefits of this approach  Provides a framework to explore:  algorithms  flexible, high performance, low power architectures (SWAPs)  Understanding of both algorithms and ASICs used for better SWAP designs  Flexibility-performance trade-off with increasing customization in SWAPs Inter-disciplinary research: Wireless communications, VLSI Signal Processing, Computer architecture, Computer arithmetic, CAD, Compilers

10 RICE UNIVERSITY 10 Talk Outline  SWAPs: framework SWAP Concept demonstration  Algorithm design  Application-specific architecture design  Current and Future Research Goals

11 RICE UNIVERSITY 11 DSP solutions  Current DSPs  Not enough functional units (FUs) for GOPs of computation  TI C6x DSP has 8 FUs -- Need 100’s of FUs  Not low power enough!!  Cannot extend to more FUs  Limited Instruction Level Parallelism (ILP)  Limited Subword Parallelism (MMX)  Cannot support more registers (area,ports)  Compilers: difficult to find ILP as FUs increase

12 RICE UNIVERSITY 12 Solution: SWAPs  Exploit data parallelism (DP)  Available in many wireless algorithms  This is what ASICs do!!  Example: int i,a[N],b[N],c[N]; // 32 bits short int d[N],e[N],f[N]; // 16 bits packed for (i = 0; i< 1024; ++i) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; } ILP DP Subword

13 RICE UNIVERSITY 13 SWAPs: stream processors for wireless Kernel Viterbi decoding Stream Input Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits  Kernels (computation) and streams (communication)  Operations on kernels use local data in clusters  Streams expose data parallelism  Imagine stream processor at Stanford

14 RICE UNIVERSITY 14 DSP vs. SWAPs + + + * * * Internal Memory ILP Stream Register File (SRF) DSP (1 cluster) + + + * * * + + + * * * + + + * * * + + + * * * … ILP DP SWAPs max. clusters clusters same, same operations. Power-down unused FUs, clusters

15 RICE UNIVERSITY 15 Arithmetic clusters  FUs (+,*,/)  Scratch-pad (Sp)  Indexed accesses  Comm. unit (CU)  Intercluster comm.  Distributed reg. Files  more FUs Intercluster Network From/To SRF Cross Point Local Register File CU + + + * * / + / + + + * * / + / Sp SRF

16 RICE UNIVERSITY 16 Talk Outline  SWAPs: framework SWAP Concept demonstration  Algorithm design  Application-specific architecture design  Current and Future Research Goals

17 RICE UNIVERSITY 17 Physical layer of wireless receivers Antenna Channel estimation DetectionDecoding Higher (MAC/Network/ OS) Layers RF Front-end Baseband processing Receiver more complex than transmitter

18 RICE UNIVERSITY 18 Algorithms for  Multiple antenna systems (MIMO systems)  Complexity exponential with transmit * receive antennas  Wide range of extremely complex algorithms  Optimal depends on fading, mobility, bandwidth, antennas  GOPs of computations  Estimation: Linear MMSE, blind, conjugate gradient….  Detection: FFT, (blind) interference cancellation….  Decoding: Viterbi, Turbo, LDPC….  Implement ALL of them AND the NEXT one in line  Use for the best for the situation Example for concept demonstration: Viterbi decoding

19 RICE UNIVERSITY 19 Parallel Viterbi Decoding  Add-Compare-Select (ACS) : trellis interconnect  Parallelism depends on constraint length (#states)  Conventional Traceback  Sequential (No DP)  Difficult to implement in parallel architecture  Use Register Exchange (RE)  parallel solution ACS Unit Traceback Unit Detected bits Decoded bits

20 RICE UNIVERSITY 20 Re-ordering for parallel Viterbi X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Exploiting Viterbi DP in SWAPs:  Re-order ACS, RE DP

21 RICE UNIVERSITY 21 SWAP: Algorithms + Architecture Algorithm design for parallelism Architecture design?

22 RICE UNIVERSITY 22 SWAP design  Decide how many clusters  Exploit DP  Decide what to put within each cluster  Maximize ILP with high functional unit efficiency  Search design space with “explore” tool  See how it meets time-area-power constraints + ? * * + * * + * * + * * … ILP DP ???

23 RICE UNIVERSITY 23 Inside a SWAP cluster: EXPLORE Auto-exploration of adders and multipliers for “ACS" (Adder FU%, Multiplier FU%)

24 RICE UNIVERSITY 24 “Explore” tool benefits  Instruction count vs. functional unit efficiency  What goes inside each cluster  Explore all algorithms  turn off functional units not in use for given kernel  Design customized application-specific units  Better performance with increased FU utilization Algorithm 1 : 3 adders, 3 multipliers, 32 clusters Algorithm 2 : 4 adders, 1 multiplier, 64 clusters Architecture: 4 adders, 3 multipliers, 64 clusters

25 RICE UNIVERSITY 25 Saving Power  Turning off FUs  Easy  Use the right FUs for staticly scheduled algorithm  Turning off clusters  Not so easy  Each cluster does not have access to entire SRF  Need data from SRF of other clusters

26 RICE UNIVERSITY 26 Reconfiguration : 1 : Data transfer Move data to appropriate clusters via comm units Significant performance loss, additional SRF memory required Can turn off SRF too! SRF Clusters CU

27 RICE UNIVERSITY 27 Reconfiguration : 2: Conditional streams Sp Transfer data via comm unit (CU) and scratchpad (Sp) Minimal loss in performance Cannot turn off SRF, comm unit, scratchpad in clusters

28 RICE UNIVERSITY 28 Reconfiguration : 3 : Multiplexed buffers Use mux-demux buffers Minimal loss in performance Can turn off clusters entirely – more power savings

29 RICE UNIVERSITY 29 Viterbi reconfiguration Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters) DPCan be turned OFF

30 RICE UNIVERSITY 30 64-bit Packet 1 Rate ½ K = 7 Packet 2 K = 9 Packet 3 K = 5 Kernels (Computation) No Data Memory accesses Execution Time (cycles) ClustersMemory

31 RICE UNIVERSITY 31 Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz 110100 1 10 100 1000 Number of clusters Frequency needed to attain real-time (in MHz) K = 9 K = 7 K = 5 Static architecture SWAPs DSP Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

32 RICE UNIVERSITY 32 SWAPs : Salient features  1-2 orders of magnitude better than 1 processor DSP  Any constraint length  10 MHz at 128 Kbps  Same code for all constraint lengths  no need to re-compile or load another code  as long as parallelism/cluster ratio is constant  Power savings due to dynamic cluster scaling

33 RICE UNIVERSITY 33 Expected SWAP power consumption  64 clusters and 1 multiplier per cluster:  0.13 micron, 1.2 V  Peak Active Power: ~9 mW at 1 MHz (DSP ~1 mW at 1 MHz)  Area: ~53.7 mm 2  10 MHz, 128 Kbps with reconfiguration ( DSP ~200mW) *Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164 010203040506070 0 10 20 30 40 50 60 70 80 90 Active Clusters (max 64) Power (in mW) ViterbiClusters usedPeak Power K = 964~90 mW K = 716~28.57 mW K = 54~13.8 mW overhead0~8.1 mW

34 RICE UNIVERSITY 34 Flexibility vs. performance  Suitable for mobile devices?  SWAPs: 128 Kbps at ~10-100 mW for Viterbi  What if we want to do better?  No special customization for the application  No application-specific units  Generic inter-cluster communication network  Overhead for extracting parallelism  SWAPs suitable for base-stations?  Why not? – power is not a primary constraint!

35 RICE UNIVERSITY 35 Multiuser Estimation-Detection+Decoding Real-time target : 128 Kbps per user Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

36 RICE UNIVERSITY 36 Expected SWAP power : base-station  32 user base-station with 3 X’s per cluster and 64 clusters:  0.13 micron, 1.2 V  Peak Active Power: ~18.19 mW for 1 MHz (increased *)  Area: ~93.4 mm 2  Total Peak Base-station power consumption:  ~18.19 W at 1 GHz for 32 users at 128 Kbps/user

37 RICE UNIVERSITY 37 Talk Outline  SWAPs: framework SWAP Concept demonstration  Algorithm design  Application-specific architecture design  Current and Future Research Goals

38 RICE UNIVERSITY 38 Current research  SWAPs : Completely flexible and general  How do we trade-off flexibility for better performance?  Handset SWAPs (H-SWAPs)

39 RICE UNIVERSITY 39 Handset SWAPs: H-SWAPs  Trade Data Parallelism for Task Pipelining  Design SWAPlets and customize each SWAPlet SWAPs (max. clusters and reconfigure) + + + * + + + * + + + * + + + * Limited DP SWAPlet (limit clusters) + + + * + + + * + + + * + + + * Limited DP + + * + + * + + * + + * Limited DP + + + + + + + + Limited DP H-SWAPs (collection of customized SWAPlets)

40 RICE UNIVERSITY 40 H-SWAPs: Potential advantages DSPs ILP Subword ILP Subword DP SWAPsH-SWAPs ILP Subword DP Task Pipelining Custom FUs Programmable solutions with increased customization Performance, Power benefits

41 RICE UNIVERSITY 41 Future research: efficient algorithms

42 RICE UNIVERSITY 42 Future research: architectures Generalized framework and tools for evaluating algorithm- architecture and area-time-power-flexibility trade-offs Some other potential applications  Image processing:  Cameras : variety of compression algorithms  Biomedical applications:  Hearing aids: DSP running on body heat *  Sensor networks  Compression of data before transmission *Quote: Gene Frantz, TI Fellow

43 RICE UNIVERSITY 43 Conclusions  Need flexible architectures for future wireless devices  Higher data rates, lower power, more complex algorithms  Design methodology (SWAPs concept)  Flexibility vs. performance trade-offs  SWAPs:  Exploit data parallelism like ASICs  1-2 orders better than DSPs  Turn off unused clusters and unused FUs for low power  H-SWAPs for better performance and power benefits


Download ppt "RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston."

Similar presentations


Ads by Google