Download presentation
Presentation is loading. Please wait.
Published byTabitha Greer Modified over 8 years ago
1
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar – Southern Methodist University April 23, 2003 This work has been supported in part by NSF, Nokia and Texas Instruments
2
RICE UNIVERSITY 2 Future wireless devices demand flexibility Multiple algorithms and environments supported in same device High data rate mobile devices with multimedia Flexible algorithms: Multiple antennas, complex signal processing Flexible architectures: High performance (Mbps), low power (mW) Fast design with structured exploration Bluetooth/ Home Networks Wireless Cellular Wireless LAN
3
RICE UNIVERSITY 3 Flexibility needed in different layers Physical Layer MAC Layer Network Layer Application LayerPuppeteer project at Rice http://www.cs.rice.edu/CS/Systems/Puppeteer/ Analog RF Flexible Algorithms Mapping Flexible Architectures
4
RICE UNIVERSITY 4 Research vision: Attain flexibility Algorithms: Flexibility: support variety of sophisticated algorithms Architectures: Flexibility: adapts hardware to algorithms Fast, structured design exploration Design me
5
RICE UNIVERSITY 5 Contributions: Algorithms Multi-user channel estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00] Matrix-inversions Numerical techniques conjugate-gradient descent for complexity reduction Multi-user detection: [ISCAS’01] Block-based computation to streaming computations Pipelining, lower memory requirements Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm.’02]
6
RICE UNIVERSITY 6 Contributions: Architectures Heterogeneous DSP-FPGA system designs: [ICSPAT’00] Computer arithmetic:[Symp. On Comp. Arith’01] Dynamic truncation in ASICs using on-line arithmetic with Most Significant Digit First computation [Ph.D. Thesis] Scalable Wireless Application-specific Processors (SWAPs) Rapid, structured architectures with flexibility-performance tradeoffs
7
RICE UNIVERSITY 7 Scalable Wireless Application-specific Processors Family of flexible programmable processors Clusters of ALUs High performance by supporting 100’s of ALUs Can provide customization for various algorithms Adapts (“swaps”) architecture dynamically for power + ? * * + * * + * * + * * … ??? Scale Clusters Scale ALUs
8
RICE UNIVERSITY 8 Rapid, structured design for SWAPs Low “complexity”, parallel, fixed point algorithms Architecture Exploration ASIC design apply DSP design apply SWAPs + ? * * + * * + * * + * * … ???
9
RICE UNIVERSITY 9 Research vision summary Provide a structured framework to rapidly explore: flexible, high performance, low power architectures (SWAPs) Efficient algorithm design for mapping to SWAPs Understanding of algorithms, DSPs and ASICs used Flexibility-performance trade-offs Inter-disciplinary research: Wireless communications, VLSI Signal Processing, Computer architecture, Computer arithmetic, Circuits, CAD, Compilers
10
RICE UNIVERSITY 10 Talk Outline Research vision SWAPs - Background Algorithm design for SWAPs Architecture design for SWAPs Current and Future Research Goals
11
RICE UNIVERSITY 11 SWAPs borrow from DSPs DSPs use : Instruction Level Parallelism (ILP) Subword Parallelism (MMX) Not enough ALUs for GOPs of computation-- Need 100’s TI C6x has 8 ALUs Why not more ALUs? Cannot support more registers (area,ports) Difficult to find ILP as ALUs increase 32 Register File 1 ALU RF 4 16
12
RICE UNIVERSITY 12 SWAPs borrow from ASICs Exploit data parallelism (DP) Available in many wireless algorithms This is what ASICs do! int i,a[N],b[N],sum[N]; // 32 bits short int c[N],d[N],diff[N]; // 16 bits packed for (i = 0; i< 1024; ++i) { sum[i] = a[i] + b[i]; diff[i] = c[i] - d[i]; } ILP DP Subword
13
RICE UNIVERSITY 13 SWAPs borrow from stream processors Kernel Viterbi decoding Stream Input Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits Kernels (computation) and streams (communication) Use local data in clusters providing GOPs support Imagine stream processor at Stanford [Rixner’01] Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.
14
RICE UNIVERSITY 14 SWAPs are multi-cluster DSPs + + + * * * Internal Memory ILP Memory: Stream Register File (SRF) DSP (1 cluster) + + + * * * + + + * * * + + + * * * + + + * * * … ILP DP SWAPs adapt clusters to DP Identical clusters, same operations. Power-down unused FUs, clusters
15
RICE UNIVERSITY 15 Arithmetic clusters in SWAPs Intercluster Network Comm. Unit Scratchpad (indexed accesses) SRF From/To SRF Cross Point Distributed Register Files (supports more ALUs) + + + * * / + / + + + * * / + /
16
RICE UNIVERSITY 16 Talk Outline Research vision SWAPs Background Algorithm design for SWAPs Architecture design for SWAPs Current and Future Research Goals
17
RICE UNIVERSITY 17 SWAPs: Physical layer algorithms Antenna Channel estimation DetectionDecoding Higher (MAC/Network/ OS) Layers RF Front-end Baseband processing Complex signal processing algorithms with GOPs of computation
18
RICE UNIVERSITY 18 SWAP mapping example: Viterbi decoding Multiple antenna systems (MIMO systems) Complexity exponential with transmit x receive antennas Estimation: Linear MMSE, blind, conjugate gradient…. Detection: FFT, (blind) interference cancellation…. Decoding: Viterbi, Turbo, LDPC…. & joint schemes SWAP flexibility lets you use the best algorithms for the situation Example for concept demonstration: Viterbi decoding
19
RICE UNIVERSITY 19 Parallel Viterbi Decoding for SWAPs Add-Compare-Select (ACS) : trellis interconnect : computations Parallelism depends on constraint length (#states) Traceback: searching Conventional Sequential (No DP) with dynamic branching Difficult to implement in parallel architecture Use Register Exchange (RE) parallel solution ACS Unit Traceback Unit Detected bits Decoded bits
20
RICE UNIVERSITY 20 Parallel Viterbi needs re-ordering for SWAPs Exploiting Viterbi DP in SWAPs: Use RE instead of regular traceback Re-order ACS, RE X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) DP vector Regular ACSACS in SWAPs
21
RICE UNIVERSITY 21 Talk Outline Research vision SWAP Background Algorithm design for SWAPs Architecture design for SWAPs Current and Future Research Goals
22
RICE UNIVERSITY 22 SWAP architecture design More clusters better than more ALUs/per cluster (if #clusters > 2) 1.Decide how many clusters Exploit DP 2.Decide what to put within each cluster Maximize ILP with high functional unit efficiency Search design space with “explore” tool Time-power-area characterization + ? * * + * * + * * + * * … ILP DP ???
23
RICE UNIVERSITY 23 Design a SWAP cluster: “Explore” Auto-exploration of adders and multipliers for “ACS" (Adder util%, Multiplier util%)
24
RICE UNIVERSITY 24 “Explore” tool benefits Instruction count vs. ALU efficiency What goes inside each cluster Design customized application-specific units Better performance with increased ALU utilization Explore multiple algorithms turn off functional units not in use for given kernel Vdd-gating, clock gating techniques
25
RICE UNIVERSITY 25 Example for SWAP architecture design Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clusters Explore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters Explore Algorithm 3 : 2 adders, 2 multipliers, 64 clusters Explore Algorithm 4 : 2 adders, 2 multipliers, 16 clusters Chosen Architecture: 4 adders, 3 multipliers, 64 clusters ILP DP
26
RICE UNIVERSITY 26 SWAP flexibility provides power savings Multiple algorithms Different ALU, cluster requirements Turning off ALUs ( –add –mul compiler options) Use the right #ALUs from “explore” tool Turning off clusters Data across SRF of all clusters Cluster only has access to its own SRF Next kernel may need data from SRF of other clusters Reconfiguration support needs to be provided
27
RICE UNIVERSITY 27 SWAPs provide cluster reconfiguration SRF Clusters Mux-Demux Network With Stream buffers Additional latency (few cycles) due to microcontroller stalls - Minimal loss in performance
28
RICE UNIVERSITY 28 Cluster reconfiguration for Viterbi Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters) DPCan be turned OFF
29
RICE UNIVERSITY 29 64-bit Rate ½ Packet 1 K = 7 Packet 2 K = 9 Packet 3 K = 5 Kernels (Computation) No Data Memory accesses Execution Time (cycles) Clusters Memory SWAPs provide flexibility at negligible overhead
30
RICE UNIVERSITY 30 SWAP exploration for Viterbi decoding 110100 1 10 100 1000 Number of clusters Frequency needed to attain real-time (in MHz) K = 9 K = 7 K = 5 Different SWAPs (Without reconfiguration) Same SWAP (With reconfiguration) DSP Ideal C64x (w/o co-proc) needs ~200 MHz for real-time Max DP
31
RICE UNIVERSITY 31 SWAPs : Salient features 1-2 orders of magnitude better than a DSP Any constraint length 10 MHz at 128 Kbps Same code for all constraint lengths no need to re-compile or load another code as long as parallelism/cluster ratio is constant Power savings due to dynamic cluster scaling
32
RICE UNIVERSITY 32 Expected SWAP power consumption Power model based on [Khailany’03] 64 clusters and 1 multiplier per cluster: 0.13 micron, 1.2 V Peak Active Power: ~9 mW at 1 MHz (DSP ~1 mW) Area: ~53.7 mm 2 10 MHz, 128 Kbps with reconfiguration Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003 010203040506070 0 10 20 30 40 50 60 70 80 90 Active Clusters (max 64) Power (in mW) ViterbiClusters UsedPeak Power K = 964~90 mW K = 716~28.57 mW K = 54~13.8 mW overhead0~8.1 mW DSP, K = 91~200 mW
33
RICE UNIVERSITY 33 Multiuser Estimation-Detection+Decoding Real-time target : 128 Kbps per user Ideal C64x (w/o co-proc) needs ~15 GHz for real-time Fading scenarios
34
RICE UNIVERSITY 34 Expected SWAP power : base-station 32 user base-station with 3 X’s per cluster and 64 clusters: 0.13 micron, 1.2 V Peak Active Power: ~18.19 mW for 1 MHz (increased X) Area: ~93.4 mm 2 Total Peak Base-station power consumption: ~18.19 W at 1 GHz for 32 users at 128 Kbps/user
35
RICE UNIVERSITY 35 Talk Outline Research vision SWAP Background Algorithm design for SWAPs Architecture design for SWAPs Current and Future Research Goals
36
RICE UNIVERSITY 36 Current research: Flexibility vs. performance SWAPs: 128 Kbps at ~10-100 mW for Viterbi Borrow DP from ASICs! suitable for base-stations Flexibility more important than power suitable for mobile devices Power constraints tighter can be customized for further power savings Handset SWAPs (H-SWAPs) Borrow Task pipelining from ASICs! Application-specific units and specialized comm. network
37
RICE UNIVERSITY 37 Handset SWAPs: H-SWAPs Trade Data Parallelism for Task Pipelining SWAPs (max. clusters and reconfigure) + + + * + + + * + + + * + + + * Limited DP SWAPlet (limit clusters) + + + * + + + * + + + * + + + * Limited DP + + * + + * + + * + + * Limited DP + + + + + + + + Limited DP H-SWAPs (collection of customized SWAPlets)
38
RICE UNIVERSITY 38 Sample points in architecture exploration DSPs (1 cluster) ILP Subword ILP Subword DP SWAPs (multiple) H-SWAPs (optimized for handsets) ILP Subword DP Task Pipelining Custom ALUs Programmable solutions with increased customization Performance, Power benefits (with decreasing flexibility)
39
RICE UNIVERSITY 39 Future: Efficient algorithms and mapping Multiple antenna systems with 1-2 orders-of-magnitude higher complexity
40
RICE UNIVERSITY 40 Future research: Architectures Generalized and structured framework and tools Joint algorithm-architecture exploration Area-time-power-flexibility tradeoffs Potential applications: embedded systems Image and Video processing: Cameras : variety of compression algorithms Biomedical applications: Hearing aids: DSP running on body heat * Sensor networks Compression of data before transmission *Quote: Gene Frantz, TI Fellow
41
RICE UNIVERSITY 41 SWAPs: Flexibility, Performance, Power Need flexibility in future wireless devices Algorithms and Architectures Rapid Exploration for Scalable, Wireless Application-specific Processors Structured approach with flexibility-performance trade-offs SWAPs - flexibility, high performance and low power Exploit data parallelism like ASICs 1-2 orders better performance than DSPs Turn off unused clusters and unused ALUs for low power
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.