Algorithms and Architectures for Future Wireless Base-Stations Sridhar Rajagopal and Joseph Cavallaro ECE Department Rice University April 19, 2000 This work is supported by Texas Instruments, Nokia, Texas Advanced Technology Program and NSF
4/19/00 TI Meeting2 Overview Future Base-Stations Current DSP Implementation Our Approach –Make Algorithms Computationally effective –Task Partitioning for pipelining, parallelism Processor Design for Accelerating Wireless
4/19/00 TI Meeting3 Evolution of Wireless Comm First Generation Voice Second/Current Generation Voice + Low-rate Data (9.6Kbps) Third Generation + Voice + High-rate Data (2 Mbps) + Multimedia W-CDMA
4/19/00 TI Meeting4 Communication System Uplink Direct Path Reflected Paths Noise +MAI User 1 User 2 Base Station
4/19/00 TI Meeting5 Main Processing Blocks Channel EstimationDetection Decoding Baseband Layer of Base-Station Receiver
4/19/00 TI Meeting6 Proposed Base-Station No Multiuser Detection TI's Wireless Basestation (
4/19/00 TI Meeting7 Real -Time Requirements Multiple Data Rates by Varying Spreading Factors Detection needs to be done in real-time –1953 cycles available in a C6x DSP at 250MHz to detect 1 bit at 128 Kbps
4/19/00 TI Meeting8 Current DSP Implementation
4/19/00 TI Meeting9 Complexity Algorithm Choice Limited by Complexity –Multistage reduces data rate by half. Main Features –Matrix based operations –High levels of parallelism –Bit level computations 32x32 problem size for the Detector shown Estimation, Decoding assumed pipelined.
4/19/00 TI Meeting10 Reasons Sophisticated, Compute-Intensive Algorithms Need more MIPs/FLOPs performance Unable to fully exploit pipelining or parallelism Bit - level computations / Storage
4/19/00 TI Meeting11 Our Approach Make algorithms computationally effective –without sacrificing error rate performance Task Partitioning on Multiple Processing Elements –DSPs : Core –FPGAs : Application Specific / Bit-level Computations Processor with reconfigurable support and extensions for wireless
4/19/00 TI Meeting12 Algorithms Channel Estimation –Avoid inversion by iterative scheme Detection –Avoid block-based detection by pipelining
4/19/00 TI Meeting13 Computations Involved Model Compute Correlation Matrices Bits of K async. users aligned at times I and I-1 Received bits of spreading length N for K users riri bibi b i+1 time delay
4/19/00 TI Meeting14 Multishot Detection Solve for the channel estimate, A i
4/19/00 TI Meeting15 Differencing Multistage Detection Stage 0- Matched Filter Stage 1 Successive Stages S=diag(A H A) y - soft decision d - detected bits (hard decision )
4/19/00 TI Meeting16 Iterative Scheme Tracking Method of Steepest Descent Stable convergence behavior Same Performance
4/19/00 TI Meeting17 Simulations - AWGN Channel Detection Window = 12 SINR = 0 Paths =3 Preamble L =150 Spreading N = 31 Users K = bits/user MF – Matched Filter ML- Maximum Likelihood ACT – using inversion
4/19/00 TI Meeting18 Fading Channel with Tracking Doppler = 10 Hz, 1000 Bits,15 users, 3 Paths
4/19/00 TI Meeting19 Block Based Detector Matched Filter Stage 1 Stage 2 Stage 3 Matched Filter Stage 1 Stage 2 Stage 3 Bits 2-11 Bits 12-21
4/19/00 TI Meeting20 Pipelined Detector Matched Filter Stage 1 Stage 2 Stage
4/19/00 TI Meeting21 Task Decomposition [Asilomar99] Matrix Products InverseCorrelation Matrices (Per Bit) R br [I] O(KN) A 0 H A 1 O(K 2 N) A H r O(KND ) A 1 H A 1 O(K 2 N) A 0 H A 0 O(K 2 N) R bb A H = R br [I] O(K 2 N) Multistage Detection (Per Window) O(DK 2 Me) b Pilot Data MUXMUX d Data’ MUXMUX R bb A H = R br [R] O(K 2 N) d R br [R] O(KN) R bb O(K 2 ) Block I Block II Block III Block IV Channel Estimation Matched Filter Multistage Detector
4/19/00 TI Meeting22 Achieved Data Rates
4/19/00 TI Meeting23 VLSI Implementation Channel Estimation as a Case Study Area - Time Efficient Architecture Real - Time Implementation Bit- Level Computations - FPGAs Core Operations - DSPs
4/19/00 TI Meeting24 Motivation for Architecture Wireless, the next wave after Multimedia Highly Compute-Intensive Algorithms Real-Time Requirements
4/19/00 TI Meeting25 Outline Processor Core with Reconfigurable Support Permutation Based Interleaved Memory Processor Architecture -EPIC Instruction Set Extensions Truncated Multipliers Software Support Needed
4/19/00 TI Meeting26 Characteristics of Wireless Algorithms Massive Parallelism Bit-level Computations Matrix Based Operations Memory Intensive Complex-valued Data Approximate Computations
4/19/00 TI Meeting27 What’s wrong with Current Architectures for these applications?
4/19/00 TI Meeting28 Problems with Current Architectures UltraSPARC, C6x, MMX, IA-64 Not enough MIPs/FLOPs Unable to fully exploit parallelism Bit Level Computations Memory Bottlenecks Specialized Instructions for Wireless Communications
4/19/00 TI Meeting29 Why Reconfigurable Adapt algorithms to environment Seamless and Continuous Data Processing during Handoffs Home Area Wireless LAN High Speed Office Wireless LAN Outdoor CDMA Cellular Network
4/19/00 TI Meeting30 Reconfigurable Support User Interface Translation Synchronization Transport Network OSI Layers 3-7 Data Link Layer (Converts Frames to Bits) OSI Layer 2 Physical Layer (hardware; raw bit stream) OSI Layer 1
4/19/00 TI Meeting31 Different Protocols Source CodingChannel Coding Channel Decoding Source Decoding Multiuser Detection Channel Estimation MPEG-4, H Voice,Multimedia Convolutional,Turbo - Channel Coding
4/19/00 TI Meeting32 A New Architecture Processor Core (GPP/DSP) Cache QQ Crossbar Reconfigurable Logic Real-Time I/O Bit Stream Main Memory RF Unit Processor Add-on PCMCIA Card
4/19/00 TI Meeting33 Why Reconfigurable Process initial bit level computations Optimize for fast I/O transfer Reconfigurable Logic Real-Time I/O Bit Stream RF Unit
4/19/00 TI Meeting34 Reconfigurable Support Configuration Caches 2 64-bit data buses 1 64-bit address bus Control Blocks Sequencer GARP Architecture at UC,Berkeley Boolean values 64-bit Datapath Fast I/O
4/19/00 TI Meeting35 Reconfigurable Support Wide Path to Memory –Data Transfer –Minimize Load Times Configuration Caches –Recently Displaced Configurations(5 cycles) –Can hold 4 full size Configurations Independent Execution
4/19/00 TI Meeting36 Reconfigurable Support Access to same Memory System as Processor –Minimize overhead When idle –Load Configurations –Transfer Data
4/19/00 TI Meeting37 Memory Interface Access to Main Memory and L1 Data Cache –Large, fast Memory Store Memory Prefetch Queues for Sequential Accesses –Read aheads and Write Behinds Processor Core (GPP/DSP) L1 Data Cache QQ Crossbar Main Memory FPGA Instruction Cache
4/19/00 TI Meeting38 Permutation Based Interleaved Memory (PBI) High Memory Bandwidth Needed Stride-Insensitive Memory System for Matrices Multiple Banks Sustained Peak Throughput (95%) L1 Data Cache Main Memory
4/19/00 TI Meeting39 Processor Core 64-bit EPIC Architecture with Extensions(IA-64/C6x) Statically determined Parallelism;exploit ILP Execution Time Predictability Processor Core (GPP/DSP) Cache QQ Crossbar FPGA
4/19/00 TI Meeting40 EPIC Principle Explicitly Parallel Instruction Computing Evolution of VLIW Computing Compiler- Key role Architecture to assist Compiler Better cope with dynamic factors –which limited VLIW Parallelism
4/19/00 TI Meeting41 Instruction Set Extensions To accelerate Bit level computations in Wireless Real/Complex Integer - Bit Multiplications –Used in Multiuser Detection, Decoding Bit - Bit Multiplications –Used in Outer Product Updates –Correlation, Channel Estimation Complex Integer-Integer Multiplications Useful in other Signal Processing applications –Speech, Video,,,
4/19/00 TI Meeting42 Architecture Support Support via Instruction Set Extensions Minimal ALU Modifications necessary Transparent to Register Files/Memory Additional 8-bit Special Purpose Registers
4/19/00 TI Meeting43 Integer - Bit Multiplications 64-bit Register A 64-bit Register C +/- 64-bit Register D D = D + b*C Eg: Cross-Correlation 8-bit Register b Register Renaming?
4/19/00 TI Meeting44 8-bit to 64-bit conversions D = D + b*b T Eg: Auto-Correlation b1 = b(1:8),b(1:8),….b(1:8) b2 = b(1)b(1)……b(8)b(8) b(1)..b(8)b(1) b(8) b(1)..b(8) b(1)b(2) b(8)b(7) b(8) 8-bit Register b64-bit Register A
4/19/00 TI Meeting45 Bit-Bit Multiplications D = D + b*b T Eg: Auto-Correlation 64-bit Register A = b164-bit Register B=b2 Ex-NOR b1*b2 Bit-Bit Multiplications 64-bit Register C=b1*b2
4/19/00 TI Meeting46 Increment/Decrement 64-bit Register D +/- 64-bit Register (D+b1*b2) 8-bit Register b1*b2 1 D = D + b*b T Eg: Auto-Correlation
4/19/00 TI Meeting47 Complex-valued Data Processing Is it easy to add ? Is this worth an additional ALU Support ? Typically supported by Software! ?
4/19/00 TI Meeting48 Truncated Multipliers Many applications need approximate computations Adaptive Algorithms :Y = Y + mu*(Y*C) Truncate lower bits Truncated Multipliers - half the area/half the delay Can do 2 truncated multiplies in parallel with regular Multiplier 1Multiplier 2 Truncated Multiplier ALU Multipliers
4/19/00 TI Meeting49 Software Support Greater Interaction between Compilers and Architectures –EPIC –Reconfigurable Logic Compiler needs to find and exploit bit level computations Reconfigurable Logic Programming
4/19/00 TI Meeting50 Other Uses Reconfigurable Logic –For accelerating loops of general purpose processors Bit Level Support –For other voice, video and multimedia applications
4/19/00 TI Meeting51 Software Suggestions Limited OS Support Compiler Efficiency –No more Assembly! Performance Analysis Tools Code Composer Studio 1.2
4/19/00 TI Meeting52 Conclusions DSPs to play major role in Future Base-Station Search for Computationally Efficient Algorithms and Better Processor Designs to meet Real-Time Reduced Complexity Algorithms designed Processor Core with Reconfigurable Support developed
Extra Slides
4/19/00 TI Meeting54 PBI Scheme N- address length M = 2 n Banks 2 N-n words in each bank To access a word, –n-bit bank number –N-n bit address (high-order) Calculation of the n-bit Bank Number
4/19/00 TI Meeting55 Calculate Bank Number Use all N bits to get n-bit vector Y = A X, A = n*N matrix of 0’s & 1’s Y = A h X h + A l X l (N-n,n) [A l -rank n] N-bit parity circuit with log k N levels of XOR gates (k-Fanin) Parity Ckt. Row 0 of A Parity Ckt. Row 1 of A Parity Ckt. Row n-1 of A N-bit address Decoder n parity bit signals 2 n bank select signals
4/19/00 TI Meeting56 Interleaved Memory Model Address Source M(0)M(1)M(M-1) Data SinkData Sequencer Input Buffers Output Buffers Memory Banks
4/19/00 TI Meeting57 Aspects of EPIC Designing Plan of Execution(POE) at Compile Time Permitting Compiler to play Statistics –Conditional Branches, Memory references Communicating POE to the hardware –Static Scheduling –Branch information
4/19/00 TI Meeting58 Architecture Features in EPIC Static Scheduling –MultiOP –Non-Unit Assumed Latency (NUAL) The Branch Problem –Predicated Execution –Control Speculation –Predicated Code Motion The Memory Problem –Cache Specifiers –Data Speculation
4/19/00 TI Meeting59 Operation of Reconfigurable Logic Load Configuration –If in configuration cache, minimal time Copy initial data with coprocessor move instructions Start execution Issue wait that interlocks while active Copy registers back at kernel completion