Algorithms and Architectures for Future Wireless Base-Stations Sridhar Rajagopal and Joseph Cavallaro ECE Department Rice University April 19, 2000 This.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Real-Time DSP Multiprocessor Implementation for Future Wireless Base-Station Receivers Bryan Jones, Sridhar Rajagopal, and Dr. Joseph Cavallaro.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Implementation Issues for Channel Estimation and Detection Algorithms for W-CDMA Sridhar Rajagopal and Joseph Cavallaro ECE Dept.
DSPs in Wireless Communication Systems Vishwas Sundaramurthy Electrical and Computer Engineering Department, Rice University, Houston,TX.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
A bit-streaming, pipelined multiuser detector for wireless communications Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,
TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal Srikrishna Bhashyam, Joseph R. Cavallaro,
Super computers Parallel Processing By Lecturer: Aisha Dawood.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
RICE UNIVERSITY Advanced Wireless Receivers: Algorithmic and Architectural Optimizations Suman Das Rice University Department of Electrical and Computer.
RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.
Pipelining and number theory for multiuser detection Sridhar Rajagopal and Joseph R. Cavallaro Rice University This work is supported by Nokia, TI, TATP.
RICE UNIVERSITY On the architecture design of a 3G W-CDMA/W-LAN receiver Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia.
Implementing Multiuser Channel Estimation and Detection for W-CDMA Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro and Behnaam Aazhang Rice.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,
Buffering Techniques Greg Stitt ECE Department University of Florida.
Use of Pipelining to Achieve CPI < 1
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
CS 352H: Computer Systems Architecture
Low-power Digital Signal Processing for Mobile Phone chipsets
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
A programmable communications processor for future wireless systems
Sridhar Rajagopal April 26, 2000
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
DSPs for Future Wireless Base-Stations
On-line arithmetic for detection in digital communication receivers
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Sridhar Rajagopal COMP 625 April 17, 2000
CSC3050 – Computer Architecture
Sridhar Rajagopal, Srikrishna Bhashyam,
DSP Architectures for Future Wireless Base-Stations
On-line arithmetic for detection in digital communication receivers
Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro
DSPs for Future Wireless Base-Stations
Presentation transcript:

Algorithms and Architectures for Future Wireless Base-Stations Sridhar Rajagopal and Joseph Cavallaro ECE Department Rice University April 19, 2000 This work is supported by Texas Instruments, Nokia, Texas Advanced Technology Program and NSF

4/19/00 TI Meeting2 Overview Future Base-Stations Current DSP Implementation Our Approach –Make Algorithms Computationally effective –Task Partitioning for pipelining, parallelism Processor Design for Accelerating Wireless

4/19/00 TI Meeting3 Evolution of Wireless Comm First Generation Voice Second/Current Generation Voice + Low-rate Data (9.6Kbps) Third Generation + Voice + High-rate Data (2 Mbps) + Multimedia W-CDMA

4/19/00 TI Meeting4 Communication System Uplink Direct Path Reflected Paths Noise +MAI User 1 User 2 Base Station

4/19/00 TI Meeting5 Main Processing Blocks Channel EstimationDetection Decoding Baseband Layer of Base-Station Receiver

4/19/00 TI Meeting6 Proposed Base-Station No Multiuser Detection TI's Wireless Basestation (

4/19/00 TI Meeting7 Real -Time Requirements Multiple Data Rates by Varying Spreading Factors Detection needs to be done in real-time –1953 cycles available in a C6x DSP at 250MHz to detect 1 bit at 128 Kbps

4/19/00 TI Meeting8 Current DSP Implementation

4/19/00 TI Meeting9 Complexity Algorithm Choice Limited by Complexity –Multistage reduces data rate by half. Main Features –Matrix based operations –High levels of parallelism –Bit level computations 32x32 problem size for the Detector shown Estimation, Decoding assumed pipelined.

4/19/00 TI Meeting10 Reasons Sophisticated, Compute-Intensive Algorithms Need more MIPs/FLOPs performance Unable to fully exploit pipelining or parallelism Bit - level computations / Storage

4/19/00 TI Meeting11 Our Approach Make algorithms computationally effective –without sacrificing error rate performance Task Partitioning on Multiple Processing Elements –DSPs : Core –FPGAs : Application Specific / Bit-level Computations Processor with reconfigurable support and extensions for wireless

4/19/00 TI Meeting12 Algorithms Channel Estimation –Avoid inversion by iterative scheme Detection –Avoid block-based detection by pipelining

4/19/00 TI Meeting13 Computations Involved Model Compute Correlation Matrices Bits of K async. users aligned at times I and I-1 Received bits of spreading length N for K users riri bibi b i+1 time delay

4/19/00 TI Meeting14 Multishot Detection Solve for the channel estimate, A i

4/19/00 TI Meeting15 Differencing Multistage Detection Stage 0- Matched Filter Stage 1 Successive Stages S=diag(A H A) y - soft decision d - detected bits (hard decision )

4/19/00 TI Meeting16 Iterative Scheme Tracking Method of Steepest Descent Stable convergence behavior Same Performance

4/19/00 TI Meeting17 Simulations - AWGN Channel Detection Window = 12 SINR = 0 Paths =3 Preamble L =150 Spreading N = 31 Users K = bits/user MF – Matched Filter ML- Maximum Likelihood ACT – using inversion

4/19/00 TI Meeting18 Fading Channel with Tracking Doppler = 10 Hz, 1000 Bits,15 users, 3 Paths

4/19/00 TI Meeting19 Block Based Detector Matched Filter Stage 1 Stage 2 Stage 3 Matched Filter Stage 1 Stage 2 Stage 3 Bits 2-11 Bits 12-21

4/19/00 TI Meeting20 Pipelined Detector Matched Filter Stage 1 Stage 2 Stage

4/19/00 TI Meeting21 Task Decomposition [Asilomar99] Matrix Products InverseCorrelation Matrices (Per Bit) R br [I] O(KN) A 0 H A 1 O(K 2 N) A H r O(KND ) A 1 H A 1 O(K 2 N) A 0 H A 0 O(K 2 N) R bb A H = R br [I] O(K 2 N) Multistage Detection (Per Window) O(DK 2 Me) b Pilot Data MUXMUX d Data’ MUXMUX R bb A H = R br [R] O(K 2 N) d R br [R] O(KN) R bb O(K 2 ) Block I Block II Block III Block IV Channel Estimation Matched Filter Multistage Detector

4/19/00 TI Meeting22 Achieved Data Rates

4/19/00 TI Meeting23 VLSI Implementation Channel Estimation as a Case Study Area - Time Efficient Architecture Real - Time Implementation Bit- Level Computations - FPGAs Core Operations - DSPs

4/19/00 TI Meeting24 Motivation for Architecture Wireless, the next wave after Multimedia Highly Compute-Intensive Algorithms Real-Time Requirements

4/19/00 TI Meeting25 Outline Processor Core with Reconfigurable Support Permutation Based Interleaved Memory Processor Architecture -EPIC Instruction Set Extensions Truncated Multipliers Software Support Needed

4/19/00 TI Meeting26 Characteristics of Wireless Algorithms Massive Parallelism Bit-level Computations Matrix Based Operations Memory Intensive Complex-valued Data Approximate Computations

4/19/00 TI Meeting27 What’s wrong with Current Architectures for these applications?

4/19/00 TI Meeting28 Problems with Current Architectures UltraSPARC, C6x, MMX, IA-64 Not enough MIPs/FLOPs Unable to fully exploit parallelism Bit Level Computations Memory Bottlenecks Specialized Instructions for Wireless Communications

4/19/00 TI Meeting29 Why Reconfigurable Adapt algorithms to environment Seamless and Continuous Data Processing during Handoffs Home Area Wireless LAN High Speed Office Wireless LAN Outdoor CDMA Cellular Network

4/19/00 TI Meeting30 Reconfigurable Support User Interface Translation Synchronization Transport Network OSI Layers 3-7 Data Link Layer (Converts Frames to Bits) OSI Layer 2 Physical Layer (hardware; raw bit stream) OSI Layer 1

4/19/00 TI Meeting31 Different Protocols Source CodingChannel Coding Channel Decoding Source Decoding Multiuser Detection Channel Estimation MPEG-4, H Voice,Multimedia Convolutional,Turbo - Channel Coding

4/19/00 TI Meeting32 A New Architecture Processor Core (GPP/DSP) Cache QQ Crossbar Reconfigurable Logic Real-Time I/O Bit Stream Main Memory RF Unit Processor Add-on PCMCIA Card

4/19/00 TI Meeting33 Why Reconfigurable Process initial bit level computations Optimize for fast I/O transfer Reconfigurable Logic Real-Time I/O Bit Stream RF Unit

4/19/00 TI Meeting34 Reconfigurable Support Configuration Caches 2 64-bit data buses 1 64-bit address bus Control Blocks Sequencer GARP Architecture at UC,Berkeley Boolean values 64-bit Datapath Fast I/O

4/19/00 TI Meeting35 Reconfigurable Support Wide Path to Memory –Data Transfer –Minimize Load Times Configuration Caches –Recently Displaced Configurations(5 cycles) –Can hold 4 full size Configurations Independent Execution

4/19/00 TI Meeting36 Reconfigurable Support Access to same Memory System as Processor –Minimize overhead When idle –Load Configurations –Transfer Data

4/19/00 TI Meeting37 Memory Interface Access to Main Memory and L1 Data Cache –Large, fast Memory Store Memory Prefetch Queues for Sequential Accesses –Read aheads and Write Behinds Processor Core (GPP/DSP) L1 Data Cache QQ Crossbar Main Memory FPGA Instruction Cache

4/19/00 TI Meeting38 Permutation Based Interleaved Memory (PBI) High Memory Bandwidth Needed Stride-Insensitive Memory System for Matrices Multiple Banks Sustained Peak Throughput (95%) L1 Data Cache Main Memory

4/19/00 TI Meeting39 Processor Core 64-bit EPIC Architecture with Extensions(IA-64/C6x) Statically determined Parallelism;exploit ILP Execution Time Predictability Processor Core (GPP/DSP) Cache QQ Crossbar FPGA

4/19/00 TI Meeting40 EPIC Principle Explicitly Parallel Instruction Computing Evolution of VLIW Computing Compiler- Key role Architecture to assist Compiler Better cope with dynamic factors –which limited VLIW Parallelism

4/19/00 TI Meeting41 Instruction Set Extensions To accelerate Bit level computations in Wireless Real/Complex Integer - Bit Multiplications –Used in Multiuser Detection, Decoding Bit - Bit Multiplications –Used in Outer Product Updates –Correlation, Channel Estimation Complex Integer-Integer Multiplications Useful in other Signal Processing applications –Speech, Video,,,

4/19/00 TI Meeting42 Architecture Support Support via Instruction Set Extensions Minimal ALU Modifications necessary Transparent to Register Files/Memory Additional 8-bit Special Purpose Registers

4/19/00 TI Meeting43 Integer - Bit Multiplications 64-bit Register A 64-bit Register C +/- 64-bit Register D D = D + b*C Eg: Cross-Correlation 8-bit Register b Register Renaming?

4/19/00 TI Meeting44 8-bit to 64-bit conversions D = D + b*b T Eg: Auto-Correlation b1 = b(1:8),b(1:8),….b(1:8) b2 = b(1)b(1)……b(8)b(8) b(1)..b(8)b(1) b(8) b(1)..b(8) b(1)b(2) b(8)b(7) b(8) 8-bit Register b64-bit Register A

4/19/00 TI Meeting45 Bit-Bit Multiplications D = D + b*b T Eg: Auto-Correlation 64-bit Register A = b164-bit Register B=b2 Ex-NOR b1*b2 Bit-Bit Multiplications 64-bit Register C=b1*b2

4/19/00 TI Meeting46 Increment/Decrement 64-bit Register D +/- 64-bit Register (D+b1*b2) 8-bit Register b1*b2 1 D = D + b*b T Eg: Auto-Correlation

4/19/00 TI Meeting47 Complex-valued Data Processing Is it easy to add ? Is this worth an additional ALU Support ? Typically supported by Software! ?

4/19/00 TI Meeting48 Truncated Multipliers Many applications need approximate computations Adaptive Algorithms :Y = Y + mu*(Y*C) Truncate lower bits Truncated Multipliers - half the area/half the delay Can do 2 truncated multiplies in parallel with regular Multiplier 1Multiplier 2 Truncated Multiplier ALU Multipliers

4/19/00 TI Meeting49 Software Support Greater Interaction between Compilers and Architectures –EPIC –Reconfigurable Logic Compiler needs to find and exploit bit level computations Reconfigurable Logic Programming

4/19/00 TI Meeting50 Other Uses Reconfigurable Logic –For accelerating loops of general purpose processors Bit Level Support –For other voice, video and multimedia applications

4/19/00 TI Meeting51 Software Suggestions Limited OS Support Compiler Efficiency –No more Assembly! Performance Analysis Tools Code Composer Studio 1.2

4/19/00 TI Meeting52 Conclusions DSPs to play major role in Future Base-Station Search for Computationally Efficient Algorithms and Better Processor Designs to meet Real-Time Reduced Complexity Algorithms designed Processor Core with Reconfigurable Support developed

Extra Slides

4/19/00 TI Meeting54 PBI Scheme N- address length M = 2 n Banks 2 N-n words in each bank To access a word, –n-bit bank number –N-n bit address (high-order) Calculation of the n-bit Bank Number

4/19/00 TI Meeting55 Calculate Bank Number Use all N bits to get n-bit vector Y = A X, A = n*N matrix of 0’s & 1’s Y = A h X h + A l X l (N-n,n) [A l -rank n] N-bit parity circuit with log k N levels of XOR gates (k-Fanin) Parity Ckt. Row 0 of A Parity Ckt. Row 1 of A Parity Ckt. Row n-1 of A N-bit address Decoder n parity bit signals 2 n bank select signals

4/19/00 TI Meeting56 Interleaved Memory Model Address Source M(0)M(1)M(M-1) Data SinkData Sequencer Input Buffers Output Buffers Memory Banks

4/19/00 TI Meeting57 Aspects of EPIC Designing Plan of Execution(POE) at Compile Time Permitting Compiler to play Statistics –Conditional Branches, Memory references Communicating POE to the hardware –Static Scheduling –Branch information

4/19/00 TI Meeting58 Architecture Features in EPIC Static Scheduling –MultiOP –Non-Unit Assumed Latency (NUAL) The Branch Problem –Predicated Execution –Control Speculation –Predicated Code Motion The Memory Problem –Cache Specifiers –Data Speculation

4/19/00 TI Meeting59 Operation of Reconfigurable Logic Load Configuration –If in configuration cache, minimal time Copy initial data with coprocessor move instructions Start execution Issue wait that interlocks while active Copy registers back at kernel completion