Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro Http://www.ece.rice.edu/ Arithmetic Acceleration Techniques for Wireless Communication Receivers Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro {suman,sridhar,chaitali,cavallar}@rice.edu Rice University This work is supported by Nokia, Texas Instruments, Texas Advanced Technology Program and NSF
Objective Next generation Wireless Base-station Real-Time Requirements Multiuser Channel Estimation and Detection High Complexity Algorithms for Advanced Receiver Structures Task Decomposition Potential for parallelism Application-Specific Design / Single Processor
Outline Motivation Real-time Requirements Joint Estimation and Detection Task Decomposition Results Summary
Motivation Next Generation Wireless Systems Higher Data Rates , up to 2 Mbps Multimedia Capabilities Multi-rate, QoS High Complexity in Proposed Algorithms Pressure on existing hardware Time, power, size constraints Acceleration on Hardware Needed
Wireless Communication Uplink Asynchronous CDMA System Multiple Users Channel Effects Fading Multiple paths Multiple Access Interference Direct Path Reflected Paths Noise +MAI User 1 User 2 Base Station
Base-Station Receiver Multiple Users Channel Estimation Multiuser Detection Decoder Data Pilot Demod -ulator Antenna Decision Feedback MU X Detected Bits + Base-station Receiver Delay d b The Physical Layer
Real -Time Requirements W-CDMA Transmission done by multiplication of signature waveform (Spreading) Data Transmission in 10 ms Frames Multiple Data Rates by Varying Spreading Factors Detection needs to be done in real-time 1953 cycles available in a C6x DSP at 250MHz to detect 1 bit at 128 Kbps
Joint Estimation and Detection Algorithm to jointly estimate the channel response and detect all the user’s bits. Shown to have better performance as well as reduced computational complexity. Maximum Likelihood Based Channel Estimation [C.Sengupta et al. : PIMRC’1998 WCNC’1999] Differencing Multistage Detection based on Parallel Interference Cancellation [G.Xu et al. : SPIE’1999]
Computations Involved delay Model Compute Correlation Matrices ri bi bi-1 time Bits of K async. users aligned at times I and I-1 Received bits of spreading length N for K users
Solve for the channel estimate, Ai Multishot Detection Solve for the channel estimate, Ai Multishot Detection
Differencing Multistage Detection Successive Stages S=diag(AHA) y - soft decision d - detected bits (hard decision)
Block Bi-Diagonal Matrix Structure of AHA Block Bi-Diagonal Matrix
Bottlenecks Identify using C6x DSP Implementation Channel Estimation Can be done less frequently Depends on BER needed Multiuser Detection Needs to be done all the time Differencing Multistage Less computations on successive stages Analysis on Various levels of Optimization for Detection
Correlation Matrices (Per Bit) Task Decomposition Block I Block II Block III Task B Correlation Matrices (Per Bit) Inverse Matrix Products Block IV M U X d A0HA1 O(K2N) Multistage Detection (Per Window) Rbr[R] O(KN) RbbAH = Rbr[R] O(K2N) b A0HA0 O(K2N) Rbr[I] O(KN) Data’ M U X RbbAH = Rbr[I] O(K2N) d O(DK2Me) Rbb O(K2) A1HA1 O(K2N) Pilot AHr O(KND) Data Channel Estimation Multistage Detection Task A
Sequential / Pipeline A B Task A Block IV AHr O(KND) d Data O(DK2Me) Real-time 1953 cycles,128 Kbps Task B 13272 cycles 3367*Me cycles (Single PE) Sequential : A+B: 13272 + 3367*Me : 10.7 Kbps (2 PE) Pipeline : A B : max(13272, 3367*Me) : 18.8 Kbps *Me =3
(K+1 PE) Parallel A B : 3367*Me : 24.75 Kbps Block IV Task A AHr O(ND) 1 Data O(DK2Me) d K Task B Real-time 1953 cycles,128 Kbps 3367*Me cycles 885 cycles (K+1 PE) Parallel A B : 3367*Me : 24.75 Kbps
Parallel A Pipeline B Parallel A Parallel + Pipeline B Task A 1 K Task B Real-time 1953 cycles,128 Kbps 885 cycles O(N) 3367 cycles O(K2) 225 cycles O(K) (K +3 PE) Parallel A Pipeline B : 3367 : 74.25 Kbps ((Me+1)K PE) Parallel A Parallel + Pipeline B : 885 : 282.5 Kbps
At this step Task A Task B Multistage Detection Block I &II 1 Data K Stage 1 Stage2 Stage3… Block IV Block III Task B
Achieved Data Rates 9 10 11 12 13 14 15 0.5 1 1.5 2 2.5 3 x 10 5 Number of Users Data Rates Data Rates for Different Levels of Pipelining and Parallelism (Parallel A) (Parallel+Pipe B) (Parallel A) (Pipe B) (Parallel A) B A B Sequential A + B Data Rate Requirement = 128 Kbps
Mapping to Hardware Analysis independent of hardware DSP with coprocessors Multiple Processors Combination of a processor with ASIC/FPGA Single ASIC Minimize Idle time in processing elements Some computations can be shared Assumptions Critical processing elements have functional units similar to C6x No communication overhead between processors Number of elements dependent on number of users
Summary Acceleration Techniques for Multiuser Estimation and Detection : computationally intensive algorithm Task Decomposition C6x DSP Simulator Real-time Analysis Hardware Mapping Issues Application Specific Design more effective than a single processor solution
Future Work Fixed Point Implementation Matrix Oriented Architectures LU Decomposition Other Algorithms for decomposition Matrix Oriented Architectures Vector Processor with SIMD 2 Levels of Parallelism Complex Arithmetic
DSP Implementation Texas Instruments C6x Simulator TI TMS320C6701 Floating Point DSP Code and Program optimized to fit in internal memory 32 -bit VLIW Architecture 8 Functional Units 2 Multipliers 4 Adders 2 Load/Store TI C Compiler