Channel Equalization in MIMO Downlink and ASIP Architectures Predrag Radosavljevic Rice University March 29, 2004
Wireless System Downlink transmission in MIMO wireless system Physical layer of the mobile handset Linear channel equalization Hardware implementation using ASIP architectures
Motivation MIMO Downlink and Equalization MIMO: high data rate and high spectral efficiency Interference from each antenna that introduces MAI DS-CDMA signals in multipath environment – user orthogonality is destroyed which causes ISI Solution: powerful channel equalization to mitigate ISI and MAI in order to restore user’s orthogonality Chip level channel equalization based on iterative CG and adaptive LMS algorithms
Motivation ASIP Hardware Implementation Future generations of mobile handsets: high speed, flexibility and low power Traditional approaches: ASIC and DSP processors ASIC: No flexibility: Family of ASICs are needed High probability of design errors, high design cost DSP: Not optimized for a given application Often limited instruction and data level parallelism ASIP: Tradeoff between efficiency of ASICs and flexibility of DSPs
Thesis Contributions Channel equalization in broad range of environments 16-bit fixed point implementation Flexible ASIP architecture design Same hardware - different equalization (slow/fast fading, CG/LMS) Extension of ASIP instruction set with application-specific operations Customized architecture: Real-time requirements for 1xEV-DV standard ( Mc/s) Reasonable clock frequency (up to 150MHz) and power dissipation Automatic hardware design: from C to gate level Hardware synthesis for FPGA and CMOS libraries
Outline Data model Channel equalization ASIP hardware implementations Conclusions and future work
Data Model: Transmission Side Alternating symbols over transmit antennas Spreading: orthogonality between users Scrambling: Reduction of inter-cell interference Transmission over multipath correlated channels
Receiver Implementations RAKE Receiver, Multiuser Detector, Kalman filter, LMMSE equalization RAKE: Deteriorated performance in highly loaded system Not appropriate for MIMO environments Multiuser Detectors: High computational complexity Limited knowledge about the activity of other users Kalman filter: Optimal solution in the sense of MSE Prohibitive complexity in MIMO environments
LMMSE Equalization Lower complexity in comparison with other receivers Independent on the number of users Iterative Solutions Good performance in highly scattered environments LMMSE Receiver
LMMSE Equalization Linear system to be solved: Covariance: block Toeplitz and positive definite A and B: Toeplitz Hermitian matrices C: Toeplitz matrix
LMMSE Approaches LMMSE solution: Cholesky decomposition More complex hardware primitives Conjugate Gradient (CG) Iterative solution, fast convergence Block algorithm – modifications for fast fading channels Least Mean Square (LMS) Adaptive algorithm Sensitivity to learning step
Equalization in Time-Varying Channels Spatially correlated, frequency selective (multipaths), fading channels Data-rate: MChips/sec Antenna correlation: Base Station: 50.18% Mobile: 43.99%
Channel Equalization: CG Algorithm N samples: 4096 in slow fading channels
CG Equalization in Veh. A 30km/h Sliding Window (SW) approach Faster variations: more frequent update of filter coefficients
CG Equalization: Velocity of 120km/h Multiple sub-blocks instead of two blocks Partial channel estimation for each sub-block Apply weights for global channel estimation: Weights are adjusted according to the channel variations If channel fading is faster, faster the coefficients drop to 0
Architectural Alternative: LMS Equalization Adaptive LMS:
Performance: Slow Fading Environments From 32-bit floating to 16-bit fixed point Control of quantization error Pedestrian A – 3km/h Pedestrian B – 10km/h
Performance: Vehicular A 30km/h CG with sliding window (CG-SW): Improvement in comparison with basic CG
CG–SW Approach: Fixed Point 32-bit floating point and 16-bit fixed point About 1 % BER difference Vehicular A – 30km/h
Performance: Velocity of 120km/h CG with sliding window and weights averaging CG-SW-WA with different numbers of sub-blocks Performance improvement if weights are applied Pedestrian A - 120km/hVehicular A 120km/h
Computational Complexity Number of operations per chip in 1 second CG filter update is less complex Reason: block-level filter update algorithm
Directions for Architecture Implementation Equalization in different environments Block CG, adaptive LMS for slow fading environments Modifications of CG for fast fading channels Different computational complexity and amount of parallelism Flexible hardware for different equalizations and CG modifications Programmable architecture Application specific
ASIP Architecture for Equalization: Required Features Flexible architecture able to operate in different channel environments Slow/fast fading Low/high scattering Architecture customization Implementation of application-specific operations Instruction and data level parallelism Fast execution of complex algorithms Automatic hardware-software co-design Fast processor design starting from C/C++ code of application
ASIP Architecture Based on TTA Flexible architecture No limitations to add new FUs, buses, registers Customizable architecture Implementation of Special Function Units (SFUs) Instruction and data level parallelism VLIW architecture principle Efficient and parallel data flow Fast processor design Automatic search for best processor VHDL processor representation
General Structure of TTA Transport of operands triggers the appropriate operation as a side effect Only one instruction: “move” instruction 32-bit architecture
TTA Design Flow: MOVE Tool Design space exploration for optimal architecture
Customization of ASIP Implementation of application specific operations User-defined Special Function Units (SFUs) Sacrificing architecture generality for optimization and performance improvement Designed SFUs: Real multiplication with shifting ability Complex multiplication with shifting Sub-word arithmetic operations Sign-test and add/subtract
SFU: Complex Multiplication Reduction of data transports between FUs Less number of buses and smaller interconnection network Smaller instruction word Instruction and data parallelism is placed inside CXMUL
Performance Improvement with SFUs Bus reduction of 50% Instruction word length reduction of about 50%
TTA Processors for MIMO Equalization 1. Two co-processors (CG equalization) Co-processor for updating equalizer coefficients Co-processor for filtering and user detection 2. Single processor for all parts of equalization algorithm (CG/LMS equalization) Identical architectures for slow and fast fading environments
Single Processor vs. Two Coprocessors Single processor Smaller area and power dissipation Higher clock frequency
Processor Flexibility Identical customized processor for broad range of channel environments Identical processor for LMS and CG equalization
Example of Designed Processor Coprocessor for CG filter update
Hardware synthesis design flow MOVEGen: generates VHDL representation of processor core Xilinx tools for fast FPGA prototyping Mentor Graphics tools for CMOS gate level design
VHDL Template of TTA Processor Automatic VHDL generation of processor core, control and interconnection FUs, SFUs, peripherals: pre-designed or defined by user
MoveProc Synthesis on Xilinx FPGA CG/LMS equalizer including user detection no SFUs 32 buses Xilinx FPGA part: XC2V8000 Slices: 38,757 out of 46,592 BRAMs: 148 out of 168 IOBs: 263 out of 1108 MULT18x18s: 24 out of 168
MoveProc Synthesis on Xilinx FPGA Customized CG/LMS equalizer including user detection with SFUs 16 buses Xilinx FPGA part: XC2V6000 Slices: 21,126 out of 33,792 BRAMs: 107 out of 144 IOBs: 229 out of 1104 MULT18x18s: 11 out of 144
Gate Level CMOS Synthesis Mentor Graphics Tools 0.5 CMOS library Customized CG/LMS equalizer including user detection (with SFUs) Synthesis estimate of processor core: 182,887 gates
Conclusions Equalization algorithms for broad range of channel environments Slow fading: CG/LMS Fast fading: Modifications of basic CG equalization ASIP architecture design based on TTA Same architecture – different equalization algorithms Optimization with application-specific operations Reasonable frequency and power dissipation for 3GPP data rate Fast processor design VHDL representation of optimal processor FPGA synthesis and CMOS gate level synthesis
Future Work Processor layout synthesis IC Station software tool from Mentor Graphics Precise timing, area, and power analysis Implementation of hybrid word length Reduced precision for filter application part Implementation on C5x DSP for comparison
Acknowledgements Thanks to: Professor Cavallaro Dr. De Baynast Professor Aazhang Dr. Dabak Dr. Sabharwal Texas Instruments Nokia