Task partitioning wireless base-station receiver algorithms on multiple DSPs and FPGAs Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro Rice University This work is supported by Nokia, TI, TATP and NSF
Motivation Build wireless multimedia communication systems Kbps to Mbps Sophisticated algorithms - exponential complexity Approaches: Sub-optimal algorithms - O(n2,n3) complexity Better hardware implementations needed
Hardware implementations DSP - programmable ASIC - customized hardware FPGA - programmable ASICs Single DSP - too slow Need flexibility - for different protocols speed - to meet real-time Multiple DSP-FPGA solution investigated
Contributions Efficient task-partitioning multiuser estimation and detection algorithms on fixed hardware maximize performance, minimize overhead 1.19X- 5.92X speedup with 2 DSPs. additional processing power and internal memory Use of FPGAs to accelerate multiuser detection Multiple DSP-FPGA to meet real-time requirements
Outline Introduction Multiprocessor system at Rice Single and multiprocessor simulations FPGAs for acceleration Summary
Multiuser estimation and detection noise + interference Base-station Direct Reflections User 1 User 2 Jointly estimate attenuations, fading and delays Jointly detect data of all users
Benefits of multiuser estimation & detection 2 4 6 8 10 12 14 16 -4 -3 -2 -1 Error rate vs. SNR SNR (in dB) Bit error rate Single-user (channel estimation + detection) Multi-user estimation+ Single-user detection Multi-user (channel estimation + detection)
Base-station receiver Antenna Multiuser detection Decoding Information bits Multiuser channel estimation Training Tracking
Sub-optimal estimation and detection Maximum likelihood estimation O( User2 * spreading gain ) avoids matrix inversion by an iterative scheme Multi-user detection with interference cancellation Single user detector (code matched filter) O( User * spreading gain ) 3 Stages of parallel interference cancellation O( User2 )
Outline Introduction Multiprocessor system at Rice Single and multiprocessor simulations FPGAs for acceleration Summary
Multiprocessor implementations Single DSP - too slow Multiple DSPs - communication overhead Partition estimation and detection on different DSPs Narrow communication link Maximize performance Data rates dependent only on detection
Multiprocessor system at Rice Prototype multiprocessor board from Sundance Inc. Two TI C67 DSPs and two Xilinx 300K gate FPGAs Inter-processor communication at 20 MBps Host DSP Multiuser estimation detection Received bits Detected bits PC FPGA2 FPGA1 DSP1 DSP2
Outline Introduction Multiprocessor system at Rice Single and multiprocessor simulations FPGAs for acceleration Summary
Base case implementation Single DSP Multiuser estimation 10X-50X slower than multiuser detection Different algorithm complexity Multiuser detection in internal memory (64 KB) Multiuser estimation in internal and off-chip memory
Base case simulation Execution time (in seconds) Users 5 10 15 20 25 5 10 15 20 25 30 35 -6 -5 -4 -3 -2 -1 Execution time (in seconds) Users Multi-user estimation Single-user estimation Multi-user detection Single-user detection
Dual DSP implementation Both estimation and detection now in internal memory 2X - 12.66X speedup in estimation (DSP1 vs. DSP) No change in detection performance Estimation still 3X slower than detection Inter-processor communication overhead O( users * spreading gain ) = 16 - 512 KB overhead
Dual DSP simulations Execution time (in seconds) Users 5 10 15 20 25 5 10 15 20 25 30 35 -6 -5 -4 -3 -2 -1 Execution time (in seconds) Users Multi-user estimation - DSP Comm. overhead DSP1 - DSP2 Multi-user estimation - DSP1 Multi-user detection - DSP Multi-user detection - DSP2
Balancing division of tasks Unbalanced task division Estimation 3X slower than detection Huge communication overhead > estimation, detection Data rates dependent only on detection Update channel estimates less frequently reasonable for slow fading channels (indoor environments)
Frequency of estimation updates Can update more frequently with more users Once every 48 bits - single user Once every 9 bits - 32 users Relatively larger overhead for fewer users Estimation, detection = O( User2 ) Communication overhead = O( User )
Frequency of channel estimate updates 5 10 15 20 25 30 35 40 45 50 Users Frequency of estimation updates ( 1 in 'x' bits)
Outline Introduction Multiprocessor system at Rice Single and multiprocessor simulations FPGAs for acceleration Summary
Limitations of DSP implementations Further acceleration needed for real-time performance Matrix based massively parallel algorithms Detection of bits {+1,-1} : bit - level operations DSPs Bit multiplications not needed - (add/subtract on FPGA) Bit storage not convenient Not fully able to exploit parallelism
FPGAs for acceleration Flexibility of ASICs Good for parallelism and bit-level operations Code matched filter detector Multiuser estimation PIC (Stage 1) PIC (Stage 2) Received bits Detected bits DSP2 DSP1 FPGA1 FPGA2
Multiprocessor simulations 5 10 15 20 25 30 35 -6 -5 -4 -3 -2 Execution time (in seconds) Users Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user 2 DSPs + 2 FPGAs
Multiprocessor advantages 1.19X - 5.92X speedup using 2 DSPs Up to 50X acceleration possible by task balancing with additional FPGAs DSP - FPGA communication overhead Just 2 DSPs and 2 FPGAs can meet 128 Kbps/user real-time requirements for up to 7 users
Outline Introduction Multiprocessor system at Rice Single and multiprocessor simulations FPGAs for acceleration Future work and summary
Future work DSP - FPGA communication overhead Transferring KBs of data into FPGAs Implementation of channel decoding Complete real-time system
Summary Efficient task-partitioning multiuser estimation and detection algorithms on fixed hardware maximize performance, minimize overhead 1.19X- 5.92X speedup with 2 DSPs. additional processing power and internal memory Use of FPGAs to accelerate multiuser detection Multiple DSP-FPGA to meet real-time requirements