RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal
RICE UNIVERSITY Motivation Wireless Mobile device Baseband Programmable Communications Processor RF Unit A/D D/A Add-on PCMCIA Network Interface Card Higher Layers Mobile: Switch between standards and between parameters Base-station: varying number of users with different parameters
RICE UNIVERSITY The problem GPP DSP FPGA VLSI Performance Power Flexibility
RICE UNIVERSITY An approach for the solution Algorithms well understood at VLSI level Can design real-time systems. Pushing it higher in the chain Current DSPs not powerful enough for our application Using the IMAGINE simulator to see what kind of architecture features would be useful in a future DSP for such applications.
RICE UNIVERSITY History of my work Algorithms DSP VLSI FPGA IMAGINE Multiuser channel estimation Multiuser detection Task-partitioning Parallelism Pipelining Conventional arithmetic On-line arithmetic Instruction set extensions Co-processor support Functional unit design and usage Distant Past Recent Past Recent and Near Future
RICE UNIVERSITY Contents Programmable architecture design using the IMAGINE simulator Multiuser estimation and detection implementation Performance comparisons and results Other extensions for possible integration Conclusions
RICE UNIVERSITY The IMAGINE architecture and simulator IMAGINE is a media signal processor
RICE UNIVERSITY Why the IMAGINE simulator? Great for media processing algorithms Has a VLIW-based cluster -- DSP comparisons A good base architecture : 1024-pt FFT RSIM, SimpleScalar…: more general purpose architecture simulators
RICE UNIVERSITY What does the simulator give us? Execution time for the different parts of the code Functional unit utilization Insights into the bottlenecks Flexibility to add and remove functional units already present or design your own Graphical view of the schedule on the functional units
RICE UNIVERSITY Down-side 2 level C++ programming StreamC: transfers streams of data between main memory and stream register file (SRF) KernelC: transfers streams from the SRF to the ALU clusters Code optimized to the number of ALU clusters and the size of the data Compiler may fail register allocation if too many variables or functional units modified
RICE UNIVERSITY Contents Programmable architecture design using the IMAGINE simulator Multiuser estimation and detection implementation Performance comparisons and results Other extensions for possible integration Conclusions
RICE UNIVERSITY Typical workload representation (Base-station) Equalization FFT Viterbi decoding Channel estimation Multiuser detection Viterbi/Turbo decoding Multiple antennas Long spreading codes Space-Time codes Wireless LAN W-CDMA If you felt that life was too easy
RICE UNIVERSITY Estimation/Detection (64,32 sizes) Multiuser Estimation Kernel 1,2,3 Multiuser Detection Kernel 6, 7 Massaging matrices for detection Kernel 4, 5
RICE UNIVERSITY Kernels 1. Update: Update Rbb, Rbr 2. Mmult : multiply Rbb * A 3. Iterate: gradient descent 4. MmultL: Calculate L 5. MmultC: Calculate C 6. Mf: Matched Filter 7. Pic: 1 Parallel Interference Cancellation Stage
RICE UNIVERSITY Kernel 2 (mmult) for 3 +,2* Divider not being utilized Adders have limited FU utilization O(N 3 ) *, O(N 3 ) + Multipliers 100% in loop Replace / with *
RICE UNIVERSITY Kernel 2 (mmult)for 3 +,3* better adder utilization needs sufficient registers for scaling [register allocation may fail] code may also need slight tuning of variables for optimization
RICE UNIVERSITY Contents Programmable architecture design using the IMAGINE simulator Multiuser estimation and detection implementation Performance comparisons and results Other extensions for possible integration Conclusions
RICE UNIVERSITY FU utilization on each cluster Time for detection at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles
RICE UNIVERSITY Comparisons with DSPs Execution time (in seconds) Users Single DSP implementation 2 DSP implementation Target data rate Kbps/user Our architecture based on Imagine X x
RICE UNIVERSITY Current work Evaluating performance of wireless communication algorithms such as estimation, detection and decoding on this architecture Studying bottlenecks, functional unit design needed to attain real-time The insights gained from the design can also be applied to other processors such as DSPs.