Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems Ali Irturk †, Bridget Benson †, Nikolay Laptev ‡, Ryan Kastner † † Department of Computer Science and Engineering University of California, San Diego {airturk, b1benson, 1 ‡ Department of Computer Science University of California, Los Angeles April 2009
Motivation Matrix Decompositions are essential computations for wireless communications; Matrix Decompositions are used for simplifying matrix inversion which are used in Equalization algorithms to remove the effect of the channel on the signal, Minimum mean square error algorithms for pre- coding in spatial multiplexing, Detection-estimation algorithms in space-time coding. QR, A -1 2
Motivation 3 There are a number of tools that translate Matlab algorithms to a hardware description language; However, we believe that the majority of these tools take the wrong approach; We take a more focused approach, specifically developing a tool that is targeting matrix computation algorithms.
Computing Platforms 4 ASICsDSPsFPGAsGPUCELL BE Exceptional Performance Long Time to Market Substantial Costs Ease of Development Fast Time to Market Low Performance Ease of Development Fast Time to Market ASIC-like Performance
Major Contributions 5 Design of a novel tool, GUSTO, for automatic generation and optimization of application specific matrix computation architectures from a given Matlab algorithm; Comparison of different matrix decomposition methods in terms of different matrix dimensions, bit widths and parallelism; Thorough study of area and throughput tradeoffs of matrix decomposition architectures using different parameterizations; A case study: Implementation of Adaptive Weight Calculation Core using QRD-RLS algorithm.
GUSTO General architecture design Utility and Synthesis Tool for Optimization GUSTO an easy-to-use tool for more efficient design space exploration and development; automatically generates and optimizes application specific architectures; creates a prototype hardware system in just minutes instead of days or weeks. GUSTO Bit width (e.g. 19 bits of precision) Resource Allocation (e.g. 4 multipliers and 3 adders) Modes (e.g. Heterogeneous cores connected using hierarchical datapaths) Algorithm (e.g. QR decomposition) HDL files Error Analysis Number of bits used Average Error 6
Outline Motivation GUSTO: Design Tool and Methodology Decomposition Methods Results Inflection Point Analysis Architectural Design Alternatives Conclusions 7
GUSTO Design Flow Algorithm Analysis Algorithm Instruction Generation Resource Allocation Type and # of Arithmetic Resources Design Library Error Analysis Architecture Generation Data Representation Collecting Scheduling Information Resource Trimming for Hardware Optimization Area, Latency and Throughput Results Simulation Results General Purpose Architecture Application Specific Architecture 8
GUSTO Design Flow Algorithm Analysis Algorithm Inst. Cont. A A A A M M M M Mem. Cont. Processing Element PE Software Defined Radio GUSTO provides options to divide the given algorithm into smaller processing elements which are small in area and highly optimized for throughput. ? 9
GUSTO Design Flow Instruction Generation Resource Allocation Type and # of Arithmetic Resources Design Library + - */ GUSTO uses instruction scheduling for better resource utilization and provides different scheduling methods. GUSTO generates resource constrained architectures, i.e. the user chooses the number and type of arithmetic units. Inst. Cont. A A A A M M M M Mem. Cont. Processing Element ? 10
GUSTO Design Flow Error Analysis GUSTO employs fixed point arithmetic in generated architectures; GUSTO performs error analysis to find an appropriate fixed point representation which provides results with the accuracy similar to that of a floating point implementation. GUSTO MATLAB Error Analysis Metrics: 1)Mean Error 2)Peak Error 3)Standard Deviation of Error 4)Mean Percentage Error User Defined Input Data Fixed Point Arithmetic Results (using variable bit width) Floating Point Arithmetic Results (Single/Double precision) 11
GUSTO Design Flow Architecture Generation GUSTO generates a CPU like architecture with Dynamic Instruction Scheduling; Dynamic Memory Assignments; Full Connectivity between functional units. Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Full Connectivity Dynamic Instruction Scheduling Dynamic Memory Assignments 12
GUSTO Design Flow Collecting Scheduling Information Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Full Connectivity Static Instruction Scheduling Static Memory Assignments GUSTO collects scheduling information from instruction and memory controllers. GUSTO uses this information to eliminate unneeded resources, automatically creating a small, fast statically scheduled architecture. 13
GUSTO Design Flow Resource Trimming for Hardware Optimization GUSTO simulates the architecture to define the usage of arithmetic units, multiplexers, register entries and input/output ports and trims away the unused components with their interconnects. GUSTOs’ optimization provides tremendous silicon savings while ensuring the correctness of solution. Multiplier Adder Memory Full Connectivity Multiplier Adder Memory Required Connectivity 14
GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B In_B1In_B2 Out_B mem In_mem1 A Out_A Out_B Out_mem1 Out_mem2 Out_A Out_B Out_mem1 Out_mem2 Out_A In_A1 In_A2 Out_A Out_B Out_ mem1 Out_ mem2 Simulation runs 15
GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B In_B1In_B2 Out_B mem In_mem1 B Out_A Out_B Out_mem1 Out_mem2 Out_A Out_B Out_mem1 Out_mem2 Out_B In_B1 In_B2 Out_A Out_B Out_ mem1 Out_ mem2 Simulation runs 16
Outline Motivation GUSTO: Design Tool and Methodology Decomposition Methods Results Inflection Point Analysis Architectural Design Alternatives Conclusions 17
M ATRIX D ECOMPOSITIONS QR, LU AND C HOLESKY Given Matrix Orthogonal Matrix Upper Triangular Matrix 18 Lower Triangular Matrix Given Matrix Upper Triangular Matrix Unique Lower Triangular Matrix (Cholesky triangle) Transpose of Lower Triangular Matrix Given Matrix
M ATRIX I NVERSION Given Matrix Inverse Matrix Identity Matrix Full Matrix Inversion is costly! 19
Outline Motivation GUSTO: Design Tool and Methodology Decomposition Methods Results Inflection Point Analysis Architectural Design Alternatives Conclusions 20
Results Inflection Point Analysis: Sequential 21
Results Inflection Point Analysis: Parallel 22
Results Finding the Optimal Hardware : Decomposition Methods General Purpose Architecture Application Specific Architecture QRLUCholesky Decrease in Area (Percentage) 94%83%86% 23
Results Finding the Optimal Hardware: Decomposition Methods General Purpose Architecture (Mode 1) Application Specific Architecture (Mode 2) QRLUCholesky Increase in Throughput (Percentage) 68% 16% 14% 24
Results Finding the Optimal Hardware: Matrix Inversion (using QR) average of 59% decrease in area 3X increase in throughput 25
Results Architectural Design Alternatives 26
Results Comparison with Previously Published Work: AWC Edman et al. Karkooti et al. Dick et al.GUSTO Application Matrix Inversion BeamformerAWC Method QR Matrix Size4 × 4 3 × 35 × 54 × 4 Bit width Data typefixedfloatingNRfixed Device type Virtex 2Virtex 4 Slices DSP48sNR BRAMsNR961 Throughput (10 6 ×s -1 ) F. Edman, V. Öwall, “A Scalable Pipelined Complex Valued Matrix Inversion Architecture”, IEEE International Symposium on Circuits and Systems. (2005). M. Karkooti, J.R. Cavallaro, C. Dick, “FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm”, Asilomar Conference on Signals, Systems and Computers (2005). C. Dick, F. Harris, M. Pajic, D. Vuletic, “Real-Time QRD-Based Beamforming on an FPGA Platform,” Asilomar Conference on Signals, Systems and Computers (2006). 27 Adaptive Weight Calculation (AWC) Core
Outline Motivation GUSTO: Design Tool and Methodology Decomposition Methods Results Inflection Point Analysis Architectural Design Alternatives Conclusions 28
GUSTO General architecture design Utility and Synthesis Tool for Optimization GUSTO is a tool to provide automatic generation and optimization of a variety of application specific processing elements (PEs) with different parameterization options; Current Projects includes implementation of Short Preamble Processing unit for OFDM Receiver design. GUSTO Bit width (e.g. 19 bits of precision) Resource Allocation (e.g. 4 multipliers and 3 adders) Modes (e.g. Heterogeneous cores connected using hierarchical datapaths) Algorithm (e.g. QR decomposition) HDL files Error Analysis 29
Thank You 30