Distributed Arithmetic

Slides:



Advertisements
Similar presentations
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
1 EE24C Digital Electronics Project Theory: Sequential Logic (part 2)
8085 processor. Bus system in microprocessor.
Institute of Applied Microelectronics and Computer Engineering © 2014 UNIVERSITY OF ROSTOCK | College of Computer Science and Electrical Engineering.
Signal Processing Using Digital Technology Jeremy Barsten Jeremy Stockwell December 10, 2002 Advisors: Dr. Thomas Stewart Dr. Vinod Prasad.
Analog-to-Digital Converters
20 October 2003WASPAA New Paltz, NY1 Implementation of real time partitioned convolution on a DSP board Enrico Armelloni, Christian Giottoli, Angelo.
Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.
Digital Kommunikationselektronik TNE027 Lecture 3 1 Multiply-Accumulator (MAC) Compute Sum of Product (SOP) Linear convolution y[n] = f[n]*x[n] = Σ f[k]
Computer ArchitectureFall 2008 © August 25, CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (1)
Distributed Arithmetic: Implementations and Applications
A COMPARATIVE STUDY OF MULTIPLY ACCCUMULATE IMPLEMENTATIONS ON FPGAS Using Distributed Arithmetic and Residue Number System.
Multiplication.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
The Xilinx Spartan 3 FPGA EGRE 631 2/2/09. Basic types of FPGA’s One time programmable Reprogrammable (non-volatile) –Retains program when powered down.
A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.
Digital Filtering.
Digital Filtering.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “A Tutorial“ Greg Goslin Digital Signal Processing.
Highest Performance Programmable DSP Solution September 17, 2015.
Computational Technologies for Digital Pulse Compression
High Speed, Low Power FIR Digital Filter Implementation Presented by, Praveen Dongara and Rahul Bhasin.
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
Chapter # 5: Arithmetic Circuits
Topic: Arithmetic Circuits Course: Digital Systems Slide no. 1 Chapter # 5: Arithmetic Circuits.
© 2003 Xilinx, Inc. All Rights Reserved Answers DSP Design Flow.
Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),
Sequential Multipliers Lecture 9. Required Reading Chapter 9, Basic Multiplication Scheme Chapter 10, High-Radix Multipliers Chapter 12.3, Bit-Serial.
Software Defined Radio 長庚電機通訊組 碩一 張晉銓 指導教授 : 黃文傑博士.
EECS Components and Design Techniques for Digital Systems Lec 16 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
Lecture 4 Multiplier using FPGA 2007/09/28 Prof. C.M. Kyung.
A Fast Hardware Approach for Approximate, Efficient Logarithm and Anti-logarithm Computation Suganth Paul Nikhil Jayakumar Sunil P. Khatri Department of.
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
Sequential Arithmetic ELEC 311 Digital Logic and Circuits Dr. Ron Hayne Images Courtesy of Cengage Learning.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo
D ISTRIBUTED A RITHMETIC (DA) 1. D EFINITION DA is basically (but not necessarily) a bit- serial computational operation that forms an inner (dot) product.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
ECE 448: Lab 5 DSP and FPGA Embedded Resources (Signal Filtering and Display)
Paper Review Presentation Paper Title: Hardware Assisted Two Dimensional Ultra Fast Placement Presented by: Mahdi Elghazali Course: Reconfigurable Computing.
Fixed & Floating Number Format Dr. Hugh Blanton ENTC 4337/5337.
ECE 448: Lab 7 Design and Testing of an FIR Filter.
Xilinx Core Solutions Group
1. Adaptive System Identification Configuration[2] The adaptive system identification is primarily responsible for determining a discrete estimation of.
© 2003 Xilinx, Inc. All Rights Reserved Answers DSP Design Flow.
Digital Electronics Tutorial: Number System & Arithmetic Circuits Solutions.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
More Binary Arithmetic - Multiplication
Improved Resource Sharing for FPGA DSP Blocks
CORDIC (Coordinate rotation digital computer)
Sequential Multipliers
Instructor: Dr. Phillip Jones
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Radix 2 Sequential Multipliers
Multiplier-less Multiplication by Constants
Applications of Distributed Arithmetic to Digital Signal Processing:
Programmable Logic- How do they do that?
ARM implementation the design is divided into a data path section that is described in register transfer level (RTL) notation control section that is viewed.
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Final Project presentation
Christian Hackmann and Evert Nord
Applications of Distributed Arithmetic to Digital Signal Processing:
Computer Architecture
Computer Architecture
Presentation transcript:

Distributed Arithmetic Dr Sumam David S. Dept. of E&C, NITK Surathkal Courtesy for slides – Xilinx Professor’s Workshop Resources

Objective Distributed arithmetic What ? Where ? How ?

What is DA? Multiplication using LUT Used to implement multipliers in LUT rich FPGAs

Twos Complement Multiplication One bit at a time: Multiplicand (in this example = -127) is added to the partial sum after sign extending for every multiplier bit except for the last multiplier bit For the last multiplier bit, multiplicand is subtracted to handle negative multiplier numbers For each multiplier bit, one product bit result is determined and output

SDA 1-Tap FIR Filter Z-1 +/- Partial Product ROM Parallel to serial N BITS WIDE SAMPLE DATA Partial Product ROM A0 Z-1 X0 +/- 1 Parallel to serial converter Scaling Accumulator LUT contains two locations 00000...0 C0 A0 1 Distributed arithmetic is based on saving partial products in memories. Because the coefficients are known ahead of time, it is possible to pre-calculate the result of a multiplication. In this example, we are looking at a 1-tap FIR filter. The result of the multiplication is either 0 x coef or 1 x coef. Hence, the LUT, used in ROM mode, will be initialized with 0 at location 0 and C0 at location 1. Taking this further for 2 taps

Distributed Arithmetic for a 2-Tap Filter Partial products of equal weight are added together before being summed to next higher partial product weight Create look-up table of summed partial products -23 22 21 20 -23 22 21 20 C0 = 1 0 0 1 (-7) C1 = 0 1 1 0 ( 6) X X0 = 0 1 1 1 ( 7) X X1 = 0 1 0 1 ( 5) + ( 1 0 0 1 ( 1 0 0 1 ( 1 0 0 1 (0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 0) 0 0 0 0 ) 0 1 1 0 ) 0 0 0 0 ) 0 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 = 1 1 1 0 1 1 0 1 (-1) (-14) (-4) (0) (-19) (-49) ( 30) Basically involves changing the order of the computations. Calculate the partial product formed by multiplying bit 0 by the first coefficient and the second coefficient, then add them together. = Sign Extension (Serial-Data / Tap-Parallel Multiply)

SDA 2-Tap FIR Filter Partial Product ROM Z-1 +/- Scaling Accumulator N BITS WIDE SAMPLE DATA A0 Partial Product ROM X0 Z-1 +/- A1 X1 1 Scaling Accumulator LUT contains all possible sums of the partial products 00 01 10 11 0000...0 C0 C0 + C1 C1 This shows the 2 tap version. Shows the partial products output

SDA 4-Tap FIR Filter Partial Product + ROM + + 0000...0 C0 0000...0 C1 N BITS WIDE SAMPLE DATA Partial Product ROM A0 0000...0 X0 C0 1 + +/- Z-1 Scaling Accumulator A1 0000...0 X1 C1 1 + A2 0000...0 X2 C2 1 + A3 0000...0 X3 C3 Here is the 4 tap showing 4 ROMs, each with two locations used out of 16 to store the coefficient and the 0 values. But the LUT has four inputs and so the four ROMs and adders are pre-programmed within a single 16x1ROM with the four address bits provided by the outputs of the parallel to serial converters.

SDA 8-Tap FIR Filter Partial Product ROM + Partial Product ROM N BITS WIDE SAMPLE DATA A0 X0 Partial Product ROM 1 A1 X1 1 A2 X2 Pre-Adder 1 A3 X3 Z-1 + +/- 1 A0 X4 Partial Product ROM Scaling Accumulator 1 A1 X5 1 A2 Due to the FPGA 4-input look-up tables, taps are grouped by four in order to efficiently address the LUTs preloaded with partial products. Based on the above block diagram, you can imagine that there may be an advantage to use multiple of 4 taps to make full use of the distributed memory. More design tricks are covered in the DSP Implementation Techniques course. X6 4 -input LUT contains all possible sums of the partial products 1 A3 X7

Xilinx DA FIR Performance 10 20 30 40 50 60 Sample Rate (MSPS) Single MAC DA FIR B=8 DA FIR B=12 DA FIR B=16 100 150 200 250 Serial FPGA FIR 6000 Dual MAC 5000 DA FIR B=8 DA FIR B=12 4000 DA FIR B=16 3000 Performance (MMACs/s) Serial FPGA FIR 2000 1000 50 100 150 200 250 Filter Length (Taps) Filter Length (Taps) fclk = 200 MHz for both processor and FPGA B = data sample precision for FPGA As number of taps increases, MAC-based filter’s sample rate decreases exponentially whereas serial DA-based FIR filter will have constant sample rate independent of number of taps. The sample rate depends on the sample size in case of DA FIR filter. Hence as the B increases, sample rate decreases. Note that the hardware resources is a function of sample size and number of taps. In the right side figure, performance is given in terms of mega MACS per slice.

Trade Clock Cycles for Logic Area 20Ms/s Multi bits per clock cycle 160Ms/s b7 b7 b7 Serial-DA Parallel-DA b4 b3 Hardware Over-sampling = 8 b0 Hardware Over-sampling = 4 b0 Hardware Over-sampling = 2 b0 b0 b7 b3 Hardware Over-sampling = 1 b4 b0 The sample is serialized and processed 1 bit per clock cycle. 8 clock cycles are thus required to process the whole sample The sample is serialized and processed 2 bits per clock cycle. 4 clock cycles are thus required to process the whole sample Processing the data serially, one-bit-at-a-time, can result in slow computation rates. When the input variables are B bits in length, B clock cycles are required to complete an inner-product calculation. Additional speed may be obtained in several ways. One approach is to partition the input words into L subwords and process these subwords in parallel. This method requires L-times as memory look up tables and so comes at a cost of a linear increase in storage requirements. Maximum speed is achieved by factoring the input variables into single bit subwords. With this factoring, a new output sample is computed on each clock cycle. This factoring results in a fully parallel DA FIR (PDAFIR) architecture. The sample is processed in parallel 8 bits per clock cycle The sample is serialized and processed 4 bits per clock cycle b0

Conclusion Efficiency of computation Slow as its bit serial Memory requirements

References The role of Distributed Arithmetic in FPGA based signal processing, www.xilinx.com