A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Distributed Arithmetic
ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform.
Signal Processing Using Digital Technology Jeremy Barsten Jeremy Stockwell December 10, 2002 Advisors: Dr. Thomas Stewart Dr. Vinod Prasad.
Fixed-Point Arithmetics: Part I
Chapter 15 Digital Signal Processing
Chapter # 5: Arithmetic Circuits Contemporary Logic Design Randy H
Digital Kommunikationselektronik TNE027 Lecture 3 1 Multiply-Accumulator (MAC) Compute Sum of Product (SOP) Linear convolution y[n] = f[n]*x[n] = Σ f[k]
Computer ArchitectureFall 2008 © August 25, CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (1)
1 COMP541 Arithmetic Circuits Montek Singh Mar 20, 2007.
Distributed Arithmetic: Implementations and Applications
Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
M. Interleaving Montgomery High-Radix Comparison Improvement Adders CLA CSK Comparison Conclusion Improving Cryptographic Architectures by Adopting Efficient.
A COMPARATIVE STUDY OF MULTIPLY ACCCUMULATE IMPLEMENTATIONS ON FPGAS Using Distributed Arithmetic and Residue Number System.
Computer ArchitectureFall 2007 © August 29, 2007 Karem Sakallah CS 447 – Computer Architecture.
Multiplication.
+ CS 325: CS Hardware and Software Organization and Architecture Exam 1: Study Guide.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
CS231: Computer Architecture I Laxmikant Kale Fall 2004.
Multiplication CPSC 252 Computer Organization Ellen Walker, Hiram College.
ECE 4110– Sequential Logic Design
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
CS1Q Computer Systems Lecture 9 Simon Gay. Lecture 9CS1Q Computer Systems - Simon Gay2 Addition We want to be able to do arithmetic on computers and therefore.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Chapter # 5: Arithmetic Circuits
Topic: Arithmetic Circuits Course: Digital Systems Slide no. 1 Chapter # 5: Arithmetic Circuits.
07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.
Sequential Multipliers Lecture 9. Required Reading Chapter 9, Basic Multiplication Scheme Chapter 10, High-Radix Multipliers Chapter 12.3, Bit-Serial.
Lecture 4 Multiplier using FPGA 2007/09/28 Prof. C.M. Kyung.
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
D ISTRIBUTED A RITHMETIC (DA) 1. D EFINITION DA is basically (but not necessarily) a bit- serial computational operation that forms an inner (dot) product.
ECE 448: Lab 5 DSP and FPGA Embedded Resources (Signal Filtering and Display)
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU CORDIC (Coordinate rotation digital computer) Ref: Y. H. Hu, “CORDIC based VLSI architecture.
Fixed & Floating Number Format Dr. Hugh Blanton ENTC 4337/5337.
EE2174: Digital Logic and Lab Professor Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University CHAPTER 8 Arithmetic.
Applications of Distributed Arithmetic to Digital Signal Processing:
Topics covered: Arithmetic CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
CORDIC Algorithm COordinate Rotation DIgital Computer Method for Elementary Function Evaluation (e.g., sin(z), cos(z), tan -1 (y)) Originally Used for.
ECE 448: Lab 7 Design and Testing of an FIR Filter.
Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.
COMP541 Arithmetic Circuits
1. Adaptive System Identification Configuration[2] The adaptive system identification is primarily responsible for determining a discrete estimation of.
1 ELEN 033 Lecture 4 Chapter 4 of Text (COD2E) Chapters 3 and 4 of Goodman and Miller book.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Arithmetic: Part II.
CS 151: Digital Design Chapter 4: Arithmetic Functions and Circuits
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Chapter 6 Discrete-Time System. 2/90  Operation of discrete time system 1. Discrete time system where and are multiplier D is delay element Fig. 6-1.
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
Arithmetic Circuits I. 2 Iterative Combinational Circuits Like a hierachy, except functional blocks per bit.
CORDIC (Coordinate rotation digital computer)
Linear Constant-Coefficient Difference Equations
CORDIC (Coordinate rotation digital computer)
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Subject Name: Digital Signal Processing Algorithms & Architecture
ECE 434 Advanced Digital System L13
Arithmetic Logical Unit
Multiplier-less Multiplication by Constants
Applications of Distributed Arithmetic to Digital Signal Processing:
Chapter 6 Discrete-Time System
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Reconfigurable Computing University of Arkansas
UNIVERSITY OF MASSACHUSETTS Dept
ECE 352 Digital System Fundamentals
UNIVERSITY OF MASSACHUSETTS Dept
Lecture #17 INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM
Applications of Distributed Arithmetic to Digital Signal Processing:
Presentation transcript:

A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1

 DA is a bit-serial technique to greatly reduce resource requirements for the dot product calculation  So-called because the resources are not easily recognizable: “Where’s the MAC module?”  Takes advantage of small tables of pre- computed coefficients and clever rearrangement of the math 2

 In signal processing the most common operation is the dot product  DA lends itself well to FPGA implementation due its use of lookup tables  DA can reduce gate count by 50%-80% in signal processing arithmetic! 3

 It turns out that the dot product is used extensively in DSP (FIR, FFT, etc)  Recall that dot product is a sum of products:  Written as a summation: 4

 Simple example: smoothing data via DSP (low-pass filter)  Accomplished with an FIR filter. General form:  So we could implement a “3-tap (K=4) moving average filter”: (In this special case, A 1 =A 2 =A 3 =0.33) 5

 Recall the goal:  X is the filter input, (digital!), so let’s consider two’s complement representation (scaled x<1 for cleanliness)  Putting them together N – total bits 6

 Expand the summation:  We can precompute all terms that depend on the input data (b k0..b kK ) and store them in a ROM of size 2 K+1  The x inputs can then be used to address the ROM directly: LUT! Since b kn is 0 or 1, this has only 2 K possible values Two possible values 7

 Non-DA Hardware Implementation 8-bit Multiplier 8-bit Adder Based on the original equation 8

 We said this is ‘bit-serial’ technique, so how can we perform multiplication? Here, x is 4-bit input and A is 8-bit constant Example Multiplication x = 1011 A = Shift right by 1 Result register x A AND with 1 parallel and 1 serial input 9

 So, now we substitute the scaling accumulator into our original design. Getting closer... 10

 Let’s rearrange the hardware to match our expanded eq n : We first sum the products of each input bit and its constant Then we add and scale each of those terms 11

 Now recall that we had the clever idea to use pre- computed sums in a LUT for the bitwise addition AddressData C0C0 0010C1C1 0011C 0 +C C 0 +C 1+ C C 0 +C 1 +C 2 +C 3 12

 We need to accommodate the negative term, so we add one more address line to the LUT called T s. ROM size now 2 K+1  T s is a timing signal. T s =1 during sign bit time, 0 otherwise  We also need this bit to know when the final result is ready AddressData C (C 0 +C 1 +C 2 +C 3 ) For all T s = 1 the ROM contains the negative of the appropriate sum 13

This is an example of K=4 DA dot-product hardware ROM Size = 2 K+1 =2 5 =32 Here is our scaling accumulator Switch SWA in pos 2 after Ts=1, at which point y contains final result 14

 Computes N-bit dot product in N cycles  Reduced area and high speed due to the ROM  However, requires 2 K+1 size ROM (grows exponentially with input lines)  Input sizes often 16 bits -> Need 128K ROM! 15

 Bit-serial means N-bit dot product requires N cycles... Slower than parallel?  N HW multipliers not generally practical due to large area\power!  Time-multiplexing your parallel HW multiplier means you lose the speed gain: N vs K  Example: K=8, N=8 takes the same time on time multiplexed parallel HW vs DA bit-serial 16

 We can reduce the ROM size to 2 K with some tricks  There are other math tricks to reduce the size further to 2 K-1 Replace adder with adder/subtractor T s becomes control line for adder/subtractor ROM size is reduced by half 17

 Speed determined by serial nature of input – 1 BAAT  We can expand the HW to do multi-bit at a time Introduce input as bit pairs x 10 x 11, x 12 x 13, etc Shift LSB of pair result by 1 Shift accumulator feedback by 2 Requires 2 ROMs instead of 1 18

 DA lends itself easily to DSP because of its easy application to the dot product  DA is easily implementable on FPGA because of the similar architecture-> LUTs (of course better on custom hardware)  DA is not limited to dot product; will work for any algorithm where pre-computed values can be leveraged 19

 DA is a very efficient means of mechanizing the dot product  The use of DA can save 50-80% area over the parallel approach  Like everything, DA has tradeoffs: ROM size  input lines Speed  area (multi ROM) 20

 Application of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review. White, Stanley. IEEE ASSP Magazine July 1989 (I pulled most of the basic talk info from here)  Parallel and Pipelined Architecture Designs for Distributed Arithmetic-Based Recursive Digital Filters. Hwang, H. and Su. C. IEEE Xplore VLSI Signal Processing IX, (this has some slight remarks about bit parallel vs bit serial, also auto-regressive moving average filter example)  Distributed Arithmetic for Efficient Base-Band Processing in Real-Time GNSS Software Receivers. Waelchli, G et al. Journal of Electrical and Computer Engineering volume 2010 (application to GPS)  An FPGA-Based Parallel Distributed Arithmetic Implementation of the 1-D Discrete Wavelet Transform. Al-Haj, Ali. Informatica 29 (2005) (DSP example using a Virtex FPGA) 21