Distributed Arithmetic: Implementations and Applications

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

Distributed Arithmetic
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.
CENG536 Computer Engineering Department Çankaya University.
ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform.
Henry Hexmoor1 Chapter 5 Arithmetic Functions Arithmetic functions –Operate on binary vectors –Use the same subfunction in each bit position Can design.
EECS Components and Design Techniques for Digital Systems Lec 18 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
Copyright 2008 Koren ECE666/Koren Part.9b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
ECE C03 Lecture 61 Lecture 6 Arithmetic Logic Circuits Hai Zhou ECE 303 Advanced Digital Design Spring 2002.
Chapter # 5: Arithmetic Circuits Contemporary Logic Design Randy H
Lecture 8 Arithmetic Logic Circuits
Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.
Digital Kommunikationselektronik TNE027 Lecture 3 1 Multiply-Accumulator (MAC) Compute Sum of Product (SOP) Linear convolution y[n] = f[n]*x[n] = Σ f[k]
Computer ArchitectureFall 2008 © August 25, CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (1)
Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
A COMPARATIVE STUDY OF MULTIPLY ACCCUMULATE IMPLEMENTATIONS ON FPGAS Using Distributed Arithmetic and Residue Number System.
Chapter 5 Arithmetic Logic Functions. Page 2 This Chapter..  We will be looking at multi-valued arithmetic and logic functions  Bitwise AND, OR, EXOR,
GPGPU platforms GP - General Purpose computation using GPU
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.
Logical Circuit Design Week 8: Arithmetic Circuits Mentor Hamiti, MSc Office ,
CS1Q Computer Systems Lecture 9 Simon Gay. Lecture 9CS1Q Computer Systems - Simon Gay2 Addition We want to be able to do arithmetic on computers and therefore.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Chapter 4 – Arithmetic Functions and HDLs Logic and Computer Design Fundamentals.
Chapter # 5: Arithmetic Circuits
Chapter 6-1 ALU, Adder and Subtractor
07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.
Sequential Multipliers Lecture 9. Required Reading Chapter 9, Basic Multiplication Scheme Chapter 10, High-Radix Multipliers Chapter 12.3, Bit-Serial.
Spring 2002EECS150 - Lec12-cl3 Page 1 EECS150 - Digital Design Lecture 12 - Combinational Logic Circuits Part 3 March 4, 2002 John Wawrzynek.
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
Copyright © 2001, S. K. Mitra Digital Filter Structures The convolution sum description of an LTI discrete-time system be used, can in principle, to implement.
Implementation of Finite Field Inversion
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 5 Digital Building Blocks.
Lecture 30 Read Only Memory (ROM)
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
Reconfigurable Computing - Type conversions and the standard libraries John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots.
1 Tutorial: ITI1100 Dewan Tanvir Ahmed SITE, UofO.
COE 202: Digital Logic Design Combinational Circuits Part 2 KFUPM Courtesy of Dr. Ahmad Almulhem.
D ISTRIBUTED A RITHMETIC (DA) 1. D EFINITION DA is basically (but not necessarily) a bit- serial computational operation that forms an inner (dot) product.
1 Lecture 6 BOOLEAN ALGEBRA and GATES Building a 32 bit processor PH 3: B.1-B.5.
EE2174: Digital Logic and Lab Professor Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University CHAPTER 8 Arithmetic.
Combinational Circuits
Applications of Distributed Arithmetic to Digital Signal Processing:
Topics covered: Arithmetic CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
1 Chapter 4 Combinational Logic Logic circuits for digital systems may be combinational or sequential. A combinational circuit consists of input variables,
CS 151: Digital Design Chapter 4: Arithmetic Functions and Circuits
Digital Electronics Tutorial: Number System & Arithmetic Circuits Solutions.
1 Fundamentals of Computer Science Combinational Circuits.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
High-Radix Sequential Multipliers Bit-Serial Multipliers Modular Multipliers Lecture 9.
ECE DIGITAL LOGIC LECTURE 15: COMBINATIONAL CIRCUITS Assistant Prof. Fareena Saqib Florida Institute of Technology Fall 2015, 10/20/2015.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
Topic: N-Bit parallel and Serial adder
Gunjeet Kaur Dronacharya Group of Institutions. Binary Adder-Subtractor A combinational circuit that performs the addition of two bits is called a half.
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Instructor: Alexander Stoytchev
Digital Building Blocks
CSE Winter 2001 – Arithmetic Unit - 1
Unsigned Multiplication
Arithmetic Logical Unit
Multiplier-less Multiplication by Constants
Programmable Configurations
Applications of Distributed Arithmetic to Digital Signal Processing:
ARM implementation the design is divided into a data path section that is described in register transfer level (RTL) notation control section that is viewed.
UNIVERSITY OF MASSACHUSETTS Dept
UNIVERSITY OF MASSACHUSETTS Dept
ECE 352 Digital System Fundamentals
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
Applications of Distributed Arithmetic to Digital Signal Processing:
Presentation transcript:

Distributed Arithmetic: Implementations and Applications A Tutorial

Distributed Arithmetic (DA) [Peled and Liu,1974] An efficient technique for calculation of sum of products or vector dot product or inner product or multiply and accumulate (MAC) MAC operation is very common in all Digital Signal Processing Algorithms

So Why Use DA? The advantages of DA are best exploited in data-path circuit designing Area savings from using DA can be up to 80% and seldom less than 50% in digital signal processing hardware designs An old technique that has been revived by the wide spread use of Field Programmable Gate Arrays (FPGAs) for Digital Signal Processing (DSP) DA efficiently implements the MAC using basic building blocks (Look Up Tables) in FPGAs

An Illustration of MAC Operation The following expression represents a multiply and accumulate operation A numerical example

A Few Points about the MAC Consider this Note a few points A=[A1, A2,…, AK] is a matrix of “constant” values x=[x1, x2,…, xK] is matrix of input “variables” Each Ak is of M-bits Each xk is of N-bits y should be able large enough to accommodate the result

A Possible Hardware (NOT DA Yet!!!) Let, Shift right Registers to hold sum of partial products Multi-bit AND gate Each scaling accumulator calculates Ai X xi Adder/Subtractor Shift registers

How does DA work? The “basic” DA technique is bit-serial in nature DA is basically a bit-level rearrangement of the multiply and accumulate operation DA hides the explicit multiplications by ROM look-ups  an efficient technique to implement on Field Programmable Gate Arrays (FPGAs)

Moving Closer to Distributed Arithmetic …(1) Consider once again a. Let xk be a N-bits scaled two’s complement number i.e. | xk | < 1 xk : {bk0, bk1, bk2……, bk(N-1) } where bk0 is the sign bit b. We can express xk as c. Substituting (2) in (1), …(2) …(3)

Moving More Closer to DA …(3) Expanding this part

Moving Still More Closer to DA

Almost there! …(4) The Final Reformulation

Lets See the change of hardware Our Original Equation Bit Level Rearrangement

So where does the ROM come in? Note this portion. It’s can be treated as function of serial inputs bits of {A, B, C,D}

The ROM Construction has only 2K possible values i.e. (5) can be pre-calculated for all possible values of b1n b2n …bKn We can store these in a look-up table of 2K words addressed by K-bits i.e. b1n b2n …bKn …(4) …(5)

Lets See An Example Let number of taps K=4 The fixed coefficients are A1 =0.72, A2= -0.3, A3 = 0.95, A4 = 0.11 We need 2K = 24 = 16-words ROM …(4)

ROM: Address and Contents b1n b2n b3n b4n Contents 1 A4=0.11 A3=0.95 A3+ A4=1.06 A2=-0.30 A2+ A4= -0.19 A2+ A3=0.65 A2+ A3 + A4=0.75 A1=0.72 A1+ A4=0.83 A1+ A3=1.67 A1+ A3 + A4=1.78 A1+ A2=0.42 A1+ A2 + A4=0.53 A1+ A2 + A3=1.37 A1+ A2 + A3 + A4=1.48

Key Issue: ROM Size The size of ROM is very important for high speed implementation as well as area efficiency ROM size grows exponentially with each added input address line The number of address lines are equal to the number of elements in the vector i.e. K Elements up to 16 and more are common => 216=64K of ROM!!! We have to reduce the size of ROM

A Very Neat Trick: …(6) 2‘s-complement …(7)

Re-Writing xk in a Different Code Define: Offset Code Finally …(7) …(8)

Using the New xk Substitute the new xk in here …(9)

The New Formulation in Offset Code Let and Constant

The Benefit: Only Half Values to Store b1n b2n b3n b4n c1n c2n c3n c4n Contents -1 -1/2 (A1+ A2 + A3 + A4) = -0.74 1 -1/2 (A1+ A2 + A3 - A4) = - 0.63 -1/2 (A1+ A2 - A3 + A4) = 0.21 -1/2 (A1+ A2 - A3 - A4) = 0.32 -1/2 (A1 - A2 + A3 + A4) = -1.04 -1/2 (A1 - A2 + A3 - A4) = - 0.93 -1/2 (A1 - A2 - A3 + A4) = - 0.09 -1/2 (A1 - A2 - A3 - A4) = 0.02 -1/2 (-A1+ A2 + A3 + A4) = -0.02 -1/2 (-A1+ A2 + A3 - A4) = 0.09 -1/2 (-A1+ A2 - A3 + A4) = 0.93 -1/2 (-A1+ A2 - A3 - A4) = 1.04 -1/2 (-A1 - A2 + A3 + A4) = - 0.32 -1/2 (-A1 - A2 + A3 - A4) = - 0.21 -1/2 (-A1 - A2 - A3 + A4) = 0.63 -1/2 (-A1 - A2 - A3 - A4) = 0.74 Inverse symmetry

Hardware Using Offset Coding x1 selects between the two symmetric halves Ts indicates when the sign bit arrives

Alternate Technique: Decomposing the ROM Requires additional adder to the sum the partial outputs

Speed Concerns We considered One Bit At A Time (1 BAAT) No. of Clock Cycles Required = N If K=N, then essentially we are taking 1 cycle per dot product  Not bad! Opportunity for parallelism exists but at a cost of more hardware We could have 2 BAAT or up to N BAAT in the extreme case N BAAT  One complete result/cycle

Illustration of 2 BAAT

Illustration of N BAAT

The Speed Limit: Carry Propagation The speed in the critical path is limited by the width of the carry propagation Speed can be improved upon by using techniques to limit the carry propagation

Speeding Up Further: Using RNS+DA By Using RNS, the computations can be broken down into smaller elements which can be executed in parallel Since we are operating on smaller arguments, the carry propagation is naturally limited So by using RNS+DA, greater speed benefits can be attained, specially for higher precision calculations

Conclusion Ref: Stanley A. White, “Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review,” IEEE ASSP Magazine, July, 1989 Ref: Xilinx App Note, ”The Role of Distributed Arithmetic In FPGA Based Signal Processing’