D ISTRIBUTED A RITHMETIC (DA) 1. D EFINITION DA is basically (but not necessarily) a bit- serial computational operation that forms an inner (dot) product.

Slides:



Advertisements
Similar presentations
Functions and Functional Blocks
Advertisements

Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.
Distributed Arithmetic
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.
CENG536 Computer Engineering Department Çankaya University.
ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform.
Lecture 9 Sept 28 Chapter 3 Arithmetic for Computers.
Chapter 6 Arithmetic. Addition Carry in Carry out
ECE C03 Lecture 61 Lecture 6 Arithmetic Logic Circuits Hai Zhou ECE 303 Advanced Digital Design Spring 2002.
Chapter # 5: Arithmetic Circuits Contemporary Logic Design Randy H
Lecture 8 Arithmetic Logic Circuits
Digital Kommunikationselektronik TNE027 Lecture 3 1 Multiply-Accumulator (MAC) Compute Sum of Product (SOP) Linear convolution y[n] = f[n]*x[n] = Σ f[k]
Computer ArchitectureFall 2008 © August 25, CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (1)
Distributed Arithmetic: Implementations and Applications
Overview Iterative combinational circuits Binary adders
Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
COE 308: Computer Architecture (T041) Dr. Marwan Abu-Amara Integer & Floating-Point Arithmetic (Appendix A, Computer Architecture: A Quantitative Approach,
Chapter 5 Arithmetic Logic Functions. Page 2 This Chapter..  We will be looking at multi-valued arithmetic and logic functions  Bitwise AND, OR, EXOR,
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.
Logical Circuit Design Week 8: Arithmetic Circuits Mentor Hamiti, MSc Office ,
1 Modified from  Modified from 1998 Morgan Kaufmann Publishers Chapter Three: Arithmetic for Computers citation and following credit line is included:
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
P. 4.1 Digital Technology and Computer Fundamentals Chapter 4 Digital Components.
Chapter 4 – Arithmetic Functions and HDLs Logic and Computer Design Fundamentals.
Chapter # 5: Arithmetic Circuits
Chapter 6-1 ALU, Adder and Subtractor
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Logic Circuits I.
5-1 Programmable and Steering Logic Chapter # 5: Arithmetic Circuits.
Number Systems. Why binary numbers? Digital systems process information in binary form. That is using 0s and 1s (LOW and HIGH, 0v and 5v). Digital designer.
Spring 2002EECS150 - Lec12-cl3 Page 1 EECS150 - Digital Design Lecture 12 - Combinational Logic Circuits Part 3 March 4, 2002 John Wawrzynek.
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
Advanced VLSI Design Unit 05: Datapath Units. Slide 2 Outline  Adders  Comparators  Shifters  Multi-input Adders  Multipliers.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 5 Digital Building Blocks.
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Logic and Computer Design.
Arithmetic Logic Unit (ALU) Anna Kurek CS 147 Spring 2008.
1 Lecture 6 BOOLEAN ALGEBRA and GATES Building a 32 bit processor PH 3: B.1-B.5.
9/15/09 - L15 Decoders, Multiplexers Copyright Joanne DeGroat, ECE, OSU1 Decoders and Multiplexer Circuits.
EE2174: Digital Logic and Lab Professor Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University CHAPTER 8 Arithmetic.
COMP541 Arithmetic Circuits
N, Z, C, V in CPSR with Adder & Subtractor Prof. Taeweon Suh Computer Science Education Korea University.
Combinational Circuits
Digital Logic Design (CSNB163)
Applications of Distributed Arithmetic to Digital Signal Processing:
Topics covered: Arithmetic CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
COMP541 Arithmetic Circuits
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Arithmetic: Part II.
1 Chapter 4 Combinational Logic Logic circuits for digital systems may be combinational or sequential. A combinational circuit consists of input variables,
CS 151: Digital Design Chapter 4: Arithmetic Functions and Circuits
CHAPTER 2 Digital Combinational Logic/Arithmetic Circuits
1 Fundamentals of Computer Science Combinational Circuits.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
1 Lecture 14 Binary Adders and Subtractors. 2 Overview °Addition and subtraction of binary data is fundamental Need to determine hardware implementation.
Lecture #23: Arithmetic Circuits-1 Arithmetic Circuits (Part I) Randy H. Katz University of California, Berkeley Fall 2005.
ECE DIGITAL LOGIC LECTURE 15: COMBINATIONAL CIRCUITS Assistant Prof. Fareena Saqib Florida Institute of Technology Fall 2015, 10/20/2015.
C OMPUTER A RITHMETIC. I NTRODUCTION A processor has an separate unit that is known as ALU that executes arithmetic operations. Negative numbers may be.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
Explain Half Adder and Full Adder with Truth Table.
Unsigned Multiplication
King Fahd University of Petroleum and Minerals
Arithmetic Circuits (Part I) Randy H
Multiplier-less Multiplication by Constants
EE207: Digital Systems I, Semester I 2003/2004
Programmable Configurations
Applications of Distributed Arithmetic to Digital Signal Processing:
UNIVERSITY OF MASSACHUSETTS Dept
ECE 352 Digital System Fundamentals
UNIVERSITY OF MASSACHUSETTS Dept
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
Applications of Distributed Arithmetic to Digital Signal Processing:
Presentation transcript:

D ISTRIBUTED A RITHMETIC (DA) 1

D EFINITION DA is basically (but not necessarily) a bit- serial computational operation that forms an inner (dot) product of a pair of vectors in single direct step. Why is it called DA? DA is so named because the arithmetic operations that appear in single processing (+,-,*) are not “lumped” in a comfortable familiar fashion, but in an often unrecognizable fashion. 2

Motivation The extreme computational efficiency is the most important factor and that factor can be best exploited in circuit design. By careful design one may reduce the total gate count in a signal processing arithmetic unit by a number seldom smaller than 50 percent and often larger than 80 percent. 3

Is DA slow? Although it seems to be slow because of the bit- serial natural, but that is not real. 1- if the number of elements in each vector is commensurate with the number of bits in each vector element, e.g., the time required to input eight 8-bits words one at a time in parallel fashion is exactly the same as the time required to input all eight words serially 2-Other modifications to increase the speed may be made by employing techniques such as bit pairing or partitioning the input words into the most significant half and least significant half, the least significant half of the most significant half, etc. 4

S O W HY U SE DA? The advantages of DA are best exploited in data- path circuit designing Area savings from using DA can be up to 80% and seldom less than 50% in digital signal processing hardware designs An old technique that has been revived by the wide spread use of Field Programmable Gate Arrays (FPGAs) for Digital Signal Processing (DSP) DA efficiently implements the MAC using basic building blocks (Look Up Tables) in FPGAs 5

I NNER - PRODUCT (DA EXAMPLE ): The following expression represents an inner-product (multiply and accumulate) operation A numerical example 6 6

A FEW POINTS ABOUT THE INNER-PRODUCT Consider this Note a few points A=[A 1, A 2,…, A K ] is a matrix of “ constant” values x=[x 1, x 2,…, x K ] is matrix of input “variables” Each A k is of M-bits Each x k is of N-bits y should be able large enough to accommodate the result 7

A P OSSIBLE H ARDWARE FOR I NNER - PRODUCT 8-bits Multiplier 8-bits Adder 8

T HE REALIZATION OF MULTIPLIER IN H ARDWARE Add and shift method to realize a multiplier. Example of 4 bits multiplier: (1011 * 0101) MultiplierMultiplicandPartial result Result , ,0000 (MSB) 00000, (LSB) 10000, and << 0000, and << 0000, , and << 0001, and 0010, ,0111 << 9

A POSSIBLE HARDWARE (NOT DA YET!!!) Let, Multi-bit AND gate Registers to hold sum of partial products Shift registers Each scaling accumulator calculates Ai X xi Shift right Adder/Subtractor 10

H OW DOES DA WORK ? The “basic” DA technique is bit-serial in nature DA is basically a bit-level rearrangement of the multiply and accumulate operation DA hides the explicit multiplications by ROM look-ups  an efficient technique to implement on Field Programmable Gate Arrays (FPGAs) 11

MOVING CLOSER TO DISTRIBUTED ARITHMETIC Consider once again Where A k are fixed coefficients, and X k are the input data words. X k is N-bits fraction signed number → | x k | < 1 x k : { b k0, b k1, b k2 ……, b k(N-1) } where bk0 is the sign bit We can express x k either as signed and magnitude or as 2’s complement … (2) Sign and magnitudeTwo’s complement 12

CONTINUE …. For two’s complement representation: From this point, the distributed arithmetic (DA) can be define Because this made the summation has only Instead of compute these value on line, we may precompute the values and store them in ROM with size. The input data (X) can be used to directly address the memory and the result can be dropped into an accumulator. This term represent the negative value of X, Therefore, we need a ROM of size to cover all the negative and positive value of X 13

L ET U S REMEMBER O UR H ARDWARE 14

B Y USING THE R EARRANGE EQUATION T HE ROM COMES H ERE 15

T HE ROM C ONSTRUCTION has only 2 K possible values i.e. (5) can be pre-calculated for all possible values of b 1n b 2n … b Kn We can store these in a look-up table of 2 K words addressed by K-bits i.e. b 1n b 2n … b Kn 16

E XAMPLE : The memory must contain all possible combination (16 values). b 1n b 2n b 3n b 4n Contents A 4 = A 3 = A 3 + A 4 = A 2 = A 2 + A 4 = A 2 + A 3 = A 2 + A 3 + A 4 = A 1 = A 1 + A 4 = A 1 + A 3 = A 1 + A 3 + A 4 = A 1 + A 2 = A 1 + A 2 + A 4 = A 1 + A 2 + A 3 = A 1 + A 2 + A 3 + A 4 =1.48 What about negative value? 17

C ONTINUE …. n=0 TsTs b 1n b 2n b 3n b 4n Contents A 4 = A 3 = (A 3 + A 4 )= A 2 = (A 2 + A 4 )= (A 2 + A 3 )= (A 2 + A 3 + A 4 )= A 1 = (A 1 + A 4 )= (A 1 + A 3 )= (A 1 + A 3 + A 4 )= (A 1 + A 2 )= (A 1 + A 2 + A 4 )= (A 1 + A 2 + A 3 )= (A 1 + A 2 + A 3 + A 4 )= <=n<=(N-1) TsTs b 1n b 2n b 3n b 4n Contents A 4 = A 3 = A 3 + A 4 = A 2 = A 2 + A 4 = A 2 + A 3 = A 2 + A 3 + A 4 = A 1 = A 1 + A 4 = A 1 + A 3 = A 1 + A 3 + A 4 = A 1 + A 2 = A 1 + A 2 + A 4 = A 1 + A 2 + A 3 = A 1 + A 2 + A 3 + A 4 =1.48 It is a single-bit timing signal. During the sign-bit time the control signal Ts=1, otherwise Ts=0. Address 5 bitsData 18

A DDER AND FULL M EMORY 19 The input data (2’s-complement number) is delivered in a one- bit-at-a-time (1BAAT) fashion During the sign-bit time Ts= 1, otherwise, Ts=0 When SWA in position: 1- Add and Shift, during the accumulation time. 2- Pass the final result to y (output).

K EY I SSUE : ROM S IZE The size of ROM is very important for high speed implementation as well as area efficiency ROM size grows exponentially with each added input address line The number of address lines are equal to the number of elements in the vector plus one i.e. K+1 Elements up to 16 and more are common => 2 17 =128K of ROM!!! We have to reduce the size of ROM 20

21 S OLUTION OF THE MEMORY SIZE :  First solution is to reduce the size of memory to 2 K instead of 2 K+1 by modify the adder to adder/ subtractor and using Ts as add/sub-control line.

S OLUTION OF THE MEMORY SIZE : Second reduction can lead to reduce the ROM size to 2 K-1. 2‘s-complement=1’s-complement+1 22

C ONTINUE …. 23

U SING THE NEW FORMULA OF X K : 24 &

THE NEW FORMULATION IN OFFSET CODE Constant 25

THE BENEFIT: ONLY HALF VALUES TO STORE b 1n b 2n b 3n b 4n c 1n c 2n c 3n c 4n Contents /2 (A 1 + A 2 + A 3 + A 4 ) = /2 (A 1 + A 2 + A 3 - A 4 ) = /2 (A 1 + A 2 - A 3 + A 4 ) = /2 (A 1 + A 2 - A 3 - A 4 ) = /2 (A 1 - A 2 + A 3 + A 4 ) = /2 (A 1 - A 2 + A 3 - A 4 ) = /2 (A 1 - A 2 - A 3 + A 4 ) = /2 (A 1 - A 2 - A 3 - A 4 ) = /2 (-A 1 + A 2 + A 3 + A 4 ) = /2 (-A 1 + A 2 + A 3 - A 4 ) = /2 (-A 1 + A 2 - A 3 + A 4 ) = /2 (-A 1 + A 2 - A 3 - A 4 ) = /2 (-A 1 - A 2 + A 3 + A 4 ) = /2 (-A 1 - A 2 + A 3 - A 4 ) = /2 (-A 1 - A 2 - A 3 + A 4 ) = /2 (-A 1 - A 2 - A 3 - A 4 ) = 0.74 Inverse symmetry 26 b 1n b 2n b 3n b 4n c 1n c 2n c 3n c 4n Contents /2 (A 1 + A 2 + A 3 + A 4 ) = /2 (A 1 + A 2 + A 3 - A 4 ) = /2 (A 1 + A 2 - A 3 + A 4 ) = /2 (A 1 + A 2 - A 3 - A 4 ) = /2 (A 1 - A 2 + A 3 + A 4 ) = /2 (A 1 - A 2 + A 3 - A 4 ) = /2 (A 1 - A 2 - A 3 + A 4 ) = /2 (A 1 - A 2 - A 3 - A 4 ) = /2 (-A 1 - A 2 - A 3 - A 4 ) = /2 (-A 1 - A 2 - A 3 + A 4 ) = /2 (-A 1 - A 2 + A 3 - A 4 ) = /2 (-A 1 - A 2 + A 3 + A 4 ) = /2 (-A 1 + A 2 - A 3 - A 4 ) = /2 (-A 1 + A 2 - A 3 + A 4 ) = /2 (-A 1 + A 2 + A 3 - A 4 ) = /2 (-A 1 + A 2 + A 3 + A 4 ) = inv +ve Normal -ve

H ARDWARE U SING O FFSET C ODING x1 selects between the two symmetric halves Ts indicates when the sign bit arrives 27 The rest of input data

I NCREASING THE S PEED OF DA Because the data enter in serial ;One Bit At A Time (1 BAAT) No. of Clock Cycles Required = N If K=N, then essentially we are taking N clock cycle per dot product. If K>N the DA processor is faster than single parallel multiplier/accumulator. The speed can be increase interring L bit instead of one. We could have 2 BAAT or up to N BAAT in the extreme case N BAAT  One complete result/cycle 28

29 Two bit at a time The same contain in both memories Two bits shift We can increase the parallelism as much as we want, but that will lead to increase the number of need gate. 29

INCREASE SPEED BY REDUCING THE CARRY PROPAGATION OF ADDER The speed in the critical path is limited by the width of the carry propagation Speed can be improved upon by using techniques to limit the carry propagation Carry Save adder. Carry-Skip adder Carry-lookahead adder 30

C ONCLUSION DA is a very efficient means to mechanize computations that are dominated by inner products. By using DA instead of traditional way, a huge reduction in area can be happened. DA is not slow cause of it serial nature, yet sometimes it is faster than parallel one. 31

Question ? 32