VLSI Arithmetic Lecture 10: Multipliers

Slides:



Advertisements
Similar presentations
VLSI Arithmetic Adders & Multipliers
Advertisements

Multiplication and Shift Circuits Dec 2012 Shmuel Wimer Bar Ilan University, Engineering Faculty Technion, EE Faculty 1.
Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.
Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.
Henry Hexmoor1 Chapter 5 Arithmetic Functions Arithmetic functions –Operate on binary vectors –Use the same subfunction in each bit position Can design.
Copyright 2008 Koren ECE666/Koren Part.6b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
UNIVERSITY OF MASSACHUSETTS Dept
Contemporary Logic Design Arithmetic Circuits © R.H. Katz Lecture #24: Arithmetic Circuits -1 Arithmetic Circuits (Part II) Randy H. Katz University of.
Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
An Extra-Regular, Compact, Low-Power Multiplier Design Using Triple-Expansion Schemes and Borrow Parallel Counter Circuits Rong Lin Ronald B. Alonzo SUNY.
Chapter 5 Arithmetic Logic Functions. Page 2 This Chapter..  We will be looking at multi-valued arithmetic and logic functions  Bitwise AND, OR, EXOR,
Lecture 18: Datapath Functional Units
ECE 4110– Sequential Logic Design
Aug Shift Operations Source: David Harris. Aug Shifter Implementation Regular layout, can be compact, use transmission gates to avoid threshold.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
VLSI Arithmetic Adders & Multipliers Prof. Vojin G. Oklobdzija University of California
Abdullah Aldahami ( ) Feb26, Introduction 2. Feedback Switch Logic 3. Arithmetic Logic Unit Architecture a.Ripple-Carry Adder b.Kogge-Stone.
ECE 645 – Computer Arithmetic Lecture 7: Tree and Array Multipliers ECE 645—Computer Arithmetic 3/18/08.
Chapter 4 – Arithmetic Functions and HDLs Logic and Computer Design Fundamentals.
07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.
Reconfigurable Computing - Multipliers: Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on.
Multi-operand Addition
Advanced VLSI Design Unit 05: Datapath Units. Slide 2 Outline  Adders  Comparators  Shifters  Multi-input Adders  Multipliers.
A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo
EE2174: Digital Logic and Lab Professor Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University CHAPTER 8 Arithmetic.
Digital Logic Design (CSNB163)
Wallace Tree Previous Example is 7 Input Wallace Tree
CPEN Digital System Design
Full Tree Multipliers All k PPs Produced Simultaneously Input to k-input Multioperand Tree Multiples of a (Binary, High-Radix or Recoded) Formed at Top.
ECE DIGITAL LOGIC LECTURE 15: COMBINATIONAL CIRCUITS Assistant Prof. Fareena Saqib Florida Institute of Technology Fall 2015, 10/20/2015.
Comparison of Various Multipliers for Performance Issues 24 March Depart. Of Electronics By: Manto Kwan High Speed & Low Power ASIC
CSE477 L21 Multiplier Design.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
Combinational Circuits
Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
CSE477 VLSI Digital Circuits Fall 2003 Lecture 21: Multiplier Design
Sequential Multipliers
UNIVERSITY OF MASSACHUSETTS Dept
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Algorithms with numbers (1) CISC4080, Computer Algorithms
VLSI Arithmetic Lecture 5
Chap. 8 Datapath Units: Multiplier Design
CSE Winter 2001 – Arithmetic Unit - 1
Unsigned Multiplication
VLSI Arithmetic Lecture 4
VLSI Arithmetic Adders & Multipliers
Arithmetic Functions & Circuits
Arithmetic Circuits (Part I) Randy H
UNIVERSITY OF MASSACHUSETTS Dept
Overview Part 1 – Design Procedure Part 2 – Combinational Logic
Part III The Arithmetic/Logic Unit
Multioperand Addition
UNIVERSITY OF MASSACHUSETTS Dept
Overview Iterative combinational circuits Binary adders
UNIVERSITY OF MASSACHUSETTS Dept
ECE 352 Digital System Fundamentals
Combinational Circuits
ECE 352 Digital System Fundamentals
ECE 352 Digital System Fundamentals
UNIVERSITY OF MASSACHUSETTS Dept
Lecture 9 Digital VLSI System Design Laboratory
Comparison of Various Multipliers for Performance Issues
Sequential Multipliers
UNIVERSITY OF MASSACHUSETTS Dept
Description and Analysis of MULTIPLIERS using LAVA
Arithmetic Building Blocks
Instruction execution and ALU
Lecture 3 Combinational units. Adders
Computer Architecture
UNIVERSITY OF MASSACHUSETTS Dept
Presentation transcript:

VLSI Arithmetic Lecture 10: Multipliers Prof. Vojin G. Oklobdzija University of California http://www.ece.ucdavis.edu/acsel

Multiplication Algorithm* *from Parhami 27 November 2018

Multiplication Algorithm* *from Parhami 27 November 2018

Multiplication Algorithm* *from Parhami 27 November 2018

*from Parhami 27 November 2018

Multiplication* *from Parhami 27 November 2018

Multiplication* *from Parhami 27 November 2018

*from Parhami 27 November 2018

*from Parhami 27 November 2018

Multiplier Recoding* *from Parhami 27 November 2018

*from Parhami 27 November 2018

Multiplication by Constants *from Parhami 27 November 2018

Multiplication by Constants *from Parhami 27 November 2018

Fast Multipliers *from Parhami 27 November 2018

Using Higher Radix Multiplier *from Parhami 27 November 2018

Using Higher Radix Multiplier *from Parhami 27 November 2018

Higher Radix Multiplier *from Parhami 27 November 2018

*from Parhami 27 November 2018

Booth’s Recoding *from Parhami 27 November 2018

Booth’s Recoding *from Parhami 27 November 2018

Booth’s Recoding *from Parhami 27 November 2018

*from Parhami 27 November 2018

Higher Radix Multipliers *from Parhami 27 November 2018

Tree and Array Multipliers *from Parhami 27 November 2018

Tree and Array Multipliers *from Parhami 27 November 2018

Generating Partial Products 27 November 2018 *from G. Bewick

Generating Partial Products *from G. Bewick 27 November 2018

Generating Partial Products using Booth’s Recoding *from G. Bewick 27 November 2018

Generating Partial Products using Booth’s Recoding *from G. Bewick 27 November 2018

Booth Partial Product Selector Logic *from G. Bewick 27 November 2018

Tree Multipliers *from Parhami 27 November 2018

27 November 2018

27 November 2018

27 November 2018

27 November 2018

27 November 2018

Tree Multipliers *from Parhami 27 November 2018

Tree Multipliers *from Parhami 27 November 2018

Tree Multipliers *from Parhami 27 November 2018

27 November 2018

27 November 2018

27 November 2018

Reduction using 4:2 Compressors *from G. Bewick 27 November 2018

A Method for Generation of Fast Parallel Multipliers by Vojin G. Oklobdzija David Villeger Simon S. Liu Electrical and Computer Engineering University of California Davis

Fast Parallel Multipliers Objective Improved Speed of Parallel Multiplier via: Improvements in Partial-Product Bit Reduction Techniques Optimization of the Final Adder for the Uneven Signal Arrival Profile from the Multiplier Tree 27 November 2018 Multiplier Design

Multiplication Algorithm: initially for j=0,....,n-1 Traditionally multiplication operation is performed in a variety of forms, in hardware and software, depending on the cost and transistor budget allocated for this particular operation. Today it is more likely to find a full hardware implementation of the multiplication because of growing demand for speed and decreasing cost of hardware. We show a basic multiplication algorithm which operates on positive n-bit long integers X and Y resulting in the product P which is 2n - bits long. This expression indicates that the multiplication process is performed by summing n terms of a partial product Pi. This product Pi is obtained by simple arithmetic left shift of X for the i positions and multiplication by the single digit yi. For the binary radix (r=2), yi can only be 0 or 1 and multiplication by the digit yi is a simple AND operation. The addition of n terms can be performed at once, by passing the partial products through a network of adders or sequentially, by adding partial products using an adder n times. The algorithm to perform the multiplication of X and Y is shown in the slide. It can be proved without difficulties that after n steps this recurrence results in a product p(n)=XY. p(n)=XY after n steps 27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

Parallel Multipliers Multiplier Design 27 November 2018 An alternative approach to sequential multiplication involves simultaneous generation of all bit products and their summation with an array of full adders. This approach uses an n by n array of AND gates to form the bit products, an array of n x n adders (and half adders) to sum the bit products in a carry-save fashion. Finally a 2n Carry-Propagate Adder is used in the final step to finish the summation and produce the result. Wallace introduced a way of summing the partial product bits in parallel using a tree of Carry Save Adders which became generally known as the “Wallace Tree” . A suggestion for improved efficiency of addition of the partial products was published by Dadda in 1965. Dadda introduces a notion of a counter which will take a number of bits p in the same bit position and output a number q which represent the count of ones at the input. This process is shown in the slide illustrating 8 by 8 multiplication process. An input of 8 by 8 matrix of dots, each dot represents a bit product, is shown as a Matrix 0. Columns having more than six dots are reduced by the use of half adders. Each half adder takes in two dots and outputs one in the same column and one in the next more significant column. Each full adder takes in three dots and outputs one in the same column and one in the next more significant column. No column in Matrix 1 will have more than six dots. Half adders are shown by a “crossed” line in the succeeding matrix and full adders are shown by a line in the succeeding matrix. In each case the right most dot of the pair that are connected by a line is in the column from which the inputs were taken for the adder. In the succeeding steps reduction is performed to Matrix 2 with no more than four dots per column, Matrix 3 with no more than three dots per column, and finally Matrix 4 with no more than two dots per column is. The height of the matrices is determined by working back from the final matrix and limiting the height of each matrix to the largest integer that is no more than 1.5 times the height of its successor. Each matrix is produced from its predecessor in one adder delay. Since the number of matrices is logarithmically related to the number of bits in the words to be multiplied, the delay of the matrix reduction process is proportional to log(n). The adder that reduces the final two row matrix to the final product can be implemented as a fast adder, which also has logarithmic delay. The total delay for this multiplier is proportional to the logarithm of the size of its operands. The effort of improving the speed of the parallel multiplier continued for almost 30 years. 27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

Minimum Number of Stages (Dada’s Rule) 27 November 2018 Multiplier Design

Their Schemes

Use of 4:2 Compressors A. Weinberger 1981 M. Santoro 1988 27 November 2018 Multiplier Design

4:2 Compressor Multiplier Design 27 November 2018 In 1981 Weinberger disclosed a structure which he called "4-2 carry-save module". This structure contained a combination of Full Adder cells in an intricate interconnection structure which yields a faster partial product compression than the use of 3:2 counters. The structure actually compresses five partial product bits into three, however it is connected in such a way that four of the inputs are coming from the same bit position of the weight j while one bit is fed from the neighboring position j-1 also known as carry-in. The output of such a 4:2 module consists of one bit in the position j and two bits in the position j+1. This structure does not represent a counter, though it became erroneously known as "4:2 counter“, but a "compressor" that would compress four partial product bits into two. The structure of 4:2 compressor is shown in this slide. The efficiency of such a structure to reduce partial product bits is higher. It reduces the number of partial product bits by one half at each stage. The speed of such a 4:2 compressor has been determined by the speed of 3 XOR gates in series, in the redesigned version of 4:2 compressor, making such a scheme more efficient that the one using 3:2 counters in a regular "Wallace Tree". The other equally important feature of the use of 4:2 compressor is that the interconnections between 4:2 cells follow more regular pattern than it is the case of the "Wallace Tree". 27 November 2018 Multiplier Design

Critical Signal Path in a 4:2 Compressor Tree 27 November 2018 Multiplier Design

Re-designed 4:2 Compressor with 3 XOR Delay (Nagamatsu, Toshiba) in I1 I2 S I3 I4 1 C This slide shows a re-design 4:2 compressor as introduced by Toshiba. The advantage of this compressor is that it results in 3 XOR gate maximal delay as opposed to 4 XOR delay in a regular implementation using Full-Adder cells. In the next slide we will show how this re-design was not necessary. The missing point was in the lack of understanding of how to balance delays of individual full-adders. C out 27 November 2018 Multiplier Design

Critical Path in a 4:2 Compressor 27 November 2018 Multiplier Design

Signal Arrival Profile for RWT (3:2) and MWT (4:2) 27 November 2018 Multiplier Design

Using 9:2 Compressors (P. Song, G. De Michelli 1991) 27 November 2018 Multiplier Design

Compressor Tree Implemented with 9:2 Compressors 27 November 2018 Multiplier Design

9:2 Compressor Structure 27 November 2018 Multiplier Design

Title Multiplier Design Critical Path: (Equivalent XOR Gate Delays) 24 Title 4:2 Compressor (Redesigned) 22 Delay (XOR Gates) 9to2 Compressor (Redesigned ) 20 3,2 Counter 18 16 14 12 10 8 6 4 2 10 20 30 40 50 60 70 80 90 100 Multiplier Width (bits) 27 November 2018 Multiplier Design

Delay Expressed as No. of XOR Gate Delays 27 November 2018 Multiplier Design

Use of Higher-Order Compressors D. Villeger, V.G. Oklobdzija 1993 27 November 2018 Multiplier Design

Design of a 13:2 Compressor from a 9:2 Compressor 27 November 2018 Multiplier Design

Delay Profile of a 24:2 Compressor Tree 27 November 2018 Multiplier Design

Compressor Family Characteristics 27 November 2018 Multiplier Design

Using Carry-Propagate Adders (G. Bewick 1993) (D. Villeger & V. G. Oklobdzija 1993) 27 November 2018 Multiplier Design

Column Compression Tree Consisting of 4-bit Adders 27 November 2018 Multiplier Design

Bit Reduction Using 4-bit Adders (24X24) 27 November 2018 Multiplier Design

Idea !!!!!

(Oklobdzija, Villeger, Liu, 1994) A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach – TDM (Oklobdzija, Villeger, Liu, 1994) 27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

Signal Delays in a Full Adder (3,2) Counter A B Cin Sum * Fast Input ¨ Fast Output Carry 27 November 2018 Multiplier Design

Three-Dimensional optimization Method: TDM (Oklobdzija, Villeger, Liu, 1996) The further improvement in speed of a parallel multiplier was achieved by introduction of TDM method in 1996. TDM optimizes the entire Partial Product Reduction Tree in one pass, thus the name Three Dimensional optimizaiton Method. The important aspect of this method is in sorting of fast inputs and fast outputs. It was realized that the most important step in achieving fast partial product reduction is to properly interconnect the elements. Thus, appropriate counters, 3:2 adders in this particular case, were characterized in a way which identifies delay of each input to each output. Interconnecting of the Partial Product Reduction Tree was done in a way in which signals with large delays are connected to "fast inputs" and signals with small delay to "slow inputs" . This slide illustrates how an 4:1 compressor with 3 XOR delay can be obtained by a simple application of TDM method without the need for redesign. 27 November 2018 Multiplier Design

Method

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

Computer Tools

27 November 2018 Multiplier Design

Design of Parallel Multipliers Algorithm for Automatic Generation of Partial Product Array. Initialize: Form 2N-1 lists Li ( i = 0, 2N-2 ) each consisting of pi elements where: p i = i+1 for i £ N-1 and p i = 2N-1-i for i  N An element of a list Li ( j = 0,...,pi-1 ) is a pair: <nj, Dj>i where: nj : is a unique node identifying name Dj : is a delay associated with that node representing a delay of a signal arriving to the node nj with respect to some reference point. For i = 0,1 and 2N-2: connect nodes from the corresponding lists Li directly to the CPA. 27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

For i=2 to i=2N-3 {Partial Product Array Generation} Begin For if length of Li is even Then Begin If sort the elements of Li in ascending order by the values of delay dj connect an HA to the first 2 elements of Li starting with the slowest input Ds =max {dA+dA-S, dB+dB-S} Dc =max {dA+dA-C, dB+dB-C} remove 2 elements from Li insert the pair <Ds,NetName> into Li insert the pair <Dc,NetName> into Li+1 decrement the length of Li increment the length of Li+1 End If; 27 November 2018 Multiplier Design

while length of Li > 3 Begin While sort the elements of Li in ascending order by the values of delay dj connect an FA to the first 3 elements of Li starting with the slowest input of the FA: Ds =max {dcA+dcA-S, dcB+dcB-S, dcCi+dcCi-S} Dc = max {dcA+dcA-C, dcB+dcB-C, dcCi+dcCi-C} remove 3 elements from Li insert the pair <Ds,NetName> into Li insert the pair <Dc,NetName> into Li+1 subtract 2 from the length of Li increment the length of Li+1 End While; sort the elements of Li connect an FA to the last 3 nodes of Li connect the S and C to the bit i and i+1 of the CPA End For; End Method;

Competing Approaches

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

Algorithm for Implementation of Fast Parallel Multipliers [1] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A Method For Speed Optimized Partial Product Reduction And Generation Of Fast Parallel Multipliers Using An Algorithmic Approach,” IEEE Transactions on Computers, Vol 45, No.3, March, 1996. [2] V. G. Oklobdzija and D. Villeger, “Improving Multiplier Design By Using Improved Column Compression Tree And Optimized Final Adder In CMOS Technology,” IEEE Transactions on VLSI Systems, Vol.3, No.2, June, 1995, 25 pages. [3] V. G. Oklobdzija and D. Villeger, “Multiplier Design Utilizing Improved Column Compression Tree And Optimized Final Adder In CMOS Technology,” Proceedings of the 1993 International Symposium on VLSI Technology, Systems and Applications, pp. 209-212, 1993. [4] P. Stelling, C. Martel, V. G. Oklobdzija, R. Ravi, “Optimal Circuits for Parallel Multipliers,” IEEE Transaction on Computers, Vol. 47, No.3, pp. 273-285, March, 1998. 27 November 2018 Multiplier Design

Organization of Hitachi's DPL multiplier And example of a fast multiplier is Hitachi's DPL multiplier which was the first one to achieve under 5nS speed for a 54-bit floating-point mantissa. This multiplier is of a regular structure including: (a.) A Booth Recoder (b.) A Partial Product Reduction Tree and (c.) A final Carry Propagate Adder (CPA) as shown in this slide.   27 November 2018 Multiplier Design

Hitachi's 4:2 compressor structure  The key to performance of Hitachi's multiplier lays in the use of DPL circuits and the efficiency with which DPL can realize 4:2 compressor. The structure of Hitachi's 4:2 compressor is shown in this slide. The realization of the 4:2 function consists entirely of DPL multiplexers which introduce only one pass-transistor delay in the critical path as shown in the next slide. Indeed this structure is one of the fastest transistor realizations for the Partial Product Reduction Tree. The speed of this multiplier can be further optimized by applying TDM algorithm realizing optimal interconnections. Such a structure yields 4.1nS delay in 0.25u technology. For larger size multipliers this structure may start showing degraded performance because of the long pass-transistor chain in 4:2 compressors used. However, this is of not much concern since 54-bit represents double precession floating point format and we rarely use larger multiplier sizes in practice. 27 November 2018 Multiplier Design

DPL multiplexer circuit An efficient pass transistor realization of multiplexers using Hitachi’s DPL logic is shown in this slide. This multiplexer introduces only one pass-transistor delay in the critical path, thus allowing three pass transistor delays for one 4:2 compressor. The entire partial product reduction tree results in 12 pass-transistor delays in the critical path. The use of TDM reduces this number to 10. 27 November 2018 Multiplier Design

Addition Under Non-equal Signal Arrival Profile Assumption P. Stelling , V. G. Oklobdzija, "Design Strategies for Optimal Hybrid Final Adders in a Parallel Multiplier", special issue on VLSI Arithmetic, Journal of VLSI Signal Processing, Kluwer Academic Publishers, Vol.14, No.3, December 1996 27 November 2018 Multiplier Design

Signal Arrival Profile form the Parallel Multiplier Partial-Product Recuction Tree 27 November 2018 Multiplier Design

Oklobdzija, Villeger, IEEE Transactions on VLSI Systems, June, 1995 27 November 2018 Multiplier Design

Oklobdzija and Villeger, IEEE Transactions on VLSI Systems, June, 1995 27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

Performing Multiply-Add Operation in the Multiply Time P. Stelling, V. G. Oklobdzija, " Achieving Multiply-Accumulate Operation in the Multiply Time", Thirteenth International Symposium on Computer Arithmetic, Pacific Grove, California, July 5 - 9, 1997. 27 November 2018 Multiplier Design

27 November 2018 Multiplier Design

Final Adder: Implementation 27 November 2018 Multiplier Design

Final Adder: Implementation 27 November 2018 Multiplier Design

Final Adder: Implementation 27 November 2018 Multiplier Design

Final Adder: Implementation 27 November 2018 Multiplier Design

RECOMENDATIONS

Fast Parallel Multipliers Different Counter and Compressor Families were compared. The best way is to build a compressor of the maximal size (i.e. the entire size of the multiplier) The Essence of the optimal tree is optimal wiring and NOT the use of counter/compressor family The use of Carry-Propagate Adders is advantageous for larger size multipliers in the first stage and for particular technology Tuning of the Final Adder into the signal arrival profile is more important than the speed of the Final Adder. 27 November 2018 Multiplier Design

THE END 27 November 2018 Multiplier Design

Hollywood