VLSI Arithmetic Lecture 10: Multipliers Prof. Vojin G. Oklobdzija University of California http://www.ece.ucdavis.edu/acsel
Multiplication Algorithm* *from Parhami 27 November 2018
Multiplication Algorithm* *from Parhami 27 November 2018
Multiplication Algorithm* *from Parhami 27 November 2018
*from Parhami 27 November 2018
Multiplication* *from Parhami 27 November 2018
Multiplication* *from Parhami 27 November 2018
*from Parhami 27 November 2018
*from Parhami 27 November 2018
Multiplier Recoding* *from Parhami 27 November 2018
*from Parhami 27 November 2018
Multiplication by Constants *from Parhami 27 November 2018
Multiplication by Constants *from Parhami 27 November 2018
Fast Multipliers *from Parhami 27 November 2018
Using Higher Radix Multiplier *from Parhami 27 November 2018
Using Higher Radix Multiplier *from Parhami 27 November 2018
Higher Radix Multiplier *from Parhami 27 November 2018
*from Parhami 27 November 2018
Booth’s Recoding *from Parhami 27 November 2018
Booth’s Recoding *from Parhami 27 November 2018
Booth’s Recoding *from Parhami 27 November 2018
*from Parhami 27 November 2018
Higher Radix Multipliers *from Parhami 27 November 2018
Tree and Array Multipliers *from Parhami 27 November 2018
Tree and Array Multipliers *from Parhami 27 November 2018
Generating Partial Products 27 November 2018 *from G. Bewick
Generating Partial Products *from G. Bewick 27 November 2018
Generating Partial Products using Booth’s Recoding *from G. Bewick 27 November 2018
Generating Partial Products using Booth’s Recoding *from G. Bewick 27 November 2018
Booth Partial Product Selector Logic *from G. Bewick 27 November 2018
Tree Multipliers *from Parhami 27 November 2018
27 November 2018
27 November 2018
27 November 2018
27 November 2018
27 November 2018
Tree Multipliers *from Parhami 27 November 2018
Tree Multipliers *from Parhami 27 November 2018
Tree Multipliers *from Parhami 27 November 2018
27 November 2018
27 November 2018
27 November 2018
Reduction using 4:2 Compressors *from G. Bewick 27 November 2018
A Method for Generation of Fast Parallel Multipliers by Vojin G. Oklobdzija David Villeger Simon S. Liu Electrical and Computer Engineering University of California Davis
Fast Parallel Multipliers Objective Improved Speed of Parallel Multiplier via: Improvements in Partial-Product Bit Reduction Techniques Optimization of the Final Adder for the Uneven Signal Arrival Profile from the Multiplier Tree 27 November 2018 Multiplier Design
Multiplication Algorithm: initially for j=0,....,n-1 Traditionally multiplication operation is performed in a variety of forms, in hardware and software, depending on the cost and transistor budget allocated for this particular operation. Today it is more likely to find a full hardware implementation of the multiplication because of growing demand for speed and decreasing cost of hardware. We show a basic multiplication algorithm which operates on positive n-bit long integers X and Y resulting in the product P which is 2n - bits long. This expression indicates that the multiplication process is performed by summing n terms of a partial product Pi. This product Pi is obtained by simple arithmetic left shift of X for the i positions and multiplication by the single digit yi. For the binary radix (r=2), yi can only be 0 or 1 and multiplication by the digit yi is a simple AND operation. The addition of n terms can be performed at once, by passing the partial products through a network of adders or sequentially, by adding partial products using an adder n times. The algorithm to perform the multiplication of X and Y is shown in the slide. It can be proved without difficulties that after n steps this recurrence results in a product p(n)=XY. p(n)=XY after n steps 27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
Parallel Multipliers Multiplier Design 27 November 2018 An alternative approach to sequential multiplication involves simultaneous generation of all bit products and their summation with an array of full adders. This approach uses an n by n array of AND gates to form the bit products, an array of n x n adders (and half adders) to sum the bit products in a carry-save fashion. Finally a 2n Carry-Propagate Adder is used in the final step to finish the summation and produce the result. Wallace introduced a way of summing the partial product bits in parallel using a tree of Carry Save Adders which became generally known as the “Wallace Tree” . A suggestion for improved efficiency of addition of the partial products was published by Dadda in 1965. Dadda introduces a notion of a counter which will take a number of bits p in the same bit position and output a number q which represent the count of ones at the input. This process is shown in the slide illustrating 8 by 8 multiplication process. An input of 8 by 8 matrix of dots, each dot represents a bit product, is shown as a Matrix 0. Columns having more than six dots are reduced by the use of half adders. Each half adder takes in two dots and outputs one in the same column and one in the next more significant column. Each full adder takes in three dots and outputs one in the same column and one in the next more significant column. No column in Matrix 1 will have more than six dots. Half adders are shown by a “crossed” line in the succeeding matrix and full adders are shown by a line in the succeeding matrix. In each case the right most dot of the pair that are connected by a line is in the column from which the inputs were taken for the adder. In the succeeding steps reduction is performed to Matrix 2 with no more than four dots per column, Matrix 3 with no more than three dots per column, and finally Matrix 4 with no more than two dots per column is. The height of the matrices is determined by working back from the final matrix and limiting the height of each matrix to the largest integer that is no more than 1.5 times the height of its successor. Each matrix is produced from its predecessor in one adder delay. Since the number of matrices is logarithmically related to the number of bits in the words to be multiplied, the delay of the matrix reduction process is proportional to log(n). The adder that reduces the final two row matrix to the final product can be implemented as a fast adder, which also has logarithmic delay. The total delay for this multiplier is proportional to the logarithm of the size of its operands. The effort of improving the speed of the parallel multiplier continued for almost 30 years. 27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
Minimum Number of Stages (Dada’s Rule) 27 November 2018 Multiplier Design
Their Schemes
Use of 4:2 Compressors A. Weinberger 1981 M. Santoro 1988 27 November 2018 Multiplier Design
4:2 Compressor Multiplier Design 27 November 2018 In 1981 Weinberger disclosed a structure which he called "4-2 carry-save module". This structure contained a combination of Full Adder cells in an intricate interconnection structure which yields a faster partial product compression than the use of 3:2 counters. The structure actually compresses five partial product bits into three, however it is connected in such a way that four of the inputs are coming from the same bit position of the weight j while one bit is fed from the neighboring position j-1 also known as carry-in. The output of such a 4:2 module consists of one bit in the position j and two bits in the position j+1. This structure does not represent a counter, though it became erroneously known as "4:2 counter“, but a "compressor" that would compress four partial product bits into two. The structure of 4:2 compressor is shown in this slide. The efficiency of such a structure to reduce partial product bits is higher. It reduces the number of partial product bits by one half at each stage. The speed of such a 4:2 compressor has been determined by the speed of 3 XOR gates in series, in the redesigned version of 4:2 compressor, making such a scheme more efficient that the one using 3:2 counters in a regular "Wallace Tree". The other equally important feature of the use of 4:2 compressor is that the interconnections between 4:2 cells follow more regular pattern than it is the case of the "Wallace Tree". 27 November 2018 Multiplier Design
Critical Signal Path in a 4:2 Compressor Tree 27 November 2018 Multiplier Design
Re-designed 4:2 Compressor with 3 XOR Delay (Nagamatsu, Toshiba) in I1 I2 S I3 I4 1 C This slide shows a re-design 4:2 compressor as introduced by Toshiba. The advantage of this compressor is that it results in 3 XOR gate maximal delay as opposed to 4 XOR delay in a regular implementation using Full-Adder cells. In the next slide we will show how this re-design was not necessary. The missing point was in the lack of understanding of how to balance delays of individual full-adders. C out 27 November 2018 Multiplier Design
Critical Path in a 4:2 Compressor 27 November 2018 Multiplier Design
Signal Arrival Profile for RWT (3:2) and MWT (4:2) 27 November 2018 Multiplier Design
Using 9:2 Compressors (P. Song, G. De Michelli 1991) 27 November 2018 Multiplier Design
Compressor Tree Implemented with 9:2 Compressors 27 November 2018 Multiplier Design
9:2 Compressor Structure 27 November 2018 Multiplier Design
Title Multiplier Design Critical Path: (Equivalent XOR Gate Delays) 24 Title 4:2 Compressor (Redesigned) 22 Delay (XOR Gates) 9to2 Compressor (Redesigned ) 20 3,2 Counter 18 16 14 12 10 8 6 4 2 10 20 30 40 50 60 70 80 90 100 Multiplier Width (bits) 27 November 2018 Multiplier Design
Delay Expressed as No. of XOR Gate Delays 27 November 2018 Multiplier Design
Use of Higher-Order Compressors D. Villeger, V.G. Oklobdzija 1993 27 November 2018 Multiplier Design
Design of a 13:2 Compressor from a 9:2 Compressor 27 November 2018 Multiplier Design
Delay Profile of a 24:2 Compressor Tree 27 November 2018 Multiplier Design
Compressor Family Characteristics 27 November 2018 Multiplier Design
Using Carry-Propagate Adders (G. Bewick 1993) (D. Villeger & V. G. Oklobdzija 1993) 27 November 2018 Multiplier Design
Column Compression Tree Consisting of 4-bit Adders 27 November 2018 Multiplier Design
Bit Reduction Using 4-bit Adders (24X24) 27 November 2018 Multiplier Design
Idea !!!!!
(Oklobdzija, Villeger, Liu, 1994) A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach – TDM (Oklobdzija, Villeger, Liu, 1994) 27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
Signal Delays in a Full Adder (3,2) Counter A B Cin Sum * Fast Input ¨ Fast Output Carry 27 November 2018 Multiplier Design
Three-Dimensional optimization Method: TDM (Oklobdzija, Villeger, Liu, 1996) The further improvement in speed of a parallel multiplier was achieved by introduction of TDM method in 1996. TDM optimizes the entire Partial Product Reduction Tree in one pass, thus the name Three Dimensional optimizaiton Method. The important aspect of this method is in sorting of fast inputs and fast outputs. It was realized that the most important step in achieving fast partial product reduction is to properly interconnect the elements. Thus, appropriate counters, 3:2 adders in this particular case, were characterized in a way which identifies delay of each input to each output. Interconnecting of the Partial Product Reduction Tree was done in a way in which signals with large delays are connected to "fast inputs" and signals with small delay to "slow inputs" . This slide illustrates how an 4:1 compressor with 3 XOR delay can be obtained by a simple application of TDM method without the need for redesign. 27 November 2018 Multiplier Design
Method
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
Computer Tools
27 November 2018 Multiplier Design
Design of Parallel Multipliers Algorithm for Automatic Generation of Partial Product Array. Initialize: Form 2N-1 lists Li ( i = 0, 2N-2 ) each consisting of pi elements where: p i = i+1 for i £ N-1 and p i = 2N-1-i for i N An element of a list Li ( j = 0,...,pi-1 ) is a pair: <nj, Dj>i where: nj : is a unique node identifying name Dj : is a delay associated with that node representing a delay of a signal arriving to the node nj with respect to some reference point. For i = 0,1 and 2N-2: connect nodes from the corresponding lists Li directly to the CPA. 27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
For i=2 to i=2N-3 {Partial Product Array Generation} Begin For if length of Li is even Then Begin If sort the elements of Li in ascending order by the values of delay dj connect an HA to the first 2 elements of Li starting with the slowest input Ds =max {dA+dA-S, dB+dB-S} Dc =max {dA+dA-C, dB+dB-C} remove 2 elements from Li insert the pair <Ds,NetName> into Li insert the pair <Dc,NetName> into Li+1 decrement the length of Li increment the length of Li+1 End If; 27 November 2018 Multiplier Design
while length of Li > 3 Begin While sort the elements of Li in ascending order by the values of delay dj connect an FA to the first 3 elements of Li starting with the slowest input of the FA: Ds =max {dcA+dcA-S, dcB+dcB-S, dcCi+dcCi-S} Dc = max {dcA+dcA-C, dcB+dcB-C, dcCi+dcCi-C} remove 3 elements from Li insert the pair <Ds,NetName> into Li insert the pair <Dc,NetName> into Li+1 subtract 2 from the length of Li increment the length of Li+1 End While; sort the elements of Li connect an FA to the last 3 nodes of Li connect the S and C to the bit i and i+1 of the CPA End For; End Method;
Competing Approaches
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
Algorithm for Implementation of Fast Parallel Multipliers [1] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A Method For Speed Optimized Partial Product Reduction And Generation Of Fast Parallel Multipliers Using An Algorithmic Approach,” IEEE Transactions on Computers, Vol 45, No.3, March, 1996. [2] V. G. Oklobdzija and D. Villeger, “Improving Multiplier Design By Using Improved Column Compression Tree And Optimized Final Adder In CMOS Technology,” IEEE Transactions on VLSI Systems, Vol.3, No.2, June, 1995, 25 pages. [3] V. G. Oklobdzija and D. Villeger, “Multiplier Design Utilizing Improved Column Compression Tree And Optimized Final Adder In CMOS Technology,” Proceedings of the 1993 International Symposium on VLSI Technology, Systems and Applications, pp. 209-212, 1993. [4] P. Stelling, C. Martel, V. G. Oklobdzija, R. Ravi, “Optimal Circuits for Parallel Multipliers,” IEEE Transaction on Computers, Vol. 47, No.3, pp. 273-285, March, 1998. 27 November 2018 Multiplier Design
Organization of Hitachi's DPL multiplier And example of a fast multiplier is Hitachi's DPL multiplier which was the first one to achieve under 5nS speed for a 54-bit floating-point mantissa. This multiplier is of a regular structure including: (a.) A Booth Recoder (b.) A Partial Product Reduction Tree and (c.) A final Carry Propagate Adder (CPA) as shown in this slide. 27 November 2018 Multiplier Design
Hitachi's 4:2 compressor structure The key to performance of Hitachi's multiplier lays in the use of DPL circuits and the efficiency with which DPL can realize 4:2 compressor. The structure of Hitachi's 4:2 compressor is shown in this slide. The realization of the 4:2 function consists entirely of DPL multiplexers which introduce only one pass-transistor delay in the critical path as shown in the next slide. Indeed this structure is one of the fastest transistor realizations for the Partial Product Reduction Tree. The speed of this multiplier can be further optimized by applying TDM algorithm realizing optimal interconnections. Such a structure yields 4.1nS delay in 0.25u technology. For larger size multipliers this structure may start showing degraded performance because of the long pass-transistor chain in 4:2 compressors used. However, this is of not much concern since 54-bit represents double precession floating point format and we rarely use larger multiplier sizes in practice. 27 November 2018 Multiplier Design
DPL multiplexer circuit An efficient pass transistor realization of multiplexers using Hitachi’s DPL logic is shown in this slide. This multiplexer introduces only one pass-transistor delay in the critical path, thus allowing three pass transistor delays for one 4:2 compressor. The entire partial product reduction tree results in 12 pass-transistor delays in the critical path. The use of TDM reduces this number to 10. 27 November 2018 Multiplier Design
Addition Under Non-equal Signal Arrival Profile Assumption P. Stelling , V. G. Oklobdzija, "Design Strategies for Optimal Hybrid Final Adders in a Parallel Multiplier", special issue on VLSI Arithmetic, Journal of VLSI Signal Processing, Kluwer Academic Publishers, Vol.14, No.3, December 1996 27 November 2018 Multiplier Design
Signal Arrival Profile form the Parallel Multiplier Partial-Product Recuction Tree 27 November 2018 Multiplier Design
Oklobdzija, Villeger, IEEE Transactions on VLSI Systems, June, 1995 27 November 2018 Multiplier Design
Oklobdzija and Villeger, IEEE Transactions on VLSI Systems, June, 1995 27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
Performing Multiply-Add Operation in the Multiply Time P. Stelling, V. G. Oklobdzija, " Achieving Multiply-Accumulate Operation in the Multiply Time", Thirteenth International Symposium on Computer Arithmetic, Pacific Grove, California, July 5 - 9, 1997. 27 November 2018 Multiplier Design
27 November 2018 Multiplier Design
Final Adder: Implementation 27 November 2018 Multiplier Design
Final Adder: Implementation 27 November 2018 Multiplier Design
Final Adder: Implementation 27 November 2018 Multiplier Design
Final Adder: Implementation 27 November 2018 Multiplier Design
RECOMENDATIONS
Fast Parallel Multipliers Different Counter and Compressor Families were compared. The best way is to build a compressor of the maximal size (i.e. the entire size of the multiplier) The Essence of the optimal tree is optimal wiring and NOT the use of counter/compressor family The use of Carry-Propagate Adders is advantageous for larger size multipliers in the first stage and for particular technology Tuning of the Final Adder into the signal arrival profile is more important than the speed of the Final Adder. 27 November 2018 Multiplier Design
THE END 27 November 2018 Multiplier Design
Hollywood