Download presentation
Presentation is loading. Please wait.
1
VLSI Arithmetic Lecture 10: Multipliers
Prof. Vojin G. Oklobdzija University of California
2
Multiplication Algorithm*
*from Parhami 27 November 2018
3
Multiplication Algorithm*
*from Parhami 27 November 2018
4
Multiplication Algorithm*
*from Parhami 27 November 2018
5
*from Parhami 27 November 2018
6
Multiplication* *from Parhami 27 November 2018
7
Multiplication* *from Parhami 27 November 2018
8
*from Parhami 27 November 2018
9
*from Parhami 27 November 2018
10
Multiplier Recoding* *from Parhami 27 November 2018
11
*from Parhami 27 November 2018
12
Multiplication by Constants
*from Parhami 27 November 2018
13
Multiplication by Constants
*from Parhami 27 November 2018
14
Fast Multipliers *from Parhami 27 November 2018
15
Using Higher Radix Multiplier
*from Parhami 27 November 2018
16
Using Higher Radix Multiplier
*from Parhami 27 November 2018
17
Higher Radix Multiplier
*from Parhami 27 November 2018
18
*from Parhami 27 November 2018
19
Booth’s Recoding *from Parhami 27 November 2018
20
Booth’s Recoding *from Parhami 27 November 2018
21
Booth’s Recoding *from Parhami 27 November 2018
22
*from Parhami 27 November 2018
23
Higher Radix Multipliers
*from Parhami 27 November 2018
24
Tree and Array Multipliers
*from Parhami 27 November 2018
25
Tree and Array Multipliers
*from Parhami 27 November 2018
26
Generating Partial Products
27 November 2018 *from G. Bewick
27
Generating Partial Products
*from G. Bewick 27 November 2018
28
Generating Partial Products using Booth’s Recoding
*from G. Bewick 27 November 2018
29
Generating Partial Products using Booth’s Recoding
*from G. Bewick 27 November 2018
30
Booth Partial Product Selector Logic
*from G. Bewick 27 November 2018
31
Tree Multipliers *from Parhami 27 November 2018
32
27 November 2018
33
27 November 2018
34
27 November 2018
35
27 November 2018
36
27 November 2018
37
Tree Multipliers *from Parhami 27 November 2018
38
Tree Multipliers *from Parhami 27 November 2018
39
Tree Multipliers *from Parhami 27 November 2018
40
27 November 2018
41
27 November 2018
42
27 November 2018
43
Reduction using 4:2 Compressors
*from G. Bewick 27 November 2018
44
A Method for Generation of Fast Parallel Multipliers
by Vojin G. Oklobdzija David Villeger Simon S. Liu Electrical and Computer Engineering University of California Davis
45
Fast Parallel Multipliers
Objective Improved Speed of Parallel Multiplier via: Improvements in Partial-Product Bit Reduction Techniques Optimization of the Final Adder for the Uneven Signal Arrival Profile from the Multiplier Tree 27 November 2018 Multiplier Design
46
Multiplication Algorithm: initially for j=0,....,n-1
Traditionally multiplication operation is performed in a variety of forms, in hardware and software, depending on the cost and transistor budget allocated for this particular operation. Today it is more likely to find a full hardware implementation of the multiplication because of growing demand for speed and decreasing cost of hardware. We show a basic multiplication algorithm which operates on positive n-bit long integers X and Y resulting in the product P which is 2n - bits long. This expression indicates that the multiplication process is performed by summing n terms of a partial product Pi. This product Pi is obtained by simple arithmetic left shift of X for the i positions and multiplication by the single digit yi. For the binary radix (r=2), yi can only be 0 or 1 and multiplication by the digit yi is a simple AND operation. The addition of n terms can be performed at once, by passing the partial products through a network of adders or sequentially, by adding partial products using an adder n times. The algorithm to perform the multiplication of X and Y is shown in the slide. It can be proved without difficulties that after n steps this recurrence results in a product p(n)=XY. p(n)=XY after n steps 27 November 2018 Multiplier Design
47
27 November 2018 Multiplier Design
48
27 November 2018 Multiplier Design
49
Parallel Multipliers Multiplier Design 27 November 2018
An alternative approach to sequential multiplication involves simultaneous generation of all bit products and their summation with an array of full adders. This approach uses an n by n array of AND gates to form the bit products, an array of n x n adders (and half adders) to sum the bit products in a carry-save fashion. Finally a 2n Carry-Propagate Adder is used in the final step to finish the summation and produce the result. Wallace introduced a way of summing the partial product bits in parallel using a tree of Carry Save Adders which became generally known as the “Wallace Tree” . A suggestion for improved efficiency of addition of the partial products was published by Dadda in 1965. Dadda introduces a notion of a counter which will take a number of bits p in the same bit position and output a number q which represent the count of ones at the input. This process is shown in the slide illustrating 8 by 8 multiplication process. An input of 8 by 8 matrix of dots, each dot represents a bit product, is shown as a Matrix 0. Columns having more than six dots are reduced by the use of half adders. Each half adder takes in two dots and outputs one in the same column and one in the next more significant column. Each full adder takes in three dots and outputs one in the same column and one in the next more significant column. No column in Matrix 1 will have more than six dots. Half adders are shown by a “crossed” line in the succeeding matrix and full adders are shown by a line in the succeeding matrix. In each case the right most dot of the pair that are connected by a line is in the column from which the inputs were taken for the adder. In the succeeding steps reduction is performed to Matrix 2 with no more than four dots per column, Matrix 3 with no more than three dots per column, and finally Matrix 4 with no more than two dots per column is. The height of the matrices is determined by working back from the final matrix and limiting the height of each matrix to the largest integer that is no more than 1.5 times the height of its successor. Each matrix is produced from its predecessor in one adder delay. Since the number of matrices is logarithmically related to the number of bits in the words to be multiplied, the delay of the matrix reduction process is proportional to log(n). The adder that reduces the final two row matrix to the final product can be implemented as a fast adder, which also has logarithmic delay. The total delay for this multiplier is proportional to the logarithm of the size of its operands. The effort of improving the speed of the parallel multiplier continued for almost 30 years. 27 November 2018 Multiplier Design
50
27 November 2018 Multiplier Design
51
27 November 2018 Multiplier Design
52
27 November 2018 Multiplier Design
53
27 November 2018 Multiplier Design
54
27 November 2018 Multiplier Design
55
Minimum Number of Stages (Dada’s Rule)
27 November 2018 Multiplier Design
56
Their Schemes
57
Use of 4:2 Compressors A. Weinberger 1981 M. Santoro 1988
27 November 2018 Multiplier Design
58
4:2 Compressor Multiplier Design 27 November 2018
In 1981 Weinberger disclosed a structure which he called "4-2 carry-save module". This structure contained a combination of Full Adder cells in an intricate interconnection structure which yields a faster partial product compression than the use of 3:2 counters. The structure actually compresses five partial product bits into three, however it is connected in such a way that four of the inputs are coming from the same bit position of the weight j while one bit is fed from the neighboring position j-1 also known as carry-in. The output of such a 4:2 module consists of one bit in the position j and two bits in the position j+1. This structure does not represent a counter, though it became erroneously known as "4:2 counter“, but a "compressor" that would compress four partial product bits into two. The structure of 4:2 compressor is shown in this slide. The efficiency of such a structure to reduce partial product bits is higher. It reduces the number of partial product bits by one half at each stage. The speed of such a 4:2 compressor has been determined by the speed of 3 XOR gates in series, in the redesigned version of 4:2 compressor, making such a scheme more efficient that the one using 3:2 counters in a regular "Wallace Tree". The other equally important feature of the use of 4:2 compressor is that the interconnections between 4:2 cells follow more regular pattern than it is the case of the "Wallace Tree". 27 November 2018 Multiplier Design
59
Critical Signal Path in a 4:2 Compressor Tree
27 November 2018 Multiplier Design
60
Re-designed 4:2 Compressor with 3 XOR Delay (Nagamatsu, Toshiba)
in I1 I2 S I3 I4 1 C This slide shows a re-design 4:2 compressor as introduced by Toshiba. The advantage of this compressor is that it results in 3 XOR gate maximal delay as opposed to 4 XOR delay in a regular implementation using Full-Adder cells. In the next slide we will show how this re-design was not necessary. The missing point was in the lack of understanding of how to balance delays of individual full-adders. C out 27 November 2018 Multiplier Design
61
Critical Path in a 4:2 Compressor
27 November 2018 Multiplier Design
62
Signal Arrival Profile for RWT (3:2) and MWT (4:2)
27 November 2018 Multiplier Design
63
Using 9:2 Compressors (P. Song, G. De Michelli 1991)
27 November 2018 Multiplier Design
64
Compressor Tree Implemented with 9:2 Compressors
27 November 2018 Multiplier Design
65
9:2 Compressor Structure
27 November 2018 Multiplier Design
66
Title Multiplier Design Critical Path: (Equivalent XOR Gate Delays)
24 Title 4:2 Compressor (Redesigned) 22 Delay (XOR Gates) 9to2 Compressor (Redesigned ) 20 3,2 Counter 18 16 14 12 10 8 6 4 2 10 20 30 40 50 60 70 80 90 100 Multiplier Width (bits) 27 November 2018 Multiplier Design
67
Delay Expressed as No. of XOR Gate Delays
27 November 2018 Multiplier Design
68
Use of Higher-Order Compressors
D. Villeger, V.G. Oklobdzija 1993 27 November 2018 Multiplier Design
69
Design of a 13:2 Compressor from a 9:2 Compressor
27 November 2018 Multiplier Design
70
Delay Profile of a 24:2 Compressor Tree
27 November 2018 Multiplier Design
71
Compressor Family Characteristics
27 November 2018 Multiplier Design
72
Using Carry-Propagate Adders
(G. Bewick 1993) (D. Villeger & V. G. Oklobdzija 1993) 27 November 2018 Multiplier Design
73
Column Compression Tree Consisting of 4-bit Adders
27 November 2018 Multiplier Design
74
Bit Reduction Using 4-bit Adders (24X24)
27 November 2018 Multiplier Design
75
Idea !!!!!
76
(Oklobdzija, Villeger, Liu, 1994)
A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach – TDM (Oklobdzija, Villeger, Liu, 1994) 27 November 2018 Multiplier Design
77
27 November 2018 Multiplier Design
78
27 November 2018 Multiplier Design
79
27 November 2018 Multiplier Design
80
Signal Delays in a Full Adder
(3,2) Counter A B Cin Sum * Fast Input Fast Output Carry 27 November 2018 Multiplier Design
81
Three-Dimensional optimization Method: TDM (Oklobdzija, Villeger, Liu, 1996)
The further improvement in speed of a parallel multiplier was achieved by introduction of TDM method in 1996. TDM optimizes the entire Partial Product Reduction Tree in one pass, thus the name Three Dimensional optimizaiton Method. The important aspect of this method is in sorting of fast inputs and fast outputs. It was realized that the most important step in achieving fast partial product reduction is to properly interconnect the elements. Thus, appropriate counters, 3:2 adders in this particular case, were characterized in a way which identifies delay of each input to each output. Interconnecting of the Partial Product Reduction Tree was done in a way in which signals with large delays are connected to "fast inputs" and signals with small delay to "slow inputs" . This slide illustrates how an 4:1 compressor with 3 XOR delay can be obtained by a simple application of TDM method without the need for redesign. 27 November 2018 Multiplier Design
82
Method
83
27 November 2018 Multiplier Design
84
27 November 2018 Multiplier Design
85
27 November 2018 Multiplier Design
86
27 November 2018 Multiplier Design
87
Computer Tools
88
27 November 2018 Multiplier Design
89
Design of Parallel Multipliers
Algorithm for Automatic Generation of Partial Product Array. Initialize: Form 2N-1 lists Li ( i = 0, 2N-2 ) each consisting of pi elements where: p i = i+1 for i £ N-1 and p i = 2N-1-i for i N An element of a list Li ( j = 0,...,pi-1 ) is a pair: <nj, Dj>i where: nj : is a unique node identifying name Dj : is a delay associated with that node representing a delay of a signal arriving to the node nj with respect to some reference point. For i = 0,1 and 2N-2: connect nodes from the corresponding lists Li directly to the CPA. 27 November 2018 Multiplier Design
90
27 November 2018 Multiplier Design
91
For i=2 to i=2N-3 {Partial Product Array Generation} Begin For if length of Li is even Then Begin If sort the elements of Li in ascending order by the values of delay dj connect an HA to the first 2 elements of Li starting with the slowest input Ds =max {dA+dA-S, dB+dB-S} Dc =max {dA+dA-C, dB+dB-C} remove 2 elements from Li insert the pair <Ds,NetName> into Li insert the pair <Dc,NetName> into Li decrement the length of Li increment the length of Li End If; 27 November 2018 Multiplier Design
92
while length of Li > 3 Begin While sort the elements of Li in ascending order by the values of delay dj connect an FA to the first 3 elements of Li starting with the slowest input of the FA: Ds =max {dcA+dcA-S, dcB+dcB-S, dcCi+dcCi-S} Dc = max {dcA+dcA-C, dcB+dcB-C, dcCi+dcCi-C} remove 3 elements from Li insert the pair <Ds,NetName> into Li insert the pair <Dc,NetName> into Li+1 subtract 2 from the length of Li increment the length of Li+1 End While; sort the elements of Li connect an FA to the last 3 nodes of Li connect the S and C to the bit i and i+1 of the CPA End For; End Method;
93
Competing Approaches
95
27 November 2018 Multiplier Design
96
27 November 2018 Multiplier Design
97
Algorithm for Implementation of Fast Parallel Multipliers
[1] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A Method For Speed Optimized Partial Product Reduction And Generation Of Fast Parallel Multipliers Using An Algorithmic Approach,” IEEE Transactions on Computers, Vol 45, No.3, March, 1996. [2] V. G. Oklobdzija and D. Villeger, “Improving Multiplier Design By Using Improved Column Compression Tree And Optimized Final Adder In CMOS Technology,” IEEE Transactions on VLSI Systems, Vol.3, No.2, June, 1995, 25 pages. [3] V. G. Oklobdzija and D. Villeger, “Multiplier Design Utilizing Improved Column Compression Tree And Optimized Final Adder In CMOS Technology,” Proceedings of the 1993 International Symposium on VLSI Technology, Systems and Applications, pp , 1993. [4] P. Stelling, C. Martel, V. G. Oklobdzija, R. Ravi, “Optimal Circuits for Parallel Multipliers,” IEEE Transaction on Computers, Vol. 47, No.3, pp , March, 1998. 27 November 2018 Multiplier Design
98
Organization of Hitachi's DPL multiplier
And example of a fast multiplier is Hitachi's DPL multiplier which was the first one to achieve under 5nS speed for a 54-bit floating-point mantissa. This multiplier is of a regular structure including: (a.) A Booth Recoder (b.) A Partial Product Reduction Tree and (c.) A final Carry Propagate Adder (CPA) as shown in this slide. 27 November 2018 Multiplier Design
99
Hitachi's 4:2 compressor structure
The key to performance of Hitachi's multiplier lays in the use of DPL circuits and the efficiency with which DPL can realize 4:2 compressor. The structure of Hitachi's 4:2 compressor is shown in this slide. The realization of the 4:2 function consists entirely of DPL multiplexers which introduce only one pass-transistor delay in the critical path as shown in the next slide. Indeed this structure is one of the fastest transistor realizations for the Partial Product Reduction Tree. The speed of this multiplier can be further optimized by applying TDM algorithm realizing optimal interconnections. Such a structure yields 4.1nS delay in 0.25u technology. For larger size multipliers this structure may start showing degraded performance because of the long pass-transistor chain in 4:2 compressors used. However, this is of not much concern since 54-bit represents double precession floating point format and we rarely use larger multiplier sizes in practice. 27 November 2018 Multiplier Design
100
DPL multiplexer circuit
An efficient pass transistor realization of multiplexers using Hitachi’s DPL logic is shown in this slide. This multiplexer introduces only one pass-transistor delay in the critical path, thus allowing three pass transistor delays for one 4:2 compressor. The entire partial product reduction tree results in 12 pass-transistor delays in the critical path. The use of TDM reduces this number to 10. 27 November 2018 Multiplier Design
101
Addition Under Non-equal Signal Arrival Profile Assumption
P. Stelling , V. G. Oklobdzija, "Design Strategies for Optimal Hybrid Final Adders in a Parallel Multiplier", special issue on VLSI Arithmetic, Journal of VLSI Signal Processing, Kluwer Academic Publishers, Vol.14, No.3, December 1996 27 November 2018 Multiplier Design
102
Signal Arrival Profile form the Parallel Multiplier Partial-Product Recuction Tree
27 November 2018 Multiplier Design
103
Oklobdzija, Villeger, IEEE Transactions on VLSI Systems, June, 1995
27 November 2018 Multiplier Design
104
Oklobdzija and Villeger, IEEE Transactions on VLSI Systems, June, 1995
27 November 2018 Multiplier Design
105
27 November 2018 Multiplier Design
106
27 November 2018 Multiplier Design
107
27 November 2018 Multiplier Design
108
27 November 2018 Multiplier Design
109
27 November 2018 Multiplier Design
110
27 November 2018 Multiplier Design
111
27 November 2018 Multiplier Design
112
27 November 2018 Multiplier Design
113
Performing Multiply-Add Operation in the Multiply Time
P. Stelling, V. G. Oklobdzija, " Achieving Multiply-Accumulate Operation in the Multiply Time", Thirteenth International Symposium on Computer Arithmetic, Pacific Grove, California, July 5 - 9, 1997. 27 November 2018 Multiplier Design
114
27 November 2018 Multiplier Design
115
Final Adder: Implementation
27 November 2018 Multiplier Design
116
Final Adder: Implementation
27 November 2018 Multiplier Design
117
Final Adder: Implementation
27 November 2018 Multiplier Design
118
Final Adder: Implementation
27 November 2018 Multiplier Design
119
RECOMENDATIONS
120
Fast Parallel Multipliers
Different Counter and Compressor Families were compared. The best way is to build a compressor of the maximal size (i.e. the entire size of the multiplier) The Essence of the optimal tree is optimal wiring and NOT the use of counter/compressor family The use of Carry-Propagate Adders is advantageous for larger size multipliers in the first stage and for particular technology Tuning of the Final Adder into the signal arrival profile is more important than the speed of the Final Adder. 27 November 2018 Multiplier Design
121
THE END 27 November 2018 Multiplier Design
122
Hollywood
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.