- ppt download

http://www.ece.umn.edu/users/kia/Courses/EE5324 Kia Bazargan
EE 5324 – VLSI Design II Part III: Multipliers and Shifters Kia Bazargan University of Minnesota Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

References and Copyright
Textbooks referenced [WE92] N. H. E. Weste, K. Eshraghian “Principles of CMOS VLSI Design: A System Perspective” Addison-Wesley, 2nd Ed., 1992. [Rab96] J. M. Rabaey “Digital Integrated Circuits: A Design Perspective” Prentice Hall, 1996. [Par00] B. Parhami “Computer Arithmetic: Algorithms and Hardware Designs” Oxford University Press, 2000. Spring 2006 EE VLSI Design II - © Kia Bazargan

References and Copyright (cont.)
Slides used(Modified by Kia when necessary) [©Hauck] © Scott A. Hauck, ; G. Borriello, C. Ebeling, S. Burns, 1995, University of Washington [©Prentice Hall] © Prentice Hall 1995, © UCB Slides for [Rab96] [©Oxford U Press] © Oxford University Press, New York, Slides for [Par00] With permission from the author Spring 2006 EE VLSI Design II - © Kia Bazargan

EE 5324 - VLSI Design II - © Kia Bazargan
Why Multipliers? Used in a lot of DSP applications Vector product, matrix multiplication Convolution Filtering (tap filters, FIR, …) ... “At least one good reason for studying multiplication and division is that there is an infinite number of ways of performing these operations and hence there is an infinite number of PhDs (or expense-paid visits to conferences in USA) to be won from inventing new forms of multiplier” Alan Clements The Principles of Computer Hardware, 1986 [Par00] Spring 2006 EE VLSI Design II - © Kia Bazargan

Outline Serial Multiplier Multiplier arrays Carry save adder (CSA) and multiple operand addition Booth encoding Pipelined multipliers Wallace tree Signed multiplication Shifters Outline Spring 2006 EE VLSI Design II - © Kia Bazargan

Multiplication Example
Example: 12x5 Multiplicand: Multiplier: 4 partial products The partial product can be generated using an array of AND gates Spring 2006 EE VLSI Design II - © Kia Bazargan

Sequential Multiplier
Sequential Multiplier Shift register Originally holds multiplicand Shifts it left for each partial product One bit of multiplier at a time presented to the AND gates 2N bits Shift Register One bit of mplier applied each cycle Initialized w/ mcand, shifts it left We have to wait for 2N bit addition to be complete (or do we? What if we wait only for N bits to propagate?) N-bit propagation delay: Remember Slide #74 of the adders (“8-bit Ripple Carry Addition: Example”)? What happens if the LSB and MSB have zero inputs? How long should we wait for the sum signal to stabilize? In that slide, does it matter if we add more bits to the left and to the right as long as the input to all is 00? Adder Register [©Hauck] Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Sequential Multiplier – Resource Requirements
Adder: 2N-bit Registers: 2N-bit wide Better design: Shift result register to right Uses N AND gates Uses N-bit adder Adder Register Shift Register Register Adder Shift Register [©Hauck] Spring 2006 EE VLSI Design II - © Kia Bazargan

Combinational Multiplier: Idea
Use an array of AND gates to generate the partial products in parallel multiplicand 1 1 LSB LSB 1 multiplier 1 1 1 1 1 1 1 [©Hauck] Spring 2006 EE VLSI Design II - © Kia Bazargan

Combinational Multiplier: Adding PProds
HA FA X3 X2 X1 X0 Y1 Y0 Z0 Y2 Z1 Y3 Z2 Z3 Z4 Z5 Z6 Z7 [WE92] p547 [Rab96] p.409 Spring 2006 EE VLSI Design II - © Kia Bazargan

Combinational Multiplier: Critical Path(s)
Combinational Multiplier: Critical Path(s) A lot of critical paths: same delay. (AND gates not shown) HA FA FA HA MxN Multiplier M FA FA FA HA Critical Path 1 N Critical Path 2 M=# of multiplier bits N=# of multiplicand bits Does it make sense to use faster adders for the last row? No! You have to change ALL the critical paths simultaneously FA FA FA HA Delay=(M+N-2)tcarry+(N-1)tsum+tAND [Rab96] p.410 Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Combinational Multiplier: Layout
Combinational Multiplier: Layout Better floorplan for compact layout: Send partial product diagonally Results in better area (AND gates and hence the first row not shown) HA FA FA HA FA FA FA HA M=# of multiplier bits N=# of multiplicand bits FA FA FA HA [WE92] p548 [Rab96] p.412 Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Carry-Save Adder: the Idea
When adding k n-bit numbers, don’t need to optimize the carry chain of each of the rows Below is the old-style ripple-adder HA FA FA HA FA FA FA HA FA FA FA HA Spring 2006 EE VLSI Design II - © Kia Bazargan

Carry-Save Adder: structure
Carry-Save Adder: structure Postpone the “carry propagation” operation to the last stage Delay=N.tcarry+ tand + tmerge HA HA HA HA HA FA FA FA CSA HA FA FA FA KIA: we can get rid of the MSB HA’s, right? They have only one input, which is basically a wire. Then, if we get rid of those, we will be using the same number of blocks. Note how different combinations of HA and FA are used. Catch? An additional row for merging the carry and sum vectors Now, we can optimize the last row (merging row) using faster adder structures. Vector merging stage HA FA HA FA FA HA [Rab96] p.411 Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Carry-Save Adder: Details
Carry-Save Adder: Details H H H F F F F F F F F H Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

CSA: Intermediate FA Cells
CSA: Intermediate FA Cells Better to have the same sum and carry delays (both contribute to critical path) P A Ci P S P A Ci B B P P The structure uses transmission gate extensively A Co P A Ci Setup P [Rab96] p.410 Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Booth Multiplier: an Introduction
Booth Multiplier: an Introduction Recode each 1 in multiplier as “+2-1” Converts sequences of 1 to 10…0(-1) Might reduce the number of 1’s +1 -1 Based on the idea that 1111 can be rewritten as Does this recoding help us speedup the “Sequential Multiplier”? No! The datapath still goes through the adder Useful in constant addition, or asynchronous multipliers The real benefit would be when we group the digits into pairs, something that we will see in “Modified Booth Encoding” When does this encoding reduce number of 1’s? Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Booth Multiplier: Recoding (Encoding) Example
Booth Multiplier: Recoding (Encoding) Example (+1 -1) ( ) ( ) ( ) ( ) (+1 -1) If you use the last row in multiplication, you should get exactly the same result as using the first row (after all, they represent the same number!) The fact that I have written the (+1 -1) pairs in three rows doesn’t mean anything: it’s just that I couldn’t write them all in a single row Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Booth Recoding: Multiplication Example
Booth Recoding: Multiplication Example x (-6) Sign extension 1 1 1 We first recode the multiplier (14) Then multiply the recoded number to the multiplicand We should do sign-extension so that the addition of the partial products is correct Whenever adding a k-bit negative number to a L-bit number (L>k), we should first convert the k-bit representation to L-bit representation (just extending the sign bit to the higher bit positions does it) and then perform the addition Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Booth Recoding: Advantages and Disadvantages
Depends on the architecture Potential advantage: might reduce the # of 1’s in multiplier In the multipliers that we have seen so far: Doesn’t save in speed (still have to wait for the critical path, e.g., the shift-add delay in sequential multiplier) Increases area: recoding circuitry AND subtraction Spring 2006 EE VLSI Design II - © Kia Bazargan

Modified Booth Multiplier: Idea
Group pairs, leaving –2, -1, 0, 1, 2 Grouping reduces # of partial products by half Booth recoding results in: Gets rid of 3’s (sequences of 1’s in general) (+1 -1) ( ) ( ) ( ) ( ) (+1 -1) [©Hauck] Spring 2006 EE VLSI Design II - © Kia Bazargan

Modified Booth Multiplier: Idea (cont.)
Can encode the digits by looking at three bits at a time Booth recoding table: Must be able to add multiplicand times –2, -1, 0, 1 and 2 Since Booth recoding got rid of 3’s, generating partial products is not that hard (shifting and negating) i+1 i i-1 add *M *M *M *M –2*M –1*M –1*M *M [©Hauck] Spring 2006 EE VLSI Design II - © Kia Bazargan

Modified Booth Multiplier: Idea (cont.)
Interpretation of the Booth recoding table: i+1 i i-1 add Explanation *M No string of 1’s in sight *M End of a string of 1’s *M Isolated *M End of a string of 1’s –2*M Beginning of a string of 1’s –1*M End one string, begin new one –1*M Beginning of a string of 1’s *M Continuation of string of 1’s [Par] p. 160 Spring 2006 EE VLSI Design II - © Kia Bazargan

(Modified) Booth Multiplier: Example
Retire two bits per shift operation Addition: signed Sign extend 2 bits if adding two partial products at a time -1 -2 i+1 i i-1 add *M *M *M *M –2*M –1*M –1*M *M Spring 2006 EE VLSI Design II - © Kia Bazargan

Modified Booth Recoding: Summary
Grouping multiplier bits into pairs Orthogonal idea to the Booth recoding Reduces the num of partial products to half If Booth recoding not used  have to be able to multiply by 3 (hard: shift+add) Applying the grouping idea to Booth  Modified Booth Recoding (Encoding) We already got rid of sequences of 1’s  no mult by 3 Just negate, shift once or twice Spring 2006 EE VLSI Design II - © Kia Bazargan

Modified Booth Multiplier: Summary (cont.)
Uses high-radix to reduce number of intermediate addition operands Can go higher: radix-8, radix-16 Radix-8 should implement *3, *-3, *4, *-4 Recoding and partial product generation becomes more complex Can automatically take care of signed multiplication (we will see why) Spring 2006 EE VLSI Design II - © Kia Bazargan

Pipelined Multipliers
Pipelined Multipliers Insert registers (latches) between rows Insert registers for bits of multiplier Schedule MSB bits to arrive later HA FA FA HA FA FA FA HA Is timing EXACTLY the same as in the combinational multiplier? No. The latency is different. In the combinational multiplier, the first bit of the second row of adders could start after the 2nd bit of the 1st row was done (a lot of parallelism and overlapping the carry propagation of different rows). Here, with this simple architecture you can’t do that. FA FA FA HA Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Pipelined Multiplier: Example
Pipelined Multiplier: Example a a a a a x x x x x 4 3 2 1 1 2 3 4 Sum/ carry path FA with AND gate and latches (for ai, intermediate sum and carry) Latch Is this architecture any different from the previous one? Can carry propagations in different rows overlap? Yes. This has smaller latency, but more area. FA p p p p p p p p p p [Par00] p186 9 8 7 6 5 4 3 2 1 [© Oxford U Press] Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Wallace Tree: Idea Idea: divide & conquer Why add the k numbers one by one? Tree structure  logarithmic For now, let’s assume we are going to add 7 6-bit numbers – which are NOT partial products, hence not shifted. What’s the fastest way to add them? [Par00] p131 Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Wallace Tree Example Circles represent digits Boxes show FAs Diagonal lines correspond to (Sum, Carry) pairs generated by the FA cells in the previous stage (same color) Dotted box is some carry propagate adder (e.g., CLA) Delay = 4 CSA + 1 CLA [Par00] p130 [© Oxford U Press] Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Wallace Tree: Structure for 7 k-bit Numbers
K-bit CSA K-bit CSA [1,k] [1,k] [0,k-1] [0,k-1] K-bit CSA [1,k] [0,k-1] K-bit CSA [2,k+1] [1,k] ‘0’,[2,k] [1,k-1], ‘0’ K-bit CSA [k+1] [2,k+1] [1,k+1] [2,k+1] K-bit CPA [k+2] [2,k+1] [1] [0] [Par00] p131 Spring 2006 EE VLSI Design II - © Kia Bazargan

Wallace Tree: Timing At each step, # of operands reduces to 2/3 n k-bit numbers CSA CSA CSA CSA CSA CSA CSA CSA CSA (2/3) n nums CSA CSA CSA CSA CSA CSA (2/3)2 n CSA CSA CSA CSA h levels . . . CSA (2/3)h n = 2 Spring 2006 EE VLSI Design II - © Kia Bazargan

Wallace Tree: Timing (cont.)
Delay depends on height h h = O ( log n )  Logarithmic delay Max # N of k-bit numbers that can be added using a Wallace tree of height h h N h N h N 2 7 28 14 474 1 3 8 42 15 711 2 4 9 63 16 1066 3 6 10 94 17 1599 4 9 11 141 18 2398 5 13 12 211 19 3597 6 19 13 316 20 5395 [Par00] p132 [© Oxford U Press] Spring 2006 EE VLSI Design II - © Kia Bazargan

Multiplying Signed Numbers
Multiplying Signed Numbers Coding of the numbers Signed-magnitude  trivial 2’s complement? 2’s complement Mplier positive, Mcand +/- : Sign extend the partial products when adding up Example: x x How to do signed multiplication of sign-magnitude numbers? Do we need all the sign extension bits? Or is one bit enough? Depends on the adder bit width that we are using to add up the partial products Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Multiplying Signed Numbers (cont.)
Multiplying Signed Numbers (cont.) 2’s complement (cont.) Mplier negative, Mcand +/- : Ad-hoc solution: convert negative Mplier to positive, do the multiplication, negate the result Example: x x Negating a number takes N-bit (or 2N-bit addition for the product) addition time Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Multiplying Signed Numbers: Efficient Method
Using almost the same architecture, we can do signed mult w/o negating the result Idea: “What if we had negated the mplier?” Consider  and  as positive magnitudes (forget about the 2’s complement convention for now) We want to use computation:  . M Previously, we negated to get , then computed  . M and negated it M =+5x = -3 1  1 1 0 1  1  negate 1 1 0 1 =-3 0 1 1 =+3   Spring 2006 EE VLSI Design II - © Kia Bazargan

Multiplying Signed Numbers: Efficient Method The negation process k-1 k k-1 k = - 1   1  = 2k-1 +  negate - 1  = 2k – 1  = 2k – (2k-1 + ) = 2k – 2k-1 -  = (2k – 2k-1) -  = 2k-1 -  In the last step, we can guarantee that the MSB of 0B is actually 0. The reason is 2^(k-1)>Alpha, hence the resulting 2^(k-1)-Alpha is positive The final result is valid ONLY if the MSB of the original number 1Alpha is 1. =  = 2k-1 -    Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Machine’s understanding Our interpretation k-1 k k-1 k  = 2k-1 -  1  = -  3 3 3 = 1 = - We used to compute: - ( . M) -  . M = - (2k-1 -  ) . M = -2k-1 . M +  . M Subtract the mcand for the last bit Normal mult for the first k-1 bits Spring 2006 EE VLSI Design II - © Kia Bazargan

Multiplying Signed Numbers: Example
Multiplying Signed Numbers: Example x (-5) Normal mult for the first k-1 bits Use a subtractor for the last pproduct We still have to do sign extension (here, since we have +5, I haven’t shown the zeros for the sign extension) Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Booth Recoding: Signed Numbers
For unsigned numbers, increase bit-width on mplier & mcand (add 0 to the left) If dealing with Signed numbers, discard the extra bit Why does it work?   M. = M.( - 2k) = -M.(2k -) = -M. ( is the positive, 2’s compliment of ) Spring 2006 EE VLSI Design II - © Kia Bazargan

Booth Recoding: Signed Mult Example
(+10) 1 0 1 1 1 1 1 Note: the column which has ‘1111’ generates a carry of ’10’ if calculating by hand Spring 2006 EE VLSI Design II - © Kia Bazargan

Multiplier: Summary Goals different than addition In some structures, sum and carry delay equal Analysis more difficult : Multiple critical paths Different levels of optimization Data encoding (Booth) Architecture-level: Wallace Tree Gate-level: pipelining Transistor-level: equal sum, carry delays More to cover: Constant multiplication Floating point, precision Spring 2006 EE VLSI Design II - © Kia Bazargan

Shift and Rotate Operations
Used in: Microprocessors Encryption algorithms If fixed shift, simply wire the inputs to the correct output positions Variable shift One-bit shifter Barrel shifter Logarithmic shifter Spring 2006 EE VLSI Design II - © Kia Bazargan

One-bit Shifter Right NOP Left Ai Bi Ai-1 Bi-1 Bit-slice i [©Prentice Hall] Spring 2006 EE VLSI Design II - © Kia Bazargan

Simple n-bit Shifter Quadratic number of transistors One switch per path in1 in2 in3 in4 out1 out2 out3 out4 [©Hauck] Spring 2006 EE VLSI Design II - © Kia Bazargan

Barrel Shifter A3 B3 Bit 3 wrapped around Sh1 A2 B2 Data Wire Sh2 A1 Control Wire B1 Sh3 Area dominated by wiring A0 This structure does “sign extension”, so it is Arithmetic Shift (divide by two) Look how A3 is used to fill the missing bits B0 Sh0 Sh1 Sh2 Sh3 [©Prentice Hall] Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Logarithmic Shifter Simplified structure but more stages (greater delay) S1 S2 o1 i1 S1' S2' S1 S2 S1 S2 o2 i2 S1' S2' S1 S2 S1 S2 o3 i3 S1' S2' The structure extends to more levels (the next level would shift the lines 4 bit positions, the next one would do 8, etc.) S1 S2 S1 S2 o4 i4 S1' S2' S1 S2 [©Hauck] Spring 2006 EE VLSI Design II - © Kia Bazargan VLSI Design II – © Kia Bazargan

Shift: Summary Trade-off between area, delay Barrel shifter: fastest O(1), n2 transistors Logarithmic shifter: O(log n), n log n transistors One-bit shifter: O(n), n transistors Barrel shifter: wire-dominated circuit Spring 2006 EE VLSI Design II - © Kia Bazargan

Similar presentations

Presentation on theme: ""— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Similar presentations

Presentation on theme: ""— Presentation transcript:

Similar presentations

About project

Feedback