Download presentation
Presentation is loading. Please wait.
Published byMitchell Garrett Modified over 6 years ago
1
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
CSE 575 Computer Arithmetic Spring Mary Jane Irwin (
2
* and / Considerations It is possible to build really fast multipliers
Wallace tree: 2logn with fast CPA and “sort of” fast dividers base 4 SRT: n/2 add (CSA) cycles at the cost of silicon area and energy. What if area (and energy) are more important metrics than performance?
3
Array Multipliers & Dividers
Slow, but Very regular structure Use only short wires to nearest neighbor cells Thus, very simple and efficient layout in VLSI Can be easily and efficiently pipelined
4
Multiply Review Right shift and add (serial) integer multiplication
Partial products accumulated from top Only requires an n bit adder n multiplicand - D multiplier - Q partial product array n double precision product - P 2n
5
Example Array Multiplier
d3 d2 d1 d0 lsb q0 M03 M02 M01 M00 p0 q1 M13 M12 M11 M10 p1 q2 M23 M22 M21 M20 shifts from correct positioning of cells need O(n**2) cells (signed will need more!) delay increases linearly with operand length as shown on next slide note that first row really doesn’t do any work - just adds in zeros, so can be reduced to and gates p2 q3 M33 M32 M31 M30 p7 p6 p5 p4 p3
6
Square Array Multiplier
d0 d1 d2 d3 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 lsb carry sum shifts from correct positioning of cells need O(n**2) cells (signed will need more!) delay increases linearly with operand length as shown on next slide note that first row really doesn’t do any work - just adds in zeros, so can be reduced to and gates
7
Array Multiplier Delay
q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 for lecture also notice the computational wavefronts (back diagonals in green) of signal glitching – so probably not energy efficient Longest delay path 2n + n - 2 = 3n - 2
8
Multiplier Cell Structure
sum input dj 2D 1D qi carry out for lecture want to design the cells so that the tsum ~ tcarry for delay balancing add extra delays to qi and di lines to complete delay balancing - Leap Frog multiplier – so that all inputs to a multiplier cell arrive simultaneously have to treat the top row, left column and right column as special cases FA carry in sum output
9
Identical Delays for Carry and Sum
Delay Balanced FA Identical Delays for Carry and Sum !p p cin !cout x y !y p !p p !p s cin y x !y p Want balanced delays from inputs to both sum and carry outputs to minimize glitching but notice that !cout is produced – does the inverter to form cout spoil the balance? Sum generation 22 transistors Signal set-up Carry generation
10
Pipelined Array Multiplier
clk p7 M00 M01 M02 M03 M10 M11 M12 M13 M20 M21 M22 M23 M30 M31 M32 M33 d0 d1 d2 d3 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 time between clks is ripple add time across one row Is there a faster way?
11
Array Multiplier with Recoding
q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 CTRL M10 M21 M20 M32 M31 M30 CTRL does differentiating recoding of ‘ier Note now shifting right rather than left recode to take care of negative ‘ier, and sign extend to accommodate negative ‘icand 2n - 1 by n cells with a worst case delay path of 3n – 2 ? (or is it 2n-1?) can pipeline just like previous unsigned scheme (with increase in delay per row since have to wait for worst case timing as exhibited by the last row carry ripple of 2n - 1 cells)
12
Multiplier Cell Structure
dj sum input Z q’i A/S carry out Z A/S zero previous partial product + 0 add previous partial product + D subt previous partial product - D FA carry in sum output
13
CSA Array Multiplier M00 M01 M02 M03 M10 M11 M12 M13 M20 M21 M22 M23
d0 d1 d2 d3 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 CSA dj sum input qi carry in output out delay is still linear - but less! once again, first row doesn’t do any real work, just forms the first row of partial product terms (and gates) last row of cells has to propagate the carry, so are slightly different micro-architecture, in particular the cells have four things to add together so can do with a CSA feeding a CPA
14
CSA Array Multiplier Longest delay path n + n - 1 = 2n - 1 M00 M01 M02
q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 delay is still linear - but less! only have to pay for the carry to ripple across the last row Longest delay path n + n - 1 = 2n - 1
15
Pipelined CSA Array Multiplier
clk d3 d2 d1 d0 q0 M03 M02 M01 M00 p0 q1 M13 M12 M11 M10 p1 but what about delay in last row - will set the rate for the clk, so no faster than previous design! q2 M23 M22 M21 M20 p2 q3 M’33 M’32 M’31 M’30 p7 p6 p5 p4 p3
16
Augmented Pipelined CSA Array Multiplier
clk d0 d1 d2 d3 M00 M01 M02 M03 M10 M11 M12 M13 M20 M21 M22 M23 M30 M31 M32 M33 q0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 M41 M42 M43 M52 M53 M63 p0 now delay is each row is defined by one CSA time - but latency is increased for msbits of product
17
Constructing Big Mult’s from Small
Can synthesize a 2b x 2b multiplier from four b x b multipliers and a three operand addition operation AH AL BH BL AL BL 3b bits AL BH AL BL BH AH b bits AH BL AH BH 4b product
18
Division Operation Left shift and subtract (serial) fractional division n n . Q quotient . . P dividend D divisor 2n P < D 1/2 D <1 partial remainder array (pra) . R remainder n
19
Restoring Array Divider
p1 p2 p3 p4 q1 1 R11 R12 R13 R14 p5 q2 1 R22 R23 R24 R25 p6 q3 1 layout resembles dots in dot diagram Difference in each row between the ppr and the divisor is formed (trial subtraction) if the result is positive, cout = 0 so qi+1 = 1 if the result is negative, cout = 1 so qi+1 = 0 restoring division R33 R34 R35 R36 p7 lsb q4 1 R44 R45 R46 R47 r5 r6 r7 r8
20
Restoring Divider Cell Structure
partial remainder input di 1 carry out FA carry in subtractor cell mux is used to select the previous ppr if qi+1 = 0, otherwise output of FA is selected 1 partial remainder output
21
Restoring Array Divider Delay
q4 r8 q3 r7 q2 r6 q1 r5 R25 R35 R36 R45 R46 R47 p1 p2 p3 p4 p5 p6 p7 1 For lecture need O(n**2) cells and O(n**2) delay since have to wait for ripple in each row and all n rows Longest delay path n * n = n2
22
Pipelined Restoring Array Divider
clk q1 1 R11 R12 R13 R14 p5 q2 1 R22 R23 R24 R25 p6 q3 pipelining speeds up delay to O(n) defined by ripple time per row 1 R33 R34 R35 R36 p7 q4 1 R44 R45 R46 R47 r5 r6 r7 r8
23
Nonrestoring Array Divider
p1 p2 p3 p4 1 R’11 R’12 R’13 R’14 q1’ p5 R’22 R’23 R’24 R’25 q2’ p6 Same size and ~speed of the restoring array (still O(n**2)) Difference in each row between the ppr and the divisor is formed if control is 1 (top left input) - note that that input wraps around and sets the carry in (on subtract if 1 meaning do subtract cause qi+1 = 1) Also note that the carry out of the previous row becomes the control input of the next row (if carry out = 1 then subtract in that row and add in the next row and vica versa) R’33 R’34 R’35 R’36 q3’ p7 R’44 R’45 R’46 R’47 q4’ r5 r6 r7 r8
24
R’ Divider Cell Structure
partial remainder input di carry out FA carry in partial remainder output
25
Pipelined Nonrestoring Array Divider
clk 1 R’11 R’12 R’13 R’14 q1 p5 R’22 R’23 R’24 R’25 q2 p6 R’33 R’34 R’35 R’36 q3 p7 R’44 R’45 R’46 R’47 q4 r5 r6 r7 r8
26
Key References Agrawal, High-speed arithmetic arrays, IEEE Trans. on Computers, 28(3): , 1979. Baugh, Wooley, A two’s complement parallel array multiplication algorithm, IEEE Trans. Computers, 22: , Dec Cappa, Hamacher, An augmented iterative array for high-speed binary division, IEEE Trans. on Computers, 22: , Feb Denver, Myers, Carry-save arrays for VLSI signal processing, Proc. of VLSI 81, pp , 1981. Kamal, A generalized pipeline array, IEEE Trans. on Computers, 23(5): , 1974. Parhami, Computer Arithmetic, Oxford Univ. Press, 1999. Mori, A 10-ns 54b by 54b parallel structured full array multiplier with 0.5 mm CMOS technology, IEEE J. SSC, 26(4): , April 1991. Pezaris, A 40-ns 17b by 17b array multiplier, IEEE Trans. Computers, 20: , April 1971.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.