CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www.cse.psu.edu/~mji)
* and / Considerations It is possible to build really fast multipliers Wallace tree: 2logn with fast CPA and “sort of” fast dividers base 4 SRT: n/2 add (CSA) cycles at the cost of silicon area and energy. What if area (and energy) are more important metrics than performance?
Array Multipliers & Dividers Slow, but Very regular structure Use only short wires to nearest neighbor cells Thus, very simple and efficient layout in VLSI Can be easily and efficiently pipelined
Multiply Review Right shift and add (serial) integer multiplication Partial products accumulated from top Only requires an n bit adder n multiplicand - D multiplier - Q partial product array n double precision product - P 2n
Example Array Multiplier d3 d2 d1 d0 lsb q0 M03 M02 M01 M00 p0 q1 M13 M12 M11 M10 p1 q2 M23 M22 M21 M20 shifts from correct positioning of cells need O(n**2) cells (signed will need more!) delay increases linearly with operand length as shown on next slide note that first row really doesn’t do any work - just adds in zeros, so can be reduced to and gates p2 q3 M33 M32 M31 M30 p7 p6 p5 p4 p3
Square Array Multiplier d0 d1 d2 d3 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 lsb carry sum shifts from correct positioning of cells need O(n**2) cells (signed will need more!) delay increases linearly with operand length as shown on next slide note that first row really doesn’t do any work - just adds in zeros, so can be reduced to and gates
Array Multiplier Delay q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 for lecture also notice the computational wavefronts (back diagonals in green) of signal glitching – so probably not energy efficient Longest delay path 2n + n - 2 = 3n - 2
Multiplier Cell Structure sum input dj 2D 1D qi carry out for lecture want to design the cells so that the tsum ~ tcarry for delay balancing add extra delays to qi and di lines to complete delay balancing - Leap Frog multiplier – so that all inputs to a multiplier cell arrive simultaneously have to treat the top row, left column and right column as special cases FA carry in sum output
Identical Delays for Carry and Sum Delay Balanced FA Identical Delays for Carry and Sum !p p cin !cout x y !y p !p p !p s cin y x !y p Want balanced delays from inputs to both sum and carry outputs to minimize glitching but notice that !cout is produced – does the inverter to form cout spoil the balance? Sum generation 22 transistors Signal set-up Carry generation
Pipelined Array Multiplier clk p7 M00 M01 M02 M03 M10 M11 M12 M13 M20 M21 M22 M23 M30 M31 M32 M33 d0 d1 d2 d3 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 time between clks is ripple add time across one row Is there a faster way?
Array Multiplier with Recoding q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 CTRL M10 M21 M20 M32 M31 M30 CTRL does differentiating recoding of ‘ier Note now shifting right rather than left recode to take care of negative ‘ier, and sign extend to accommodate negative ‘icand 2n - 1 by n cells with a worst case delay path of 3n – 2 ? (or is it 2n-1?) can pipeline just like previous unsigned scheme (with increase in delay per row since have to wait for worst case timing as exhibited by the last row carry ripple of 2n - 1 cells)
Multiplier Cell Structure dj sum input Z q’i A/S carry out Z A/S zero 0 0 previous partial product + 0 add 1 0 previous partial product + D subt 1 1 previous partial product - D FA carry in sum output
CSA Array Multiplier M00 M01 M02 M03 M10 M11 M12 M13 M20 M21 M22 M23 d0 d1 d2 d3 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 CSA dj sum input qi carry in output out delay is still linear - but less! once again, first row doesn’t do any real work, just forms the first row of partial product terms (and gates) last row of cells has to propagate the carry, so are slightly different micro-architecture, in particular the cells have four things to add together so can do with a CSA feeding a CPA
CSA Array Multiplier Longest delay path n + n - 1 = 2n - 1 M00 M01 M02 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 delay is still linear - but less! only have to pay for the carry to ripple across the last row Longest delay path n + n - 1 = 2n - 1
Pipelined CSA Array Multiplier clk d3 d2 d1 d0 q0 M03 M02 M01 M00 p0 q1 M13 M12 M11 M10 p1 but what about delay in last row - will set the rate for the clk, so no faster than previous design! q2 M23 M22 M21 M20 p2 q3 M’33 M’32 M’31 M’30 p7 p6 p5 p4 p3
Augmented Pipelined CSA Array Multiplier clk d0 d1 d2 d3 M00 M01 M02 M03 M10 M11 M12 M13 M20 M21 M22 M23 M30 M31 M32 M33 q0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 M41 M42 M43 M52 M53 M63 p0 now delay is each row is defined by one CSA time - but latency is increased for msbits of product
Constructing Big Mult’s from Small Can synthesize a 2b x 2b multiplier from four b x b multipliers and a three operand addition operation AH AL BH BL AL BL 3b bits AL BH AL BL BH AH b bits AH BL AH BH 4b product
Division Operation Left shift and subtract (serial) fractional division n n . Q quotient . . P dividend D divisor 2n P < D 1/2 D <1 partial remainder array (pra) . R remainder n
Restoring Array Divider p1 p2 p3 p4 q1 1 R11 R12 R13 R14 p5 q2 1 R22 R23 R24 R25 p6 q3 1 layout resembles dots in dot diagram Difference in each row between the ppr and the divisor is formed (trial subtraction) if the result is positive, cout = 0 so qi+1 = 1 if the result is negative, cout = 1 so qi+1 = 0 restoring division R33 R34 R35 R36 p7 lsb q4 1 R44 R45 R46 R47 r5 r6 r7 r8
Restoring Divider Cell Structure partial remainder input di 1 carry out FA carry in subtractor cell mux is used to select the previous ppr if qi+1 = 0, otherwise output of FA is selected 1 partial remainder output
Restoring Array Divider Delay q4 r8 q3 r7 q2 r6 q1 r5 R25 R35 R36 R45 R46 R47 p1 p2 p3 p4 p5 p6 p7 1 For lecture need O(n**2) cells and O(n**2) delay since have to wait for ripple in each row and all n rows Longest delay path n * n = n2
Pipelined Restoring Array Divider clk q1 1 R11 R12 R13 R14 p5 q2 1 R22 R23 R24 R25 p6 q3 pipelining speeds up delay to O(n) defined by ripple time per row 1 R33 R34 R35 R36 p7 q4 1 R44 R45 R46 R47 r5 r6 r7 r8
Nonrestoring Array Divider p1 p2 p3 p4 1 R’11 R’12 R’13 R’14 q1’ p5 R’22 R’23 R’24 R’25 q2’ p6 Same size and ~speed of the restoring array (still O(n**2)) Difference in each row between the ppr and the divisor is formed if control is 1 (top left input) - note that that input wraps around and sets the carry in (on subtract if 1 meaning do subtract cause qi+1 = 1) Also note that the carry out of the previous row becomes the control input of the next row (if carry out = 1 then subtract in that row and add in the next row and vica versa) R’33 R’34 R’35 R’36 q3’ p7 R’44 R’45 R’46 R’47 q4’ r5 r6 r7 r8
R’ Divider Cell Structure partial remainder input di carry out FA carry in partial remainder output
Pipelined Nonrestoring Array Divider clk 1 R’11 R’12 R’13 R’14 q1 p5 R’22 R’23 R’24 R’25 q2 p6 R’33 R’34 R’35 R’36 q3 p7 R’44 R’45 R’46 R’47 q4 r5 r6 r7 r8
Key References Agrawal, High-speed arithmetic arrays, IEEE Trans. on Computers, 28(3):215-224, 1979. Baugh, Wooley, A two’s complement parallel array multiplication algorithm, IEEE Trans. Computers, 22: 1045-1047, Dec. 1973. Cappa, Hamacher, An augmented iterative array for high-speed binary division, IEEE Trans. on Computers, 22:172-175, Feb. 1973. Denver, Myers, Carry-save arrays for VLSI signal processing, Proc. of VLSI 81, pp. 151-160, 1981. Kamal, A generalized pipeline array, IEEE Trans. on Computers, 23(5):533-536, 1974. Parhami, Computer Arithmetic, Oxford Univ. Press, 1999. Mori, A 10-ns 54b by 54b parallel structured full array multiplier with 0.5 mm CMOS technology, IEEE J. SSC, 26(4):600-605, April 1991. Pezaris, A 40-ns 17b by 17b array multiplier, IEEE Trans. Computers, 20:442-447, April 1971.