EECS Components and Design Techniques for Digital Systems Lec 16 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer Sciences University of California, Berkeley
Overview Review of Addition Overflow Multiplication Further adder optimizations for multiplication CLA in the large – parallel prefix
Review Circuit design for unsigned addition –Full adder per bit slice –Delay limited by Carry Propagation »Ripple is algorithmically slow, but wires are short Carry select –Simple, resource-intensive –Excellent layout Carry look-ahead –Excellent asymptotic behavior –Great at the board level, but wire length effects are significant on chip Digital number systems –How to represent negative numbers –Simple operations –Clean algorithmic properties 2s complement is most widely used –Circuit for unsigned arithmetic –Subtract by complement and carry in –Overflow when cin xor cout of sign-bit is 1
Computer Number Systems Positional notation –D n-1 D n-2 …D 0 represents D n-1 B n-1 + D n-2 B n-2 + …+ D 0 B 0 where D i { 0, …, B-1 } 2s Complement –D n-1 D n-2 …D 0 represents: - D n-1 2 n-1 + D n-2 2 n-2 + …+ D –MSB has negative weight
2s Complement Overflow Add two positive numbers to get a negative number or two negative numbers to get a positive number = -8! = +7! How can you tell an overflow occurred ?
2s comp. Overflow Detection Overflow No overflow Overflow occurs when carry in to sign does not equal carry out
2s Complement Adder/Subtractor A - B = A + (-B) = A + B + 1
Adders on the Xilinx Virtex Dedicated carry logic provides fast arithmetic carry capability for high- speed arithmetic functions. The Virtex-E CLB supports two separate carry chains, one per Slice. The height of the carry chains is two bits per CLB. The arithmetic logic includes an XOR gate and AND gate that allows a 2- bit full adder to be implemented within a slice. Cin to Cout delay = 0.1ns, versus 0.4ns for F to X delay. How do we map a 2-bit adder to one slice?
Time / Space (resource) Trade-offs Carry select and CLA utilize more silicon to reduce time. Can we use more time to reduce silicon? How few FAs does it take to do addition?
Bit-serial Adder Addition of 2 n-bit numbers: –takes n clock cycles, –uses 1 FF, 1 FA cell, plus registers –the bit streams may come from or go to other circuits, therefore the registers may be optional. Requires controller –What does the FSM look like? Implemented? Final carry out? A, B, and R held in shift- registers. Shift right once per clock cycle. Reset is asserted by controller. lsb
Discussion What is sign extension and why does it work? Where is addition used in the project? Where might you want more powerful arithmetic operations?
Announcements Reading: 5.8 (4 pages!) Digital Design in the news – from UCB –UC Berkeley is among six universities to be part of the program started by IBM Corp. and Google Inc. on college campuses to promote computer-programming techniques for clusters of processors known as "clouds". Cloud computing allows computers in remote data centers to run parallel, increasing their processing power. Each company will spend between $20 million and $25 million for hardware, software and services that can be used by computer-science professors and students.IBM CorpGoogle Inc.
Basic concept of multiplication multiplicand multiplier 1101 (13) 1011 (11) * (143) Partial products product of 2 n-bit numbers is an 2n-bit number –sum of n n-bit partial products unsigned
Combinational Multiplier: accumulation of partial products A0 B0 A0 B0 A1 B1 A1 B0 A0 B1 A2 B2 A2 B0 A1 B1 A0 B2 A3 B3 A2 B0 A2 B1 A1 B2 A0 B3 A3 B1 A2 B2 A1 B3 A3 B2 A2 B3 A3 B3 S6 S5 S4 S3S2 S1S0 S7
Array Multiplier Each row: n-bit adder with AND gates What is the critical path? Generates all n partial products simultaneously.
“Shift and Add” Multiplier Sums each partial product, one at a time. In binary, each partial product is shifted versions of A or 0. Control Algorithm: 1. P 0, A multiplicand, B multiplier 2. If LSB of B==1 then add A to P else add 0 3. Shift [P][B] right 1 4. Repeat steps 2 and 3 n-1 times. 5. [P][B] has product. Cost n, = n clock cycles. What is the critical path for determining the min clock period?
Carry-save Addition Speeding up multiplication is a matter of speeding up the summing of the partial products. “Carry-save” addition can help. Carry-save addition passes (saves) the carries to the output, rather than propagating them. Example: sum three numbers, 3 10 = 0011, 2 10 = 0010, 3 10 = c 0100 = 4 10 s 0001 = c 0010 = 2 10 s 0110 = = 8 10 carry-save add carry-propagate add In general, carry-save addition takes in 3 numbers and produces 2. Whereas, carry-propagate takes 2 and produces 1. With this technique, we can avoid carry propagation until final addition
Carry-save Circuits When adding sets of numbers, carry-save can be used on all but the final sum. Standard adder (carry propagate) is used for final sum.
Array Mult. using Carry-save Addition Fast carry- propagate adder
Another Representation Building block: full adder + and 4 x 4 array of building blocks Add CPA
Carry-save Addition CSA is associative and commutative. For example: (((X 0 + X 1 ) + X 2 ) + X 3 ) = ((X 0 + X 1 ) +( X 2 + X 3 )) A balanced tree can be used to reduce the logic delay. This structure is the basis of the Wallace Tree Multiplier. Partial products are summed with the CSA tree. Fast CPA (ex: CLA) is used for final sum. Multiplier delay log 3/2 N + log 2 N
Signed Multiplier Signed Multiplication: Remember for 2’s complement numbers MSB has negative weight: ex: -6 = = = = -6 Therefore for multiplication: a) subtract final partial product b) sign-extend partial products Modifications to shift & add circuit: a) adder/subtractor b) sign-extender on P shifter register
Signed multiplication multiplicand multiplier 1101 (-3) 1011 (-5) * (15) product of 2 n-bit numbers is an 2n-bit number –sum of n n-bit partial products 1111 Note: 2s complement Sign extension (-3) (-6) -(-24)
Signed Array Multiplier b3 0b2 0b1 0b0 0 P7 P6 P5 P4 a0 0 a1 0 a2 0 a3 0 P0 P1 P2 P3 Implicit Sign extension
“Shift and Add” Signed Multiplier Signed extend partial product at each stage Final step is a subtract
Carry Look-ahead Adders In general, for n-bit addition best we can achieve is delay log(n) How do we arrange this? (think trees) First, reformulate basic adder stage: carry “kill” k i = a i ’ b i ’ carry “propagate” p i = a i b i carry “generate” g i = a i b i c i+1 = g i + p i c i s i = p i c i a b c i c i+1 s
Carry Look-ahead Adders – in blocks “Group” propagate and generate signals: P true if the group as a whole propagates a carry to c out G true if the group as a whole generates a carry Group P and G can be generated hierarchically. pipi gigi p i+1 g i+1 p i+k g i+k P = p i p i+1 … p i+k G = g i+k + p i+k g i+k-1 + … + (p i+1 p i+2 … p i+k )g i c in c out C out = G + PC in
Carry Look-ahead Adders a0a0 b0b0 a1a1 b1b1 a2a2 b2b2 a a3a3 b3b3 a4a4 b4b4 a5a5 b5b5 b c 3 = G a + P a c 0 PaPa GaGa PbPb GbGb a6a6 b6b6 a7a7 b7b7 a8a8 b8b8 c c 6 = G b + P b c 3 PcPc GcGc P = P a P b P c G = G c + P c G b + P b P c G a c 9 = G + Pc 0 c0c0 9-bit Example of hierarchically generated P and G signals:
Parallel Prefix (generalizing CLA) Compute all the prefixes F i = F i-1 op F i-2 op … op F 0 Assume associative and commutative BA BA x BAx Ax
c0c0 a0a0 b0b0 s0s0 a1a1 b1b1 s1s1 c1c1 a2a2 b2b2 s2s2 a3a3 b3b3 s3s3 c3c3 c2c2 c0c0 c0c0 a4a4 b4b4 s4s4 a5a5 b5b5 s5s5 c5c5 a6a6 b6b6 s6s6 a7a7 b7b7 s7s7 c7c7 c6c6 c0c0 c4c4 c0c0 c8c8 p,g P,G c in c out P,G P a,G a P b,G b P = P a P b G = G b + G a P b C out = G + c in P aiai bibi sisi p,g cici c i+1 p = a b g = ab s = p c i c i+1 = g + c i p 8-bit Carry Look- ahead Adder
Summary 2 complement number systems –Algebraic and corresponding bit manipulations –Overflow detection –Signficance of “sign bit” -2 n-1 Carry look ahead is form a parallel prefix Time / Space tradeoffs –Bit serial adder Binary Multiplication algorithm –Array multiplier –Serial multiply (with bit parallel adder) Signed multiplication –Sign extend multipicand –Sign bit of multiplier treated as subtract