CSE 575 Computer Arithmetic Spring 2005 Mary Jane Irwin (www. cse. psu

CSE 575 Computer Arithmetic Spring 2005 Mary Jane Irwin (www. cse. psu
CSE 575 Computer Arithmetic Spring Mary Jane Irwin (

Review: Administrivia
Mary Jane Irwin 348C IST Bldg Office hours: by appointment Optional: Digital Arithmetic, Ercogovac & Lang, Morgan Kaufmann, 2004 Prerequisite: CSE 477 and CSE 431 Lectures are scheduled from 2:30 to 4:30 (but will usually be over by 4:00, especially if a colloquium is scheduled)

Remaining Lecture Schedule
Mar 15 Introduction, number repr Dr. Irwin Chp 1 Mar 17 Local project design review Theo T. Mar 22 Global project review Dr. Vijay Mar 24 Mar 29 Addition Chp 2 Apr 1 Redundant repr & its uses Apr 5 Multiplication Chp 4 Apr 7 Local/Global project review Apr 12 Division Chp 5 Apr 14 Flt point repr & operation Chp 8 Apr 19 Function evaluation Chp 10, 11 Apr 21 Final global project review Apr 26 Other # systems Apr 28

Review: Complement Rep Options
Two auxiliary operations required negation (computing M-X) computation of residues mod M Select M so these two operations are simple Two’s complement M= 2k negation – Xcompl + ulp mod M – ignore the high order carry out One’s complement M = 2k – ulp negation – Xcompl mod M - do end-around-carry complement – replace each digit by its radix complement [(r-1) – x]; add a unit in the lsd RC(Y) = r**k - Y

Winograd’s Lower Bound on Addition
(f,r) circuit as basic gate model f -> fan-in of the logic gate r -> radix of the system takes unit time to compute output (2,2) gate each has 2 values (f,r) gate f inputs For lecture c = a; c = b; c = !a; c = !b; c = ab; c = a|b; c = !ab; c = a!b c = !a|b; c = a|!b; c = a xor b; c = a xnor b; c = a nand b; c = a nor b; c = 0; c = 1 For a (4,4) gate, mow many logic functions can be computed in unit time? each has r values 16 distinct logic functions in unit time

Lower Bound Theorem Using (f,r) circuits, can evaluate a r-valued output which is a function of all n r-values inputs in T  logf n Proof: By induction from the definition of (f,r) circuits So, the lower bound on add is T  logf n Look at high order carry out of adder – it is an r-valued output which is a function of all n r-valued inputs

Theorem “Examples” (2,2) circuits with n = 8
2 = log4 10 3 = log2 8

Model Limitations Assumes unlimited fan-out
in reality, the larger the fan-out the slower the gate Assumes no propagation delay between logic blocks in reality, also have to worry about interconnect wire delay (capacitance, resistance) Assumes a (f,r) circuit can be made to work in unit time in reality, gates speeds are technology, fan-in and fan-out dependent

Review: Binary Full Adder (FA)
2ci+1 + si = xi + yi + ci xi yi addend augend x y ci ci+1 s carry status kill 1 propagate generate carry-in ci+1 FA ci carry-out sum si gi = xi & yi pi = xi  yi ki = !xi & !yi A VERY common operation - so worth spending some time trying to optimize And often in the critical path, so need to look at both logic level optimizations circuit level optimizations si = xi  yi  ci (odd parity function) ci+1 = xi&yi | xi&ci | yi&ci (majority function) = pi  ci = gi | pi&ci

FA Gate Level Implementations
x y cin x y cin AND/XOR/OR adder but would have to map to CMOS gates, so … 10 gates = 32 transistors build xor with NOR feeding or input of an AOI21 gate for a count of 10t or two inverters feeding an AOI22 for 12t (remember or with inverters on inputs is really a nand) 4 gate delays to sum out 4 gate delays to carry out max fan-out of 3 gates on x, y and cin static CMOS complex gate adder 3 gate delays to sum out 2 gate delays to cout max fan-in of 2 (no more than 2 transistors in series in any gate) 8 gates - 40 transistors fan-out of 3 gates for t1, x and y, 4 gates for cin cout cout s s

Mirror FA 24+4 transistors y x cin !cout !s
3 6 4 4 8 (for C and Sum inverter) transistor Full Adder; No more than 3 transistors in series; Loads: x-8, y-8, cin-6, !cout-2 (in transistors, not logic gates); Number of “gate delays” to Sum – 3? The NMOS and PMOS chains are completely symmetrical, with a maximum of two series transistors from cin to !cout, guaranteeing identical rise and fall transitions if the NMOS and PMOS devices are properly sized. When laying out the cell, the most critical issue is the minimization of the capacitances at node !cout (four diffusion capacitances, two internal gate capacitances, and two inverter gate capacitances). Shared diffusions can reduce the stack node capacitances. The transistors connected to cin are placed closest to the output (cin is the latest arriving signal). Only the transistors in the carry stage have to be sized for speed. All transistors in the sum stage can be minimal size (to reduce power). But generates !s and !cout Sizing: Since each input in the carry circuit has a logical effort of 2, the optimal fan-out for each is also 2. Since !cout drives 2 internal and 2 inverter gates (to form cin for the nms bit adder) we should oversize the carry circuit.

XOR FA xor xor 16 transistors cin x y s cout
16 transistors – vesterbacke in SiPS99 xor xor cout 16 transistors

beware of threshold drop
CPL FA !cin cin !y y x !s !x s beware of threshold drop xor xor y !y cin !cin x 20 + 4*2 = 28 transistors how many transistors in series worst case? (3 to cout) how big a threshold drop? (Vdd-Vt, once again in the or block forming the cout) how about voltage drops (esp in cout circuitry)? !cout y cin !x cout !y !cin and and or 28 transistors – dual rail

Ripple Carry Adder (RCA)
x3 y3 x2 y2 x1 y1 x0 y0 cout=c4 FA FA FA FA c0=cin s3 s2 s1 s0 So, go back and look at the gate level implementations of the FA concentrating on optimizing the carry path delay T = O(n) worst case delay TRCA  TFA(x,ycout) + (n-2)TFA(cincout) + TFA(cins) Real Goal: Make the fastest possible carry path

Carry Paths On the average, the longest carry chain in adding n-bit numbers is of length log2n see proof in Parhami Experimental results verify this log2n approximation and suggest that log2(1.25n) is an even better estimate But typical carry chains are usually quite short (and usually at the least significant end of the operand) Same as Winograd’s lower bound!

Coping with Carries Carry-completion sensing (asynchronous)
sense when carry is done (since average carry length is O(log n)) Synchronous word parallel adders Compute the carry as quickly as possible -Manchester carry chain, carry lookahead, prefix, carry skip, carry select, … Defer carry assimilation (e.g., in multiplication) - carry save adders, signed digit adders

synchronous word parallel adders
Binary Adders synchronous word parallel adders ripple carry adders (RCA) carry prop min adders signed-digit fast carry prop residue adders adders adders Manchester carry carry prefix cond. carry carry-chain select lookahead sum skip T = O(n), A = O(n) T = O(1), A = O(n) speed versus complexity versus power consumption but have to worry about constants also have bit (digit) serial adders and asynchronous adders T = O(n), A = O(n) T = O(log n) A = O(n log n) T = O(n), A = O(n)

Manchester Carry Chain Adders
Switches controlled by gi and pi Total delay of time to form the switch control signals gi and pi setup time for the switches signal propagation delay through n switches in the worst case gi pi !ci !ci+1 clock when clock is low, the carry nodes precharge; when clock goes high if gi is high, ci+1 is asserted (goes low) to prevent gi from affecting ci, the signal pi must be computed as the xor (rather than the or) of xi and yi which is not a problem since we need the xor of xi and yi for computing the sum anyway delay is roughly proportional to n**2 (as n pass transistors are connected in series) so usually group 4 stages together and buffer the carry chain with an inverter between each stage

Four Bit-Sliced MC Adder
x3 y3 x2 y2 x1 y1 x0 y0 clock &,  &,  &,  &,  g3 p3 g2 p2 g1 p1 g0 p0 cout cin Dynamic circuit – impact on clock power and timing (have to allow for precharge time) Limit of 4 transistors in a row for speed, then have to buffer carry chain !c3 !c2 !c1 !cin     s3 s2 s1 s0

Domino Manchester Carry Chain
cin g0 clock p0 p1 p2 p3 g1 g2 g3 cout 1 2 3 4 5 6 !(g3 | p3g2 | p3p2g1 | p3p2p1g0 | p3p2p1p0 cin) !(g2 | p2g1 | p2p1g0 | p2p1p0 cin) !(g1 | p1g0 | p1p0 cin) !(g0 | p0 cin) For lecture. Note four pass transistors in series (P3 P2 P1 P0) + Cin and Me (clock isolation pull-down) of first gate. Automatically forms all the intermediate carries as well – as shown on animation Sizing assumes only integer multiples allowed, should pfets all be 3?

Unrolling the Carry Recurrence
Expanding the recurrence cout = g | p&cin c0 = carry_in c1 = g0 | p0&c0 c2 = g1 | p1&c1 = g1 | p1&g0 | p1&p0&c0 . . . ci+1 = gi | pi&gi-1 | … | pi&pi-1 …&p1&g0 | pi&pi-1…&p0&c0 All the p’s and g’s can be formed in parallel, then all the carries can be formed in parallel, then all the sums can be formed in parallel. Approximately 1 + Tcarry_gate + 1 for any size adder Except for fan-in and fan-out limitations on computing the carries ! so, 4 units to do add?!? what are we forgetting here?

4-Bit Carry Lookahead Adder (CLA)
gi = xi & yi pi = xi  yi si = pi  ci c3 c2 c1 +1 1 s3 s2 s1 s0 g3 p3 g2 p2 g1 p1 g0 p0 c4 +2 c0 c1 = g0 | p0c0 c2 = g1 | p1g0 | p1p0c0 c3 = g2 | p2g1 | p2p1g0 | p2p1p0c c4 = g3 | p3g2 | p3p2g1 | p3p2p1g0 | p3p2p1p0c0 For lecture ci+1 requires i+2 inputs to the largest AND or OR term, so only really feasible for n = 4 as shown May want to form c4 from g3, p3 and c2 in the normal way if speed of c4 is not an issue Assume can form gi and pi is one delay unit, the c terms in 2 units, and the final sum in one unit, then c1, c2, c3, and c4 are all ready in 3 units and s1, s2, s3 are ready one more unit later (4 units) (s0 is ready in 2 units) A constant time adder!?! Can we build a similar 16-bit CLA? 32-bit CLA? Problem is that fan-in grows unreasonably

Lookahead “Gates” c0 !c1 g0 p0 c0 !c2 g0 g1 p0 p1

!c4 Lookahead “Gate” g3 g2 g1 g0 c0 !c4 p0 p1 p2 p3
still 5 transistors in series, but only 18 transistors for c4 (need a similar gate for c3, c2, and c1) book talks about a Ling adder which is another attempt to reduce the “gate” count, logic complexity, in the carry lookahead logic, but the above “gate” can’t be beat! Note that the latest arriving signal, the carries, is closest to the output p0 p1 p2 p3

16-Bit CLA Ripple between 4-bit blocks a3 b3 a2 b2 a1 b1 a0 b0 c3 c2
g3 p3 g2 p2 g1 p1 g0 p0 c4 +2 c0 For lecture Ripple between groups of 4-bits c4 ready in 3 units, c8 ready in 2 more units (5 units) since p’s and g’s sitting there ready, c12 in 2 more (7 units), c16 in 2 more (9 units) and the high order sum bit, s15, is ready in 10 units Question - How long for a ripple carry adder using the same delay metrics? Assume 3 units for the first FA to form its carry, 2 for the middle FA carries, and 1 unit for the last FA to form the sum once it has its carry. So 3 + 2(14) + 1 = 32 How about gate complexity comparison; interconnect; power? +2 +2 +2 c16 c12 c8 c4 c0 +1 Ripple between 4-bit blocks

A Faster 16-Bit CLA Recall
c4 = g3 | p3g2 | p3p2g1 | p3p2p1g0 | p3p2p1p0c0 then c4 = G0 | P0c0 c8 = G1 | P1 G0 | P1 P0 c0 c12 = G2 | P2 G1 | P2 P1 G0 | P2 P1 P0 c0 etc Block Generate G0 Block Propagate P0 block number Group generate - this group will generate (start) a carry Group propagate - this group will prop (continue) a carry Assume takes 2 units (after little p’s and g’s are ready) to form cap P’s and G’s

Lookahead between 4-bit blocks
16-Bit CLA - Version 2 G = g3 | p3g2 | p3p2g1 | p3p2p1g0 P = p3p2p1p0 +2 c12 c8 c4 3 +1 G3 P3 G2 P2 G1 P1 G0 P0 c16 +2 c0 For lecture Cap P’s and G’s are all ready in 3 units, 2 more units to form c4, c8, c12, and c16, and then 3 more to form high order sum, s15 for a total of 8 units Savings is more dramatic for bigger adders - and can add yet another stage of carry lookahead using the same approach c4 = G0 | P0c0 c8 = G1 | P1 G0 | P1 P0 c0 c12 = G2 | P2 G1 | P2 P1 G0 | P2 P1 P0 c0 c16 = G3 | P3 G2 | P3 P2 G1 | P3 P2 P1 G0 | P3 P2 P1 P0 c0 Lookahead between 4-bit blocks

64-Bit Adder Comparisons
RCA version  4-bit blocks (level 1 LA), RC between blocks  4-bit blocks, level 2 LA between blocks, RC between level 2 lookahead  4-bit blocks, level 2 and 3 LA between blocks  2 + 62*1 + 1 = 65 units (1 + 2) + 14*2 + (2 + 1) = 34 units For lecture 64 1-bit RCAs -> (1) + 1 = 65 16 4-bit blocks, RC -> (2) + 3 = 34 16 4-bit blocks, 4 2nd levels LA -> 3 +4*2 + 3 = 14 16 4-bit blocks, 4 2nd level LA, 1 3rd level LA -> = 12 Diminishing returns - especially considering complexity of interconnect ( ) + 2*2 + ( ) = 14 units ( ) + 1*2 + ( ) = 12 units

Adder Delay Comparisons
RCA 4b CLA blks, RC blks 4b CLA blks, L2 CLA L2&3 CLA 8 9 6 - 16 32 48 64 17 33 49 65 10 18 26 34 8 10 12 14 12 For lecture Timing model: p, g = 1 unit; 1bFA sumout, carryout = 1 unit if p and g available; P,G = 2 units; 4bCLA logic = 2 units

Time Bound on Blocked CLA
8b CLA with a blocking factor of 2 A A A A A A A A A -> form p’s, g’s (1u) to pass down B -> form P’s, G’s (2u) to pass down C -> form PP’s, GG’s (2u) to pass down D -> form PPP’s, GGG’s (2u) to pass down E -> form carries (2u) to pass up D -> form carries (2u) to pass up C -> form carries (2u) to pass up B -> form carries (2u) to pass up A -> form sum (1u) L1 B B B B B B B B L2 C C C C L3 D D But what is the area bound? Lots of interconnect here, but can get a better feel when we look at recursion adders. What about power? L4 E c0 CLA delay = 1 + 2(2u/group(# of CLA levels) - 1) +1 # of levels of CLA = logrn where r is the blocking factor CLA delay = 4 logrn

Lookahead Adder Layout Considerations
27b CLA with a blocking factor of 3 108b CLA C3 AB3 D3 E4 AB3 AB3 AB3 AB3 AB3 AB3 C3 C3 AB3 C3 D3 AB3 AB3 For lecture I/O comes in from all four sides – more typical is to have I’s on top and O’s on bottom from the datapath perspective T = O(log n) A = O(n) Max wire length = O(√n)

Ultimate Fate of Lookahead Adders
Approach has problems with gate fan-in (unless a blocking factor of 2 is used) irregular wiring and inefficient layout (unless the previous scheme is used which works only for certain size adders/blocking factors and which has inputs/outputs coming into/out of all sides that may not be appropriate for typical (linear) datapaths) Fortunately, there is another way to design log n adders that doesn’t have these problems

Prefix Adders T = O(log n) A = O(n log n)

Parallel Prefix Adders
Given p and g terms for each bit position, computing all the carries is equal to finding all the prefixes of in parallel And since € is associative, we can group them in any order but note that it is not commutative (g0,p0) € (g1,p1) € (g2,p2) € … € (gn-2,pn-2) € (gn-1,pn-1)

Parallel Prefix Adders (PPAs)
Define carry operator € on (g,p) signal pairs Since € is associative [(g’’’,p’’’) € (g’’,p’’)] € (g’,p’) = (g’’’,p’’’) € [(g’’,p’’) € (g’,p’)] (g’’,p’’) (g’,p’) g’ !g g’’ p’’ where g = g’’ | p’’&g’ p = p’’&p’ € (g,p) Show how carry operator is associate by example (g’’’,p’’’)op(g’’,p’’) = (g’’’+p’’’g’’,p’’’p’’) and then (g’’’+p’’’g’’,p’’’p’’)op(g’,p’) = (g’’’+p’’’g’’+p’’’p’’g’,p’’’p’’p’) Thus, they can be grouped in any order But carry operator is not commutative, since g’’ + p’’g’ is in general not equal to g’ + p’g’’ € €

parallel prefix logic tree
PPA General Structure Measures to consider tree cell height (time) tree cell area; number of € cells cell fan-in and fan-out max wiring length wiring congestion delay path variation (glitching) p, g logic (1 unit delay) sum logic (1 unit delay) parallel prefix logic tree (1 unit delay per level)

Brent-Kung PPA Tadd = Tsetup + 2log2n t€ -2 + Tsum T = log2n
A = 2log2n A = n/2 g14 p14 g13 p13 g12 p12 g11 p11 g10 p10 g9 p9 g8 p8 g7 p7 g6 p6 g5 p5 g4 p4 g3 p3 g2 p2 g1 p1 g0 p0 T = log2n T = log2n - 2 € € € € € € € € € € € € € € Parallel Prefix Computation € € For lecture We are assuming that co = 0, so c1 = g0 (c1 = g0 + p0c0) Time = 2logn – (to form p’s and g’s and final sum) = 2logn Area = width of n, height of 2logn – 2 carry cells (to form all of the carries) with 2n log n total cells For n=16 as shown -> 1 unit to form p’s and g’s, 2log16-2=12 units to form carries, 1 unit to form sums = 14 units Regular structure with limited fanin for all gates - only two issues to worry about are fanout (but have room to insert buffers to deal with this) and maximum wire length of n/2 How about power?? Several/many other kinds of recurrence solvers – Kogge-Stone, Elm, and hybrids € € € € € € € € € € c16 c15 c14 c13 c12 c11 c10 c9 c8 c7 c6 c5 c4 c3 c2 c1 Tadd = Tsetup + 2log2n t€ -2 + Tsum

A Faster Yet PPA Brent-Kung adder has the time bound of TBK = 1 + (2log n – 2) + 1 There is an even faster PPA approach Kogge-Stone faster pp tree (logn for KS versus 2logn-2 for BK) fan-out limited to two takes more cells (nlogn - n + 1 for KS versus n -2 - logn for BK) and has more wiring

Kogge-Stone PPAdder Tadd = Tsetup + (log2n) t€ + Tsum
cin A = log2n A = n € € € € € € € € € € € € € € € € € € € € € € € € € € € € € € € Parallel Prefix Computation € € € € € € € € € € € € € T = log2n Note that c16 can be computed at the same time as s15 € € € € € € € € € € c15 c14 c13 c12 c11 c10 c9 c8 c7 c6 c5 c4 c3 c2 c1 c16 Tadd = Tsetup + (log2n) t€ + Tsum

A 4-bit KS Example c2 = g1 | p1&c1 = g1 | p1&g0 | p1&p0&c0
reminder g = g’’ | p’’&g’ p = p’’& p’ € c0 € € € € g = g3 | p3&g2 p = p3&p2 g = g2 | p2&g1 p = p2&p1 g = g1 | p1&g0 p = p1&p0 g = g0 | p0&c0 = c1 p = p0&c0 € € € g = g3 | p3&g2 | p3&p2&(g1 | p1&g0) p = p3&p2&p1&p0 g = g1 | p1&g0 | p1&p0&c0 = c2 p = p1&p0&c0 g = g2 | p2&g1 | p2&p1&(g0 | p0&c0) = c3 p = p2&p1&p0&c0 € For class handout g = g3 | p3&g2 | p3&p2&g1 | p3&p2&p1&g0 | p3&p2&p1&p0&c0 = c4 p = p3&p2&p1&p0&c0 c4 c3 c2 c1 c2 = g1 | p1&c1 = g1 | p1&g0 | p1&p0&c0 c3 = g2 | p2&c2 = g2 | p2&g1 | p2&p1&g0 | p2&p1&p0&c0 c4 = g3 | p3&c3 = g3 | p3&g2 | p3&p2&g1 | p3&p2&p1&g0 | p3&p2&p1&p0&c0

PPA Comparisons Measure BK PPA n=64 KS PPA # of € cells 2n - 2 - logn
129 nlogn - n + 1 321 tree depth 2logn - 2 10 logn 6 tree area (WxH) (n/2) * (2logn -2) 320 n * logn 384 cell fan-in 2 cell fan-out max wire length n/4 16 n/2 32 wiring density sparse dense glitching high low

Other Prefix Adders Hybrid BK-KS Elm Han-Carlson Wei-Thompson
1st level BK, middle levels KS, last level BK faster than BK, slower than KS regular wiring of KS, but with less wiring congestion Elm the tree cells compute the partial sum bits along with the partial carry reduces the number of inter-cell interconnections leading to smaller, faster adders Faster if wire length is more important than the number of gates traversed in determining the delay Han-Carlson Wei-Thompson

(aka carry bypass adders)
Carry Skip Adders (aka carry bypass adders) T = O(n) A = O(n)

Carry Skip Adder x3 y3 x2 y2 x1 y1 x0 y0 c3 FA FA FA FA cin cout s3 s2
BP = p0 p1 p2 p3 “Block Propagate” (same as CLA Block Propagate) Idea: If (p0 & p1 & p2 & p3 = 1) then cout = cin else the block kills carry or generates carry internally

Carry Skip Chain Implementation
block carry-out carry-out BP block carry-in cin g0 p0 p1 p2 p3 g1 g2 g3 !cout BP Only 10% to 20% area overhead Only 2 “gate delays” to produce cout if skip occurs

4b Block Carry Skip Adder
bits 12 to 15 bits 8 to 11 bits 4 to 7 bits 0 to 3 Setup Setup Setup Setup Carry Propagation Carry Propagation Carry Propagation Carry Propagation cin Sum Sum Sum Sum Setup forms p’s and g’s For n bits and n/m chunks each containing m bits Worst-case delay  carry from bit 0 to bit 15 = carry generated in bit 0, ripples through bits 1, 2, and 3, skips the middle two groups (m = group size in bits), ripples in the last group from bit 12 to bit 15 TCSkA = Tsetup+mTcarry+((n/m)-2)Tskip+(m-1)Tcarry +Tsum

Optimal Block Size Assuming Tsetup = Tcarry = Tskip = Tsum = 1
TCSkA = m + (n/m -2) + (m-1) + 1 Tsetup ripple skips ripple Tsum in 1st blk in last blk = 2m + n/m - 1 So the optimal block size, m, is dTCSkA/dm = 0  √(n/2) = mopt And the optimal time is Optimal TCSkA = 4√(n/2) – 1 = 2√(2n) – 1 so if n=32, mopt = 4 bits and Topt = 15 compared to a ripple-carry adder of 32 or more than 2.5 times faster And pass chain to implement GP would also argue for no more than 4 bits in a group

Variable Carry Skip Addition
Clearly, a carry that is generated in, or absorbed by, one of the inner blocks travels a shorter distance through the skip blocks. So can allow more ripple stages for inner carries without increasing the overall delay bl(t-1) bl(t-2) bl(1) bl(0) t is the number of blocks Key is DELAY BALANCING from carry path 2 - can make block t-2 one bit wider than block t-1 without increasing the total adder delay from carry path 3 - can make block 1 one bit wider without increasing the delay skip ripple carry path 1 carry path 2 one fewer skip carry path 3 one fewer skip

Optimal Variable Block Sizes
m m+1 … m + t/ m + t/ … m+1 m And since the total number of bits in the t blocks is 2[m + (m+1) + … + (m+t/2-1)] = t(m + t/4 - 1/2) = n then m = n/t - t/4 + 1/2 Giving an adder delay of TVSkA = m (t - 2) (m-1) setup ripple skips ripple sum in 1st blk in last blk = 2m + t – 1 = 2n/t + t/2 So the optimal number of blocks is dTVSkA/dt = 0  2√n = topt And optimal time is Optimal TVSkA = 2√n t blocks Note that the optimal number of blocks is sqrt of 2 times larger than that obtained with fixed size blocks For our 32 bit example Tvska = 12 (vs 15 for Tcska) Also note that with the optimal number of blocks, m becomes 1/2, thus we take it to be 1 Toptimal is roughly a factor of sqrt of 2 smaller than that obtained with optimal fixed size skip blocks

Variable Carry Skip Adder
Delay balancing considerations cin cout 6 7 5 ripple time (including p&g) BP in 2 12 +1 6 7 8 Max skip adder width with 8 units delay For lecture – remember that the example is just looking at carries that start at the lsend and end at the msend (must also consider other variations to determine optimal sizes) Timing shows considerations for when the carry ripple is working from the least significant end However - have to consider how we are forming BP and doing the ripple inside blocks - e.g., if MCC then since the delay in the carry chain grows as the square of the block width, this may not be the optimal design in practice cin cout 1 2 3 4 3 2 1 8 7 6 5 4 3 2 +1 16 bits wide

RCA 1-level CSkA (#b/blk, #blk) 1-level VSkA (blk sizes) 8 9 7=2+2+3 (2,4) 6 ( ) 16 17 9=1+4+4 (3,5 & 1,1) ( ) 32 33 15=4+6+5 (4,8) 12 ( ) 48 49 17=3+8+6 (5,9 & 3,1) 14 ( ) 64 65 20=4+9+7 (6,10 & 1,4) ( ) from 1999 notes - needs updated Timing model: p, g = 1 unit; 1bFA sumout, carryout = 1 unit if p and g available; skip blk = 1 unit

Carry Skip Adder Comparisons
t=4 m=3 t=5 m=4 t=7 m=5 t=8 m=6 t=9 THE NUMBERS HERE ARE WRONG Need to check the m and t numbers and then add the plot lines – this slide is supposed to show what happens when you don’t start with m = 1 (as in the single plot line shown from the previous table numbers) m=1 t=5 m=2 t=9 t=7 m=3 t=12 t=10

Multilevel Carry Skip Addition
What about allowing a carry to skip over several blocks at once? cout cin 6 7 5 10 +1 skip level 1 For lecture A carry that would need 3 time units to skip the last three blocks in a single level carry skip adder can now do so in a single time unit If the block is short, there may not be any advantage to skipping (over just allowing the carry to ripple). Notice that this adder is NO LONGER delay balanced if the carry starts in the next least significant block (i.e., its output settles in 6 units, and with two skips (assuming the next two blocks have their BP set) the input to the high order block is ready in 8 units (NOT 6) giving an overall delay of 12 (NOT 10). three skips one skip skip level 2 +1 6 AND of the first level skip signals (BP’s)

T = O(n) (or O(n) with some work)
Carry Select Adders T = O(n) (or O(n) with some work) A = O(n)

Carry Select Adder x’s y’s m-b Setup
Idea: Precompute the carry out of each block for both carry_in = 0 and carry_in = 1 (can be done for all blocks in parallel) and then select the correct one p’s g’s “0” carry propagation “1” carry propagation 1 Compute both carry out with no carryin and carries with carryin and then select the right one when you know what the real carryin is multiplexer cin cout c’s Sum generation s’s

Carry Select Adder: Critical Path
bits 12 to 15 bits 8 to 11 bits 4 to 7 bits 0 to 3 x’s y’s Setup “0” carry “1” carry mux Sum gen p’s g’s c’s s’s x’s y’s Setup “0” carry “1” carry mux Sum gen p’s g’s c’s s’s x’s y’s x’s y’s 1 Setup Setup p’s g’s p’s g’s “0” carry “0” carry +4 “1” carry “1” carry 1 For lecture n is number of bits in adder, m is number of bits in block According to the book, it is easy to show that the carry select adder is more cost effective than the ripple carry adder if n >16/(alpha-1) where alpha is cadd(n) = alpha n for RCAs For alpha = 4 and tau = 2, the carry select approach is almost always preferable to ripple carry +1 +1 +1 +1 mux mux cout cin c’s c’s +1 Sum gen Sum gen s’s s’s Tadd = Tsetup + mTcarry + n/mTmux + Tsum

Square Root Carry Select Adder
bits 14 to 19 bits 9 to 13 bits 5 to 8 bits 2 to 4 bits 0 to 1 x’s y’s x’s y’s x’s y’s x’s y’s x’s y’s Setup “0” carry “1” carry mux Sum gen p’s g’s c’s Setup 1 mux Sum gen p’s g’s c’s s’s “1” carry “0” carry 1 Setup Setup Setup p’s g’s p’s g’s p’s g’s “0” carry “0” carry “0” carry +2 +6 +5 +4 +3 “1” carry “1” carry “1” carry 1 1 1 For lecture Delay balancing – make the later blocks bigger How about two level carry select as in book? +1 +1 +1 +1 +1 mux cout mux mux cin c’s c’s c’s +1 Sum gen Sum gen Sum gen s’s s’s s’s Tadd = Tsetup + 2 Tcarry + √n Tmux + Tsum

RCA 1-level CSkA 1-level VSkA CCSlA (4b blks, #blks) VCSlA (sizes) 8 9 7 6 8=2+6 (2) 7=2+1+4 (4-3-2) 16 17 10=2+2+6 (4) 9=2+3+4 ( ) 32 33 15 12 14=2+6+6 (8) 11=2+5+4 ( ) 48 49 14 18=2+10+6 (12) 13=2+7+4 (10-9-…-3-2) 64 65 20 22=2+14+6 (16) 14=2+8+4 (11-10-…-3-2) from 1999 notes - needs updated Timing model: p, g = 1 unit; 1bFA sumout, carryout = 1 unit if p and g available; skip blk = 1 unit; 2x1 mux = 1 unit

RCA 1-level CSkA 1-level VSkA CCSlA (4b blks) VCSlA BK PPA KS PPA 8 9 7 6 5 16 17 10 32 33 15 12 14 11 48 49 18 13 64 65 20 22 from 1999 notes - needs updated Timing model: p, g = 1 unit; 1bFA sumout, carryout = 1 unit if p and g available; skip blk = 1 unit; 2x1 mux = 1 unit; ; € cell = 1 unit

PPA Adder Comparisons

Sparse-Tree Adder Add slides on the adder presented in “A 4GHz 130nm Address Generation Unit with 32b Sparse-Tree Adder Core” by Mathew, et.al. (Intel) in IEEE Journal of Solid-State Circuits, 38(5), May 2003, pp 689 It’s a combination of a front-end “sparse-tree” KS (prefix) adder with back-end, 4b carry select adders

Conditions and Exceptions
Adder condition flags cout Indicating a carry-out of 1 overflow Indicating the sum is incorrect negative Indicating the sum is negative zero Indicating the sum is zero For unsigned numbers, cout and overflow are the same; the sign is irrelevant Overflow2s’c = cn  cn-1

Domino Zero Detector Circuit
s7 s6 s5 s4 s3 s2 s1 s0 not zero clock How would you build it in static CMOS? A 16-wide fan-in OR function using a tree composed of NAND and NOR gates As opposed to one dynamic gate!

ALUs ALUs have to be able to do more than add and subtract (e.g., bit wise and, or, xor, …) “snip” the carry chain (bit wise operation) mine the FA gates to do the logic operations x3 y3 x2 y2 x1 y1 x0 y0 e.g., go back and look at FA gate implementation and see if the logic can be minds to produce a logic function cout FA FA FA FA cin s3 s2 s1 s0

Key References Brent, Kung, A regular layout for parallel adders, IEEE Trans. Computers, 31: , 1982. Chan, Schlag, Analysis and design of CMOS Manchester adders with variable carry skip, IEEE Trans. Computers, 39(8): , 1990. Chan, Delay optimization of carry skip adders and block carry lookahead adders, IEEE Trans. Computers, 41(8): , 1992 Han, Carlson, Fast area-efficient VLSI adders, Proc. ARITH 8, 49-56, 1987. Kantabutra, Designing optimum one level carry skip adders, IEEE Trans. on Computers, 42(6): , 1993. Kelliher, Elm – A fast addition algorithm discovered by a program, IEEE Trans. Computers, 41(9): , 1992. Knowles, A family of adders, Proc. of ARITH 14, 1999. Kogge, Stone, A parallel algorithm for the efficient solution of a general class of recurrence equations, IEEE Trans. Computers, C-22(8): , 1973. Ladner, Fischer, Parallel prefix computation, JACM, 27(4): , 1980. Ling, High-speed binary adder, IBM J. Research and Dev., 25(3): , 1981. Nagendra, Power, Delay & Area Tradeoffs in CMOS Arithmetic Modules, PhD Thesis, PSU, 1996. Ngai, Irwin, Regular area-time efficient carry-lookahead adders, J. Parallel and Dist. Computing, 3(3):92-105, 1984. Rabaey, Digital Integrated Circuits, A Design Perspective, Prentice-Hall, 1996. Sklansky, Conditional-sum addition logic, IRE Trans. Electronic Computers, 9(2): , 1960. Sugla, Carlson, Extreme area-time optimal adder design, IEEE Trans. Computers, 39(2): , 1990. Vesterbacke, A 14-transistor CMOS full adder with full voltage swing nodes, Proc. SiPS, pp , Oct 1999. Wei, Thompson, Area-time optimal adder design, IEEE Trans. Computers, 39(5): , 1990.

CSE 575 Computer Arithmetic Spring 2005 Mary Jane Irwin (www. cse. psu

Similar presentations

Presentation on theme: "CSE 575 Computer Arithmetic Spring 2005 Mary Jane Irwin (www. cse. psu"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 575 Computer Arithmetic Spring 2005 Mary Jane Irwin (www. cse. psu

Similar presentations

Presentation on theme: "CSE 575 Computer Arithmetic Spring 2005 Mary Jane Irwin (www. cse. psu"— Presentation transcript:

Similar presentations

About project

Feedback