Spring 2006EE VLSI Design II - © Kia Bazargan 68 EE 5324 – VLSI Design II Kia Bazargan University of Minnesota Part II: Adders
Spring 2006EE VLSI Design II - © Kia Bazargan 69 References and Copyright Textbooks referenced [WE92] N. H. E. Weste, K. Eshraghian “Principles of CMOS VLSI Design: A System Perspective ” Addison-Wesley, 2 nd Ed., [Rab96] J. M. Rabaey “Digital Integrated Circuits: A Design Perspective ” Prentice Hall, [Par00] B. Parhami “Computer Arithmetic: Algorithms and Hardware Designs ” Oxford University Press, 2000.
Spring 2006EE VLSI Design II - © Kia Bazargan 70 References and Copyright (cont.) Slides used [©Hauck] © Scott A. Hauck, ; G. Borriello, C. Ebeling, S. Burns, 1995, University of Washington [©Prentice Hall] © Prentice Hall 1995, © UCB 1996 Slides for [Rab96] [ ©Oxford U Press] © Oxford University Press, New York, 2000 Slides for [Par00] With permission from the author
Spring 2006EE VLSI Design II - © Kia Bazargan 71 Outline One-bit adder, basic ripple-carry adder Carry-Lookahead adders (CLA) Manchester carry chain Carry bypass Carry select adder Brent-Kung adder
Spring 2006EE VLSI Design II - © Kia Bazargan 72 Why Adders? Addition: a fundamental operation Basic block of most arithmetic operations Address calculation Faster, faster and faster How? Architectural level optimization Gate-level optimization Speed/area trade-off
Spring 2006EE VLSI Design II - © Kia Bazargan 73 One-bit Half Adder: One-bit Full Adder: Adding Two One-bit Operands Sum = A B Cin Cout = A.B + B.Cin + A.Cin FA AB C in C out Sum Sum = A B Cout = A.B HA AB C out Sum A B Sum Cout C in A B Sum Cout
Spring 2006EE VLSI Design II - © Kia Bazargan 74 N-Bit Ripple-Carry Adder: Series of FA Cells To add two n-bit numbers C0C0 FA A0A0 S0S0 B0B0 A1A1 S1S1 B1B1 A2A2 S2S2 B2B2 A n-1 S n-1 B n-1 CnCn... Note: adder delay = Tc * n Tc = (C in :C out delay) FA AB CinCin C ou t Sum
Spring 2006EE VLSI Design II - © Kia Bazargan 75 4-bit Ripple Carry Addition: Example C0C0 FA A0A0 S0S0 B0B0 A1A1 S1S1 B1B1 A2A2 S2S2 B2B2 A3A3 S3S3 B3B3 C4C4 C1C1 C2C2 C3C3 T= T=0 B=0101 A=0011 S=0000 S= T=2 S= T=3 S= T=4 S=1000
Spring 2006EE VLSI Design II - © Kia Bazargan 76 One-bit Full Adder Implementation Direct gate implementation Cout = A.B + B.Cin + A.Cin = A.B + Cin. (A+B) Sum = A B Cin A B Cin Sum A B A B Cin Cout 32 Transistors Used [WE92] p516
Spring 2006EE VLSI Design II - © Kia Bazargan 77 includes 111 excludes 000 One-Bit Full Adder: Share Logic An observation Almost always, sum = NOT carry C in A B Sum Cout Sum = A.B.Cin + (A+B+Cin).Cout
Spring 2006EE VLSI Design II - © Kia Bazargan 78 One-Bit Full Adder: Transistor Implementation Sum = A.B.C + (A+B+C).Cout Cout = A.B + C.(A+B) A B B A C A B AB C Cout C B A A B C C B A C BA Sum –Use inverters to get Cout and Sum –C transistors close to output –Cout delay: 2 inverting stages (1-stage possible?) –Sum delay: 3 inverting stages (not an issue, though) 28 Transistors [WE92] p517 [Rab96] p390
Spring 2006EE VLSI Design II - © Kia Bazargan 79 An observation Invert inputs => outputs invert Exploit this property: Get rid of the inverter on the carry critical path One-Bit Full Adder: Inverted Inputs FA C in A B Sum Cout FA
Spring 2006EE VLSI Design II - © Kia Bazargan 80 Ripple Carry Adder: Inverting Property FA’ is similar to FA, but with no inverters on the outputs Much faster (1-stage) Disadvantage: not regular data path A1A1 S1S1 B1B1 C2C2 C0C0 A0A0 B0B0 S0S0 C1C1 A2A2 B2B2 S2S2 C3C3... FA’ A3A3 S3S3 B3B3 C4C4
Spring 2006EE VLSI Design II - © Kia Bazargan 81 Summary: Ripple-Carry Adder Basic ripple carry: AND-OR gates Area: 32 transistors (per bit position) Delay: 2 stages of inverting logic (per bit position) Direct CMOS logic, share Cout’ Area: 28 transistors Delay: 2 stages Use “inverting” property Area: 27 (odd bits:26, even bits:28) Delay: ~1 stage So far: transistor/logic manipulation Is that all we can do?!!
Spring 2006EE VLSI Design II - © Kia Bazargan 82 Outline One-bit adder, basic ripple-carry adder Carry-Lookahead adders (CLA) Manchester carry chain Carry bypass Carry select adder Brent-Kung adder
Spring 2006EE VLSI Design II - © Kia Bazargan 83 Carry-Lookahead Adder: Idea New look: carry propagation Idea: Try to “predict” C k earlier than T c *k Instead of passing through k stages, compute C k separately using 1-stage CMOS logic Carry propagation: an example Bit position Carry A B Sum
Spring 2006EE VLSI Design II - © Kia Bazargan 84 0-propagate 1-propagategenerate kill (kill) (propagate) (generate) Carry-Lookahead Adder (CLA): One Bit What happens to the propagating carry in bit position k? C C 1 0 C C C A A B B A A B B Cout [Rab96] p391 p = A+B (or A B) g = A.B A B C in Cout
Spring 2006EE VLSI Design II - © Kia Bazargan 85 CLA: Propagation Equations If C 4 =1, then either: g 3 generated at bit pos 3 g 2.p 3 generated at bit pos 2, propagated 3 g 1.p 2.p 3 generated at bit pos 1, propagated 2,3 g 0.p 1.p 2.p 3 generated at bit pos 0, propagated 1,2,3 C in.p 0.p 1.p 2.p 3 input carry, propagated 0,1,2,3 C 4 = g 3 + g 2.p 3 + g 1.p 2.p 3 + g 0.p 1.p 2.p 3 + C in.p 0.p 1.p 2.p 3 Implement C 4 as a one-stage CMOS logic delay=1 (or is it?)
Spring 2006EE VLSI Design II - © Kia Bazargan 86 p 3.g 2 C 4 p 1.g 2.g 3 C 4 CLA: Static Logic Implementation p0p0 p1p1 p2p2 p3p3 C in g0g0 g1g1 g2g2 g3g3 C4C4 [©Hauck] [Rab96] p405 d e f h j k l m n s r q o t u v w x
Spring 2006EE VLSI Design II - © Kia Bazargan 87 6 transistors in series CLA: Dynamic Logic Implementation Dynamic gate implementation: C 4 = g 3 + p 3. (g 2 + p 2. (g 1 + p 1. (g 0 + P 0.C in ))) C4C4 C in p0p0 p1p1 p2p2 p3p3 g0g0 g1g1 g2g2 g3g3 [©Hauck] [WE92] p529
Spring 2006EE VLSI Design II - © Kia Bazargan 88 CLA: Dynamic Logic Implementation Can we reuse logic? Can we get C 1, C 2 and C 3 from the same circuit? C4C4 C in p0p0 p1p1 p2p2 p3p3 g0g0 g1g1 g2g2 g3g3 C1?C1? C2?C2? C3?C3? [©Hauck] No! C1, C2 and C3 may be floating (not precharged) No! C1, C2 and C3 may be floating (not precharged) Charge sharing problem Charge sharing problem No! C1, C2 and C3 may be floating (not precharged) No! C1, C2 and C3 may be floating (not precharged) Charge sharing problem Charge sharing problem
Spring 2006EE VLSI Design II - © Kia Bazargan 89 CLA: Dynamic Logic Implementation [WE92] p529 C1C1 g0g0 p0p0 C in p1p1 g1g1 C2C2 g0g0 p0p0 p1p1 p2p2 g1g1 g2g2 C3C3 g0g0 p0p0 p1p1 p2p2 p3p3 g1g1 g2g2 g3g3 C4C4 g0g0 p0p0
Spring 2006EE VLSI Design II - © Kia Bazargan 90 CLA: Basic Block (4 Bits) Architecture Block of 4-bit p, g, C out C0C0 A0A0 S0S0 B0B0 A1A1 S1S1 B1B1 A2A2 S2S2 B2B2 A3A3 S3S3 B3B3 p,g p0p0 g0g0 p1p1 g1g1 p2p2 g2g2 p3p3 g3g3 C1C1 C2C2 C3C3 C4C4
Spring 2006EE VLSI Design II - © Kia Bazargan 91 CLA: N-Bit Architecture Put it all together: C0C0 B0B0 A0A0 S0S0 A1A1 S1S1 B1B1 A2A2 S2S2 B2B2 A3A3 S3S3 B3B3 p,g C4C4 A4A4 S4S4 A5A5 S5S5 B5B5 A6A6 S6S6 B6B6 A7A7 S7S7 B7B7 B4B4 C8C8 … … … … Carry Generator
Spring 2006EE VLSI Design II - © Kia Bazargan 92 CLA: 12-Bit Example T= B= A= T= T= T=4
Spring 2006EE VLSI Design II - © Kia Bazargan 93 Summary: Carry Lookahead Adder CLA compared to ripple-carry adder: Faster (“4 times”?), but delay still linear (w.r.t. # of bits) Larger area oP, G signal generation oCarry generation circuits oCarry generation ckt for each bit position (no re-use) Limitation: cannot go beyond 4 bits of look-ahead Large p,g fan-out slows down carry generation Next: Manchester carry chains Tries to reuse logic by pre-charging each carry position
Spring 2006EE VLSI Design II - © Kia Bazargan 94 Outline One-bit adder, basic ripple-carry adder Carry-Lookahead adders (CLA) Manchester carry chain Carry bypass Carry select adder Brent-Kung adder
Spring 2006EE VLSI Design II - © Kia Bazargan 95 Recap: Carry Look-Ahead Charge sharing problem C4C4 C in p0p0 p1p1 p2p2 p3p3 g0g0 g1g1 g2g2 g3g3 C1?C1? C2?C2? C3?C3?
Spring 2006EE VLSI Design II - © Kia Bazargan 96 C1C1 C2C2 C3C3 Manchester Carry Chain: First Shot Improvement over CLA: Precharge internal nodes to avoid charge-sharing problem [©Hauck] Fastest way to do small adders –6 transistors on the critical path
Spring 2006EE VLSI Design II - © Kia Bazargan 97 Manchester Carry Chain: Sizing [© Prentice Hall] (“k” is the sizing factor) delay
Spring 2006EE VLSI Design II - © Kia Bazargan 98 Manchester Carry Chain: An Improvement Problem: C in arrives late move it closer to output Use bypass logic: C in g0g0 p0p0 g1g1 p1p1 g2g2 p2p2 g3g3 p3p3 C4 p0p0 p1p1 p2p2 p3p3 C in [©Hauck]
Spring 2006EE VLSI Design II - © Kia Bazargan 99 Manchester Carry Chain: the Improvement Direct implementation C in p 0 g 0 p 1 g 1 p 2 g 2 p 3 g 3 C4C4 C 1 C 2 C 3 [©Hauck] p0p0 p1p1 p2p2 p3p3 C in C4C4 C4C4 Carry bypass circuitry Advantages of the carry bypass circuitry –Only 5 series transistors –Less capacitance in internal nodes –C in close to the output
Spring 2006EE VLSI Design II - © Kia Bazargan 100 Manchester Carry Chain: Summary Compared to CLA: Smaller area oPre-charge internal nodes oReuse logic for intermediate carry signals C in close to the output Carry chain can be any length Series propagate is slow (O(n 2 ) delay) buffer every 4 bits Compact adder: good for up to 16 bits Using carries to compute sum slows down MCC –Use two carry chains: one for sum, one for carry propagation [©Hauck]
Spring 2006EE VLSI Design II - © Kia Bazargan 101 Outline One-bit adder, basic ripple-carry adder Carry-Lookahead adders (CLA) Manchester carry chain Carry bypass Carry select adder Brent-Kung adder
Spring 2006EE VLSI Design II - © Kia Bazargan 102 Carry Bypass Adder: Idea The “bypass” idea is general Not just for Manchester carry chain The local carry chain could be “ripple carry adder” CiCi Bit i to i+k Setup Local Carry Chain Sum C i+k+1 Bypass? Structure –Could be static, dynamic, pass transistor –Carry and sum paths shown in different colors –Bypass logic determines: “pass” or “kill/generate”?
Spring 2006EE VLSI Design II - © Kia Bazargan 103 Local Carry Chain Static implementation, using ripple carry adder Dynamic, Manchester (mux=wire!) Carry Bypass Adder: Cell Examples FA p 0.p 1.p 2.p 3 g0g0 g1g1 p1p1 g2g2 p2p2 g3g3 p3p3 C4 p0p0 p1p1 p2p2 p3p3 C in [Rab96] p398 p0p0
Spring 2006EE VLSI Design II - © Kia Bazargan 104 Carry Bypass Adder: Cell Examples (cont.) Static (pass transistor logic), Manchester T 1 =(p 0.p 1.p 2 ).p 3 T 2 =p 3 T 3 =p 0.p 1.p 2.p 3 p0p0 p0p0 p0p0 g0g0 p1p1 p1p1 p1p1 g1g1 p2p2 p2p2 p2p2 g2g2 T2T2 T1T1 T1T1 g3g3 T2T2 T3T3 T3T3 C4C4 C0C0 [WE92] p531
Spring 2006EE VLSI Design II - © Kia Bazargan 105 Carry Bypass Adder: the Structure and Timing Bit 0-3 C0C0 [Rab96] p.399 Setup Local Carry Chain Sum Bit 4-7 Setup Local Carry Chain Sum Bit 8-11 Setup Local Carry Chain Sum Bit Setup Local Carry Chain Sum Timing (Critical path shown in different color): 1-Setup 2-Local carry generate/kill, MUX select line ready 3-C 0 -C 16 carry propagate (if applicable)
Spring 2006EE VLSI Design II - © Kia Bazargan 106 Local Carry Chain Sum Bit 8-11 Setup Local Carry Chain Sum Bit 8-11 Setup For an intermediate stage, after setup: If in pass mode oLocal carry vector computes intermediate carries (possibly incorrectly) oAt the same time, mux selection set to pass oWhen input carry arrives, intermediate carries might be recomputed oMeanwhile, input carry is sent to Cout Carry Bypass Adder: Timing of a Sub-block Sum Setup –If not pass mode (assume bit 10 generates) Local carry vector computes intermediate carries (bits 10, 11 correc) At the same time, mux selection set to local Meanwhile, output carry is sent to Cout correctly When input carry arrives, intermediate carries C 8 and C 9 (S 8,S 9,S 10 ) will be recomputed correctly Local Carry Chain Sum Local Carry Chain Sum Local Carry Chain
Spring 2006EE VLSI Design II - © Kia Bazargan x t FA + t sum 3 xt mux_pass + max { t select, 4 x t FA } +t setup + Carry Bypass Adder: Timing Bit 0-3 C0C0 Setup Local Carry Chain Sum Bit 4-7 Setup Local Carry Chain Sum Bit 8-11 Setup Local Carry Chain Sum Bit Setup Local Carry Chain Sum Delay =
Spring 2006EE VLSI Design II - © Kia Bazargan 108 Carry Bypass Adder: Pros and Cons Speed: Faster than ripple adder Still linear! Area overhead: Mux (setup?) Not worth for small adders (N<8) 10-20% for large adders [Rab96] p.399 Propagation Delay Number of bits 4..8 Ripple Adder Bypass Adder
Spring 2006EE VLSI Design II - © Kia Bazargan 109 Outline One-bit adder, basic ripple-carry adder Carry-Lookahead adders (CLA) Manchester carry chain Carry bypass Carry select adder Brent-Kung adder
Spring 2006EE VLSI Design II - © Kia Bazargan 110 Carry Select Adder: the Idea Similar to bypass Instead of “waiting” for the input carry, ”precompute” the carry output Compute C i+k for both cases C i =0 and C i =1 When C i arrives, select the appropriate result Sum computed in one step after the intermediate carry signals are ready [Rab96] p.400 p,g Multiplexers CiCi C i+k Sum Generation Carry Vector Setup (p,g) k bits 0-Carry propagation 1-Carry propagation 1 0
Spring 2006EE VLSI Design II - © Kia Bazargan 111 Linear Carry Select Adder: Structure C0C0 Sum Setup Bits Carry 1-Carry 1 0 C4C4 Sum Setup Bits Carry 1-Carry 1 0 C8C8 Sum Setup Bits Carry 1-Carry 1 0 C 12 Sum Setup Bits Carry 1-Carry 1 0 C 16 [Rab96] p.401
Spring 2006EE VLSI Design II - © Kia Bazargan 112 Linear Carry Select Adder: Timing Setup Bits 0-3 Setup Bits 4-7 Setup Bits 8-11 Setup Bits C0C0 C4C4 Sum C8C8 C 12 Sum 0-Carry 1-Carry Carry 1-Carry Carry 1-Carry Carry 1-Carry 1 0 Sum C 16 Delay = = 7 (16 bits) [Rab96] p.401
Spring 2006EE VLSI Design II - © Kia Bazargan 113 Square Root Carry Select Adder: the Idea Later stages have to wait for the multiplexers in the earlier stages Why not give them bigger chunks of data to compute? Balances the delay paths Sub-linear delay (we will see why)
Spring 2006EE VLSI Design II - © Kia Bazargan Square Root Carry Select Adder: the Structure Assuming the following delays: Setup=1, carry propagate=1/bit, mux=1 C0C0 Sum Bits 0-1 C2C2 Bits 2-4 C5C5 4 Bits 5-8 C9C9 5 Bits 9-13 C 14 6 Bits C 19 7 Delay from all paths = 8 (20 bits) [Rab96] p.402
Spring 2006EE VLSI Design II - © Kia Bazargan 115 Square Root Carry Select Adder: Delay Assume N-bit adder P stages (delay directly depends on P) First stage computes M bits For M<<N (e.g. N=64, M=2) The first term dominates N P 2 /2
Spring 2006EE VLSI Design II - © Kia Bazargan 116 Carry Select Adder: Trade-offs Area overhead: An additional carry path and a multiplexer (not the whole adder) About 30% more than a ripple-carry Delay Sub-linear (we can beat that too!) Number of bits ripple adder linear select square root select [© Prentice Hall]
Spring 2006EE VLSI Design II - © Kia Bazargan 117 Outline One-bit adder, basic ripple-carry adder Carry-Lookahead adders (CLA) Manchester carry chain Carry bypass Carry select adder Brent-Kung adder
Spring 2006EE VLSI Design II - © Kia Bazargan 118 Binary Carry-Lookahead or Brent-Kung Adder Idea: use binary tree for carry propagation logarithmic delay A 7 F A 6 A 5 A 4 A 3 A 2 A 1 A 0 A 0 A 1 A 2 A 3 A 4 A 5 A 6 A 7 F t p log 2 (N) t p N [© Prentice Hall]
Spring 2006EE VLSI Design II - © Kia Bazargan 119 Brent-Kung Adder Basic component Concatenation MSBLSB g left p left g right p right g p (g, p) g = g left + p left g right p = p left p right (g left, p left ) (g right p right ) [©Hauck]
Spring 2006EE VLSI Design II - © Kia Bazargan 120 No! Doesn’t know about C 0-3 yet! C5?C5? Brent-Kung Adder: Structure Define (Gi, Pi) generate and propagate for least significant i bits (G 0,P 0 ) = (g 0,p 0 )g i = A i.B i p i = A i B i for i>0: (G i, P i ) = (g i, p i ) (G i-1, P i-1 ) = (g i, p i ) (g i-1, p i-1 ).... (g 1, p 1 ) Key to Brent-Kung adder – use tree structure to perform concatenations [©Hauck]
Spring 2006EE VLSI Design II - © Kia Bazargan 121 Brent-Kung: the Complete Tree t add log 2 (N) [© Prentice Hall] (g 0,p 0 ) (g 1,p 1 ) (g 2,p 2 ) (g 3,p 3 ) (g 4,p 4 ) (g 5,p 5 ) (g 6,p 6 ) (g 7,p 7 ) C 0 C 1 C 3 C 7 C 2 C 6 C 5 C 4
Spring 2006EE VLSI Design II - © Kia Bazargan 122 Brent-Kung: Timing [©Oxford U Press] [Par00] p.102 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 s 0 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 s 10 s 11 s 12 s 13 s 14 s Level
Spring 2006EE VLSI Design II - © Kia Bazargan 123 Brent-Kung Adder: Summary Area On average, twice as large as ripple adder Layout of the cells is very compact Delay Logarithmic time Once carry signals are ready, sum bits derived in const time Good for wide adders
Spring 2006EE VLSI Design II - © Kia Bazargan 124 Comparing Adder Designs Number of bits Number of bits Brent-Kung select bypass manchester mirror static manchester Brent-Kung select static mirror bypass [© Prentice Hall] t p (sec) Area (mm 2 )
Spring 2006EE VLSI Design II - © Kia Bazargan 125 Combining Different Adders [©Oxford U Press] [Par00] p.103
Spring 2006EE VLSI Design II - © Kia Bazargan 126 Combining Different Adders Two-level carry skip adder Delay = 8 cycles Number of bits: 30 Blk EBlock DBlock CBlock BBlock AF Cin t=0 Cout t=8 [©Oxford U Press] [Par00] p.113 c c b bbbbb {8, 1}{7, 2}{6, 3}{5, 4} {4, 5} {3, 8} inout ABC DE F S 2 S 2 S 2 S 2 S 2 T produce T assimilate
Spring 2006EE VLSI Design II - © Kia Bazargan 127 Combining Different Adders 40 Bit Carry Select Adder 24 Bit Differential Carry Lookahead Adder MSBLSB RA(23:0)RB(23:0)RA(63:24)RB(63:24) cout23 64 Bit Adder EA(63:24) EA(23:0) real_add(40:0) hit/miss/data TLB Compare Data Cache Compare © Dan Stasiak, IBM Rochester, 2001
Spring 2006EE VLSI Design II - © Kia Bazargan 128 Combining Different Adders © Dan Stasiak, IBM Rochester, Bit Adder Section 24 Bit Adder Section EA(0:23) & EA_L(0:23) EA(24:63)
Spring 2006EE VLSI Design II - © Kia Bazargan 129 Combining Different Adders Ripple+skip adder: delay=8. Max adder width? Assume: p,g, ripple, skip signal, skipping: 1 unit delay Carry signals oPass mode: ready at time x through skip logic limit # blocks oLocal gen mode: blocks can process y bits and still have time to deliver locally generated carry by time x for the next block. Sum signals oIf in local generation mode, y is OK oIf in pass mode, y not OK for left bits (e.g., b E receives cin at x=5, can process at most z=3 bits to meet the delay bound of 8 on the sum bits) [©Oxford U Press][Par00] p.112 C out C in b bbbbb ABC DE F SSSSS b G Should appear before slide 126
Spring 2006EE VLSI Design II - © Kia Bazargan 130 CLA Static Logic: Trimmed Down p0p0 C in g0g0 C1C1 [©Hauck] [Rab96] p405 h j k s t u Should appear before slide 86