Recent Developments in Theory and Implementation of Parallel Prefix Adders Neil Burgess Division of Electronics Cardiff School of Engineering Cardiff University
Motivation Parallel Prefix Adders (e.g. Kogge- Stone) mostly ignored for deep submicron VLSI –large fan-out points –wide wiring channels Recent insights: can remove both and do... –absolute difference –late increment –media processing
Structure of Presentation Parallel Prefix Adder theory –Kogge-Stone, Ladner-Fisher New log-depth prefix trees –Knowles’ “family of adders” New applications of prefix adders –late operations, media adder
I. Parallel Prefix Adder theory
Prefix adder structure A(0:w-1) Bit propagate and generate cells g(0:w-1)p(0:w-1) B(0:w-1) c(1:w) Prefix carry tree s(0:w) Sum cells (XOR gates)
Prefix Equations - 1 g(i) = a(i) b(i)“carry generate” p(i) = a(i) b(i)“carry propagate” k(i) = {a(i) b(i)}“carry kill” g(i), p(i), & k(i) are mutually exclusive –Use any two: g(i) & k(i) = NAND & NOR –p(i) needed as well: s(i) = p(i) c(i)
Prefix Equations - 2 Generate and Not Kill signals are com- bined to form “Group Signals” G x z K x z interpretation 0 0c(x+1) = 0 0 1c(x+1) = c(z) 1 0Don’t care 1 1c(x+1) = 1
Prefix Equations - Interpretation Group signals yield carry signals: Tree outputs: c(i+1) = G i 0 Tree inputs: G i i = g(i) ; K i i = k(i)
Prefix Equations - characteristics Associative –sub-terms may be pre-computed in parallel
Prefix equations - characteristics Idempotent –sub-terms may be “overlapped” g(0), k(0)g g(1), k(1)g g(2), k(2)g GK c(3)c c(2)c c(1)c
4-bit Ladner-Fisher prefix tree 1 sub-term pre-computed Logarithmic depth Fan-out = 2 in 2 nd row (laterally)
8-bit Ladner-Fisher prefix tree Log depth; lateral fan-out = 4 in 3 rd row No exploitation of idempotency
16-bit Ladner-Fisher prefix tree Log depth with large fan-out in final row
4-bit Kogge-Stone prefix graph Fan-out = 1 (laterally) 1 extra cell parallel wires in 2 nd row
8-bit Kogge-Stone prefix graph More cells & wiring than Ladner-Fisher
16-bit Kogge-Stone prefix graph Low fan-out but wider wiring channels No exploitation of idempotency
Black cells and grey cells Carries, c(i) = G i-1 0 ; K i-1 0 terms not needed G-only cells called and coloured “grey”
The story so far… Parallel prefix adders available in VLSI Log-depth adders possible: –high fan-outs {1,2,4,8…} & low cell count –low fan-outs {1,1,1,1…} & high cell count Problematic in VLSI (buffering, area) Idempotency of ‘ ’ operator not exploited
II. Knowles’ “Family of Adders”
Log-depth prefix trees In VLSI: –L-F trees require too much buffering delay –K-S trees require too much area (wire flux) Fan-outs characterised as: –{1,2,4,8…} Ladner-Fisher –{1,1,1,1…} Kogge-Stone
Knowles’ insight Use other fan-out schemes 5 possible 8-bit log-depth prefix trees: –{1,1,1}17 cellsKogge-Stone –{1,1,2}17 cellsuses idempotency –{1,1,4}14 cellsno idempotency –{1,2,2}14 cellsno idempotency –{1,2,4} 12 cellsLadner-Fisher
Knowles’ 8-bit prefix trees All trees are log-depth
Tree construction rules Levels are labelled 0,1,2... Fan-out at j th level, 2 k, satisfies 2 k 2 j Fan-out at j th level fan-out at j+1 th level Lateral wire length at j th level is 2 j
Knowles’ 16-bit trees - I {1,1,1,1} 49 cells{1,1,1,8}42 cells {1,1,1,2} 49cells {1,2,2,2} 42 cells {1,1,1,4} 49cells {1,1,4,4} 40 cells {1,1,2,2} 49cells {1,1,4,8} 36 cells {1,1,2,4} 49cells {1,2,2,8} 36 cells {1,1,2,8} 42cells {1,2,4,4} 36 cells {1,2,2,4} 42cells {1,2,4,8} 32 cells
Knowles’ 16-bit trees - II {1,1,1,1} {1,1,1,8} {1,1,1,2} Idempotent{1,2,2,2} {1,1,1,4} Idempotent {1,1,4,4} {1,1,2,2} Idempotent {1,1,4,8} {1,1,2,4} Idempotent {1,2,2,8} {1,1,2,8} Idempotent {1,2,4,4} {1,2,2,4} Idempotent{1,2,4,8}
Knowles’ 16-bit trees - III {1,1,1,1} {1,1,1,8}R {1,1,1,2} I{1,2,2,2} R {1,1,1,4} I{1,1,4,4} R {1,1,2,2} I{1,1,4,8} R {1,1,2,4} I{1,2,2,8} R {1,1,2,8} R, I{1,2,4,4} R {1,2,2,4} R, I{1,2,4,8} R
Quick way of spotting R, I Define span(l) as distance from start of wire to first cell in l th level span(l) = 2 l fanout(l) 1 tree characteristics –R if span(j) span(k) for j < k –I if span(i) + span(j) = span(k) for i < j < k
Examples of R & I spotting fanout(l)span(l) characteristic [1,1,1,1] [1,2,4,8] neither R nor I [1,1,2,2] [1,2,3,7] I only [1,2,2,2] [1,1,3,7] R only [1,2,2,4] [1,1,3,5] R & I Are R & I adders “best”?
VLSI design of prefix adders Adders laid out as rectangular array of prefix cells (and gaps) Assume cells measure 10 m 4 m –2 cells per significance 20 m / bit Key design parameters: –buffering (area & delay) –wiring channels (area)
16-bit adder example Assumptions Maximum fan-out without buffering: –3 cells + 80 m wire (4 cell widths) Maximum fan-out with buffering: –9 cells m wire (12 cell widths) Employ {1,2,2,4} architecture
{1,2,2,4} prefix adder layout
Area vs Time for 32-bit adders Delay Area K-S {1,1,1,1,1} {1,1,2,2,2} L-F {1,2,4,8,16} {1,2,2,4,4} [1,1,3,5,13]
32-bit prefix tree adders Exploitable trade-off between adder’s delay and area –Kogge-Stone adder 16% faster than Ladner- Fisher but 66% larger –{1,2,2,4,4} adder 8% faster than Ladner-Fisher but only 3% larger –buffering also trades off speed for area
III. New applications of prefix adders
Other addition operations Late increment –Mod 2 w -1 addition for Reed-Solomon coding –floating-point rounding Late complement –absolute difference for video motion estimation –sign-magnitude addition Typically use 2 adders and a MUX
Increments in prefix trees Row of prefix cells = ‘late +1’ operation Ladner-Fisher comprises many late +1’s –1 8-bit, 2 4-bit, 4 2-bit, & 8 1-bit
Late increment tree Adder returns A+B if inc = 0 Adder returns A+B+1 if inc = 1 inc
Late increment logic “Late Carry” lc(i) set high if: –c(i) = 1 or –inc = 1 and a(n),b(n) 0,0 n: 0 n < i p(i)p(i) s(i)s(i) inc K i 0 c(i) = G i 0 lc(i)
Late complement theory In 2’s-complement, N = -(N+1) A + B = A B 1 * late increment then yields A B (A + B) = -(A B 1+1) = B A Absolute difference readily available
Absolute difference logic If c(w) = 0, result negative –if c(w) = 0, invert all the bits –else always perform late increment with K i-1 0 p(i) s(i)s(i) c(w)c(w) K i 0 c(i)
Summary of “late” ops Available on all prefix adders Extra delay: 1 gate’s delay + buffering Extra hardware: w black cells This technique used in floating-point units –late increment for rounding –late complement for true subtraction
Media (“packed”) arithmetic Fundamental strategy: Use full wordlength hardware for multiple sub-wordlength computations Examples: –32-bit adder 4 8-bit adders –32-bit multiplier 2 16-bit multipliers
Partitioning an adder Criteria: –support carries propagating within sub-adders –prevent carries propagating between sub- adders Solutions: –put AND gates on carry chains slower adder –put dummy 0’s on operand bits larger adder Use prefix adder!!
Packed prefix adder - 1 Force k(n) = 0 at partition points –prevents carries propagating across bit n –exploits don’t care condition (g, k) = (1,0) Implementation –change k(n) gate to (2,1) OR-AND gate –delay-neutral modification
Packed prefix adder - 2 Force c(n) = G n-1 0 = 0 at partition points –prevents c(n) s(n) errors Implementation –insert AND gates (off critical path) or –change G n-1 0 gate to ({2,1},1) complex gate –BUT need G n-1 0 signal for sub-adder overflows
Packed prefix adder - 3 Sub-adder carries complete early Extraneous cells automatically do nothing
Last Slide Recent developments in prefix adders: –new “family” of log-depth trees –late operations –packed arithmetic for media processing Future possibilities: –systematic exploitation of idempotency –trees with reduced buffering –combine packed arithmetic/late ops
ANY QUESTIONS OR COMMENTS?