Download presentation
Presentation is loading. Please wait.
Published byBryanna Roderick Modified over 10 years ago
1
Recent Developments in Theory and Implementation of Parallel Prefix Adders Neil Burgess Division of Electronics Cardiff School of Engineering Cardiff University
2
Motivation Parallel Prefix Adders (e.g. Kogge- Stone) mostly ignored for deep submicron VLSI –large fan-out points –wide wiring channels Recent insights: can remove both and do... –absolute difference –late increment –media processing
3
Structure of Presentation Parallel Prefix Adder theory –Kogge-Stone, Ladner-Fisher New log-depth prefix trees –Knowles’ “family of adders” New applications of prefix adders –late operations, media adder
4
I. Parallel Prefix Adder theory
5
Prefix adder structure A(0:w-1) Bit propagate and generate cells g(0:w-1)p(0:w-1) B(0:w-1) c(1:w) Prefix carry tree s(0:w) Sum cells (XOR gates)
6
Prefix Equations - 1 g(i) = a(i) b(i)“carry generate” p(i) = a(i) b(i)“carry propagate” k(i) = {a(i) b(i)}“carry kill” g(i), p(i), & k(i) are mutually exclusive –Use any two: g(i) & k(i) = NAND & NOR –p(i) needed as well: s(i) = p(i) c(i)
7
Prefix Equations - 2 Generate and Not Kill signals are com- bined to form “Group Signals” G x z K x z interpretation 0 0c(x+1) = 0 0 1c(x+1) = c(z) 1 0Don’t care 1 1c(x+1) = 1
8
Prefix Equations - Interpretation Group signals yield carry signals: Tree outputs: c(i+1) = G i 0 Tree inputs: G i i = g(i) ; K i i = k(i)
9
Prefix Equations - characteristics Associative –sub-terms may be pre-computed in parallel
10
Prefix equations - characteristics Idempotent –sub-terms may be “overlapped” g(0), k(0)g g(1), k(1)g g(2), k(2)g GK 1 0 22 11 22 00 c(3)c c(2)c c(1)c
11
4-bit Ladner-Fisher prefix tree 1 sub-term pre-computed Logarithmic depth Fan-out = 2 in 2 nd row (laterally)
12
8-bit Ladner-Fisher prefix tree Log depth; lateral fan-out = 4 in 3 rd row No exploitation of idempotency
13
16-bit Ladner-Fisher prefix tree Log depth with large fan-out in final row
14
4-bit Kogge-Stone prefix graph Fan-out = 1 (laterally) 1 extra cell parallel wires in 2 nd row
15
8-bit Kogge-Stone prefix graph More cells & wiring than Ladner-Fisher
16
16-bit Kogge-Stone prefix graph Low fan-out but wider wiring channels No exploitation of idempotency
17
Black cells and grey cells Carries, c(i) = G i-1 0 ; K i-1 0 terms not needed G-only cells called and coloured “grey”
18
The story so far… Parallel prefix adders available in VLSI Log-depth adders possible: –high fan-outs {1,2,4,8…} & low cell count –low fan-outs {1,1,1,1…} & high cell count Problematic in VLSI (buffering, area) Idempotency of ‘ ’ operator not exploited
19
II. Knowles’ “Family of Adders”
20
Log-depth prefix trees In VLSI: –L-F trees require too much buffering delay –K-S trees require too much area (wire flux) Fan-outs characterised as: –{1,2,4,8…} Ladner-Fisher –{1,1,1,1…} Kogge-Stone
21
Knowles’ insight Use other fan-out schemes 5 possible 8-bit log-depth prefix trees: –{1,1,1}17 cellsKogge-Stone –{1,1,2}17 cellsuses idempotency –{1,1,4}14 cellsno idempotency –{1,2,2}14 cellsno idempotency –{1,2,4} 12 cellsLadner-Fisher
22
Knowles’ 8-bit prefix trees All trees are log-depth
23
Tree construction rules Levels are labelled 0,1,2... Fan-out at j th level, 2 k, satisfies 2 k 2 j Fan-out at j th level fan-out at j+1 th level Lateral wire length at j th level is 2 j
24
Knowles’ 16-bit trees - I {1,1,1,1} 49 cells{1,1,1,8}42 cells {1,1,1,2} 49cells {1,2,2,2} 42 cells {1,1,1,4} 49cells {1,1,4,4} 40 cells {1,1,2,2} 49cells {1,1,4,8} 36 cells {1,1,2,4} 49cells {1,2,2,8} 36 cells {1,1,2,8} 42cells {1,2,4,4} 36 cells {1,2,2,4} 42cells {1,2,4,8} 32 cells
25
Knowles’ 16-bit trees - II {1,1,1,1} {1,1,1,8} {1,1,1,2} Idempotent{1,2,2,2} {1,1,1,4} Idempotent {1,1,4,4} {1,1,2,2} Idempotent {1,1,4,8} {1,1,2,4} Idempotent {1,2,2,8} {1,1,2,8} Idempotent {1,2,4,4} {1,2,2,4} Idempotent{1,2,4,8}
26
Knowles’ 16-bit trees - III {1,1,1,1} {1,1,1,8}R {1,1,1,2} I{1,2,2,2} R {1,1,1,4} I{1,1,4,4} R {1,1,2,2} I{1,1,4,8} R {1,1,2,4} I{1,2,2,8} R {1,1,2,8} R, I{1,2,4,4} R {1,2,2,4} R, I{1,2,4,8} R
27
Quick way of spotting R, I Define span(l) as distance from start of wire to first cell in l th level span(l) = 2 l fanout(l) 1 tree characteristics –R if span(j) span(k) for j < k –I if span(i) + span(j) = span(k) for i < j < k
28
Examples of R & I spotting fanout(l)span(l) characteristic [1,1,1,1] [1,2,4,8] neither R nor I [1,1,2,2] [1,2,3,7] I only [1,2,2,2] [1,1,3,7] R only [1,2,2,4] [1,1,3,5] R & I Are R & I adders “best”?
29
VLSI design of prefix adders Adders laid out as rectangular array of prefix cells (and gaps) Assume cells measure 10 m 4 m –2 cells per significance 20 m / bit Key design parameters: –buffering (area & delay) –wiring channels (area)
30
16-bit adder example Assumptions Maximum fan-out without buffering: –3 cells + 80 m wire (4 cell widths) Maximum fan-out with buffering: –9 cells + 240 m wire (12 cell widths) Employ {1,2,2,4} architecture
31
{1,2,2,4} prefix adder layout
32
Area vs Time for 32-bit adders Delay 1212.51313.514 24 26 28 30 32 34 36 38 40 Area K-S {1,1,1,1,1} {1,1,2,2,2} L-F {1,2,4,8,16} {1,2,2,4,4} [1,1,3,5,13]
33
32-bit prefix tree adders Exploitable trade-off between adder’s delay and area –Kogge-Stone adder 16% faster than Ladner- Fisher but 66% larger –{1,2,2,4,4} adder 8% faster than Ladner-Fisher but only 3% larger –buffering also trades off speed for area
34
III. New applications of prefix adders
35
Other addition operations Late increment –Mod 2 w -1 addition for Reed-Solomon coding –floating-point rounding Late complement –absolute difference for video motion estimation –sign-magnitude addition Typically use 2 adders and a MUX
36
Increments in prefix trees Row of prefix cells = ‘late +1’ operation Ladner-Fisher comprises many late +1’s –1 8-bit, 2 4-bit, 4 2-bit, & 8 1-bit
37
Late increment tree Adder returns A+B if inc = 0 Adder returns A+B+1 if inc = 1 inc
38
Late increment logic “Late Carry” lc(i) set high if: –c(i) = 1 or –inc = 1 and a(n),b(n) 0,0 n: 0 n < i p(i)p(i) s(i)s(i) inc K i 0 c(i) = G i 0 lc(i)
39
Late complement theory In 2’s-complement, N = -(N+1) A + B = A B 1 * late increment then yields A B (A + B) = -(A B 1+1) = B A Absolute difference readily available
40
Absolute difference logic If c(w) = 0, result negative –if c(w) = 0, invert all the bits –else always perform late increment with K i-1 0 p(i) s(i)s(i) c(w)c(w) K i 0 c(i)
41
Summary of “late” ops Available on all prefix adders Extra delay: 1 gate’s delay + buffering Extra hardware: w black cells This technique used in floating-point units –late increment for rounding –late complement for true subtraction
42
Media (“packed”) arithmetic Fundamental strategy: Use full wordlength hardware for multiple sub-wordlength computations Examples: –32-bit adder 4 8-bit adders –32-bit multiplier 2 16-bit multipliers
43
Partitioning an adder Criteria: –support carries propagating within sub-adders –prevent carries propagating between sub- adders Solutions: –put AND gates on carry chains slower adder –put dummy 0’s on operand bits larger adder Use prefix adder!!
44
Packed prefix adder - 1 Force k(n) = 0 at partition points –prevents carries propagating across bit n –exploits don’t care condition (g, k) = (1,0) Implementation –change k(n) gate to (2,1) OR-AND gate –delay-neutral modification
45
Packed prefix adder - 2 Force c(n) = G n-1 0 = 0 at partition points –prevents c(n) s(n) errors Implementation –insert AND gates (off critical path) or –change G n-1 0 gate to ({2,1},1) complex gate –BUT need G n-1 0 signal for sub-adder overflows
46
Packed prefix adder - 3 Sub-adder carries complete early Extraneous cells automatically do nothing
47
Last Slide Recent developments in prefix adders: –new “family” of log-depth trees –late operations –packed arithmetic for media processing Future possibilities: –systematic exploitation of idempotency –trees with reduced buffering –combine packed arithmetic/late ops
48
ANY QUESTIONS OR COMMENTS?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.