Recent Developments in Theory and Implementation of Parallel Prefix Adders Neil Burgess Division of Electronics Cardiff School of Engineering Cardiff University.

Slides:



Advertisements
Similar presentations
Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.
Advertisements

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Processor Data Path and Control Diana Palsetia UPenn
Factoring Quadratics — ax² + bx + c Topic
Advance Nano Device Lab. Fundamentals of Modern VLSI Devices 2 nd Edition Yuan Taur and Tak H.Ning 0 Ch9. Memory Devices.
Chapter 3 Basic Logic Gates 1.
Chapter 4 Gates and Circuits.
EE466: VLSI Design Lecture 7: Circuits & Layout
Chapter 4 Gates and Circuits.
Discrete Mathematical Structures: Theory and Applications
Chapter 3 Logic Gates.
Digital Logic Design Gate-Level Minimization
Overview Part 1 – Gate Circuits and Boolean Equations
The scale of IC design Small-scale integrated, SSI: gate number usually less than 10 in a IC. Medium-scale integrated, MSI: gate number ~10-100, can operate.
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Chapter 2 – Combinational.
EE 414 – Introduction to VLSI Design
Other Gate Types COE 202 Digital Logic Design Dr. Aiman El-Maleh
Morgan Kaufmann Publishers Arithmetic for Computers
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 2 – Combinational Logic Circuits Part 3.
Chapter 4 Gates and Circuits.
Section 5: More Parallel Algorithms
UNIVERSITY OF MASSACHUSETTS Dept
Princess Sumaya University
1 ECE 4436ECE 5367 Computer Arithmetic I-II. 2 ECE 4436ECE 5367 Addition concepts 1 bit adder –2 inputs for the operands. –Third input – carry in from.
Combinational Circuits
Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.
Datorteknik ArithmeticCircuits bild 1 Computer arithmetic Somet things you should know about digital arithmetic: Principles Architecture Design.
Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.
EECS Components and Design Techniques for Digital Systems Lec 18 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
1 Lecture 4: Arithmetic for Computers (Part 3) CS 447 Jason Bakos.
1 Design of a Parallel-Prefix Adder Architecture with Efficient Timing-Area Tradeoff Characteristic Sabyasachi Das University of Colorado, Boulder Sunil.
ECE C03 Lecture 61 Lecture 6 Arithmetic Logic Circuits Hai Zhou ECE 303 Advanced Digital Design Spring 2002.
Introduction to CMOS VLSI Design Lecture 11: Adders
Chapter # 5: Arithmetic Circuits Contemporary Logic Design Randy H
Lecture 8 Arithmetic Logic Circuits
Arithmetic-Logic Units CPSC 321 Computer Architecture Andreas Klappenecker.
Lecture 17: Adders.
Parallel Prefix Adders A Case Study
Introduction to CMOS VLSI Design Lecture 11: Adders David Harris Harvey Mudd College Spring 2004.
Lecture 18: Datapath Functional Units
1 CS/COE0447 Computer Organization & Assembly Language Chapter 3.
CS1Q Computer Systems Lecture 9 Simon Gay. Lecture 9CS1Q Computer Systems - Simon Gay2 Addition We want to be able to do arithmetic on computers and therefore.
Abdullah Aldahami ( ) Feb26, Introduction 2. Feedback Switch Logic 3. Arithmetic Logic Unit Architecture a.Ripple-Carry Adder b.Kogge-Stone.
Chapter # 5: Arithmetic Circuits
Description and Analysis of MULTIPLIERS using LAVA.
Advanced VLSI Design Unit 05: Datapath Units. Slide 2 Outline  Adders  Comparators  Shifters  Multi-input Adders  Multipliers.
Csci 136 Computer Architecture II – Constructing An Arithmetic Logic Unit Xiuzhen Cheng
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
COMP541 Arithmetic Circuits
COMP541 Arithmetic Circuits
EE466: VLSI Design Lecture 13: Adders
B1000 ALU ENGR xD52 Eric VanWyk Fall Today Review Timing with Adders Construct Adder/Subtractor Construct ALU.
CPEN Digital System Design
Addition, Subtraction, Logic Operations and ALU Design
Full Tree Multipliers All k PPs Produced Simultaneously Input to k-input Multioperand Tree Multiples of a (Binary, High-Radix or Recoded) Formed at Top.
Arithmetic-Logic Units. Logic Gates AND gate OR gate NOT gate.
Lecture #23: Arithmetic Circuits-1 Arithmetic Circuits (Part I) Randy H. Katz University of California, Berkeley Fall 2005.
ECE DIGITAL LOGIC LECTURE 15: COMBINATIONAL CIRCUITS Assistant Prof. Fareena Saqib Florida Institute of Technology Fall 2015, 10/20/2015.
B0111 ALU ENGR xD52 Eric VanWyk Fall Today Review Timing with Adders Construct Adder/Subtractor Compare Growth Characteristics Construct ALU.
B0110 ALU ENGR xD52 Eric VanWyk Fall Today Back to Gates! Review Timing with Adders Compare Growth Characteristics Construct Adder/Subtractor Construct.
Combinational Circuits
VLSI Arithmetic Lecture 5
Arithmetic Circuits (Part I) Randy H
CSE 370 – Winter 2002 – Comb. Logic building blocks - 1
CS/COE0447 Computer Organization & Assembly Language
CS/COE0447 Computer Organization & Assembly Language
Part III The Arithmetic/Logic Unit
COMS 361 Computer Organization
Combinational Circuits
Description and Analysis of MULTIPLIERS using LAVA
Presentation transcript:

Recent Developments in Theory and Implementation of Parallel Prefix Adders Neil Burgess Division of Electronics Cardiff School of Engineering Cardiff University

Motivation Parallel Prefix Adders (e.g. Kogge- Stone) mostly ignored for deep submicron VLSI –large fan-out points –wide wiring channels Recent insights: can remove both and do... –absolute difference –late increment –media processing

Structure of Presentation Parallel Prefix Adder theory –Kogge-Stone, Ladner-Fisher New log-depth prefix trees –Knowles’ “family of adders” New applications of prefix adders –late operations, media adder

I. Parallel Prefix Adder theory

Prefix adder structure A(0:w-1) Bit propagate and generate cells g(0:w-1)p(0:w-1) B(0:w-1) c(1:w) Prefix carry tree s(0:w) Sum cells (XOR gates)

Prefix Equations - 1 g(i) = a(i)  b(i)“carry generate” p(i) = a(i)  b(i)“carry propagate” k(i) =  {a(i)  b(i)}“carry kill” g(i), p(i), & k(i) are mutually exclusive –Use any two:  g(i) & k(i) = NAND & NOR –p(i) needed as well: s(i) = p(i)  c(i)

Prefix Equations - 2 Generate and Not Kill signals are com- bined to form “Group Signals” G x z  K x z interpretation 0 0c(x+1) = 0 0 1c(x+1) = c(z) 1 0Don’t care 1 1c(x+1) = 1

Prefix Equations - Interpretation Group signals yield carry signals: Tree outputs: c(i+1) = G i 0 Tree inputs: G i i = g(i) ;  K i i =  k(i)

Prefix Equations - characteristics Associative –sub-terms may be pre-computed in parallel

Prefix equations - characteristics Idempotent –sub-terms may be “overlapped” g(0), k(0)g g(1), k(1)g g(2), k(2)g GK c(3)c c(2)c c(1)c

4-bit Ladner-Fisher prefix tree 1 sub-term pre-computed Logarithmic depth Fan-out = 2 in 2 nd row (laterally)

8-bit Ladner-Fisher prefix tree Log depth; lateral fan-out = 4 in 3 rd row No exploitation of idempotency

16-bit Ladner-Fisher prefix tree Log depth with large fan-out in final row

4-bit Kogge-Stone prefix graph Fan-out = 1 (laterally) 1 extra cell parallel wires in 2 nd row

8-bit Kogge-Stone prefix graph More cells & wiring than Ladner-Fisher

16-bit Kogge-Stone prefix graph Low fan-out but wider wiring channels No exploitation of idempotency

Black cells and grey cells Carries, c(i) = G i-1 0 ; K i-1 0 terms not needed G-only cells called and coloured “grey”

The story so far… Parallel prefix adders available in VLSI Log-depth adders possible: –high fan-outs {1,2,4,8…} & low cell count –low fan-outs {1,1,1,1…} & high cell count Problematic in VLSI (buffering, area) Idempotency of ‘  ’ operator not exploited

II. Knowles’ “Family of Adders”

Log-depth prefix trees In VLSI: –L-F trees require too much buffering  delay –K-S trees require too much area (wire flux) Fan-outs characterised as: –{1,2,4,8…} Ladner-Fisher –{1,1,1,1…} Kogge-Stone

Knowles’ insight Use other fan-out schemes 5 possible 8-bit log-depth prefix trees: –{1,1,1}17 cellsKogge-Stone –{1,1,2}17 cellsuses idempotency –{1,1,4}14 cellsno idempotency –{1,2,2}14 cellsno idempotency –{1,2,4} 12 cellsLadner-Fisher

Knowles’ 8-bit prefix trees All trees are log-depth

Tree construction rules Levels are labelled 0,1,2... Fan-out at j th level, 2 k, satisfies 2 k  2 j Fan-out at j th level  fan-out at j+1 th level Lateral wire length at j th level is 2 j

Knowles’ 16-bit trees - I {1,1,1,1} 49 cells{1,1,1,8}42 cells {1,1,1,2} 49cells {1,2,2,2} 42 cells {1,1,1,4} 49cells {1,1,4,4} 40 cells {1,1,2,2} 49cells {1,1,4,8} 36 cells {1,1,2,4} 49cells {1,2,2,8} 36 cells {1,1,2,8} 42cells {1,2,4,4} 36 cells {1,2,2,4} 42cells {1,2,4,8} 32 cells

Knowles’ 16-bit trees - II {1,1,1,1} {1,1,1,8} {1,1,1,2} Idempotent{1,2,2,2} {1,1,1,4} Idempotent {1,1,4,4} {1,1,2,2} Idempotent {1,1,4,8} {1,1,2,4} Idempotent {1,2,2,8} {1,1,2,8} Idempotent {1,2,4,4} {1,2,2,4} Idempotent{1,2,4,8}

Knowles’ 16-bit trees - III {1,1,1,1} {1,1,1,8}R {1,1,1,2} I{1,2,2,2} R {1,1,1,4} I{1,1,4,4} R {1,1,2,2} I{1,1,4,8} R {1,1,2,4} I{1,2,2,8} R {1,1,2,8} R, I{1,2,4,4} R {1,2,2,4} R, I{1,2,4,8} R

Quick way of spotting R, I Define span(l) as distance from start of wire to first cell in l th level span(l) = 2 l  fanout(l)  1 tree characteristics –R if span(j)  span(k) for j < k –I if span(i) + span(j) = span(k) for i < j < k

Examples of R & I spotting fanout(l)span(l) characteristic [1,1,1,1]  [1,2,4,8] neither R nor I [1,1,2,2]  [1,2,3,7] I only [1,2,2,2]  [1,1,3,7] R only [1,2,2,4]  [1,1,3,5] R & I Are R & I adders “best”?

VLSI design of prefix adders Adders laid out as rectangular array of prefix cells (and gaps) Assume cells measure 10  m  4  m –2 cells per significance  20  m / bit Key design parameters: –buffering (area & delay) –wiring channels (area)

16-bit adder example Assumptions Maximum fan-out without buffering: –3 cells + 80  m wire (4 cell widths) Maximum fan-out with buffering: –9 cells  m wire (12 cell widths) Employ {1,2,2,4} architecture

{1,2,2,4} prefix adder layout

Area vs Time for 32-bit adders Delay Area K-S {1,1,1,1,1} {1,1,2,2,2} L-F {1,2,4,8,16} {1,2,2,4,4}  [1,1,3,5,13]

32-bit prefix tree adders Exploitable trade-off between adder’s delay and area –Kogge-Stone adder 16% faster than Ladner- Fisher but 66% larger –{1,2,2,4,4} adder 8% faster than Ladner-Fisher but only 3% larger –buffering also trades off speed for area

III. New applications of prefix adders

Other addition operations Late increment –Mod 2 w -1 addition for Reed-Solomon coding –floating-point rounding Late complement –absolute difference for video motion estimation –sign-magnitude addition Typically use 2 adders and a MUX

Increments in prefix trees Row of prefix cells = ‘late +1’ operation Ladner-Fisher comprises many late +1’s –1 8-bit, 2 4-bit, 4 2-bit, & 8 1-bit

Late increment tree Adder returns A+B if inc = 0 Adder returns A+B+1 if inc = 1 inc

Late increment logic “Late Carry” lc(i) set high if: –c(i) = 1 or –inc = 1 and a(n),b(n)  0,0  n: 0  n < i p(i)p(i) s(i)s(i) inc  K i 0 c(i) = G i 0 lc(i)

Late complement theory In 2’s-complement,  N = -(N+1) A +  B = A  B  1 * late increment then yields A  B  (A +  B) = -(A  B  1+1) = B  A Absolute difference readily available

Absolute difference logic If c(w) = 0, result negative –if c(w) = 0, invert all the bits –else always perform late increment with  K i-1 0 p(i) s(i)s(i) c(w)c(w)  K i 0 c(i)

Summary of “late” ops Available on all prefix adders Extra delay: 1 gate’s delay + buffering Extra hardware:  w black cells This technique used in floating-point units –late increment for rounding –late complement for true subtraction

Media (“packed”) arithmetic Fundamental strategy: Use full wordlength hardware for multiple sub-wordlength computations Examples: –32-bit adder  4 8-bit adders –32-bit multiplier  2 16-bit multipliers

Partitioning an adder Criteria: –support carries propagating within sub-adders –prevent carries propagating between sub- adders Solutions: –put AND gates on carry chains  slower adder –put dummy 0’s on operand bits  larger adder Use prefix adder!!

Packed prefix adder - 1 Force  k(n) = 0 at partition points –prevents carries propagating across bit n –exploits don’t care condition (g,  k) = (1,0) Implementation –change  k(n) gate to (2,1) OR-AND gate –delay-neutral modification

Packed prefix adder - 2 Force c(n) = G n-1 0 = 0 at partition points –prevents c(n)  s(n) errors Implementation –insert AND gates (off critical path) or –change G n-1 0 gate to ({2,1},1) complex gate –BUT need G n-1 0 signal for sub-adder overflows

Packed prefix adder - 3 Sub-adder carries complete early Extraneous cells automatically do nothing

Last Slide Recent developments in prefix adders: –new “family” of log-depth trees –late operations –packed arithmetic for media processing Future possibilities: –systematic exploitation of idempotency –trees with reduced buffering –combine packed arithmetic/late ops

ANY QUESTIONS OR COMMENTS?