CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix.

Slides:



Advertisements
Similar presentations
Lecture 13: Sequential Circuits
Advertisements

1 ECE 4436ECE 5367 Computer Arithmetic I-II. 2 ECE 4436ECE 5367 Addition concepts 1 bit adder –2 inputs for the operands. –Third input – carry in from.
Verilog Intro: Part 1.
Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.
1 Lecture 12: Hardware for Arithmetic Today’s topics:  Designing an ALU  Carry-lookahead adder Reminder: Assignment 5 will be posted in a couple of days.
Comparator.
Fast Adders See: P&H Chapter 3.1-3, C Goals: serial to parallel conversion time vs. space tradeoffs design choices.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
1 CS 140 Lecture 14 Standard Combinational Modules Professor CK Cheng CSE Dept. UC San Diego Some slides from Harris and Harris.
EECS Components and Design Techniques for Digital Systems Lec 18 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
ECE C03 Lecture 61 Lecture 6 Arithmetic Logic Circuits Hai Zhou ECE 303 Advanced Digital Design Spring 2002.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
Chapter # 5: Arithmetic Circuits Contemporary Logic Design Randy H
Lecture 8 Arithmetic Logic Circuits
Contemporary Logic Design Arithmetic Circuits © R.H. Katz Lecture #24: Arithmetic Circuits -1 Arithmetic Circuits (Part II) Randy H. Katz University of.
Penn ESE Spring DeHon 1 ESE : Computer Organization Day 3: January 17, 2007 Arithmetic and Pipelining.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…
Chapter 5 Arithmetic Logic Functions. Page 2 This Chapter..  We will be looking at multi-valued arithmetic and logic functions  Bitwise AND, OR, EXOR,
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 12: February 21, 2007 Compute 2: Cascades, ALUs, PLAs.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 4, 2005 Interconnect 1: Requirements.
Digital Arithmetic and Arithmetic Circuits
CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 12: February 6, 2006 Sorting.
Advanced Digital Circuits ECET 146 Week 5 Professor Iskandar Hack ET 221G, Me as I typed this slides.
Chapter # 5: Arithmetic Circuits
Chapter 6-1 ALU, Adder and Subtractor
Arithmetic Building Blocks
5-1 Programmable and Steering Logic Chapter # 5: Arithmetic Circuits.
1 Chapter 7 Computer Arithmetic Smruti Ranjan Sarangi Computer Organisation and Architecture PowerPoint Slides PROPRIETARY MATERIAL. © 2014 The McGraw-Hill.
Csci 136 Computer Architecture II – Constructing An Arithmetic Logic Unit Xiuzhen Cheng
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
Computing Systems Designing a basic ALU.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 9: April 15, 2005 ILP 2.
CHAPTER 4 Combinational Logic Design- Arithmetic Operation (Section 4.6&4.9)
Introduction to State Machine
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day9:
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 14: February 10, 2003 Interconnect 4: Switching.
1 Lecture 12 Time/space trade offs Adders. 2 Time vs. speed: Linear chain 8-input OR function with 2-input gates Gates: 7 Max delay: 7.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 3: February 3, 2014 Arithmetic Work preclass exercise.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 3: January 10, 2005 Arithmetic and Pipelining.
CALTECH CS137 Spring DeHon 1 CS137: Electronic Design Automation Day 5: April 12, 2004 Covering and Retiming.
1 Lecture 12: Adders, Sequential Circuits Today’s topics:  Carry-lookahead adder  Clocks, latches, sequential circuits.
1 Carry Lookahead Logic Carry Generate Gi = Ai Bi must generate carry when A = B = 1 Carry Propagate Pi = Ai xor Bi carry in will equal carry out here.
CSE 311 Foundations of Computing I Lecture 25 Circuits for FSMs, Carry-Look-Ahead Adders Autumn 2011 CSE 3111.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 31, 2003 Compute 2:
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 3: January 25, 2010 Arithmetic Work preclass exercise.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 8: January 27, 2006 Cellular Placement.
ESE534: Computer Organization
CS137: Electronic Design Automation
CS184a: Computer Architecture (Structure and Organization)
ESE534 Computer Organization
Homework Reading Machine Projects Labs
CS184a: Computer Architecture (Structure and Organization)
Digital Logic Structures Logic gates & Boolean logic
CSE Winter 2001 – Arithmetic Unit - 1
ESE534: Computer Organization
Arithmetic Circuits (Part I) Randy H
Instructor: Prof. Chung-Kuan Cheng
CS 140 Lecture 14 Standard Combinational Modules
Homework Reading Machine Projects Labs
74LS283 4-Bit Binary Adder with Fast Carry
ESE534: Computer Organization
Presentation transcript:

CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter DeHon 2 Today Bit-Level –Addition –LUT Cascades For Sums –Applications FSMs SATADD Data Forwarding Pointer Jumping –Applications

CALTECH CS137 Winter DeHon 3 Introduction / Reminder Addition in Log Time

CALTECH CS137 Winter DeHon 4 Ripple Carry Addition Simple “definition” of addition Serially resolve carry at each bit

CALTECH CS137 Winter DeHon 5 CLA Think about each adder bit as a computing a function on the carry in –C[i]=g(c[i-1]) –Particular function f will depend on a[i], b[i] –G=f(a,b)

CALTECH CS137 Winter DeHon 6 Functions What functions can g(c[i-1]) be? –g(x)=1 a[i]=b[i]=1 –g(x)=x a[i] xor b[i]=1 –g(x)=0 A[i]=b[i]=0

CALTECH CS137 Winter DeHon 7 Functions What functions can g(c[i-1]) be? –g(x)=1 Generate a[i]=b[i]=1 –g(x)=x Propagate a[i] xor b[i]=1 –g(x)=0 Squash A[i]=b[i]=0

CALTECH CS137 Winter DeHon 8 Combining Want to combine functions –Compute c[i]=g i (g i-1 (c[i-2])) –Compute compose of two functions What functions will the compose of two of these functions be? –Same as before Propagate, generate, squash

CALTECH CS137 Winter DeHon 9 Compose Rules (LSB MSB) ComposeResult GG GP GS PG PP PS SG SP SS

CALTECH CS137 Winter DeHon 10 Compose Rules (LSB MSB) ComposeResult GGS GPG GSS PGG PPP PSS SGG SPS SSS

CALTECH CS137 Winter DeHon 11 Combining Do it again… Combine g[i-3,i-2] and g[i-1,i] What do we get?

CALTECH CS137 Winter DeHon 12 Reduce Tree

CALTECH CS137 Winter DeHon 13 Associative Reduce  Prefix Shows us how to compute the Nth value in O(log(N)) time Can actually produce all intermediate values in this time –w/ only a constant factor more hardware

CALTECH CS137 Winter DeHon 14 Prefix Tree

CALTECH CS137 Winter DeHon 15 Parallel Prefix Important Pattern Applicable any time operation is associative Function Composition is always associative

CALTECH CS137 Winter DeHon 16 Generalizing LUT Cascade

CALTECH CS137 Winter DeHon 17 Cascaded LUT Delay Model Tcascade =T(3LUT) + T(mux) Don’t pay –General interconnect –Full 4-LUT delay

CALTECH CS137 Winter DeHon 18 Parallel Prefix LUT Cascade? Can we do better than N×Tmux? Can we compute LUT cascade in O(log(N)) time? Can we compute mux cascade using parallel prefix? Can we make mux cascade associative?

CALTECH CS137 Winter DeHon 19 Parallel Prefix Mux cascade How can mux transform S  mux-out? –A=0, B=0  mux-out=0 –A=1, B=1  mux-out=1 –A=0, B=1  mux-out=S –A=1, B=0  mux-out=/S

CALTECH CS137 Winter DeHon 20 Parallel Prefix Mux cascade How can mux transform S  mux-out? –A=0, B=0  mux-out=0 Stop= S –A=1, B=1  mux-out=1 Generate= G –A=0, B=1  mux-out=S Buffer = B –A=1, B=0  mux-out=/S Invert = I

CALTECH CS137 Winter DeHon 21 Parallel Prefix Mux cascade How can 2 muxes transform input? Can I compute 2-mux transforms from 1 mux transforms?

CALTECH CS137 Winter DeHon 22 Two-mux transforms SS  S SG  G SB  S SI  G GS  S GG  G GB  G GI  S BS  S BG  G BB  B BI  I IS  S IG  G IB  I II  B

CALTECH CS137 Winter DeHon 23 Generalizing mux-cascade How can N muxes transform the input? Is mux transform composition associative?

CALTECH CS137 Winter DeHon 24 Associative Reduce Mux-Cascade Can be hardwired, no general interconnect

CALTECH CS137 Winter DeHon 25 For Sums

CALTECH CS137 Winter DeHon 26 Prefix Sum Common Operation: –Want B[x] such that B[x]=A[0]+A[1]+…A[x] –For I=0 to x B[x]=B[x-1]+A[x]

CALTECH CS137 Winter DeHon 27 Prefix Sum Compute in tree fashion –A[I]+A[I+1] –A[I]+A[I+1]+A[I+2]+A[I+3] –…–… Combine partial sums back down tree –S(0:7)+S(8:9)+S(10)=S(0:10)

CALTECH CS137 Winter DeHon 28 Other simple operators Prefix-OR Prefix-AND Prefix-MAX Prefix-MIN

CALTECH CS137 Winter DeHon 29 Find-First One Useful for arbitration –Finds first (highest-priority) requestor –Also magnitude finding in numbers How: –Prefix-OR –Locally compute X[I-1]^X[I] –Flags the first one

CALTECH CS137 Winter DeHon 30 Arbitration Often want to find first M requestors –E.g. Assign unique memory ports to first M processors requesting Prefix-sum across all potential requesters Counts requesters, giving unique number to each Know if one of first M –Perhaps which resource assigned

CALTECH CS137 Winter DeHon 31 Partitioning Use something to order –E.g. spectral linear ordering –…or 1D cellular swap to produce linear order Parallel prefix on area of units –If not all same area Know where the midpoint is

CALTECH CS137 Winter DeHon 32 Channel Width Prefix sum on delta wires at each node –To compute net channel widths at all points along channel –E.g. 1D ordered Maybe use with cellular placement scheme

CALTECH CS137 Winter DeHon 33 Rank Finding Looking for I’th ordered element Do a prefix-sum on high-bit only –Know m=number of things > … High-low search on result –I.e. if number > I, recurse on half with leading zero –If number < I, search for (I-m)’th element in half with high-bit true Find median in log 2 (N) time

CALTECH CS137 Winter DeHon 34 FA/FSM Evaluation (regular expression recognition)

CALTECH CS137 Winter DeHon 35 Finite Automata Machine has finite state: S On each cycle –Input I –Compute output and new state Based on inputs and current state O i,S (i+1) =f(S i,I i ) Intuitively, a sequential process –Must know previous state to compute next –Must know state to compute output

CALTECH CS137 Winter DeHon 36 Function Specialization But, this is just functions –…and function composition is associative Given that we know input sequence: –I 0,I 1,I 2 … Can compute specialized functions: –f i (s)=f(s,I i ) What is f i (s)? –Worst-case, a translation table: S=0  NS0, S=1  NS1 ….

CALTECH CS137 Winter DeHon 37 Function Composition Now: O (i+m),S (i+m+1) = f (i+m) (f (i+m-1) (f (i+m-2) (…f i (S i )))) Can we compute the function composition? –f (i+1,i) (s)=f (i+1) (f i (s)) –What is f (i+1,i) (s)? A translation table just like f i (s) and f (i+1) (s) Table of size |S|, can fillin in O(|S|) time

CALTECH CS137 Winter DeHon 38 Recursive Function Composition Now: O (i+m),S (i+m+1) = f (i+m) (f (i+m-1) (f (i+m-2) (…f i (S i )))) We can compute the composition –f (i+1,i) (s)=f (i+1) (f i (s)) Repeat to compute –f (i+3,i) (s)=f (i+3,i+2) (f (i+1,i) (s)) –Etc. until have computed: f (i+m,i) (s) in O(log(m)) steps

CALTECH CS137 Winter DeHon 39 Implications If can get input stream, –Any FA can be evaluated in O(log(N)) time –Regular Expression recognition in O(log(N)) Any streaming operator with finite state –Where the input stream is independent of the output stream –Can be run arbitrarily fast by using parallel- prefix on FSM evaluation

CALTECH CS137 Winter DeHon 40 Saturated Addition S (i+1) =max(min(I i +S i,maxval),minval) Could model as FSM with: –|S|=maxval-minval So, in theory, FSM result applies …but |S| might be 2 16, 2 24

CALTECH CS137 Winter DeHon 41 SATADD Composition Can compute composition efficiently [Papadantonakis et al. FPT2005]

CALTECH CS137 Winter DeHon 42 SATADD Composition

CALTECH CS137 Winter DeHon 43 SATADD Reduce Tree

CALTECH CS137 Winter DeHon 44 Data Forwarding UltraScalar From Henry, Kuszmaul, et al. ARVLSI’99, SPAA’99, ISCA’00

CALTECH CS137 Winter DeHon 45 Consider Machine Each FU has a full RF –FU=Functional Unit –RF=Register File Build network between FUs –use network to connect produce/consume –user register names to configure interconnect Signal data ready along network

CALTECH CS137 Winter DeHon 46 Ultrascalar: concept model

CALTECH CS137 Winter DeHon 47 Ultrascalar Concept Linear delay O(1) register cost / FU Complete renaming at each FU –different set of registers –so when say complete RF at each FU, that’s only the logical registers

CALTECH CS137 Winter DeHon 48 Ultrascalar: cyclic prefix

CALTECH CS137 Winter DeHon 49 Parallel Prefix Basic idea is one we saw with adders An FU will either – produce a register (generate) –or transmit a register (propagate) –can do tree combining pair of FUs will either both propagate or will generate compute function by pair in one stage recurse to next stage get log-depth tree network connecting producer and consumer

CALTECH CS137 Winter DeHon 50 Ultrascalar: cyclic prefix

CALTECH CS137 Winter DeHon 51 Pointer Jumping

CALTECH CS137 Winter DeHon 52 Pointer Jumping Motivation Have a tree –E.g. is-a relationship tree in NETL Want to know if a node is of a particular type (is-a mammal) How long to find out? –Naïve: O(distance) Spread one level per timestep

CALTECH CS137 Winter DeHon 53 Following Pointer Chain Naïve: spread/color from target node –On each step push down to children Most nodes idle –Only active on the step something arrives Can the idle nodes do something to accelerate?

CALTECH CS137 Winter DeHon 54 Jumping Intermediates Add notion of transitive parent Initially: transitive-parent=parent On each step: –If my transitive-parent marked Mark self –else Transitive-parent = transitive-parent(transitive-parent)

CALTECH CS137 Winter DeHon 55 How Much Jumping? On each step: –If my transitive-parent marked Mark self –else Transitive-parent = transitive-parent(transitive-parent) How many such steps? –O(log(distance))

CALTECH CS137 Winter DeHon 56 Pointer Jumping Same basic idea as data forwarding Can find length of a list in O(log(length)) time

CALTECH CS137 Winter DeHon 57 Variations

CALTECH CS137 Winter DeHon 58 Segmented Parallel Prefix f i () can ignore its input –…or the function can let special I’s tell it to reset the state E.g. build huge/hardwired carry chain hardware and configurably break into separate adders (LUT cascades)

CALTECH CS137 Winter DeHon 59 Cyclic Segmented Parallel Prefix Wrap output back to input Configurable segmentation defines the starting/stopping point E.g. –In Ultrascalar dataforwarding Leave data in place and use FUs in FIFO fashion, redefining the “head” at each cycle –Priority allocation scheme Mark priority item as start of segment –Perhaps chose randomly (e.g. hardware router)

CALTECH CS137 Winter DeHon 60 Admin Class Wed. Baseline due Friday

CALTECH CS137 Winter DeHon 61 Big Ideas Any associative operation can be made parallel –Performed in log(N) time with O(N) hardware Any Finite Automata computation can be accelerated with parallelism –(FA evaluation  NC) Function composition is associated –  all functional operations can be associative