Download presentation
Presentation is loading. Please wait.
Published byDaniela Snow Modified over 9 years ago
1
CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix
2
CALTECH CS137 Winter2006 -- DeHon 2 Today Bit-Level –Addition –LUT Cascades For Sums –Applications FSMs SATADD Data Forwarding Pointer Jumping –Applications
3
CALTECH CS137 Winter2006 -- DeHon 3 Introduction / Reminder Addition in Log Time
4
CALTECH CS137 Winter2006 -- DeHon 4 Ripple Carry Addition Simple “definition” of addition Serially resolve carry at each bit
5
CALTECH CS137 Winter2006 -- DeHon 5 CLA Think about each adder bit as a computing a function on the carry in –C[i]=g(c[i-1]) –Particular function f will depend on a[i], b[i] –G=f(a,b)
6
CALTECH CS137 Winter2006 -- DeHon 6 Functions What functions can g(c[i-1]) be? –g(x)=1 a[i]=b[i]=1 –g(x)=x a[i] xor b[i]=1 –g(x)=0 A[i]=b[i]=0
7
CALTECH CS137 Winter2006 -- DeHon 7 Functions What functions can g(c[i-1]) be? –g(x)=1 Generate a[i]=b[i]=1 –g(x)=x Propagate a[i] xor b[i]=1 –g(x)=0 Squash A[i]=b[i]=0
8
CALTECH CS137 Winter2006 -- DeHon 8 Combining Want to combine functions –Compute c[i]=g i (g i-1 (c[i-2])) –Compute compose of two functions What functions will the compose of two of these functions be? –Same as before Propagate, generate, squash
9
CALTECH CS137 Winter2006 -- DeHon 9 Compose Rules (LSB MSB) ComposeResult GG GP GS PG PP PS SG SP SS
10
CALTECH CS137 Winter2006 -- DeHon 10 Compose Rules (LSB MSB) ComposeResult GGS GPG GSS PGG PPP PSS SGG SPS SSS
11
CALTECH CS137 Winter2006 -- DeHon 11 Combining Do it again… Combine g[i-3,i-2] and g[i-1,i] What do we get?
12
CALTECH CS137 Winter2006 -- DeHon 12 Reduce Tree
13
CALTECH CS137 Winter2006 -- DeHon 13 Associative Reduce Prefix Shows us how to compute the Nth value in O(log(N)) time Can actually produce all intermediate values in this time –w/ only a constant factor more hardware
14
CALTECH CS137 Winter2006 -- DeHon 14 Prefix Tree
15
CALTECH CS137 Winter2006 -- DeHon 15 Parallel Prefix Important Pattern Applicable any time operation is associative Function Composition is always associative
16
CALTECH CS137 Winter2006 -- DeHon 16 Generalizing LUT Cascade
17
CALTECH CS137 Winter2006 -- DeHon 17 Cascaded LUT Delay Model Tcascade =T(3LUT) + T(mux) Don’t pay –General interconnect –Full 4-LUT delay
18
CALTECH CS137 Winter2006 -- DeHon 18 Parallel Prefix LUT Cascade? Can we do better than N×Tmux? Can we compute LUT cascade in O(log(N)) time? Can we compute mux cascade using parallel prefix? Can we make mux cascade associative?
19
CALTECH CS137 Winter2006 -- DeHon 19 Parallel Prefix Mux cascade How can mux transform S mux-out? –A=0, B=0 mux-out=0 –A=1, B=1 mux-out=1 –A=0, B=1 mux-out=S –A=1, B=0 mux-out=/S
20
CALTECH CS137 Winter2006 -- DeHon 20 Parallel Prefix Mux cascade How can mux transform S mux-out? –A=0, B=0 mux-out=0 Stop= S –A=1, B=1 mux-out=1 Generate= G –A=0, B=1 mux-out=S Buffer = B –A=1, B=0 mux-out=/S Invert = I
21
CALTECH CS137 Winter2006 -- DeHon 21 Parallel Prefix Mux cascade How can 2 muxes transform input? Can I compute 2-mux transforms from 1 mux transforms?
22
CALTECH CS137 Winter2006 -- DeHon 22 Two-mux transforms SS S SG G SB S SI G GS S GG G GB G GI S BS S BG G BB B BI I IS S IG G IB I II B
23
CALTECH CS137 Winter2006 -- DeHon 23 Generalizing mux-cascade How can N muxes transform the input? Is mux transform composition associative?
24
CALTECH CS137 Winter2006 -- DeHon 24 Associative Reduce Mux-Cascade Can be hardwired, no general interconnect
25
CALTECH CS137 Winter2006 -- DeHon 25 For Sums
26
CALTECH CS137 Winter2006 -- DeHon 26 Prefix Sum Common Operation: –Want B[x] such that B[x]=A[0]+A[1]+…A[x] –For I=0 to x B[x]=B[x-1]+A[x]
27
CALTECH CS137 Winter2006 -- DeHon 27 Prefix Sum Compute in tree fashion –A[I]+A[I+1] –A[I]+A[I+1]+A[I+2]+A[I+3] –…–… Combine partial sums back down tree –S(0:7)+S(8:9)+S(10)=S(0:10)
28
CALTECH CS137 Winter2006 -- DeHon 28 Other simple operators Prefix-OR Prefix-AND Prefix-MAX Prefix-MIN
29
CALTECH CS137 Winter2006 -- DeHon 29 Find-First One Useful for arbitration –Finds first (highest-priority) requestor –Also magnitude finding in numbers How: –Prefix-OR –Locally compute X[I-1]^X[I] –Flags the first one
30
CALTECH CS137 Winter2006 -- DeHon 30 Arbitration Often want to find first M requestors –E.g. Assign unique memory ports to first M processors requesting Prefix-sum across all potential requesters Counts requesters, giving unique number to each Know if one of first M –Perhaps which resource assigned
31
CALTECH CS137 Winter2006 -- DeHon 31 Partitioning Use something to order –E.g. spectral linear ordering –…or 1D cellular swap to produce linear order Parallel prefix on area of units –If not all same area Know where the midpoint is
32
CALTECH CS137 Winter2006 -- DeHon 32 Channel Width Prefix sum on delta wires at each node –To compute net channel widths at all points along channel –E.g. 1D ordered Maybe use with cellular placement scheme
33
CALTECH CS137 Winter2006 -- DeHon 33 Rank Finding Looking for I’th ordered element Do a prefix-sum on high-bit only –Know m=number of things > 01111111… High-low search on result –I.e. if number > I, recurse on half with leading zero –If number < I, search for (I-m)’th element in half with high-bit true Find median in log 2 (N) time
34
CALTECH CS137 Winter2006 -- DeHon 34 FA/FSM Evaluation (regular expression recognition)
35
CALTECH CS137 Winter2006 -- DeHon 35 Finite Automata Machine has finite state: S On each cycle –Input I –Compute output and new state Based on inputs and current state O i,S (i+1) =f(S i,I i ) Intuitively, a sequential process –Must know previous state to compute next –Must know state to compute output
36
CALTECH CS137 Winter2006 -- DeHon 36 Function Specialization But, this is just functions –…and function composition is associative Given that we know input sequence: –I 0,I 1,I 2 … Can compute specialized functions: –f i (s)=f(s,I i ) What is f i (s)? –Worst-case, a translation table: S=0 NS0, S=1 NS1 ….
37
CALTECH CS137 Winter2006 -- DeHon 37 Function Composition Now: O (i+m),S (i+m+1) = f (i+m) (f (i+m-1) (f (i+m-2) (…f i (S i )))) Can we compute the function composition? –f (i+1,i) (s)=f (i+1) (f i (s)) –What is f (i+1,i) (s)? A translation table just like f i (s) and f (i+1) (s) Table of size |S|, can fillin in O(|S|) time
38
CALTECH CS137 Winter2006 -- DeHon 38 Recursive Function Composition Now: O (i+m),S (i+m+1) = f (i+m) (f (i+m-1) (f (i+m-2) (…f i (S i )))) We can compute the composition –f (i+1,i) (s)=f (i+1) (f i (s)) Repeat to compute –f (i+3,i) (s)=f (i+3,i+2) (f (i+1,i) (s)) –Etc. until have computed: f (i+m,i) (s) in O(log(m)) steps
39
CALTECH CS137 Winter2006 -- DeHon 39 Implications If can get input stream, –Any FA can be evaluated in O(log(N)) time –Regular Expression recognition in O(log(N)) Any streaming operator with finite state –Where the input stream is independent of the output stream –Can be run arbitrarily fast by using parallel- prefix on FSM evaluation
40
CALTECH CS137 Winter2006 -- DeHon 40 Saturated Addition S (i+1) =max(min(I i +S i,maxval),minval) Could model as FSM with: –|S|=maxval-minval So, in theory, FSM result applies …but |S| might be 2 16, 2 24
41
CALTECH CS137 Winter2006 -- DeHon 41 SATADD Composition Can compute composition efficiently [Papadantonakis et al. FPT2005]
42
CALTECH CS137 Winter2006 -- DeHon 42 SATADD Composition
43
CALTECH CS137 Winter2006 -- DeHon 43 SATADD Reduce Tree
44
CALTECH CS137 Winter2006 -- DeHon 44 Data Forwarding UltraScalar From Henry, Kuszmaul, et al. ARVLSI’99, SPAA’99, ISCA’00
45
CALTECH CS137 Winter2006 -- DeHon 45 Consider Machine Each FU has a full RF –FU=Functional Unit –RF=Register File Build network between FUs –use network to connect produce/consume –user register names to configure interconnect Signal data ready along network
46
CALTECH CS137 Winter2006 -- DeHon 46 Ultrascalar: concept model
47
CALTECH CS137 Winter2006 -- DeHon 47 Ultrascalar Concept Linear delay O(1) register cost / FU Complete renaming at each FU –different set of registers –so when say complete RF at each FU, that’s only the logical registers
48
CALTECH CS137 Winter2006 -- DeHon 48 Ultrascalar: cyclic prefix
49
CALTECH CS137 Winter2006 -- DeHon 49 Parallel Prefix Basic idea is one we saw with adders An FU will either – produce a register (generate) –or transmit a register (propagate) –can do tree combining pair of FUs will either both propagate or will generate compute function by pair in one stage recurse to next stage get log-depth tree network connecting producer and consumer
50
CALTECH CS137 Winter2006 -- DeHon 50 Ultrascalar: cyclic prefix
51
CALTECH CS137 Winter2006 -- DeHon 51 Pointer Jumping
52
CALTECH CS137 Winter2006 -- DeHon 52 Pointer Jumping Motivation Have a tree –E.g. is-a relationship tree in NETL Want to know if a node is of a particular type (is-a mammal) How long to find out? –Naïve: O(distance) Spread one level per timestep
53
CALTECH CS137 Winter2006 -- DeHon 53 Following Pointer Chain Naïve: spread/color from target node –On each step push down to children Most nodes idle –Only active on the step something arrives Can the idle nodes do something to accelerate?
54
CALTECH CS137 Winter2006 -- DeHon 54 Jumping Intermediates Add notion of transitive parent Initially: transitive-parent=parent On each step: –If my transitive-parent marked Mark self –else Transitive-parent = transitive-parent(transitive-parent)
55
CALTECH CS137 Winter2006 -- DeHon 55 How Much Jumping? On each step: –If my transitive-parent marked Mark self –else Transitive-parent = transitive-parent(transitive-parent) How many such steps? –O(log(distance))
56
CALTECH CS137 Winter2006 -- DeHon 56 Pointer Jumping Same basic idea as data forwarding Can find length of a list in O(log(length)) time
57
CALTECH CS137 Winter2006 -- DeHon 57 Variations
58
CALTECH CS137 Winter2006 -- DeHon 58 Segmented Parallel Prefix f i () can ignore its input –…or the function can let special I’s tell it to reset the state E.g. build huge/hardwired carry chain hardware and configurably break into separate adders (LUT cascades)
59
CALTECH CS137 Winter2006 -- DeHon 59 Cyclic Segmented Parallel Prefix Wrap output back to input Configurable segmentation defines the starting/stopping point E.g. –In Ultrascalar dataforwarding Leave data in place and use FUs in FIFO fashion, redefining the “head” at each cycle –Priority allocation scheme Mark priority item as start of segment –Perhaps chose randomly (e.g. hardware router)
60
CALTECH CS137 Winter2006 -- DeHon 60 Admin Class Wed. Baseline due Friday
61
CALTECH CS137 Winter2006 -- DeHon 61 Big Ideas Any associative operation can be made parallel –Performed in log(N) time with O(N) hardware Any Finite Automata computation can be accelerated with parallelism –(FA evaluation NC) Function composition is associated – all functional operations can be associative
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.