Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pipelining combinational circuits

Similar presentations


Presentation on theme: "Pipelining combinational circuits"— Presentation transcript:

1 Pipelining combinational circuits
Constructive Computer Architecture: Pipelining combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology December 31, 2013

2 Contributors to the course material
Arvind, Rishiyur S. Nikhil, Joel Emer, Muralidaran Vijayaraghavan Staff and students in (Spring 2013), 6.S195 (Fall 2012, 2013), 6.S078 (Spring 2012) Andy Wright, Asif Khan, Richard Ruhler, Sang Woo Jun, Abhinav Agarwal, Myron King, Kermin Fleming, Ming Liu, Li-Shiuan Peh External Prof Amey Karkare & students at IIT Kanpur Prof Jihong Kim & students at Seoul Nation University Prof Derek Chiou, University of Texas at Austin Prof Yoav Etsion & students at Technion December 31, 2013

3 Contents IFFT Inelastic versus Elastic pipelines The role of FIFOs
Concurrency issues BSV Concepts The Maybe Type Concurrency analysis December 31, 2013

4 Combinational IFFT … * + - *j t2 t0 t3 t1
Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute * + - *j t2 t0 t3 t1 Constants t0 to t3 are different for each box and can have dramatci impact on optimizations. All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ... December 31, 2013

5 BSV code: 4-way Butterfly
function Vector#(4,Complex#(s)) bfly4 (Vector#(4,Complex#(s)) t, Vector#(4,Complex#(s)) x); Vector#(4,Complex#(s)) m, y, z; m[0] = x[0] * t[0]; m[1] = x[1] * t[1]; m[2] = x[2] * t[2]; m[3] = x[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction * + - *i m y z Polymorphic code: works on any type of numbers for which *, + and - have been defined Note: Vector does not mean storage; just a group of wires with names December 31, 2013

6 Combinational IFFT … stage_f function
Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute stage_f function function Vector#(64, Complex#(n)) stage_f (Bit#(2) stage, Vector#(64, Complex#(n)) stage_in); function Vector#(64, Complex#(n)) ifft (Vector#(64, Complex#(n)) in_data); repeat stage_f three times December 31, 2013

7 BSV Code: Combinational IFFT
function Vector#(64, Complex#(n)) ifft (Vector#(64, Complex#(n)) in_data); //Declare vectors Vector#(4,Vector#(64, Complex#(n))) stage_data; stage_data[0] = in_data; for (Bit#(2) stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); endfunction The for-loop is unfolded and stage_f is inlined during static elaboration December 31, 2013

8 Folded IFFT: Reusing the stage combinational circuit
in1 in2 in63 in3 in4 out0 out1 out2 out63 out3 out4 Bfly4 Permute Stage Counter December 31, 2013

9 Superfolded IFFT: Just one Bfly-4 node!
in0 in1 in2 in63 in3 in4 out0 out1 out2 out63 out3 out4 Permute Bfly4 64, 2-way Muxes Stage 0 to 2 4, 16-way Muxes Index: 0 to 15 4, 16-way DeMuxes Index == 15? f will be invoked for 48 dynamic values of stage; each invocation will modify 4 numbers in sReg after 16 invocations a permutation would be done on the whole sReg December 31, 2013

10 Folding versus Pipelining
xi+1 xi xi-1 3 different datasets in the pipeline f0 f1 f2 Lot of area and long combinational delay Folded or multi-cycle version can save area and reduce the combinational delay but throughput per clock cycle gets worse Pipelining: a method to increase the circuit throughput by concurrently evaluating multiple inputs December 31, 2013

11 Inelastic vs Elastic pipeline
x sReg1 inQ f0 f1 f2 sReg2 outQ Inelastic: all pipeline stages move synchronously x fifo1 inQ f1 f2 f3 fifo2 outQ Elastic: A pipeline stage can process data if its input FIFO is not empty and output FIFO is not Full Most complex processor pipelines are a combination of the two styles December 31, 2013

12 Inelastic vs Elastic Pipelines
Inelastic pipeline: typically only one rule or mutually exclusive rules; the designer controls precisely which activities go on in parallel downside: The designer must program the starting and draining of the pipeline. The rule can get complicated -- easy to make mistakes; difficult to make changes Elastic pipeline: several smaller rules, each easy to write, easier to make changes downside: sometimes rules do not fire concurrently when they should December 31, 2013

13 Inelastic pipeline x sReg1 inQ f0 f1 f2 sReg2 outQ
rule sync-pipeline (True); inQ.deq(); sReg1 <= f0(inQ.first()); sReg2 <= f1(sReg1); outQ.enq(f2(sReg2)); endrule This rule can fire only if - inQ has an element - outQ has space Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated December 31, 2013

14 FIFO Module: methods with guarded interfaces
rdy enab enq deq first FIFO not full not empty not empty fifo.enq(x); // action method fifo.deq(); // action method y=fifo.first() // value method December 31, 2013

15 Inelastic pipeline Making implicit guard conditions explicit
sReg1 inQ f0 f1 f2 sReg2 outQ rule sync-pipeline (!inQ.empty() && !outQ.full); inQ.deq(); sReg1 <= f0(inQ.first()); sReg2 <= f1(sReg1); outQ.enq(f2(sReg2)); endrule Suppose sReg1 and sReg2 have data, outQ is not full but inQ is empty. What behavior do you expect? Leave green and red data in the pipeline? December 31, 2013

16 Pipeline bubbles x sReg1 inQ f0 f1 f2 sReg2 outQ
rule sync-pipeline (True); inQ.deq(); sReg1 <= f0(inQ.first()); sReg2 <= f1(sReg1); outQ.enq(f2(sReg2)); endrule Red and Green tokens must move even if there is nothing in inQ! Also if there is no token in sReg2 then nothing should be enqueued in the outQ Valid bits or the Maybe type Modify the rule to deal with these conditions December 31, 2013

17 Explicit encoding of Valid/Invalid data
inQ f0 f1 f2 outQ sReg1 sReg2 typedef union tagged {void Valid; void Invalid; } Validbit deriving (Eq, Bits); rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= f0(inQ.first()); inQ.deq(); sReg1f <= Valid end else sReg1f <= Invalid; sReg2 <= f1(sReg1); sReg2f <= sReg1f; if (sReg2f == Valid) outQ.enq(f2(sReg2)); endrule December 31, 2013

18 When is this rule enabled?
rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= f0(inQ.first()); inQ.deq(); sReg1f <= Valid end else sReg1f <= Invalid; sReg2 <= f1(sReg1); sReg2f <= sReg1f; if (sReg2f == Valid) outQ.enq(f2(sReg2)); endrule sReg1 sReg2 inQ f0 f1 f2 outQ inQ sReg1f sReg2f outQ inQ sReg1f sReg2f outQ NE V V NF NE V V F NE V I NF NE V I F NE I V NF NE I V F NE I I NF NE I I F yes No Yes E V V NF E V V F E V I NF E V I F E I V NF E I V F E I I NF E I I F yes No Yes Yes1 NE = Not Empty; NF = Not Full Yes1 = yes but no change December 31, 2013

19 The Maybe type A useful type to capture valid/invalid data
typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values Some useful functions on Maybe type: isValid(x) returns true if x is Valid fromMaybe(d,x) returns the data value in x if x is Valid the default value d if x is Invalid December 31, 2013

20 Using the Maybe type data valid/invalid
typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline if (True); if (inQ.notEmpty()) begin sReg1 <= Valid f0(inQ.first()); inQ.deq(); end else sReg1 <= Invalid; sReg2 <= isValid(sReg1)? Valid f1(fromMaybe(d, sReg1)) : Invalid; if isValid(sReg2) outQ.enq(f2(fromMaybe(d, sReg2))); endrule December 31, 2013

21 The Maybe type data using the pattern matching syntax
typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline if (True); if (inQ.notEmpty()) begin sReg1 <= Valid (f0(inQ.first())); inQ.deq(); end else sReg1 <= Invalid; case (sReg1) matches tagged Valid .sx1: sReg2 <= Valid f1(sx1); tagged Invalid : sReg2 <= Invalid; endcase case (sReg2) matches tagged Valid .sx2: outQ.enq(f2(sx2)); endcase endrule sx1 will get bound to the appropriate part of sReg1 December 31, 2013

22 Generalization: n-stage pipeline
sReg[0] inQ sReg[1] outQ x f(0) f(1) f(2) f(n-1) ... sReg[n-2] rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg[0]<= Valid f(1,inQ.first());inQ.deq();end else sReg[0]<= Invalid; for(Integer i = 1; i < n-1; i=i+1) begin case (sReg[i-1]) matches tagged Valid .sx: sReg[i] <= Valid f(i,sx); tagged Invalid: sReg[i] <= Invalid; endcase end case (sReg[n-2]) matches tagged Valid .sx: outQ.enq(f(n-1,sx)); endcase endrule December 31, 2013

23 Elastic pipeline Use FIFOs instead of pipeline registers
x inQ fifo1 fifo2 outQ rule stage1 if (True); fifo1.enq(f1(inQ.first()); inQ.deq(); endrule rule stage2 if (True); fifo2.enq(f2(fifo1.first()); fifo1.deq(); endrule rule stage3 if (True); outQ.enq(f3(fifo2.first()); fifo2.deq(); endrule What is the firing condition for each rule? Can tokens be left inside the pipeline? No need for Maybe types December 31, 2013

24 Firing conditions for reach rule
x fifo1 inQ f1 f2 f3 fifo2 outQ inQ fifo1 fifo2 outQ rule1 rule2 rule3 NE NE,NF NE,NF NF NE NE,NF NE,NF F NE NE,NF NE,F NF NE NE,NF NE,F F …. Yes Yes Yes Yes Yes No Yes No Yes Yes No No …. This is the first example we have seen where multiple rules may be ready to execute concurrently Can we execute multiple rules together? December 31, 2013

25 Informal analysis x fifo1 inQ f1 f2 f3 fifo2 outQ
inQ fifo1 fifo2 outQ rule1 rule2 rule3 NE NE,NF NE,NF NF NE NE,NF NE,NF F NE NE,NF NE,F NF NE NE,NF NE,F F …. Yes Yes Yes Yes Yes No Yes No Yes Yes No No …. FIFOs must permit concurrent enq and deq for all three rules to fire concurrently December 31, 2013

26 Concurrency when the FIFOs do not permit concurrent enq and deq
x fifo1 inQ f1 f2 f3 fifo2 outQ not empty not empty & not full not empty & not full not full At best alternate stages in the pipeline will be able to fire concurrently December 31, 2013

27 Pipelined designs expressed using Multiple rules
If rules for different pipeline stages never fire in the same cycle then the design can hardly be called a pipelined design If all the enabled rules fire in parallel every cycle then, in general, wrong results can be produced December 31, 2013

28 BSV Execution Model Repeatedly: Select a rule to execute
Compute the state updates Make the state updates Highly non-deterministic; User annotations can be used in rule selection A legal behavior of a BSV program can be explained by observing the state updates obtained by applying only one rule at a time One-rule-at-time semantics December 31, 2013

29 Concurrent scheduling of rules
The one-rule-at-a-time semantics plays the central role in defining functional correctness and verification but for meaningful hardware design it is necessary to execute multiple rules concurrently without violating the one-rule-at-a-time semantics What do we mean by concurrent scheduling? First - some hardware intuition Later – the semantics of concurrent scheduling December 31, 2013

30 Hardware intuition for concurrent scheduling
December 31, 2013

31 Rule Execution Application of a rule modifies some state elements of the system in a deterministic manner f x f x next state computation reg en’s current state next state values nextState December 31, 2013

32 Executing multiple rules in one clock cycle
Forwards old values unless updated current state next values rule2 f x rule 1 Next State mux Reg en Sequential composition ensures that no double updates are made on any register Parallel composition current state next values rule2 f x rule 1 merge NextState Reg en Sequential composition preserves the one-rule-at-a-time semantics but generally increases the critical combinational path Parallel composition does not create longer combinational paths but may not preserve the one-rule-at-a-time semantics December 31, 2013

33 Violation of sequential semantics
rule ra x <= y; endrule rule rb y <= x; Suppose initially x is x0 and y is y0 {x0,y0} ra {y0,y0} rb {y0,y0} {x0,y0} rb {x0,x0} ra {x0,x0} {x0,y0} rb||ra {y0,x0} Parallel execution does not behave like either ra<rb or rb<ra We do not want to allow concurrent execution of ra and rb next time compiler analysis to determine which rules can be executed concurrently December 31, 2013

34 Area estimates Tool: Synopsys Design Compiler
Comb. FFT Combinational area: Noncombinational area:  9279 Folded FFT Combinational area:       Noncombinational area:     11603 Pipelined FFT Combinational area:       Noncombinational area:     18558 Are the results surprising? Why is folded implementation not smaller? Explanation: Because of constant propagation optimization, each bfly4 gets reduced by 60% when twiddle factors are specified. Folded design disallows this optimization because of the sharing of bfly4’s December 31, 2013


Download ppt "Pipelining combinational circuits"

Similar presentations


Ads by Google