Performance Specifications

Slides:

Advertisements

Similar presentations

Elastic Pipelines and Basics of Multi-rule Systems Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.

Advertisements

An EHR based methodology for Concurrency management Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

March 2007http://csg.csail.mit.edu/arvindSemantics-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.

December 10, 2009 L29-1 The Semantics of Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute.

December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.

Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

1 Tutorial: Lab 4 Again Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,

October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

Elastic Pipelines: Concurrency Issues

Bluespec-3: A non-pipelined processor Arvind

Concurrency properties of BSV methods and rules

Bluespec-6: Modeling Processors

Bluespec-6: Modules and Interfaces

Scheduling Constraints on Interface methods

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Sequential Circuits: Constructive Computer Architecture

Constructive Computer Architecture Tutorial 6: Discussion for lab6

Morgan Kaufmann Publishers The Processor

Pipelining combinational circuits

Multirule Systems and Concurrent Execution of Rules

Constructive Computer Architecture: Guards

Pipelining Chapter 6.

Pipelining combinational circuits

Modular Refinement Arvind

Modular Refinement Arvind

EHR: Ephemeral History Register

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Bluespec-7: Scheduling & Rule Composition

Control Hazards Constructive Computer Architecture: Arvind

Modeling Processors: Concurrency Issues

Modules with Guarded Interfaces

Pipelining combinational circuits

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Pipelining Basic concept of assembly line

Elastic Pipelines: Concurrency Issues

Bluespec-3: A non-pipelined processor Arvind

Multirule systems and Concurrent Execution of Rules

Modular Refinement - 2 Arvind

Elastic Pipelines: Concurrency Issues

Elastic Pipelines and Basics of Multi-rule Systems

Bluespec-5: Modeling Processors

Constructive Computer Architecture: Guards

Elastic Pipelines and Basics of Multi-rule Systems

Bluespec-7: Scheduling & Rule Composition

Control Hazards Constructive Computer Architecture: Arvind

Multirule systems and Concurrent Execution of Rules

Bluespec-5: Scheduling & Rule Composition

Pipeline Control unit (highly abstracted)

Modeling Processors Arvind

Pipelining Basic concept of assembly line

Modeling Processors Arvind

Modular Refinement Arvind

Pipelining Basic concept of assembly line

Bluespec-8: Modules and Interfaces

Presentation transcript:

Performance Specifications Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

Simple processor pipeline RF iMem dMem Wb IF bF Exe bE Mem bW Dec bD Bz? cycle time? area? execution time? Functional behavior is well understood Intuition about performance is lacking Should the branch be resolved in the Decode or Execute stage? Should the branch target address be latched before its use? Experimentation is required to evaluate design alternatives Should mention that this works for designs other than processors too, e.g. GCD example in paper or many other designs… We present a design flow that makes such experimentation easy for the designer

Need for Performance Specs iMem dMem Wb IF bF Exe bE Mem bW Dec bD F = {Fetch} D = { DecAdd, DecBz, … } E = { ExeAdd, ExeBzTaken, ExeBzNotTaken, M = { MemLd, MemSt, MemWB, …} W = {Wb} Rules: What is the design’s performance / throughput? Reference model implies one rule per cycle execution Designer’s goal is usually different and based on the application!

Pipelining via Performance specification The designer wants a pipeline which executes one instruction every cycle Performance spec for a pipelined processor: W < M < E < D < F A cycle in slow motion RF iMem dMem Wb IF bF Exe bE Mem bW Dec bD I5 I4 I3 I2 I1 I0 Mention composition – we are creating gigantic composite rule… ? E M w

More Performance Specification F = {Fetch} D = { DecAdd, DecBz, … } E = { ExeAdd, ExeBzTaken, ExeBzNotTaken, … } M = { MemLd, MemSt, MemWB, …} W = {Wb} We allow the designer to specify performance! W < M < E < D < F ≡ pipelined 1) W < M < E* < D < F 2) W < M < ExeBzTaken ≡ pipelined except for ExeBzTaken What do the following mean? Danger is that critical path gets longer F < D < E < M < W ≡ unpipelined (assuming buffers start empty) W < W < M < M < E < E < D < D < F < F ≡ two-way superscalar! Synthesis algorithms ensure that performance specs are satisfied and guarantee that functionality is not altered.

Why is functionality maintained? A few observations about rule-based systems: Adding a new rule to a system can only introduce new behaviors If the new rule is a derived rule, then it does not add new behaviors Composed rules: Given rules: The composed rule is a derived rule: Ra: when πa(s) => s := δa(s); Rb: when πb(s) => s := δb(s); Ra,b: when πa(s) & πb(δa(s)) => s := δb(δa(s));

Scheduling Specifications fetch & decode execute pc rf CPU bu Scheduling Specifications rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execAdd (it matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}); rf.upd(rd, va+vb); bu.deq(); endrule rule execBzTaken(it matches tagged Bz {cond:.cv,addr:.av} &&& (cv == 0)); pc <= av; bu.clear(); endrule rule execBzNotTaken(it matches tagged Bz {cond:.cv,addr:.av} &&& !(cv == 0)); bu.deq(); endrule rule execLoad(it matches tagged ELoad{dst:.rd,addr:.av}); rf.upd(rd, dMem.read(av)); bu.deq(); endrule rule execStore(it matches tagged EStore{value:.vv,addr:.av}); dMem.write(av, vv); bu.deq(); endrule execAdd < fetch execBzTaken < fetch execBzNotTaken < fetch ? execLoad < fetch execStore < fetch

Implications for modules fetch & decode execute pc rf CPU bu Implications for modules rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execAdd (it matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}); rf.upd(rd, va+vb); bu.deq(); endrule execAdd < fetch  rf: sub > upd bu: {find, enq} > {first , deq}

Branch rules execBzTaken < fetch ? execBzNotTaken < fetch fetch & decode execute pc rf CPU bu Branch rules rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execBzTaken(it matches tagged Bz {cond:.cv,addr:.av} &&& (cv == 0)); pc <= av; bu.clear(); endrule rule execBzNotTaken(it matches tagged Bz {cond:.cv,addr:.av} &&& !(cv == 0)); bu.deq(); endrule execBzTaken < fetch ? Should be treated as conflict – give priority to execBzTaken execBzNotTaken < fetch bu: {first , deq} < {find, enq}

Load-Store Rules rule fetch_and_decode (!stallfunc(instr, bu)); fetch & decode execute pc rf CPU bu Load-Store Rules rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execLoad(it matches tagged ELoad{dst:.rd,addr:.av}); rf.upd(rd, dMem.read(av)); bu.deq(); endrule rule execStore(it matches tagged EStore{value:.vv,addr:.av}); dMem.write(av, vv); bu.deq(); endrule execLoad < fetch ? Same as execAdd, i.e., rf: upd < sub bu: {first , deq} < {find, enq} execStore < fetch ?

Properties Required of Register File & FIFO to meet performance specs rf.upd < rf.sub FIFO bu: {first , deq} < {find, enq}  bu.first < bu.find bu.first < bu.enq bu.deq < bu.find bu.deq < bu.enq

The good news ... It is always possible to transform your design to meet desired concurrency and functionality Though critical path and hence the clock period may increase

Register Interfaces read < write write < read ? 1 D Q read write.x write.en read’ read’ – returns the current state when write is not enabled read’ – returns the value being written if write is enabled

Ephemeral History Register (EHR) [Rosenband MEMOCODE’04] read0 < write0 < read1 < write1 < …. D Q 1 read1 write0.x write0.en read0 write1.x write1.en writei+1 takes precedence over writei

Transformation for Performance execAdd < fetch Transformation for Performance execBzTaken < fetch execLoad < fetch execStore < fetch rule fetch_and_decode (!stallfunc1(instr, bu)); bu.enq1(newIt(instr,rf)); pc <= predIa; endrule rule execAdd (it matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}); rf.upd0(rd, va+vb); bu.deq0(); endrule rule execBzTaken(it matches tagged Bz {cond:.cv,addr:.av} &&& (cv == 0)); pc <= av; bu.clear(); endrule rule execBzNotTaken(it matches tagged Bz {cond:.cv,addr:.av} &&& !(cv == 0)); bu.deq0(); endrule rule execLoad(it matches tagged ELoad{dst:.rd,addr:.av}); rf.upd0(rd, dMem.read(av)); bu.deq0(); endrule rule execStore(it matches tagged EStore{value:.vv,addr:.av}); dMem.write(av, vv); bu.deq0(); endrule

One Element FIFO using EHRs module mkFIFO1 (FIFO#(t)); EHReg2#(t) data <- mkEHReg2U(); EHReg2#(Bool) full <- mkEHReg2(False); method Action enq0(t x) if (!full.read0); full.write0 <= True; data.write0 <= x; endmethod method Action deq0() if (full.read0); full.write0 <= False; method t first0() if (full.read0); return (data.read0); method Action clear0(); endmodule first0 < deq0 < enq1 method Action enq1(t x) if (!full.read1); full.write1 <= True; data.write1 <= x; endmethod

Experiments in scheduling Dan Rosenband, ICCAD 2005 What happens if the user specifies: No change in rules Wb < Wb < Mem < Mem < Exe < Exe < Dec < Dec < IF < IF a superscalar processor! A cycle in slow motion RF iMem dMem Wb IF bI Exe bE Mem bW Dec bD I9 I8 I7 I6 I5 I4 I3 I2 I1 I0 Executing 2 instructions per cycle requires more resources but is functionally equivalent to the original design

4-Stage Processor Results Design Benchmark (cycles) Area 10ns (µm2) Timing 10ns (ns) Area 2ns (µm2) Timing 2ns (ns) 1 element fifo: No Spec 18525 24762 5.85 26632 2.00 Spec 1 11115 25094 6.83 33360 Spec 2 25264 6.78 34099 2.04 2 element fifo: No Spec. 32240 7.38 39033 32535 8.38 47084 2.63 7410 45296 9.99 62649 4.72 benchmark: a program containing additions / jumps / loadc’s Dan Rosenband & Arvind 2004

Summary For most designs BSV Compiler does good scheduling of rules with some user annotations for priority However, for complex designs sometimes concurrency control is quite difficult and requires a good understanding on the part of the designer of the concurrency issues Performance specification is a good, safe solution but is not implemented in the compiler yet. user can do manual “renaming” and use EHRs to meet most performance goals RWires can solve any problems but exacerbate the correctness issue Synchronous pipelines (single rule) can avoid many problems but is not recommended for complex designs