February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Slides:



Advertisements
Similar presentations
BSV execution model and concurrent rule scheduling Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February.
Advertisements

Elastic Pipelines and Basics of Multi-rule Systems Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.
Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
Constructive Computer Architecture: FIFO Lab Comments Revisiting CF FIFOs Andy Wright TA October 20, 2014http://csg.csail.mit.edu/6.175L14-1.
Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1
March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.
December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
Pipelining combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February 20, 2013http://csg.csail.mit.edu/6.375L05-1.
March 6, 2006http://csg.csail.mit.edu/6.375/L10-1 Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.
September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.
Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology
Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.
Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.
1 Tutorial: Lab 4 Again Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,
October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Computer Architecture: A Constructive Approach Bluespec execution model and concurrent rule scheduling Teacher: Yoav Etsion Taken (with permission) from.
October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010
Elastic Pipelines: Concurrency Issues
Bluespec-3: A non-pipelined processor Arvind
Concurrency properties of BSV methods and rules
Bluespec-6: Modeling Processors
Bluespec-6: Modules and Interfaces
Scheduling Constraints on Interface methods
Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind
Sequential Circuits Constructive Computer Architecture Arvind
Sequential Circuits: Constructive Computer Architecture
Performance Specifications
Pipelining combinational circuits
Multirule Systems and Concurrent Execution of Rules
Constructive Computer Architecture: Guards
Sequential Circuits Constructive Computer Architecture Arvind
Pipelining combinational circuits
Modular Refinement Arvind
Modular Refinement Arvind
EHR: Ephemeral History Register
Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind
Bluespec-7: Scheduling & Rule Composition
Modeling Processors: Concurrency Issues
Modules with Guarded Interfaces
Pipelining combinational circuits
Elastic Pipelines: Concurrency Issues
Bluespec-3: A non-pipelined processor Arvind
Multirule systems and Concurrent Execution of Rules
Modular Refinement - 2 Arvind
IP Lookup: Some subtle concurrency issues
Computer Science & Artificial Intelligence Lab.
Elastic Pipelines: Concurrency Issues
Elastic Pipelines and Basics of Multi-rule Systems
Bluespec-5: Modeling Processors
Constructive Computer Architecture: Guards
GCD: A simple example to introduce Bluespec
Elastic Pipelines and Basics of Multi-rule Systems
Bluespec-7: Scheduling & Rule Composition
Control Hazards Constructive Computer Architecture: Arvind
Multirule systems and Concurrent Execution of Rules
IP Lookup: Some subtle concurrency issues
Bluespec-5: Scheduling & Rule Composition
Pipeline Control unit (highly abstracted)
Modeling Processors Arvind
Modeling Processors Arvind
Modular Refinement Arvind
Control Hazards Constructive Computer Architecture: Arvind
Implementing for Correct Concurrency
Bluespec-8: Modules and Interfaces
Presentation transcript:

February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

February 20, 2009 L08-2http://csg.csail.mit.edu/6.375 Synchronous vs Asynchronous Pipelines In a synchronous pipeline: typically only one rule; the designer controls precisely which activities go on in parallel downside: The rule can get too complicated -- easy to make a mistake; difficult to make changes In an asynchronous pipeline: several smaller rules, each easy to write, easier to make changes downside: sometimes rules do not fire concurrently when they should

February 20, 2009 L08-3http://csg.csail.mit.edu/6.375 Two-stage Asynchronous Pipeline rule fetch_and_decode (!stallFunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execute (True); case (it) matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}: begin rf.upd(rd, va+vb); bu.deq(); end tagged EBz {cond:.cv,addr:.av}: if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); tagged ELoad{dst:.rd,addr:.av}: begin rf.upd(rd, dMem.read(av)); bu.deq(); end tagged EStore{value:.vv,addr:.av}: begin dMem.write(av, vv); bu.deq(); end endcase endrule fetch & decode execute pc rf CPU bu Can these rules fire concurrently ? Does it matter?

February 20, 2009 L08-4http://csg.csail.mit.edu/6.375 The tension If the two rules never fire in the same cycle then the machine can hardly be called a pipelined machine If both rules fire in parallel every cycle when they are enabled, then wrong results would be produced

February 20, 2009 L08-5http://csg.csail.mit.edu/6.375 The compiler issue Can the compiler detect all the conflicting conditions? Important for correctness Does the compiler detect conflicts that do not exist in reality? False positives lower the performance The main reason is that sometimes the compiler cannot detect under what conditions the two rules are mutually exclusive or conflict free What can the user specify easily? Rule priorities to resolve nondeterministic choice yes In many situations the correctness of the design is not enough; the design is not done unless the performance goals are met

February 20, 2009 L08-6http://csg.csail.mit.edu/6.375 some insight into Concurrent rule firing Rules HW RiRjRk clocks rule steps Ri Rj Rk There are more intermediate states in the rule semantics (a state after each rule step) In the HW, states change only at clock edges

February 20, 2009 L08-7http://csg.csail.mit.edu/6.375 Parallel execution reorders reads and writes Rules HW clocks rule steps In the rule semantics, each rule sees (reads) the effects (writes) of previous rules In the HW, rules only see the effects from previous clocks, and only affect subsequent clocks readswritesreadswritesreadswritesreadswritesreadswrites readswritesreadswrites

February 20, 2009 L08-8http://csg.csail.mit.edu/6.375 Correctness Rules HW RiRjRk clocks rule steps Ri Rj Rk Rules are allowed to fire in parallel only if the net state change is equivalent to sequential rule execution Consequence: the HW can never reach a state unexpected in the rule semantics

February 20, 2009 L08-9http://csg.csail.mit.edu/6.375 Executing Multiple Rules Per Cycle: Conflict-free rules Parallel execution behaves like ra < rb or equivalently rb < ra rule ra (z > 10); x <= x + 1; endrule rule rb (z > 20); y <= y + 2; endrule Rule a and Rule b are conflict-free if s.  a (s)   b (s)  1.  a ( b (s))   b ( a (s)) 2.  a ( b (s)) ==  b ( a (s))

February 20, 2009 L08-10http://csg.csail.mit.edu/6.375 Mutually Exclusive Rules Rule a and Rule b are mutually exclusive if they can never be enabled simultaneously s.  a (s)  ~  b (s) Mutually-exclusive rules are Conflict-free by definition

February 20, 2009 L08-11http://csg.csail.mit.edu/6.375 Executing Multiple Rules Per Cycle: Sequentially Composable rules rule ra (z > 10); x <= y + 1; endrule rule rb (z > 20); y <= y + 2; endrule Parallel execution behaves like ra < rb Rule a and Rule b are sequentially composable if s.  a (s)   b (s)  1.  b ( a (s)) 2. Prj R(Rb) ( b (s)) == Prj R(Rb) ( b ( a (s))) - R(Rb) is the range of rule Rb - Prj st is the projection selecting st from the total state

February 20, 2009 L08-12http://csg.csail.mit.edu/6.375 Compiler determines if two rules can be executed in parallel Rule a and Rule b are sequentially composable if s.  a (s)   b (s)  1.  b ( a (s)) 2. Prj R(Rb) ( b (s)) == Prj R(Rb) ( b ( a (s))) Rule a and Rule b are conflict-free if s.  a (s)   b (s)  1.  a ( b (s))   b ( a (s)) 2.  a ( b (s)) ==  b ( a (s)) These properties can be determined by examining the domains and ranges of the rules in a pairwise manner. Parallel execution of CF and SC rules does not increase the critical path delay D(Ra)  R(Rb) =  D(Rb)  R(Ra) =  R(Ra)  R(Rb) =  D(Rb)  R(Ra) =  These conditions are sufficient but not necessary

February 20, 2009 L08-13http://csg.csail.mit.edu/6.375 Muxing structure Muxing logic requires determining for each register (action method) the rules that update it and under what conditions Conflict Free/Mutually Exclusive) and or 11 11 22 22 Sequentially Composable and or 11  1 and ~ 2 22 22 If two CF rules update the same element then they must be mutually exclusive ( 1  ~ 2 )

February 20, 2009 L08-14http://csg.csail.mit.edu/6.375 Scheduling and control logic Modules (Current state) Rules   Scheduler 11 nn 11 nn Muxing 11 nn nn nn Modules (Next state) cond action “CAN_FIRE”“WILL_FIRE”  i   j  R i and R j are conflict-free or sequentially composable

February 20, 2009 L08-15http://csg.csail.mit.edu/6.375 Concurrency analysis Two-stage Pipeline rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execute (True); case (it) matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}: begin rf.upd(rd, va+vb); bu.deq(); end tagged EBz {cond:.cv,addr:.av}: if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); tagged ELoad{dst:.rd,addr:.av}: begin rf.upd(rd, dMem.read(av)); bu.deq(); end tagged EStore{value:.vv,addr:.av}: begin dMem.write(av, vv); bu.deq(); end endcase endrule fetch & decode execute pc rf CPU bu conflicts around: pc, bu, rf Let us split this rule for the sake of analysis

February 20, 2009 L08-16http://csg.csail.mit.edu/6.375 Concurrency analysis Add Rule rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule fetch & decode execute pc rf CPU bu rule execAdd (it matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}); rf.upd(rd, va+vb); bu.deq(); endrule rf : sub bu : find, enq pc : read,write execAdd rf : upd bu : first, deq fetch < execAdd  rf: sub < upd bu: {find, enq} < {first, deq} execAdd < fetch  rf: sub > upd bu: {find, enq} > {first, deq} Do either of these concurrency properties hold ?

February 20, 2009 L08-17http://csg.csail.mit.edu/6.375 Register File concurrency properties Normal Register File implementation guarantees: rf.sub < rf.upd  that is, reads happen before writes in concurrent execution But concurrent rf.sub(r1) and rf.upd(r2,v) where r1 ≠ r2 behaves like both rf.sub(r1) < rf.upd(r2,v) rf.sub(r1) > rf.upd(r2,v) To guarantee rf.upd < rf.sub Either bypass the input value to output when register names match Or make sure that on concurrent calls rf.upd and rf.sub do not operate on the same register True for our rules because of stalls but it is too difficult for the compiler to detect

February 20, 2009 L08-18http://csg.csail.mit.edu/6.375 Bypass Register File module mkBypassRFFull(RegFile#(RName,Value)); RegFile#(RName,Value) rf <- mkRegFileFull(); RWire#(Tuple2#(RName,Value)) rw <- mkRWire(); method Action upd (RName r, Value d); rf.upd(r,d); rw.wset(tuple2(r,d)); endmethod method Value sub(RName r); case rw.wget() matches tagged Valid {.wr,.d}: return (wr==r) ? d : rf.sub(r); tagged Invalid: return rf.sub(r); endcase endmethod endmodule Will work only if the compiler lets us ignore conflicts on the rf made by mkRegFileFull

February 20, 2009 L08-19http://csg.csail.mit.edu/6.375 Unsafe modules Bluespec allows you to import Verilog modules by identifying wires that correspond to methods Such modules can be made safe either by asserting the correct scheduling properties of the methods or by wrapping the unsafe modules in appropriate Bluespec code

February 20, 2009 L08-20http://csg.csail.mit.edu/6.375 FIFOs Ordinary one element FIFO deq & enq conflict – won’t do Loopy FIFO deq < enq Bypass FIFO enq < deq What about first, clear, find?

February 20, 2009 L08-21http://csg.csail.mit.edu/6.375 module mkLFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); RWire#(void) deqEN <- mkRWire(); Bool deqp = isValid (deqEN.wget())); method Action enq(t x) if (!full || deqp); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; deqEN.wset(?); endmethod method t first() if (full); return (data); endmethod method Action clear(); full <= False; endmethod endmodule Concurrency analysis One Element “Loopy” FIFO not empty not full rdy enab rdy enab enq deq FIFO module or !full bu.first < bu.enq bu.deq < bu.enq Let us extend it with find bu.enq < bu.clear bu.deq < bu.clear

February 20, 2009 L08-22http://csg.csail.mit.edu/6.375 module mkSFIFO1#(function Bool findf(tr r, t x)) (SFIFO#(t,tr)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); RWire#(void) deqEN <- mkRWire(); Bool deqp = isValid (deqEN.wget())); method Action enq(t x) if (!full || deqp); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; deqEN.wset(?); endmethod method t first() if (full); return (data); endmethod method Action clear(); full <= False; endmethod method Bool find(tr r); return (findf(r, data) && full); endmethod endmodule One Element Searchable FIFO bu.first < bu.enq bu.deq < bu.enq (full && !deqp)); bu.find < bu.enq bu.deq < bu.find bu.enq < bu.clear bu.deq < bu.clear

February 20, 2009 L08-23http://csg.csail.mit.edu/6.375 What concurrency do we want? If fetch and execAdd happened in the same cycle and the meaning was: fetch < execAdd  instructions will fly through the FIFO (No pipelining!)  rf and bu modules will need the properties; rf: sub < upd bu: {find, enq} < {first, deq} execAdd < fetch  execAdd will make space for the fetched instructions (i.e., how pipelining is supposed to work)  rf and bu modules will need the properties; rf: upd < sub bu: {first, deq} < {find, enq} fetch & decode execute pc rf CPU bu Suppose bu is empty initially Now we will focus only on the pipeline case Ordinary RF Bypass RF Bypass FIFO Loopy FIFO

February 20, 2009 L08-24http://csg.csail.mit.edu/6.375 Concurrency analysis Branch Rules rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule fetch & decode execute pc rf CPU bu rule execBzTaken(it matches tagged Bz {cond:.cv,addr:.av} &&& (cv == 0)); pc <= av; bu.clear(); endrule rule execBzNotTaken(it matches tagged Bz {cond:.cv,addr:.av} &&& !(cv == 0)); bu.deq(); endrule execBzTaken < fetch ? Should be treated as conflict – give priority to execBzTaken execBzNotTaken < fetch ? bu: {first, deq} < {find, enq} Loopy FIFO

February 20, 2009 L08-25http://csg.csail.mit.edu/6.375 Concurrency analysis Load-Store Rules rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule fetch & decode execute pc rf CPU bu rule execStore(it matches tagged EStore{value:.vv,addr:.av}); dMem.write(av, vv); bu.deq(); endrule rule execLoad(it matches tagged ELoad{dst:.rd,addr:.av}); rf.upd(rd, dMem.read(av)); bu.deq(); endrule execLoad < fetch ? Same as execAdd, i.e., rf: upd < sub bu: {first, deq} < {find, enq} execStore < fetch ? bu: {first, deq} < {find, enq} Bypass RF Loopy FIFO

February 20, 2009 L08-26http://csg.csail.mit.edu/6.375 Properties Required of Register File and FIFO for Instruction Pipelining Register File: rf.upd(r1, v) < rf.sub(r2) Bypass RF FIFO bu: {first, deq} < {find, enq}   bu.first < bu.find  bu.first < bu.enq  bu.deq < bu.find  bu.deq < bu.enq Loopy FIFO

February 20, 2009 L08-27http://csg.csail.mit.edu/6.375 Concurrency analysis Two-stage Pipeline rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execAdd (it matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}); rf.upd(rd, va+vb); bu.deq(); endrule rule execBz(it matches tagged Bz {cond:.cv,addr:.av}); if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); endrule rule execLoad(it matches tagged ELoad{dst:.rd,addr:.av}); rf.upd(rd, dMem.read(av)); bu.deq(); endrule rule execStore(it matches tagged EStore{value:.vv,addr:.av}); dMem.write(av, vv); bu.deq(); endrule fetch & decode execute pc rf CPU bu It all works

February 20, 2009http://csg.csail.mit.edu/6.375L08-28 Lot of nontrivial analysis but no change in processor code! Needed Fifos and Register files with the appropriate concurrency properties

February 20, 2009 L08-29http://csg.csail.mit.edu/6.375 Bypassing After decoding the newIt function must read the new register values if available (i.e., the values that are still to be committed in the register file) Will happen automatically if we use bypassRF The instruction fetch must not stall if the new value of the register to be read exists The old stall function is correct but unable to take advantage of bypassing and stalls unnecessarily

February 20, 2009 L08-30http://csg.csail.mit.edu/6.375 The stall function for the synchronous pipeline function Bool newStallFunc (Instr instr, Reg#(Maybe#(InstTemplate)) buReg); case (buReg) matches tagged Invalid: return False; tagged Valid.it: case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return (findf(ra,it) || findf(rb,it)); … Previously we stalled when ra matched the destination register of the instruction in the execute stage. Now we bypass that information when we read, so no stall is necessary. return (false);

February 20, 2009 L08-31http://csg.csail.mit.edu/6.375 The stall function for the asynchronous pipeline function Bool newStallFunc (Instr instr, SFIFO#(InstTemplate, RName) bu); case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return (bu.find(ra) || bu.find(rb)); tagged Bz {cond:.rc,addr:.addr}: return (bu.find(rc) || bu.find(addr)); … bu.find in our loopy-searchable FIFO happens after deq. This means that if bu can hold at most one instruction like in the synchronous case, we do not have to stall. Otherwise, we will still need to check for hazards and stall. No change in the stall function