Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.

Slides:

Advertisements

Similar presentations

Elastic Pipelines and Basics of Multi-rule Systems Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.

Advertisements

Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

Constructive Computer Architecture: FIFO Lab Comments Revisiting CF FIFOs Andy Wright TA October 20, 2014http://csg.csail.mit.edu/6.175L14-1.

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.

December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Pipelining combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February 20, 2013http://csg.csail.mit.edu/6.375L05-1.

Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.

Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

1 Tutorial: Lab 4 Again Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,

October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

Elastic Pipelines: Concurrency Issues

EHRs: Designing modules with concurrent methods

February 28, 2005http://csg.csail.mit.edu/6.884/L09-1 Bluespec-3: Modules & Interfaces Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Bluespec-3: A non-pipelined processor Arvind

Concurrency properties of BSV methods and rules

Bluespec-6: Modeling Processors

Bluespec-6: Modules and Interfaces

Scheduling Constraints on Interface methods

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Sequential Circuits Constructive Computer Architecture Arvind

Sequential Circuits: Constructive Computer Architecture

Caches-2 Constructive Computer Architecture Arvind

Performance Specifications

Pipelining combinational circuits

Multirule Systems and Concurrent Execution of Rules

Constructive Computer Architecture: Lab Comments & RISCV

Constructive Computer Architecture: Guards

Sequential Circuits Constructive Computer Architecture Arvind

Pipelining combinational circuits

Modular Refinement Arvind

Modular Refinement Arvind

EHR: Ephemeral History Register

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Bluespec-7: Scheduling & Rule Composition

Modeling Processors: Concurrency Issues

Modules with Guarded Interfaces

Pipelining combinational circuits

Elastic Pipelines: Concurrency Issues

Bluespec-3: A non-pipelined processor Arvind

Modular Refinement - 2 Arvind

IP Lookup: Some subtle concurrency issues

Elastic Pipelines: Concurrency Issues

Elastic Pipelines and Basics of Multi-rule Systems

Bluespec-5: Modeling Processors

Constructive Computer Architecture: Guards

Elastic Pipelines and Basics of Multi-rule Systems

Bluespec-7: Scheduling & Rule Composition

Control Hazards Constructive Computer Architecture: Arvind

Pipelined Processors Constructive Computer Architecture: Arvind

IP Lookup: Some subtle concurrency issues

Tutorial 4: RISCV modules Constructive Computer Architecture

Modeling Processors Arvind

Modeling Processors Arvind

Modular Refinement Arvind

Control Hazards Constructive Computer Architecture: Arvind

Implementing for Correct Concurrency

Modular Refinement Arvind

Constructive Computer Architecture: Well formed BSV programs

Bluespec-8: Modules and Interfaces

Presentation transcript:

Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375

Inelastic vs Elastic Pipelines In a Inelastic pipeline: typically only one rule; the designer controls precisely which activities go on in parallel downside: The rule can get too complicated -- easy to make a mistake; difficult to make changes In an Elastic pipeline: several smaller rules, each easy to write, easier to make changes downside: sometimes rules do not fire concurrently when they should Easy: cycle-level concurrency Difficult: precise functional correctness Easy: functional correctness Difficult: precise cycle-level concurrency February 28, 2011L08-2http://csg.csail.mit.edu/6.375

Processor Pipelines and FIFOs fetch execute iMem rf CPU decode memory pc write- back dMem It is better to think in terms of FIFOs as opposed to pipeline registers. February 28, 2011L08-3http://csg.csail.mit.edu/6.375

Fetch & Decode Rule: corrected fetch & decode execute pc rf CPU bu rule decodeAdd (instr matches Add{dst:.rd,src1:.ra,src2:.rb} bu.enq (EAdd{dst:rd, op1:rf[ra], op2:rf[rb]}); pc <= predIa; endrule &&& !bu.find(ra) &&& !bu.find(rb)) February 28, 2011L08-4http://csg.csail.mit.edu/6.375

SFIFO (glue between stages) interface SFIFO#(type t, type tr); method Action enq(t);// enqueue an item method Action deq();// remove oldest entry method t first();// inspect oldest item method Action clear();// make FIFO empty method Bool find(tr);// search FIFO endinterface n = # of bits needed to represent the values of type “t“ m = # of bits needed to represent the values of type “tr" not full not empty rdy enab n n rdy enab rdy enq deq first SFIFO module clear enab find m bool more on searchable FIFOs later February 28, 2011L08-5http://csg.csail.mit.edu/6.375

Two-Stage Pipeline fetch & decode execute pc rf CPU bu module mkCPU#(Mem iMem, Mem dMem)(Empty); Reg#(Iaddress) pc <- mkReg(0); RegFile#(RName, Bit#(32)) rf <- mkRegFileFull(); SFIFO#(InstTemplate, RName) bu <- mkSFifo(findf); Instr instr = iMem.read(pc); Iaddress predIa = pc + 1; InstTemplate it = bu.first(); rule fetch_decode... endmodule February 28, 2011L08-6http://csg.csail.mit.edu/6.375

Rules for Add rule decodeAdd(instr matches Add{dst:.rd,src1:.ra,src2:.rb}) bu.enq (EAdd{dst:rd,op1:rf[ra],op2:rf[rb]}); pc <= predIa; endrule rule executeAdd(it matches EAdd{dst:.rd,op1:.va,op2:.vb}) rf.upd(rd, va + vb); bu.deq(); endrule implicit check: fetch & decode execute pc rf CPU bu bu notfull bu notempty February 28, 2011L08-7http://csg.csail.mit.edu/6.375

Fetch & Decode Rule: Reexamined Wrong! Because instructions in bu may be modifying ra or rb stall ! fetch & decode execute pc rf CPU bu rule decodeAdd (instr matches Add{dst:.rd,src1:.ra,src2:.rb}) bu.enq (EAdd{dst:rd, op1:rf[ra], op2:rf[rb]}); pc <= predIa; endrule February 28, 2011L08-8http://csg.csail.mit.edu/6.375

Rules for Branch rule decodeBz(instr matches Bz{condR:.rc,addrR:.addr}) &&& !bu.find(rc) &&& !bu.find(addr)); bu.enq (EBz{cond:rf[rc],tAddr:rf[addr]}); pc <= predIa; endrule rule bzTaken(it matches EBz{cond:.vc,tAddr:.va}) &&& (vc==0)); pc <= va; bu.clear(); endrule rule bzNotTaken (it matches EBz{cond:.vc,tAddr:.va}) &&& (vc != 0)); bu.deq; endrule fetch & decode execute pc rf CPU bu rule-atomicity ensures that pc update, and discard of pre- fetched instrs in bu, are done consistently February 28, 2011L08-9http://csg.csail.mit.edu/6.375

Fetch & Decode Rule function InstrTemplate newIt(Instr instr); case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return EAdd{dst:rd,op1:rf[ra],op2:rf[rb]}; tagged Bz {condR:.rc,addrR:.addr}: return EBz{cond:rf[rc],tAddr:rf[addr]}; tagged Load {dst:.rd,addrR:.addr}: return ELoad{dst:rd,addrR:rf[addr]}; tagged Store{valueR:.v,addrR:.addr}: return EStore{val:rf[v],addr:rf[addr]}; endcase endfunction rule fetch_and_decode (!stallFunc(instr, bu)); bu.enq(newIt(instr)); pc <= predIa; endrule Same as before February 28, 2011L08-10http://csg.csail.mit.edu/6.375

The Stall Signal Bool stall = stallFunc(instr, bu); This need to search the contents of the FIFO is why we need an SFIFO, not just a FIFO function Bool stallFunc (Instr instr, SFIFO#(InstTemplate, RName) bu); case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return (bu.find(ra) || bu.find(rb)); tagged Bz {condR:.rc,addrR:.addr}: return (bu.find(rc) || bu.find(addr)); tagged Load {dst:.rd,addrR:.addr}: return (bu.find(addr)); tagged Store {valueR:.v,addrR:.addr}: return (bu.find(v)) || bu.find(addr)); endcase endfunction February 28, 2011L08-11http://csg.csail.mit.edu/6.375

The findf function When we make a searchable FIFO we need to supply a function that determines if a register is going to be updated by an instruction template mkSFifo can be parameterized by such a search function SFIFO#(InstrTemplate, RName) bu <- mkSFifo(findf); function Bool findf (RName r, InstrTemplate it); case (it) matches tagged EAdd{dst:.rd,op1:.v1,op2:.v2}: return (r == rd); tagged EBz {cond:.c,tAddr:.a}: return (False); tagged ELoad{dst:.rd,addr:.a}: return (r == rd); tagged EStore{val:.v,addr:.a}: return (False); endcase endfunction Same as before February 28, 2011L08-12http://csg.csail.mit.edu/6.375

Execute Rule rule execute (True); case (it) matches tagged EAdd{dst:.rd,op1:.va,op2:.vb}: begin rf.upd(rd, va+vb); bu.deq(); end tagged EBz {cond:.cv,tAddr:.av}: if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); tagged ELoad{dst:.rd,addr:.av}: begin rf.upd(rd, dMem.read(av)); bu.deq(); end tagged EStore{val:.vv,addr:.av}: begin dMem.write(av, vv); bu.deq(); end endcase endrule February 28, 2011L08-13http://csg.csail.mit.edu/6.375

Concurrency rule fetch_and_decode (!stallFunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execute (True); case (it) matches tagged EAdd{dst:.rd,op1:.va,op2:.vb}: begin rf.upd(rd, va+vb); bu.deq(); end tagged EBz {cond:.cv,tAddr:.av}: if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); tagged ELoad{dst:.rd,addr:.av}: begin rf.upd(rd, dMem.read(av)); bu.deq(); end tagged EStore{val:.vv,addr:.av}: begin dMem.write(av, vv); bu.deq(); end endcase endrule fetch & decode execute pc rf CPU bu Can these rules fire concurrently ? Does it matter? February 28, 2011L08-14http://csg.csail.mit.edu/6.375

The tension If the two rules never fire in the same cycle then the machine can hardly be called a pipelined machine Scheduling cannot be too conservative If both rules are enabled and are executed together then in some cases wrong results would be produced Too aggressive a scheduling would violate one-rule- at-time-semantics Case 1: Back-to-back dependencies? Two rules won’t be enabled together (stall function) Case 2: Branch taken? Two rules will be enabled together but only one rule should fire. branch-taken should have priority February 28, 2011L08-15http://csg.csail.mit.edu/6.375

rule execAdd(it matches tagged EAdd{dst:.rd,op1:.va,op2:.vb}); rf.upd(rd, va+vb); bu.deq(); endrule rule bzTaken(it matches tagged EBz {cond:.cv,tAddr:.av}) &&& (cv == 0); pc <= av; bu.clear(); endrule rule bzNotTaken(it matches tagged EBz {cond:.cv,tAddr:.av}); &&& !(cv == 0); bu.deq(); endrule rule execLoad(it matches tagged ELoad{dst:.rd,addr:.av}); rf.upd(rd, dMem.read(av)); bu.deq(); endrule rule execStore(it matches tagged EStore{val:.vv,addr:.av}); dMem.write(av, vv); bu.deq(); endrule fetch & decode execute pc rf CPU bu Execution rules Split the execution rule for analysis February 28, 2011L08-16http://csg.csail.mit.edu/6.375

Concurrency analysis Add Rule fetch < execAdd  rf: sub < upd bu: {find, enq} < {first, deq} execAdd < fetch  rf: sub > upd bu: {find, enq} > {first, deq} rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule fetch & decode execute pc rf CPU bu rule execAdd(it matches tagged EAdd{dst:.rd,op1:.va,op2:.vb}); rf.upd(rd, va+vb); bu.deq(); endrule rf : sub bu : find, enq pc : read,write execAdd rf : upd bu : first, deq Bypass RF Pipeline SFIFO Ordinary RF Bypass SFIFO February 28, 2011L08-17http://csg.csail.mit.edu/6.375

What concurrency do we want? If fetch and execAdd happened in the same cycle and the meaning was: fetch < execAdd  instructions will fly through the FIFO (No pipelining!)  rf and bu modules will need the properties; rf: sub < upd bu: {find, enq} < {first, deq} execAdd < fetch  execAdd will make space for the fetched instructions (i.e., how pipelining is supposed to work)  rf and bu modules will need the properties; rf: upd < sub bu: {first, deq} < {find, enq} fetch & decode execute pc rf CPU bu Suppose bu is empty initially Now we will focus only on the pipeline case Ordinary RF Bypass RF Bypass SFIFO Pipeline SFIFO February 28, 2011L08-18http://csg.csail.mit.edu/6.375

Concurrency analysis Branch Rules bzTaken < fetch  Should be treated as a conflict; give priority to bzTaken bzNotTaken < fetch  bu: {first, deq} < {find, enq} rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule fetch & decode execute pc rf CPU bu Rule bzTaken(it matches tagged EBz {cond:.cv,tAddr:.av} &&& (cv == 0)); pc <= av; bu.clear(); endrule rule bzNotTaken(it matches tagged EBz {cond:.cv,tAddr:.av} &&& !(cv == 0)); bu.deq(); endrule Pipeline SFIFO February 28, 2011L08-19http://csg.csail.mit.edu/6.375

Concurrency analysis Load-Store Rules execLoad < fetch  rf: upd < sub; bu: {first, deq} < {find, enq} execStore < fetch  bu: {first, deq} < {find, enq} rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule fetch & decode execute pc rf CPU bu rule execStore(it matches tagged EStore{val:.vv,addr:.av}); dMem.write(av, vv); bu.deq(); endrule rule execLoad(it matches tagged ELoad{dst:.rd,addr:.av}); rf.upd(rd, dMem.read(av)); bu.deq(); endrule Bypass RF Pipeline SFIFO February 28, 2011L08-20http://csg.csail.mit.edu/6.375

Properties Required of Register File and FIFO for Instruction Pipelining Register File: rf.upd(r1, v) < rf.sub(r2) Bypass RF FIFO bu: {first, deq} < {find, enq}   bu.first < bu.find  bu.first < bu.enq  bu.deq < bu.find  bu.deq < bu.enq Pipeline SFIFO February 28, 2011L08-21http://csg.csail.mit.edu/6.375

One Element Searchable Pipeline SFIFO module mkSFIFO1#(function Bool findf(tr r, t x)) (SFIFO#(t,tr)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkConfigReg(False); RWire#(void) deqEN <- mkRWire(); Bool deqp = isValid (deqEN.wget())); method Action enq(t x) if (!full || deqp); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; deqEN.wset(?); endmethod method t first() if (full); return (data); endmethod method Action clear(); full <= False; endmethod method Bool find(tr r); return (findf(r, data) && full); endmethod endmodule bu.enq > bu.deq bu.enq > bu.first bu.enq < bu.clear (full && !deqp)); bu.find < bu.enq bu.find < bu.deq bu.find < bu.clear bu.deq > bu.first bu.deq < bu.clear bu.find < bu.enq bu.find > bu.deq bu.find < bu.clear February 28, 2011L08-22http://csg.csail.mit.edu/6.375

Suppose we used the wrong SFIFO? bu.find < bu.deq Will the system produce wrong results? NO because the fetch rule will simply conflict with the execute rules February 28, 2011L08-23http://csg.csail.mit.edu/6.375

Register File concurrency properties Normal Register File implementation guarantees: rf.sub < rf.upd  that is, reads happen before writes in concurrent execution But concurrent rf.sub(r1) and rf.upd(r2,v) where r1 ≠ r2 behaves like both rf.sub(r1) < rf.upd(r2,v) rf.sub(r1) > rf.upd(r2,v) To guarantee rf.upd < rf.sub Either bypass the input value to output when register names match Or make sure that on concurrent calls rf.upd and rf.sub do not operate on the same register True for our rules because of stalls but it is too difficult for the compiler to detect February 28, 2011L08-24http://csg.csail.mit.edu/6.375

Bypass Register File module mkBypassRFFull(RegFile#(RName,Value)); RegFile#(RName,Value) rf <- mkRegFileFullWCF(); RWire#(Tuple2#(RName,Value)) rw <- mkRWire(); method Action upd (RName r, Value d); rf.upd(r,d); rw.wset(tuple2(r,d)); endmethod method Value sub(RName r); case rw.wget() matches tagged Valid {.wr,.d}: return (wr==r) ? d : rf.sub(r); tagged Invalid: return rf.sub(r); endcase endmethod endmodule Will work only if the compiler lets us ignore conflicts on the rf made by mkRegFileFull “Config reg file” February 28, 2011L08-25http://csg.csail.mit.edu/6.375

Since our rules do not really require a Bypass Register File, the overhead of bypassing can be avoided by simply using the “Config Regfile” February 28, 2011L08-26http://csg.csail.mit.edu/6.375

Concurrency analysis Two-stage Pipeline rule fetch_and_decode (!stallfunc(instr, bu)); bu.enq(newIt(instr,rf)); pc <= predIa; endrule rule execAdd (it matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}); rf.upd(rd, va+vb); bu.deq(); endrule rule BzTaken(it matches tagged Bz {cond:.cv,addr:.av}) &&& (cv == 0); pc <= av; bu.clear(); endrule rule BzNotTaken(it matches tagged Bz {cond:.cv,addr:.av}); &&& !(cv == 0); bu.deq(); endrule rule execLoad(it matches tagged ELoad{dst:.rd,addr:.av}); rf.upd(rd, dMem.read(av)); bu.deq(); endrule rule execStore(it matches tagged EStore{value:.vv,addr:.av}); dMem.write(av, vv); bu.deq(); endrule fetch & decode execute pc rf CPU bu all concurrent cases work X February 28, 2011L08-27http://csg.csail.mit.edu/6.375

Lot of nontrivial analysis but no change in processor code! Needed Fifos and Register files with the appropriate concurrency properties February 28, 2011L08-28http://csg.csail.mit.edu/6.375

Bypassing After decoding the newIt function must read the new register values if available (i.e., the values that are still to be committed in the register file) Will happen automatically if we use bypassRF The instruction fetch must not stall if the new value of the register to be read exists The old stall function is correct but unable to take advantage of bypassing and stalls unnecessarily February 28, 2011L08-29http://csg.csail.mit.edu/6.375

The stall function for the elastic pipeline function Bool newStallFunc (Instr instr, SFIFO#(InstTemplate, RName) bu); case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return (bu.find(ra) || bu.find(rb)); tagged Bz {cond:.rc,addr:.addr}: return (bu.find(rc) || bu.find(addr)); … bu.find in our Pipeline SFIFO happens after deq. This means that if bu can hold at most one instruction like in the inelastic case, we do not have to stall. Otherwise, we will still need to check for hazards and stall. No change in the stall function February 28, 2011L08-30http://csg.csail.mit.edu/6.375