Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 23, 2007 http://csg.csail.mit.edu/6.375/
Topics Simultaneous enq & deq in a FIFO The RWire solution Dead cycle elimination in the IP circular pipeline code Two-stage processor pipeline Value forwarding to reduce bubbles http://csg.csail.mit.edu/6.375/ February 23, 2007
Implicit guards (conditions) Rule rule <name> (<guard>); <action>; endrule where <action> ::= r <= <exp> | m.g(<exp>) | if (<exp>) <actions> endif m.gB(<exp>) when m.gG make implicit guards explicit http://csg.csail.mit.edu/6.375/ February 23, 2007
Guards vs If’s A guard on one action of a parallel group of actions affects every action within the group (a1 when p1); (a2 when p2) ==> (a1; a2) when (p1 && p2) A condition of a Conditional action only affects the actions within the scope of the conditional action (if (p1) a1); a2 p1 has no effect on a2 ... Mixing ifs and whens (if (p) (a1 when q)) ; a2 ((if (p) a1); a2) when (p&q | !p) http://csg.csail.mit.edu/6.375/ February 23, 2007
Example: making guards explicit rule recirculate (True); if (p) fifo.enq(8); r <= 7; endrule rule recirculate ((p && fifo.engG) || !p); if (p) fifo.enqB(8); r <= 7; endrule http://csg.csail.mit.edu/6.375/ February 23, 2007
A problem ... (from the last lecture) rule recirculate (True); TableEntry p <- ram.resp(); match {.rip, .tok} = fifo.first(); if (isLeaf(p)) cbuf.put(tok, p); else begin fifo.enq(tuple2(rip << 8, tok)); ram.req(p+signExtend(rip[15:8])); end fifo.deq(); endrule The fifo needs to be able to do enq and deq simultaneously for this rule to make sense http://csg.csail.mit.edu/6.375/ February 23, 2007
One Element FIFO module mkFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); method Action enq(t x) if (!full); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; method t first() if (full); return (data); method Action clear(); endmodule enq and deq cannot even be enabled together much less fire concurrently! The functionality we want is as if deq “happens” before enq; if deq does not happen then enq behaves normally n not empty not full rdy enab enq deq module FIFO http://csg.csail.mit.edu/6.375/ February 23, 2007
RWire to rescue interface RWire#(type t); method Action wset(t x); method Maybe#(t) wget(); endinterface Like a register in that you can read and write it but unlike a register - read happens after write - data disappears in the next cycle http://csg.csail.mit.edu/6.375/ February 23, 2007
One Element “Loopy” FIFO module mkLFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); RWire#(void) deqEN <- mkRWire(); method Action enq(t x) if (!full || isValid (deqEN.wget())); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; deqEN.wset(?); method t first() if (full); return (data); method Action clear(); full <= False; endmodule not empty not full rdy enab enq deq module FIFO !full or http://csg.csail.mit.edu/6.375/ February 23, 2007
Problem solved! What if fifo is empty? LFIFO fifo <- mkLFIFO; // use a loopy fifo rule recirculate (True); TableEntry p <- ram.resp(); match {.rip, .tok} = fifo.first(); if (isLeaf(p)) cbuf.put(tok, p); else begin fifo.enq(tuple2(rip << 8, tok)); ram.req(p+signExtend(rip[15:8])); end fifo.deq(); endrule What if fifo is empty? http://csg.csail.mit.edu/6.375/ February 23, 2007
The Dead Cycle Problem rule enter (True); done? RAM cbuf inQ fifo rule enter (True); Token tok <- cbuf.getToken(); IP ip = inQ.first(); ram.req(ext(ip[31:16])); fifo.enq(tuple2(ip[15:0], tok)); inQ.deq(); endrule rule recirculate (True); TableEntry p <- ram.resp(); match {.rip, .tok} = fifo.first(); if (isLeaf(p)) cbuf.put(tok, p); else begin fifo.enq(tuple2(rip << 8, tok)); ram.req(p+signExtend(rip[15:8])); end fifo.deq(); endrule Can a new request enter the system simultaneously with an old one leaving? http://csg.csail.mit.edu/6.375/ February 23, 2007
Scheduling conflicting rules When two rules conflict on a shared resource, they cannot both execute in the same clock The compiler produces logic that ensures that, when both rules are applicable, only one will fire Which one? source annotations (* descending_urgency = “recirculateh, enter” *) http://csg.csail.mit.edu/6.375/ February 23, 2007
A slightly simpler example enter? done? RAM inQ fifo rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq(); endrule rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq(); endrule In general these two rules conflict but when isLeaf(p) is true there is no apparent conflict! http://csg.csail.mit.edu/6.375/ February 23, 2007
Rule Spliting rule foo (True); if (p) r1 <= 5; else r2 <= 7; endrule rule fooT (p); r1 <= 5; endrule rule fooF (!p); r2 <= 7; rule fooT and fooF can be scheduled independently with some other rule http://csg.csail.mit.edu/6.375/ February 23, 2007
Spliting the recirculate rule rule recirculate (!isLeaf(ram.peek())); IP rip = fifo.first(); fifo.enq(rip << 8); ram.req(ram.peek() + rip[15:8]); fifo.deq(); ram.deq(); endrule rule exit (isLeaf(ram.peek())); outQ.enq(ram.peek()); fifo.deq(); ram.deq(); endrule rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq(); endrule Now rules enter and exit can be scheduled simultaneously http://csg.csail.mit.edu/6.375/ February 23, 2007
Sometimes rule splitting is not possible makes life difficult rule recirculate (True); TableEntry p <- ram.resp(); match {.rip, .tok} = fifo.first(); if (isLeaf(p)) cbuf.put(tok, p); else begin fifo.enq(tuple2(rip << 8, tok)); ram.req(p+signExtend(rip[15:8])); end fifo.deq(); endrule You will have to resort to interface changes and/or the use of RWires http://csg.csail.mit.edu/6.375/ February 23, 2007
Packaging a module: Turning a rule into a method enter? done? RAM cbuf fifo inQ rule enter (True); Token t <- cbuf.getToken(); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(tuple2(ip[15:0], t)); inQ.deq(); endrule method Action enter (IP ip); Token t <- cbuf.getToken(); ram.req(ip[31:16]); fifo.enq(tuple2(ip[15:0], t)); endmethod http://csg.csail.mit.edu/6.375/ February 23, 2007
Processor with a two-stage pipeline 5-minute break to stretch you legs February 23, 2007 http://csg.csail.mit.edu/6.375/
Processor Pipelines and FIFOs pc rf decode execute memory write- back fetch iMem dMem CPU http://csg.csail.mit.edu/6.375/ February 23, 2007
SFIFO (glue between stages) interface SFIFO#(type t, type tr); method Action enq(t); // enqueue an item method Action deq(); // remove oldest entry method t first(); // inspect oldest item method Action clear(); // make FIFO empty method Bool find(tr); // search FIFO endinterface n enab enq rdy not full n = # of bits needed to represent the values of type “t“ m = # of bits needed values of type “tr" enab rdy deq SFIFO module not empty n first not empty rdy enab more on searchable FIFOs later clear bool m find http://csg.csail.mit.edu/6.375/ February 23, 2007
Two-Stage Pipeline module mkCPU#(Mem iMem, Mem dMem)(Empty); fetch & decode execute pc rf CPU bu module mkCPU#(Mem iMem, Mem dMem)(Empty); Reg#(Iaddress) pc <- mkReg(0); RegFile#(RName, Bit#(32)) rf <- mkRegFileFull(); SFIFO#(InstTemplate, RName) bu <- mkSFifo(findf); Instr instr = iMem.read(pc); Iaddress predIa = pc + 1; InstTemplate it = bu.first(); rule fetch_decode ... endmodule This first method describe what happens in the fetch of this pipeline. This methods says that given any processor term, we can always rewrite it to a new term wehre, and we enqueue the current instruction into bf. And at the same time the pc field is incremende by one It may look like this method can fire forever, but because bf is bounded, when a method enqueue into bf, there is an implied predicate that say bf must not be full. http://csg.csail.mit.edu/6.375/ February 23, 2007 8 8 6 8
Instructions & Templates typedef union tagged { struct {RName dst; RName src1; RName src2} Add; struct {RName cond; RName addr} Bz; struct {RName dst; RName addr} Load; struct {RName value; RName addr} Store; } Instr deriving(Bits, Eq); typedef union tagged { struct {RName dst; Value op1; Value op2} EAdd; struct {Value cond; Iaddress tAddr} EBz; struct {RName dst; Daddress addr} ELoad; struct {Value data; Daddress addr} EStore; } InstTemplate deriving(Eq, Bits); typedef Bit#(32) Iaddress; typedef Bit#(32) Daddress; typedef Bit#(32) Value; http://csg.csail.mit.edu/6.375/ February 23, 2007
Rules for Add CPU implicit check: pc rf fetch & decode execute bu rule decodeAdd(instr matches Add{dst:.rd,src1:.ra,src2:.rb}) bu.enq (EAdd{dst:rd,op1:rf[ra],op2:rf[rb]}); pc <= predIa; endrule implicit check: This first method describe what happens in the fetch of this pipeline. This methods says that given any processor term, we can always rewrite it to a new term wehre, and we enqueue the current instruction into bf. And at the same time the pc field is incremende by one It may look like this method can fire forever, but because bf is bounded, when a method enqueue into bf, there is an implied predicate that say bf must not be full. bu notfull rule executeAdd(it matches EAdd{dst:.rd,op1:.va,op2:.vb}) rf.upd(rd, va + vb); bu.deq(); endrule bu notempty http://csg.csail.mit.edu/6.375/ February 23, 2007 8 6 8 8
Fetch & Decode Rule: Reexamined execute pc rf CPU bu rule decodeAdd (instr matches Add{dst:.rd,src1:.ra,src2:.rb}) bu.enq (EAdd{dst:rd, op1:rf[ra], op2:rf[rb]}); pc <= predIa; endrule This first method describe what happens in the fetch of this pipeline. This methods says that given any processor term, we can always rewrite it to a new term wehre, and we enqueue the current instruction into bf. And at the same time the pc field is incremende by one It may look like this method can fire forever, but because bf is bounded, when a method enqueue into bf, there is an implied predicate that say bf must not be full. Wrong! Because instructions in bu may be modifying ra or rb stall ! http://csg.csail.mit.edu/6.375/ February 23, 2007 8 6 8 8
Fetch & Decode Rule: corrected execute pc rf CPU bu rule decodeAdd (instr matches Add{dst:.rd,src1:.ra,src2:.rb} bu.enq (EAdd{dst:rd, op1:rf[ra], op2:rf[rb]}); pc <= predIa; endrule &&& !bu.find(ra) &&& !bu.find(rb)) This first method describe what happens in the fetch of this pipeline. This methods says that given any processor term, we can always rewrite it to a new term wehre, and we enqueue the current instruction into bf. And at the same time the pc field is incremende by one It may look like this method can fire forever, but because bf is bounded, when a method enqueue into bf, there is an implied predicate that say bf must not be full. http://csg.csail.mit.edu/6.375/ February 23, 2007 8 8 6 8
Rules for Branch CPU rule-atomicity ensures that pc update, and fetch & decode execute pc rf CPU bu rule-atomicity ensures that pc update, and discard of pre- fetched instrs in bu, are done consistently rule decodeBz(instr matches Bz{cond:.rc,addr:.addr}) &&& !bu.find(rc) &&& !bu.find(addr)); bu.enq (EBz{cond:rf[rc],addr:rf[addr]}); pc <= predIa; endrule This first method describe what happens in the fetch of this pipeline. This methods says that given any processor term, we can always rewrite it to a new term wehre, and we enqueue the current instruction into bf. And at the same time the pc field is incremende by one It may look like this method can fire forever, but because bf is bounded, when a method enqueue into bf, there is an implied predicate that say bf must not be full. rule bzTaken(it matches EBz{cond:.vc,addr:.va}) &&& (vc==0)); pc <= va; bu.clear(); endrule rule bzNotTaken (it matches EBz{cond:.vc,addr:.va}) &&& (vc != 0)); bu.deq; endrule http://csg.csail.mit.edu/6.375/ February 23, 2007 8 6 8 8
The Stall Signal Bool stall = case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return (bu.find(ra) || bu.find(rb)); tagged Bz {cond:.rc,addr:.addr}: return (bu.find(rc) || bu.find(addr)); tagged Load {dst:.rd,addr:.addr}: return (bu.find(addr)); tagged Store {value:.v,addr:.addr}: return (bu.find(v)) || bu.find(addr)); endcase; This first rule describe what happens in the fetch of this pipeline. This rules says that given any processor term, we can always rewrite it to a new term wehre, and we enqueue the current instruction into bf. And at the same time the pc field is incremende by one It may look like this rule can fire forever, but because bf is bounded, when a rule enqueue into bf, there is an implied predicate that say bf must not be full. Need to extend the fifo interface with the “find” method where “find” searches the fifo using the findf function http://csg.csail.mit.edu/6.375/ February 23, 2007 8 8 6 8
Parameterization: The Stall Function function Bool stallfunc (Instr instr, SFIFO#(InstTemplate, RName) bu); case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return (bu.find(ra) || bu.find(rb)); tagged Bz {cond:.rc,addr:.addr}: return (bu.find(rc) || bu.find(addr)); tagged Load {dst:.rd,addr:.addr}: return (bu.find(addr)); tagged Store {value:.v,addr:.addr}: return (bu.find(v)) || bu.find(addr)); endcase endfunction This first rule describe what happens in the fetch of this pipeline. This rules says that given any processor term, we can always rewrite it to a new term wehre, and we enqueue the current instruction into bf. And at the same time the pc field is incremende by one It may look like this rule can fire forever, but because bf is bounded, when a rule enqueue into bf, there is an implied predicate that say bf must not be full. We need to include the following call in the mkCPU module Bool stall = stallfunc(instr, bu); no extra gates! http://csg.csail.mit.edu/6.375/ February 23, 2007 8 8 6 8
The findf function function Bool findf (RName r, InstrTemplate it); case (it) matches tagged EAdd{dst:.rd,op1:.ra,op2:.rb}: return (r == rd); tagged EBz {cond:.c,addr:.a}: return (False); tagged ELoad{dst:.rd,addr:.a}: return (r == rd); tagged EStore{value:.v,addr:.a}: return (False); endcase endfunction This first rule describe what happens in the fetch of this pipeline. This rules says that given any processor term, we can always rewrite it to a new term wehre, and we enqueue the current instruction into bf. And at the same time the pc field is incremende by one It may look like this rule can fire forever, but because bf is bounded, when a rule enqueue into bf, there is an implied predicate that say bf must not be full. SFIFO#(InstrTemplate, RName) bu <- mkSFifo(findf); mkSFifo can be parameterized by the search function! http://csg.csail.mit.edu/6.375/ no extra gates! February 23, 2007 8 8 6 8
Fetch & Decode Rule rule fetch_and_decode(!stall); case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: bu.enq(EAdd{dst:rd,op1:rf[ra],op2:rf[rb]}); tagged Bz {cond:.rc,addr:.addr}: bu.enq(EBz{cond:rf[rc],addr:rf[addr]}); tagged Load {dst:.rd,addr:.addr}: bu.enq(ELoad{dst:rd,addr:rf[addr]}); tagged Store{value:.v,addr:.addr}: bu.enq(EStore{value:rf[v],addr:rf[addr]}); endcase pc<= predIa; endrule http://csg.csail.mit.edu/6.375/ February 23, 2007
Fetch & Decode Rule another style InstrTemplate newIt = case (instr) matches tagged Add {dst:.rd,src1:.ra,src2:.rb}: return EAdd{dst:rd,op1:rf[ra],op2:rf[rb]}; tagged Bz {cond:.rc,addr:.addr}: return EBz{cond:rf[rc],addr:rf[addr]}; tagged Load {dst:.rd,addr:.addr}: return ELoad{dst:rd,addr:rf[addr]}; tagged Store{value:.v,addr:.addr}: return EStore{value:rf[v],addr:rf[addr]}; endcase; rule fetch_and_decode (!stall); bu.enq(newIt); pc <= predIa; endrule Conceptually cleaner; hides unnecessary details http://csg.csail.mit.edu/6.375/ February 23, 2007
Execute Rule rule execute (True); case (it) matches tagged EAdd{dst:.rd,src1:.va,src2:.vb}: begin rf.upd(rd, va+vb); bu.deq(); end tagged EBz {cond:.cv,addr:.av}: if (cv == 0) then begin pc <= av; bu.clear(); end else bu.deq(); tagged ELoad{dst:.rd,addr:.av}: begin rf.upd(rd, dMem.read(av)); bu.deq(); end tagged EStore{value:.vv,addr:.av}: begin dMem.write(av, vv); bu.deq(); endcase endrule Next time simultaneous execution of fetch and execute rules http://csg.csail.mit.edu/6.375/ February 23, 2007