Control Hazards Constructive Computer Architecture: Arvind

Slides:



Advertisements
Similar presentations
Constructive Computer Architecture: Data Hazards in Pipelined Processors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Advertisements

Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1
Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.
September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.
Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.
Computer Architecture: A Constructive Approach Branch Prediction - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of.
Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.
Computer Architecture: A Constructive Approach Branch Direction Prediction – Pipeline Integration Joel Emer Computer Science & Artificial Intelligence.
Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.
February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
1 Tutorial: Lab 4 Again Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,
October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Constructive Computer Architecture Tutorial 6: Five Details of SMIPS Implementations Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T05-1.
Computer Architecture: A Constructive Approach Multi-Cycle and 2 Stage Pipelined SMIPS Implementations Teacher: Yoav Etsion Taken (with permission) from.
October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010
Elastic Pipelines: Concurrency Issues
6.175: Constructive Computer Architecture Tutorial 5 Epochs, Debugging, and Caches Quan Nguyen (Troubled by the two biggest problems in computer science…
Control Hazards Constructive Computer Architecture: Arvind
Bluespec-6: Modeling Processors
Scheduling Constraints on Interface methods
Tutorial 7: SMIPS Epochs Constructive Computer Architecture
Sequential Circuits: Constructive Computer Architecture
Constructive Computer Architecture Tutorial 6: Discussion for lab6
Branch Prediction Constructive Computer Architecture: Arvind
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Branch Prediction Constructive Computer Architecture: Arvind
Multistage Pipelined Processors and modular refinement
in Pipelined Processors
Performance Specifications
Pipelining combinational circuits
Multirule Systems and Concurrent Execution of Rules
Constructive Computer Architecture: Lab Comments & RISCV
Non-Pipelined Processors - 2
Pipelining combinational circuits
Modular Refinement Arvind
Modular Refinement Arvind
Constructive Computer Architecture Tutorial 5 Epoch & Branch Predictor
Lab 4 Overview: 6-stage SMIPS Pipeline
Non-Pipelined and Pipelined Processors
in Pipelined Processors
Control Hazards Constructive Computer Architecture: Arvind
Modules with Guarded Interfaces
Pipelining combinational circuits
Branch Prediction Constructive Computer Architecture: Arvind
Multistage Pipelined Processors and modular refinement
Modular Refinement Arvind
Realistic Memories and Caches
Branch Prediction: Direction Predictors
Branch Prediction: Direction Predictors
Modular Refinement - 2 Arvind
Multistage Pipelined Processors and modular refinement
in Pipelined Processors
Pipelined Processors Arvind
Pipelined Processors Constructive Computer Architecture: Arvind
Branch Prediction: Direction Predictors
Tutorial 4: RISCV modules Constructive Computer Architecture
Modeling Processors Arvind
Modeling Processors Arvind
Modular Refinement Arvind
Control Hazards Constructive Computer Architecture: Arvind
Modular Refinement Arvind
Tutorial 7: SMIPS Labs and Epochs Constructive Computer Architecture
Pipelined Processors: Control Hazards
Presentation transcript:

Control Hazards Constructive Computer Architecture: Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October 11, 2017 http://csg.csail.mit.edu/6.175

Synchronous 2-Stage Pipeline pc ir Execute Fetch invalid real pc guessed next pc fetched instruction, ... next state we will call it pcF Fetch and Execute are concurrently active on two different instructions; Fetch guesses the next pc and Execute corrects it when necessary October 11, 2017 http://csg.csail.mit.edu/6.175

Synchronous 2-Stage Pipeline pc ir Execute Fetch invalid real pc guessed next pc fetched instruction, ... Synchronous 2-Stage Pipeline rule doPipeline ; let newInst = iMem.req(pcF); let newPcF = nap(pcF); let newIR= Valid(Fetch2Decode{pc:pcF,ppc:newPcF, inst:newInst}); if(isValid(ir)) begin let x = fromMaybe(?, ir); let pc = x.pc; let inst = x.inst; let dInst = decode(inst); ... register fetch, exec, memory op, rf update ... let nextPC = eInst.brTaken ? eInst.addr : pc + 4; if (x.ppc != nextPC) begin newIR = Invalid; newPcF = nextPC; end end pcF <= newPcF; ir <= newIR; endrule pass pcF and predicted pc to the execute stage fetch execute October 11, 2017 http://csg.csail.mit.edu/6.175

Performance? pc ir fetch execute invalid real pc guessed next pc fetched instruction, ... Performance? rule doPipeline ; let newInst = iMem.req(pcF); let newPcF = nap(pcF); let newIR=Valid(Fetch2Decode{pc:pcF, ppc:newPcF, inst:newInst}); if(isValid(ir)) begin let x = fromMaybe(?, ir); let pc = x.pc; let inst = x.inst; let dInst = decode(inst); ... register fetch, exec, memory op, rf update, nextPC ... if (x.ppc != nextPC) begin newIR = Invalid; newPcF = nextPC; end end pcF <= newPcF; ir <= newIR; endrule fetch Notice there is always a bubble (dead cycle) after every miss-prediction The critical path: max{tnewPcF, tnewIr}  max{ max{tnap, texec}, max{tiMem, texec}}  max{tiMem, texec} execute texec includes tdecode etc. The critical path is not (tiMem + texec) October 11, 2017 http://csg.csail.mit.edu/6.175

Elastic two-stage pipeline Fetch pc redirect Execute PC f2d <inst, pc, ppc> We replace f2d register by a FIFO to make the machine more elastic, that is, Fetch keeps putting instructions into f2d and Execute keeps removing and executing instructions from f2d Fetch passes the pc and predicted pc in addition to the inst to Execute; Execute redirects the PC in case of a miss-prediction October 11, 2017 http://csg.csail.mit.edu/6.175

An elastic Two-Stage pipeline rule doFetch ; let inst = iMem.req(pcF); let newPcF = nap(pcF); pcF <= newPcF; f2d.enq(Fetch2Decode{pc:pcF, ppc:newPcF, inst:inst}); endrule rule doExecute ; let x = f2d.first; let pc = x.pc; let inst = x.inst; ... register fetch, exec, memory op, rf update, nextPC ... if (x.ppc != nextPC) begin pcF <= eInst.addr; f2d.clear; end else f2d.deq; Can these rules execute concurrently assuming the FIFO allows concurrent enq, deq and clear? No, double writes in pc Fetch: reads and writes PC, enq ir Execute: writes PC, clear and deq ir The machine works correctly with Bypass and CF Fifos, but only CF Fifo will give us the pipelined behavior So, Fetch < Execute in one cycle These rules can execute in any order, however, the execution of doExecute may throw away fetched instructions clear vs deq ? October 11, 2017 http://csg.csail.mit.edu/6.175

For concurrency make pc into an EHR design 1 rule doFetch ; let inst = iMem.req(pcF[0]); let newPcF = nap(pcF[0]); pcF[0] <= newPcF; f2d.enq(Fetch2Decode{pc:pcF[0], ppc:newPcF, inst:inst}); endrule rule doExecute ; let x = f2d.first; let pc = x.pc; let inst = x.inst; ... register fetch, exec, memory op, rf update, nextPC ... if (x.ppc != nextPC) begin pcF[1] <= eInst.addr; f2d.clear; end else f2d.deq; doFetch < doExecute Notice, for concurrency, f2d implementation must guarantee that (enq < clear) Fetch: reads and writes PC, enq ir Execute: writes PC, clear and deq ir The machine works correctly with Bypass and CF Fifos, but only CF Fifo will give us the pipelined behavior So, Fetch < Execute in one cycle October 11, 2017 http://csg.csail.mit.edu/6.175

Concurrency and Correctness Fetch < Execute pc redirect Execute PC f2d <inst, pc, ppc> Once Execute redirects the PC, no wrong path instruction should be executed the next instruction executed must be the redirected one Thus, concurrent execution requires (enq < clear) Performance? A dead-cycle or pipeline bubble after each miss prediction October 11, 2017 http://csg.csail.mit.edu/6.175

doFetch < doExecute Design 1 Performance doFetch < doExecute enq < clear rule doFetch ; let inst = iMem.req(pcF[0]); let newPcF = nap(pcF[0]); pcF[0] <= newPcF; f2d.enq(Fetch2Decode{pc:pcF[0], ppc:newPcF, inst:inst}); endrule rule doExecute ; let x = f2d.first; let pc = x.pc; let inst = x.inst; ... register fetch, exec, memory op, rf update, nextPC ... if (x.ppc != nextPC) begin pcF[1] <= eInst.addr; f2d.clear; end else f2d.deq; f2d is guaranteed to be empty after each misprediction (the same as the synchronous design) Fetch: reads and writes PC, enq ir Execute: writes PC, clear and deq ir The machine works correctly with Bypass and CF Fifos, but only CF Fifo will give us the pipelined behavior So, Fetch < Execute in one cycle October 11, 2017 http://csg.csail.mit.edu/6.175

Design 2 doExecute < doFetch rule doFetch ; let inst = iMem.req(pcF[1]); let newPcF = nap(pcF[1]); pcF[1] <= newPcF; f2d.enq(Fetch2Decode{pc:pcF[1], ppc:newPcF, inst:inst}); endrule rule doExecute ; let x = f2d.first; let pc = x.pc; let inst = x.inst; ... register fetch, exec, memory op, rf update, nextPC ... if (x.ppc != nextPC) begin pcF[0] <= eInst.addr; f2d.clear; end else f2d.deq; 1. Concurrency: should (clear < enq ) ? 2. Does this design have better performance? Fetch: reads and writes PC, enq ir Execute: writes PC, clear and deq ir The machine works correctly with Bypass and CF Fifos, but only CF Fifo will give us the pipelined behavior So, Fetch < Execute in one cycle October 11, 2017 http://csg.csail.mit.edu/6.175

Design 2 correctness/concurrency Execute < Fetch pc redirect Execute PC f2d <inst, pc, ppc> Once Execute redirects the PC, no wrong path instruction should be executed the next instruction executed must be the redirected one Thus, concurrent execution requires (clear < enq) Performance? No dead-cycle but the critical path length is (tiMem + texec) Slower clock means every instruction will take longer! October 11, 2017 http://csg.csail.mit.edu/6.175

Takeaway Get the functionality right before worrying about concurrency Introduce EHRs systematically to avoid rule conflicts; analyze various designs for dead cycles and critical path lengths BSV compiler produces information about conflicts Dead cycles can be estimated by running suitable benchmark programs Estimation of critical paths is often difficult and requires hardware synthesis tools October 11, 2017 http://csg.csail.mit.edu/6.175

Killing fetched instructions In the simple design with combinational memory we have discussed so far, all the mispredicted instructions were present in f2d. So the Execute stage can atomically: Clear f2d Set pc to the correct target In highly pipelined machines there can be multiple mispredicted and partially executed instructions in the pipeline; it will generally take more than one cycle to kill all such instructions Need a more general solution then clearing the f2d FIFO October 11, 2017 http://csg.csail.mit.edu/6.175

Epoch: a method to manage control hazards Add an epoch register in the processor state The Execute stage changes the epoch whenever the pc prediction is wrong and sets the pc to the correct value The Fetch stage associates the current epoch with every instruction when it is fetched The epoch of the instruction is examined when it is ready to execute. If the processor epoch has changed the instruction is thrown away PC iMem nap f2d Epoch Fetch Execute inst targetPC October 11, 2017 http://csg.csail.mit.edu/6.175

An epoch based solution Can these rules execute concurrently ? rule doFetch ; let instF=iMem.req(pcF[0]); let ppcF=nap(pcF[0]); pcF[0]<=ppcF; f2d.enq(Fetch2Decode{pc:pcF[0],ppc:ppcF,epoch:epoch, inst:instF}); endrule rule doExecute; let x=f2d.first; let pc=x.pc; let inEp=x.epoch; let inst = x.inst; if(inEp == epoch) begin ...decode, register fetch, exec, memory op, rf update nextPC ... if (x.ppc != nextPC) begin pcF[1] <= eInst.addr; epoch <= next(epoch); end end f2d.deq; endrule yes two values for epoch are sufficient Fetch reads and writes PC, writes epoch and enqs ir Execute writes PC, writes epoch and deqs ir Fetch < Execute within the same cycle The machine works with both CF and Bypass FIFOs for ir, but only CF Fifo gives pipelined behavior If inEp was replaced by epoch in Execute, then Fetch and Execute can not be scheduled concurrently October 11, 2017 http://csg.csail.mit.edu/6.175

Discussion Epoch based solution kills one wrong-path instruction at a time in the execute stage It may be slow, but it is more robust in more complex pipelines, if you have multiple stages between fetch and execute or if you have outstanding instruction requests to the iMem It requires the Execute stage to set the pc and epoch registers simultaneously which may result in a long combinational path from Execute to Fetch October 11, 2017 http://csg.csail.mit.edu/6.175

Decoupled Fetch and Execute <corrected pc, new epoch> Fetch Execute <inst, pc, ppc, epoch> In decoupled systems a subsystem reads and modifies only local state atomically In our solution, pc and epoch are read by both rules Properly decoupled systems permit greater freedom in independent refinement of subsystems October 11, 2017 http://csg.csail.mit.edu/6.175

A decoupled solution using epochs Fetch fEpoch eEpoch Execute Add fEpoch and eEpoch registers to the processor state; initialize them to the same value The epoch changes whenever Execute detects the pc prediction to be wrong. This change is reflected immediately in eEpoch and eventually in fEpoch via a message from Execute to Fetch Associate fEpoch with every instruction when it is fetched In the execute stage, reject, i.e., kill, the instruction if its epoch does not match eEpoch October 11, 2017 http://csg.csail.mit.edu/6.175

Control Hazard resolution A robust two-rule solution eEpoch fEpoch FIFO Register File redirect PC f2d Decode Execute +4 FIFO Inst Memory Data Memory Execute sends information about the target pc to Fetch, which updates fEpoch and pc whenever it examines the redirect (PC) fifo October 11, 2017 http://csg.csail.mit.edu/6.175 19

Two-stage pipeline Decoupled code structure module mkProc(Proc); Fifo#(Fetch2Execute) f2d <- mkFifo; Fifo#(Addr) redirect <- mkFifo; Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); rule doFetch; let inst = iMem.req(pcF); ... f2d.enq(... inst ..., fEpoch); endrule rule doExecute; if(inEp == eEpoch) begin Decode and execute the instruction; update state; In case of misprediction, redirect.enq(correct pc); end f2d.deq; endrule endmodule October 11, 2017 http://csg.csail.mit.edu/6.175

The Fetch rule rule doFetch; let inst = iMem.req(pcF); if(!redirect.notEmpty) begin let newPcF = nap(pcF); pcF <= newPcF; f2d.enq(Fetch2Execute{pc: pcF, ppc: newPcF, inst: inst, epoch: fEpoch}); end else begin fEpoch <= !fEpoch; pcF <= redirect.first; redirect.deq; end endrule Notice: In case of PC redirection, nothing is enqueued into f2d October 11, 2017 http://csg.csail.mit.edu/6.175

The Execute rule Can doFetch and doExecute execute concurrently? rule doExecute; let x = f2d.first; let inst = x.inst; let pc = x.pc; let inEp = x.epoch; if(inEp == eEpoch) begin ...decode, register fetch, exec, memory op, rf update nextPC ... if (x.ppc != nextPC) begin redirect.enq(eInst.addr); eEpoch <= !inEp; end end f2d.deq; endrule Can doFetch and doExecute execute concurrently? yes, assuming CF FIFOs October 11, 2017 http://csg.csail.mit.edu/6.175

Epoch mechanism is independent of the sophisticated branch prediction schemes that we will study later October 11, 2017 http://csg.csail.mit.edu/6.175