Pipelined Processors: Control Hazards

Slides:



Advertisements
Similar presentations
Constructive Computer Architecture: Data Hazards in Pipelined Processors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Advertisements

Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1
Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Computer Architecture: A Constructive Approach Branch Direction Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab.
Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.
Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November.
Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.
September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.
Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.
Computer Architecture: A Constructive Approach Branch Prediction - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of.
Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.
Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September.
Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.
1 Tutorial: Lab 4 Again Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,
Constructive Computer Architecture Tutorial 6: Five Details of SMIPS Implementations Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T05-1.
Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Computer Architecture: A Constructive Approach Multi-Cycle and 2 Stage Pipelined SMIPS Implementations Teacher: Yoav Etsion Taken (with permission) from.
Computer Architecture: A Constructive Approach Data Hazards and Multistage Pipelines Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*,
October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010
Constructive Computer Architecture Interrupts/Exceptions/Faults Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
6.175: Constructive Computer Architecture Tutorial 5 Epochs, Debugging, and Caches Quan Nguyen (Troubled by the two biggest problems in computer science…
Concurrency properties of BSV methods and rules
Control Hazards Constructive Computer Architecture: Arvind
Bluespec-6: Modeling Processors
Scheduling Constraints on Interface methods
Tutorial 7: SMIPS Epochs Constructive Computer Architecture
Constructive Computer Architecture Tutorial 6: Discussion for lab6
Branch Prediction Constructive Computer Architecture: Arvind
Branch Prediction Constructive Computer Architecture: Arvind
Multistage Pipelined Processors and modular refinement
in Pipelined Processors
Pipelining combinational circuits
Multirule Systems and Concurrent Execution of Rules
Non-Pipelined Processors - 2
Pipelining combinational circuits
EHR: Ephemeral History Register
Constructive Computer Architecture Tutorial 5 Epoch & Branch Predictor
Constructive Computer Architecture: Well formed BSV programs
Lab 4 Overview: 6-stage SMIPS Pipeline
Non-Pipelined and Pipelined Processors
in Pipelined Processors
Control Hazards Constructive Computer Architecture: Arvind
Modules with Guarded Interfaces
Pipelining combinational circuits
Branch Prediction Constructive Computer Architecture: Arvind
Multistage Pipelined Processors and modular refinement
Modular Refinement Arvind
Realistic Memories and Caches
Branch Prediction: Direction Predictors
Branch Prediction: Direction Predictors
Modular Refinement - 2 Arvind
Multistage Pipelined Processors and modular refinement
Computer Science & Artificial Intelligence Lab.
in Pipelined Processors
Pipelined Processors Arvind
Control Hazards Constructive Computer Architecture: Arvind
Pipelined Processors Constructive Computer Architecture: Arvind
Branch Prediction: Direction Predictors
Tutorial 4: RISCV modules Constructive Computer Architecture
Modeling Processors Arvind
Control Hazards Constructive Computer Architecture: Arvind
Pipelining combinational circuits
Modular Refinement Arvind
Constructive Computer Architecture: Well formed BSV programs
Tutorial 7: SMIPS Labs and Epochs Constructive Computer Architecture
Presentation transcript:

Pipelined Processors: Control Hazards Constructive Computer Architecture: Pipelined Processors: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Contributors to the course material Arvind, Rishiyur S. Nikhil, Joel Emer, Muralidaran Vijayaraghavan Staff and students in 6.375 (Spring 2013), 6.S195 (Fall 2012, 2013), 6.S078 (Spring 2012) Andy Wright, Asif Khan, Richard Ruhler, Sang Woo Jun, Abhinav Agarwal, Myron King, Kermin Fleming, Ming Liu, Li-Shiuan Peh External Prof Amey Karkare & students at IIT Kanpur Prof Jihong Kim & students at Seoul Nation University Prof Derek Chiou, University of Texas at Austin Prof Yoav Etsion & students at Technion January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Two-Cycle SMIPS: Analysis Register File Fetch Execute stage PC fr Decode Execute +4 In any given clock cycle, lot of unused hardware ! Inst Memory Data Memory Pipeline execution of instructions to increase the throughput January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC 3

Problems in Instruction pipelining Insti+1 PC Decode Register File Execute Data Memory Inst +4 f2d Insti Control hazard: Insti+1 is not known until Insti is at least decoded. So which instruction should be fetched? Structural hazard: Two instructions in the pipeline may require the same resource at the same time, e.g., contention for memory Data hazard: Insti may affect the state of the machine (pc, rf, dMem) – Insti+1must be fully cognizant of this change none of these hazards were present in the FFT pipeline January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Arithmetic versus Instruction pipelining The data items in an arithmetic pipeline, e.g., FFT, are independent of each other The entities in an instruction pipeline affect each other This causes pipeline stalls or requires other fancy tricks to avoid stalls Processor pipelines are significantly more complicated than arithmetic pipelines sReg1 sReg2 x inQ f0 f1 f2 outQ January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Control Hazards Insti+1 Insti PC Decode Register File Execute Data Memory Inst +4 f2d Insti+1 Insti Insti+1 is not known until Insti is at least decoded. So which instruction should be fetched? General solution – speculate, i.e., predict the next instruction address requires the next-instruction-address prediction machinery; can be as simple as pc+4 prediction machinery is usually elaborate because it dynamically learns from the past behavior of the program What if speculation goes wrong? machinery to kill the wrong-path instructions, restore the correct processor state and restart the execution at the correct pc January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Two-stage Pipelined SMIPS Fetch stage Decode-RegisterFetch-Execute-Memory-WriteBack stage Register File kill misprediction correct pc PC f2d Decode Execute pred Inst Memory Data Memory Fetch stage must predict the next instruction to fetch to have any pipelining In case of a misprediction the Execute stage must kill the mispredicted instruction in f2d January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC 7

Two-stage Pipelined SMIPS PC Decode Register File Execute Data Memory Inst pred f2d Fetch stage Decode-RegisterFetch-Execute-Memory-WriteBack stage kill misprediction correct pc f2d must contain a Maybe type value because sometimes the fetched instruction is killed Fetch2Decode type captures all the information that needs to be passed from Fetch to Decode, i.e. Fetch2Decode {pc:Addr, ppc: Addr, inst:Inst} January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC 8

Pipelining Two-Cycle SMIPS –single rule rule doPipeline ; let instF = iMem.req(pc); let ppcF = nextAddr(pc); let nextPc = ppcF; let newf2d = Valid (Fetch2Decode{pc:pc,ppc:ppcF, inst:instF}); if(isValid(f2d)) begin let x = fromMaybe(?,f2d); let pcD = x.pc; let ppcD = x.ppc; let instD = x.inst; let dInst = decode(instD); ... register fetch ...; let eInst = exec(dInst, rVal1, rVal2, pcD, ppcD); ...memory operation ... ...rf update ... if (eInst.mispredict) begin nextPc = eInst.addr; newf2d = Invalid; end end pc <= nextPc; f2d <= newf2d; endrule fetch execute these values are being redefined January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Inelastic versus Elastic pipeline The pipeline presented is inelastic, that is, it relies on executing Fetch and Execute together or atomically In a realistic machine, Fetch and Execute behave more asynchronously; for example memory latency or a functional unit may take variable number of cycles If we replace f2d register by a FIFO then it is possible to make the machine more elastic, that is, Fetch keeps putting instructions into f2d and Execute keeps removing and executing instructions from f2d January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

An elastic Two-Stage pipeline rule doFetch ; let instF = iMem.req(pc); let ppcF = nextAddr(pc); pc <= ppcF; f2d.enq(Fetch2Decode{pc:pc, ppc:ppcF, inst:instF}); endrule rule doExecute; let x = f2d.first; let pcD = x.pc; let ppcD = x.ppc; let instD = x.inst; let dInst = decode(instD); ... register fetch ...; let eInst = exec(dInst, rVal1, rVal2, pcD, ppcD); ...memory operation ... ...rf update ... if (eInst.mispredict) begin pc <= eInst.addr; f2d.clear; end else f2d.deq; Can these rules execute concurrently assuming the FIFO allows concurrent enq, deq and clear? Fetch: reads and writes PC, enq ir Execute: writes PC, clear and deq ir The machine works correctly with Bypass and CF Fifos, but only CF Fifo will give us the pipelined behavior So, Fetch < Execute in one cycle No because of a possible double write in pc January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

An elastic Two-Stage pipeline: for concurrency make pc into an EHR rule doFetch ; let instF = iMem.req(pc[0]); let ppcF = nextAddr(pc[0]); pc[0] <= ppcF; f2d.enq(Fetch2Decode{pc:pc[0],ppc:ppcF,inst:inst}); endrule rule doExecute; let x = f2d.first; let pcD = x.pc; let ppcD = x.ppc; let instD = x.inst; let dInst = decode(instD); ... register fetch ...; let eInst = exec(dInst, rVal1, rVal2, pcD, ppcD); ...memory operation ... ...rf update ... if (eInst.mispredict) begin pc[1] <= eInst.addr; f2d.clear; end else f2d.deq; These rules can execute concurrently assuming the FIFO has (enq CF deq) and (enq < clear) Fetch: reads and writes PC, enq ir Execute: writes PC, clear and deq ir The machine works correctly with Bypass and CF Fifos, but only CF Fifo will give us the pipelined behavior So, Fetch < Execute in one cycle Double writes in pc have been replaced by prioritized writes in pc January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Conflict-free FIFO with a Clear method db da module mkCFFifo(Fifo#(2, t)) provisos(Bits#(t, tSz)); Ehr#(3, t) da <- mkEhr(?); Ehr#(3, Bool) va <- mkEhr(False); Ehr#(3, t) db <- mkEhr(?); Ehr#(3, Bool) vb <- mkEhr(False); rule canonicalize if(vb[2] && !va[2]); da[2] <= db[2]; va[2] <= True; vb[2] <= False; endrule method Action enq(t x) if(!vb[0]); db[0] <= x; vb[0] <= True; endmethod method Action deq if (va[0]); va[0] <= False; endmethod method t first if(va[0]); return da[0]; endmethod method Action clear; va[1] <= False ; vb[1] <= False endmethod endmodule If there is only one element in the FIFO it resides in da first CF enq deq CF enq first < deq enq < clear Canonicalize must be the last rule to fire! January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Why canonicalize must be the last rule to fire rule foo ; f.deq; if (p) f.clear endrule Consider rule foo. If p is false then canonicalize must fire after deq for proper concurrency. If canonicalize uses EHR indices between deq and clear, then canonicalize won’t fire when p is false first CF enq deq CF enq first < deq enq < clear January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Correctness issue Once Execute redirects the PC, Fetch Execute PC <pc, ppc, inst> Once Execute redirects the PC, no wrong path instruction should be executed the next instruction executed must be the redirected one This is true for the code shown because Execute changes the pc and clears the FIFO atomically Fetch reads the pc and enqueues the FIFO atomically January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Killing fetched instructions In the simple design with combinational memory we have discussed so far, the mispredicted instruction was present in the f2d. So the Execute stage can atomically Clear the f2d Set the pc to the correct target In highly pipelined machines there can be multiple mispredicted and partially executed instructions in the pipeline; it will generally take more than one cycle to kill all such instructions Need a more general solution then clearing the f2d FIFO January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Epoch: a method for managing control hazards Add an epoch register in the processor state The Execute stage changes the epoch whenever the pc prediction is wrong and sets the pc to the correct value The Fetch stage associates the current epoch with every instruction when it is fetched The epoch of the instruction is examined when it is ready to execute. If the processor epoch has changed the instruction is thrown away PC iMem pred f2d Epoch Fetch Execute inst targetPC January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

An epoch based solution Can these rules execute concurrently ? rule doFetch ; let instF=iMem.req(pc[0]); let ppcF=nextAddr(pc[0]); pc[0]<=ppcF; f2d.enq(Fetch2Decode{pc:pc[0],ppc:ppcF,epoch:epoch, inst:instF}); endrule rule doExecute; let x=f2d.first; let pcD=x.pc; let inEp=x.epoch; let ppcD = x.ppc; let instD = x.inst; if(inEp == epoch) begin let dInst = decode(instD); ... register fetch ...; let eInst = exec(dInst, rVal1, rVal2, pcD, ppcD); ...memory operation ... ...rf update ... if (eInst.mispredict) begin pc[1] <= eInst.addr; epoch <= epoch + 1; end end f2d.deq; endrule yes two values for epoch are sufficient Fetch reads and writes PC, writes epoch and enqs ir Execute writes PC, writes epoch and deqs ir Fetch < Execute within the same cycle The machine works with both CF and Bypass FIFOs for ir, but only CF Fifo gives pipelined behavior If inEp was replaced by epoch in Execute, then Fetch and Execute can not be scheduled concurrently January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Discussion Epoch based solution kills one wrong-path instruction at a time in the execute stage It may be slow, but it is more robust in more complex pipelines, if you have multiple stages between fetch and execute or if you have outstanding instruction requests to the iMem It requires the Execute stage to set the pc and epoch registers simultaneously which may result in a long combinational path from Execute to Fetch January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Decoupled Fetch and Execute <corrected pc, new epoch> Fetch Execute <inst, pc, ppc, epoch> In decoupled systems a subsystem reads and modifies only local state atomically In our solution, pc and epoch are read by both rules Properly decoupled systems permit greater freedom in independent refinement of subsystems January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

A decoupled solution using epochs fetch fEpoch eEpoch execute Add fEpoch and eEpoch registers to the processor state; initialize them to the same value The epoch changes whenever Execute detects the pc prediction to be wrong. This change is reflected immediately in eEpoch and eventually in fEpoch via a message from Execute to Fetch Associate the fEpoch with every instruction when it is fetched In the execute stage, reject, i.e., kill, the instruction if its epoch does not match eEpoch January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Control Hazard resolution A robust two-rule solution eEpoch fEpoch FIFO Register File redirect PC f2d Decode Execute +4 FIFO Inst Memory Data Memory Execute sends information about the target pc to Fetch, which updates fEpoch and pc whenever it looks at the redirect PC fifo January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC 22

Two-stage pipeline Decoupled code structure module mkProc(Proc); Fifo#(Fetch2Execute) f2d <- mkFifo; Fifo#(Addr) execRedirect <- mkFifo; Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); rule doFetch; let instF = iMem.req(pc); ... f2d.enq(... instF ..., fEpoch); endrule rule doExecute; if(inEp == eEpoch) begin Decode and execute the instruction; update state; In case of misprediction, execRedirect.enq(correct pc); end f2d.deq; endrule endmodule January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

The Fetch rule pass the pc and predicted pc to the execute stage rule doFetch; let instF = iMem.req(pc); if(!execRedirect.notEmpty) begin let ppcF = nextAddrPredictor(pc); pc <= ppcF; f2d.enq(Fetch2Execute{pc: pc, ppc: ppcF, inst: instF, epoch: fEpoch}); end else begin fEpoch <= !fEpoch; pc <= execRedirect.first; execRedirect.deq; end endrule pass the pc and predicted pc to the execute stage Notice: In case of PC redirection, nothing is enqueued into f2d January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

The Execute rule exec returns a flag if there was a fetch misprediction rule doExecute; let instD = f2d.first.inst; let pcF = f2d.first.pc; let ppcD = f2d.first.ppc; let inEp = f2d.first.epoch; if(inEp == eEpoch) begin let dInst = decode(instD); let rVal1 = rf.rd1(validRegValue(dInst.src1)); let rVal2 = rf.rd2(validRegValue(dInst.src2)); let eInst = exec(dInst, rVal1, rVal2, pcD, ppcD); if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld, addr: eInst.addr, data: ?}); else if (eInst.iType == St) let d <- dMem.req(MemReq{op: St, addr: eInst.addr, data: eInst.data}); if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data); if(eInst.mispredict) begin execRedirect.enq(eInst.addr); eEpoch <= !inEp; end end f2d.deq; endrule Can these rules execute concurrently? yes, assuming CF FIFOs January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Overview of control prediction mispred insts must be filtered Next Addr Pred correct mispred correct mispred tight loop correct mispred P C Decode Reg Read Execute Write Back Instr type, PC relative targets available Simple conditions, register targets available Complex conditions available Need next PC immediately With respect to branches the important steps are… With ‘stall style’ dependence resolution - correct next PC calculation waits for … To alleviate stalls speculate next PC – now dependence info is speculation check Given (pc, ppc), a misprediction can be corrected (used to redirect the pc) as soon as it is detected. In fact, pc can be redirected as soon as we have a “better” prediction. However, for forward progress it is important that a correct pc should never be redirected. January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Next Address Predictor (NAP) first attempt pc is the only information NAP has predicted target Branch Target Buffer (BTB) (2k entries) iMem k pc target Fetch: ppc = look up the target in BTB Later check prediction, if wrong then kill the instruction and update BTB January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC

Branch Target Buffer (BTB) I-Cache PC k Valid valid Entry PC = match predicted target target PC 2k-entry direct-mapped BTB Keep the (pc, target pc) in the BTB pc+4 is predicted if no pc match is found BTB is updated only for branches and jumps Permits ppc to be determined before instruction is decoded January 1, 2013 http://csg.csail.mit.edu/6.S195/CDAC