Computer Architecture: A Constructive Approach Data Hazards and Multistage Pipelines Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*,

Slides:



Advertisements
Similar presentations
Constructive Computer Architecture: Data Hazards in Pipelined Processors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Advertisements

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo.
Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP III Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 152 Computer Architecture and Engineering Lecture 14 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
Arvind and Joel Emer Computer Science and Artificial Intelligence Laboratory M.I.T. Branch Prediction.
Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1
Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Computer Architecture: A Constructive Approach Branch Direction Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab.
Interrupts / Exceptions / Faults Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 30, 2012L21-1
Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.
Caches and in-order pipelines Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 11, 2012L24-1.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.
Out-of-Order Execution & Register Renaming Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring.
Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.
Computer Architecture: A Constructive Approach Branch Prediction - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of.
Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.
Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas EE (CE) 6304 Computer Architecture Lecture #12 (10/27/15) Course.
Computer Architecture: A Constructive Approach Branch Direction Prediction – Pipeline Integration Joel Emer Computer Science & Artificial Intelligence.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.
1 Tutorial: Lab 4 Again Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Constructive Computer Architecture Tutorial 6: Five Details of SMIPS Implementations Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T05-1.
Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas EE (CE) 6304 Computer Architecture Lecture #13 (10/28/15) Course.
Computer Architecture: A Constructive Approach Multi-Cycle and 2 Stage Pipelined SMIPS Implementations Teacher: Yoav Etsion Taken (with permission) from.
6.375 Tutorial 4 RISC-V and Final Projects Ming Liu March 4, 2016http://csg.csail.mit.edu/6.375T04-1.
October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010
6.175: Constructive Computer Architecture Tutorial 5 Epochs, Debugging, and Caches Quan Nguyen (Troubled by the two biggest problems in computer science…
Control Hazards Constructive Computer Architecture: Arvind
Bluespec-6: Modeling Processors
Tutorial 7: SMIPS Epochs Constructive Computer Architecture
Constructive Computer Architecture Tutorial 6: Discussion for lab6
Branch Prediction Constructive Computer Architecture: Arvind
Branch Prediction Constructive Computer Architecture: Arvind
Multistage Pipelined Processors and modular refinement
in Pipelined Processors
Non-Pipelined Processors - 2
Constructive Computer Architecture Tutorial 5 Epoch & Branch Predictor
Lab 4 Overview: 6-stage SMIPS Pipeline
Non-Pipelined and Pipelined Processors
in Pipelined Processors
Control Hazards Constructive Computer Architecture: Arvind
Branch Prediction Constructive Computer Architecture: Arvind
Krste Asanovic Electrical Engineering and Computer Sciences
Bypassing Computer Architecture: A Constructive Approach Joel Emer
Multistage Pipelined Processors and modular refinement
Modular Refinement Arvind
Realistic Memories and Caches
Branch Prediction: Direction Predictors
Branch Prediction: Direction Predictors
Multistage Pipelined Processors and modular refinement
in Pipelined Processors
Pipelined Processors Arvind
Control Hazards Constructive Computer Architecture: Arvind
Pipelined Processors Constructive Computer Architecture: Arvind
Branch Prediction: Direction Predictors
Tutorial 4: RISCV modules Constructive Computer Architecture
Modeling Processors Arvind
Control Hazards Constructive Computer Architecture: Arvind
Modular Refinement Arvind
Tutorial 7: SMIPS Labs and Epochs Constructive Computer Architecture
Branch Predictor Interface
Pipelined Processors: Control Hazards
Presentation transcript:

Computer Architecture: A Constructive Approach Data Hazards and Multistage Pipelines Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*, Massachusetts Institute of Technology Derek Chiou, The University of Texas at Austin * Joel Emer, Li-Shiuan Peh, Murali Vijayaraghavan, Asif Khan, Abhinav Agarwal, Myron King 1

Two-Stage pipeline A robust two-rule solution PC Inst Memory Decode Register File Execute Data Memory +4 ir Bypass FIFO Pipeline FIFO nextPC fEpoch eEpoch Either fifo can be a normal (>1 element) fifo 2

A different 2-Stage pipeline: 2-Stage-DH pipeline PC Inst Memory Decode Register File Execute Data Memory itr nextPC fEpoch eEpoch +4 3

TypeDecode2Execute typedef struct { Addr pc; Bool epoch; DecodedInst dInst; Data rVal1; Data rVal2; } TypeDecode2Execute deriving (Bits, Eq); value instead of register names 4

2-Stage-DH pipeline first attempt module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PipeReg#(TypeDecode2Execute) itr <- mkPipeReg; Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); FIFOF#(TypeNextPCE) nextPC <- mkBypassFIFOF; typedef struct { Addr npc; Bool nepoch; } TypeNextPCE deriving (Bits, Eq); 5

2-Stage-DH pipeline doFetch rule first attempt rule doFetch (itr.notFull); let inst = iMem(pc); let dInst = decode(inst); let rVal1 = rf.rd1(fromMaybe(dInst.src1)); let rVal2 = rf.rd2(fromMaybe(dInst.src2)); itr.enq(TypeDecode2Execute{pc:pc, epoch:fEpoch, dInst:dInst, rVal1:rVal1, rVal2:rVal2}); if(nextPC.notEmpty) begin npc = nextPC.first.npc; nepoch = nextPC.first.nepoch; pc <= npc; fEpoch <= nepoch; nextPC.deq; end else pc <= pc+4; endrule Not quite correct! 6

2-Stage-DH pipeline doExecute rule first attempt rule doExecute (itr.notEmpty); let itrpc=itr.first.pc; let dInst=itr.first.dInst; let rVal1=itr.first.rVal1; let rVal2=itr.first.rVal2; if(itr.first.epoch==eEpoch) begin let eInst = execute(dInst, rVal1, rVal2, itrpc); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if(eInst.brTaken) begin let nepoch = next(epoch); eEpoch <= nepoch; nextPC.enq( TypeNextPCE{npc:eInst.addr, nepoch:nepoch}); end itr.deq; endrule endmodule Not quite correct! Fetch is potentially reading stale values from rf 7

Data Hazards fetch & decode execute itr time t0t1t2t3t4t5t6t7.... FDstageFD 1 FD 2 FD 3 FD 4 FD 5 EXstageEX 1 EX 2 EX 3 EX 4 EX 5 I 1 Add(R1,R2,R3) I 2 Add(R4,R1,R2) I 2 must be stalled until I 1 updates the register file pcrf dMem time t0t1t2t3t4t5t6t7.... FDstageFD 1 FD 2 FD 2 FD 3 FD 4 FD 5 EXstageEX 1 EX 2 EX 3 EX 4 EX 5 8

2-Stage-DH pipeline Stall logic PC Inst Memory Decode Register File Execute Data Memory itr nextPC fEpoch eEpoch +4 scoreboard 9

Data Hazard Given two source registers and a destination register determine if there is a potential for a data hazard src1, src2 and rDst in decodedInst are changed from Rindx to Maybe#(Rindx) function Bool dataHazard(Maybe#(Rindx) src1, Maybe#(Rindx) src2, Maybe#(Rindx) dst); return (isValid(dst) && ( (isValid(src1) && fromMaybe(dst) == fromMaybe(src1)) || (isValid(src2) && fromMaybe(dst)== fromMaybe(src2)))); endfunction 10

Scoreboard: Keeping track of instructions in execution Scoreboard: a data structure to keep track of the destination registers of the instructions beyond the fetch stage method insert: inserts the destination (if any) of an instruction in the scoreboard when the instruction is decoded method search(src1,src2): searches the scoreboard for data hazards method remove: deletes the oldest entry when an instruction commits 11

Scoreboard module mkScoreboard(Scoreboard#(size)); Vector#(size, EHR#(2, Maybe#(Rindx))) sb <- replicateM(mkEHR(Invalid)); Reg#(Bit#(TAdd#(TLog#(size),1))) iidx <- mkReg(0); Reg#(Bit#(TAdd#(TLog#(size),1))) ridx <- mkReg(0); EHR#(2, Bit#(TAdd#(TLog#(size),1))) cnt <- mkEHR(0); Integer vsize = valueOf(size); Bit#(TAdd#(TLog#(size),1)) sz = fromInteger(vsize); method Action insert(Maybe#(Rindx) r) if(cnt.r1!==sz); sb[iidx].w1(r); iidx <= iidx==(sz-1) ? 0 : iidx + 1; cnt.w1(cnt.r1 + 1); endmethod 12

Scoreboard cont method Action remove if (cnt.r0!=0); sb[ridx].w0(Invalid); ridx <= ridx==sz-1 ? 0 : ridx + 1; cnt.w0(cnt.r0 – 1); endmethod method Bool search(Maybe#(Rindx) s1, Maybe#(Rindx) s2); Bool j = False; for (Integer i=0; i<vsize; i=i+1) j = (j || dataHazard(s1, s2, sb[i].r1)); return j; endmethod endmodule 13 remove < search < insert 13

2-Stage-DH pipeline module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkBypassRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PipeReg#(TypeDecode2Execute) itr <- mkPipeReg; Scoreboard#(1) sb <- mkScoreboard; // contains only one instruction Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); FIFOF#(TypeNextPCE) nextPC <- mkBypassFIFOF; 14

2-Stage-DH pipeline doFetch rule rule doFetch (itr.notFull); let inst = iMem(pc); let dInst = decode(inst); let stall = sb.search(dInst.src1, dInst.src2); if(!stall) begin let rVal1 = rf.rd1(fromMaybe(dInst.src1)); let rVal2 = rf.rd2(fromMaybe(dInst.src2)); itr.enq(TypeDecode2Execute{pc:pc, epoch:fEpoch, dInst:dInst, rVal1:rVal1, rVal2:rVal2}); sb.insert(dInst.rDst); if(nextPC.notEmpty) begin npc = nextPC.first.npc; nepoch = nextPC.first.nepoch; pc <= npc; fEpoch <= nepoch; nextPC.deq; end else pc <= pc+4; end endrule 15

2-Stage-DH pipeline doExecute rule rule doExecute (itr.notEmpty); let itrpc=itr.first.pc; let dInst=itr.first.dInst; let rVal1=itr.first.rVal1; let rVal2=itr.first.rVal2; if(itr.first.epoch==eEpoch) begin let eInst = execute(dInst, rVal1, rVal2, itrpc); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if(eInst.brTaken) begin let nepoch = next(epoch); eEpoch <= nepoch; nextPC.enq( TypeNextPCE{npc:eInst.addr, nepoch:nepoch}); end itr.deq; sb.remove; endrule endmodule 16

Concurrency analysis doExecute < doFetch implies that the method calls of a module whose methods are called by both rules must be ordered similarly  {itr.first, itr.deq} < {itr.enq}  pipeline FIFO sb.remove < {sb.search, sb.insert} {nextPC.enq} < {nextPC.first, nextPC.deq}  bypass FIFO {rf.wr} < {rf.rd1, rfrd2}  bypass RF 17

Multi-stage pipeline with Data Hazards 18

Three Stage Pipeline Bypass (1) PC Inst Memory Decode Register File Execute Data Memory itr nextPC fEpoch eEpoch +4 cr scoreboard 19

What Sort of Logic? What information is needed? Does anything need to be done to the pipeline? 20

Three Stage Pipeline Bypass (2) PC Inst Memory Decode Register File Execute Data Memory itr nextPC fEpoch eEpoch +4 cr scoreboard 21

What Sort of Logic? What information is needed? Does anything need to be done to the pipeline? 22

Bypass Issues Need to ensure that data is never “lost” Conceptually, data needs to live until everyone who needs it has it Naming is important There can be different versions throughout the pipeline Bypassing once is logically straight forward But, not necessarily easy to implement What if you make a change to the pipeline structure? One elegant bypassing strategy is to rename registers Only need to look for one tag Eliminates complexity of bypassing for a specific pipeline 23

Computer Architecture: A Constructive Approach Branch Prediction

I-cache Fetch Buffer Issue Buffer Func. Units Arch. State Execute Decode Result Buffer Commit PC Fetch Branch executed Next fetch started Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution ! Control Flow Penalty How much work is lost if pipeline doesn’t follow correct instruction flow ? ~ Loop length x pipeline width 25

Average Run-Length between Branches Average dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU39 %13 % FPU Add 20 % FPU Mult13 % load26 %23 % store 9 % 9 % branch16 % 8 % other10 %12 % SPECint92: compress, eqntott, espresso, gcc, li SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor What is the average run-length between branches? 26

InstructionTaken known?Target known? J JR BEQZ/BNEZ MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of information from the preceding instruction: 1. Is the preceding instruction a taken branch? 2. If so, what is the target address? After Inst. Decode After Reg. Fetch After Exec 27

Currently our simple pipelined architecture does very simple branch prediction What is it? Branch is predicted not taken: pc, pc+4, pc+8, … Can we do better? 28

Branch Prediction Bits Assume 2 BP bits per instruction Use saturating counter On ¬taken   On taken 11Strongly taken 10Weakly taken 01Weakly ¬taken 00Strongly ¬taken 29

Branch History Table (BHT) 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 00 Fetch PC Branch? Target PC + I-Cache Opcodeoffset Instruction k BHT Index 2 k -entry BHT, 2 bits/entry Taken/¬Taken? 30

Where does BHT fit in the processor pipeline? BHT can only be used after instruction decode What should we do at the fetch stage? Need a mechanism to update the BHT where does the update information come from 31

Overview of branch prediction PCPC Need next PC immediately Decode Reg Read Execute Instr type, PC relative targets available Simple conditions, register targets available Complex conditions available Next Addr Pred BP, JMP, Ret Loose loop Tight loop Best predictors reflect program behavior 32

Next Address Predictor (NAP) first attempt BP bits are stored with the predicted target address. IF stage: nPC = If (BP=taken) then target else pc+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb iMem PC Branch Target Buffer (2 k entries) k BPb predicted targetBP target 33

Address Collisions What will be fetched after the instruction at 1028? NAP prediction= Correct target=  Assume a 128-entry NAP BPb target take Add Jump 100 Instruction Memory kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles? 34

Use NAP for Control Instructions only NAP contains useful information for branch and jump instructions only  Do not update it for other instructions For all other instructions the next PC is (PC)+4 ! How to achieve this effect without decoding the instruction? 35