Computer Architecture: A Constructive Approach Multi-Cycle and 2 Stage Pipelined SMIPS Implementations Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*, Massachusetts Institute of Technology Derek Chiou, The University of Texas at Austin * Joel Emer, Li-Shiuan Peh, Murali Vijayaraghavan, Asif Khan, Abhinav Agarwal, Myron King 1
Single-Cycle SMIPS: Clock Speed PC Inst Memory Decode Register File Execute Data Memory +4 t Clock > t M + t DEC + t RF + t ALU + t M + t WB We can improve the clock speed if we execute each instruction in two clock cycles (may or may not be faster overall) t Clock > max {t M, (t DEC + t RF + t ALU + t M + t WB )} 2
Two-Cycle SMIPS PC Inst Memory Decode Register File Execute Data Memory +4 ir stage Introduce register “ir” to hold a fetched instruction and register “stage” to remember which stage (fetch/execute) we are in 3
Additional Types typedef struct { Addr pc; Bit#(32) inst; } TypeFetch2Decode deriving (Bits, Eq); typedef enum {Fetch, Execute} TypeStage deriving (Bits, Eq); 44
Two-Cycle SMIPS module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; Reg#(TypeFetch2Decode) ir <- mkRegU; Reg#(TypeStage) stage <- mkReg(Fetch); rule doFetch (state == Fetch); let inst = iMem.req(pc); ir <= TypeFetch2Decode{pc: pc, inst: inst}; stage <= Execute; endrule 5
Two-Cycle SMIPS rule doExecute(stage==Execute); let irpc = ir.pc; let inst = ir.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); pc <= eInst.brTaken ? eInst.addr : pc + 4; stage <= Fetch; endrule endmodule no change from single-cycle 6
Princeton versus Harvard Architecture Harvard architecture uses different memories for instructions and data needed for a single-cycle implementation Princeton architecture uses the same memory for instruction and data and thus, requires at least two cycles to execute Load/Store instructions The two-cycle implementations of Princeton and Harvard architectures are almost the same 7
SMIPS Princeton Architecture PC Memory Decode Register File Execute +4 ir Since both the Fetch and Execute stages want to use the memory, there is a structural hazard in accessing memory stage 8
Two-Cycle SMIPS Princeton module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; DMemory mem <- mkDMemory; Reg#(TypeFetch2Decode) ir <- mkRegU; Reg#(TypeStage) stage <- mkReg(Fetch); rule doFetch (stage == Fetch); let inst <- mem.req( MemReq{op:Ld, addr:pc, data:?}); ir <= TypeFetch2Decode{pc: pc, inst: inst}; stage <= Execute; endrule 9
Two-Cycle SMIPS Princeton rule doExecute(stage == Execute); let irpc = ir.pc; let inst = ir.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, mem); regUpdate(eInst, memData, rf); pc <= eInst.brTaken ? eInst.addr : pc + 4; stage <= Fetch; endrule endmodule 10
Two-Cycle SMIPS: Analysis PC Inst Memory Decode Register File Execute Data Memory +4 fr stage In any given clock cycle, lot of unused hardware ! ExecuteFetch Pipeline execution of instructions to increase the throughput 11
Instruction pipelining Much more complicated than arithmetic pipelines, e.g., IFFT The entities in an instruction pipeline are not independent of each other This causes pipeline stalls or requires other fancy tricks to avoid stalls sReg1 sReg2 x inQ f0f1f2 outQ Valid/Invalid 12
Hazards in instruction pipelining Structural hazard: Two instructions in the pipeline may require the same resource at the same time, e.g., contention for memory Control hazard: An instruction in the pipeline may determine the next instruction to be executed, e.g., branches Data hazard: An instruction in the pipeline may affect the state of the machine (pc, rf, dMem) – the next instruction must be fully cognizant of this change Notice that none of these hazards are present in the IFFT pipeline. 13
The power of computers comes from the fact that the instructions in a program are not independent of each other must deal with hazard 14
Two-stage Pipelined SMIPS (Harvard) PC Inst Memory Decode Register File Execute Data Memory +4 ir Let us assume we keep fetching instructions from pc, pc+4, pc+8, … and correct it when control hazard is detected 15
ir: The instruction register You may recall from our earlier discussion of pipelining (e.g., IFFT) that there is a possibility that the intermediate or pipelined registers do not contain any meaningful data We can associate a (Valid/Invalid) bit to ir Equivalently we can think of a pipeline register as a one-element FIFO 16
Two-stage pipeline SMIPS (Harvard) – first attempt module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; FIFOF#(TypeFetch2Decode) ir <- mkPipelineFIFO(); rule doFetch (ir.notFull); let inst = iMem.req(pc); ir.enq(TypeFetch2Decode{pc: pc, inst: inst}); pc <= pc+4; endrule implicit guard simple branch prediction 17
Two-stage pipeline SMIPS (Harvard) – first attempt rule doExecute (ir.notEmpty); let irpc = ir.first.pc; let inst = ir.first.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.brTaken) pc <= eInst.addr; ir.deq; endrule not quite correct! Implicit guard Correct the branch prediction Fix the control hazard 18
Killing miss-predicted fetched instructions If a branch is taken then the instructions in ir are useless and need to be killed; in fact all instructions fetched after the branch instruction and before the pc is corrected need to be thrown away Different designs or even different timing can result in different number of miss-predicted instructions; all such instructions have to be killed before the correct target instruction can execute 19
Killing fetched instructions In the simple design with combinational memory we have discussed so far, all the miss-predicted instructions are present in the ir fifo. So doExecute can atomically Clear the ir fifo (provide the fifo has a method to do so) Set the pc to the correct target rule doExecute (ir.notEmpty);... if (eInst.brTaken) begin pc <= eInst.addr; ir.clear end else ir.deq; endrule 20
Scheduling issues In case both rules want to update the pc or ir then only one of them should execute. Which one? doFetch rule would not fire once the ir fifo is full. However, it would be preferable to give doExecute rule priority over doFetch rule in case both can execute. Why? For proper pipelining both rules must fire together whenever possible. Can they? rule doFetch (ir.notFull);... ir.enq(...); pc <=...; endrule rule doExecute (ir.notEmpty);... ir.deq; if (eInst.brTaken) begin pc <=...; ir.clear end endrule 21
Bluespec concurrency model In Bluespec two rules A and B can execute together only if the concurrent execution results in a state that can be got by either executing A before B or B before A Example: rule A pA(x); x<=f(x); endrule; rule B pB(y); y<=g(x); endrule; rule BA if pA(x) x<=f(x); if pB(y) y<=g(x); endrule; Behaves as if rule B happened before rule A (all the register reads happen before any register write, i.e., action 2 before action 1)
Concurrency analysis two-stage MIPS pipeline – the first attempt In the current example, doFetch reads and updates pc and enqueues into ir; doExecute dequeues ir and sometime also updates pc and clears ir suppose we want doExecute < doFetch then consider two cases: pc is corrected by doExecute conflict fire only the doExecute rule pc is not corrected by doExecute no conflict fire both the rules Bluespec compiler would not allow these rules to fire in parallel because in general it cannot do such analysis in the presence of conditionals 23
Rewriting the 2-stage pipeline SMIPS (Harvard) single rule; not correct module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; FIFOF#(TypeFetch2Decode) ir <- mkPipelineFIFO(); rule doProc; if(isValid(brTaken)) begin pc <= fromMaybe(brTaken); ir.clear; end else if (ir.notFull) begin let inst = iMem.req(pc); ir.enq(TypeFetch2Decode{pc: pc, inst: inst}); pc <= pc+4; end else pc <= pc; 1 Let brTaken be a Maybe type variable to carry information from Execute to Fetch about whether the branch was taken or not 24
Rewriting 2-stage pipeline SMIPS (Harvard) single rule; not correct if(ir.notEmpty) begin let irpc = ir.first.pc; let inst = ir.first.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.brTaken) brTaken <= Valid (eInst.addr); ir.deq; end endrule endmodule 2 Problems: brTaken should be initialized to Invalid; action 1 should read the value being set action 2, but that is not what the code does! reorder actions 25
Rewriting the 2-stage pipeline SMIPS (Harvard) single rule module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PipeReg#(TypeFetch2Decode) ir <- mkPipeReg; rule doProc; Maybe#(Addr) brTaken = Invalid; if(ir.notEmpty) begin let irpc = ir.first.pc; let inst = ir.first.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); 2 reorder the actions to get the desired data dependencies 26
Rewriting 2-stage pipeline SMIPS (Harvard) single rule if (eInst.brTaken) brTaken = Valid (eInst.addr); ir.deq; end if(isValid(brTaken)) begin pc <= fromMaybe(brTaken); ir.clear; end else if (ir.notFull) begin let inst = iMem.req(pc); ir.enq(TypeFetch2Decode{pc: pc, inst: inst}); pc <= pc+4; end else pc <= pc; endrule endmodule 1 brTaken was assigned in the Execute action 27
Code Reordering in Pipelines In pipeline codes, a stage with an older instruction has priority over any stage with a younger instruction Consequently, reordering the code to reflect that Execute happens before Fetch is OK 28
Reminder: Sequential assignments In BSV multiple conditional assignments to in- scope variables result in muxes. Variable assignment within a rule follows sequential semantics Bit#(32) x = 0; y = x+1; if(p) x = 100; z = x+1; x p +1 z y 29
FIFO with a “clear” method For correct functioning, the effect of clear has to come after deq if both methods are executed concurrently FIFO interface properties: concurrent enq, deq and clear have to be permitted with the functionality deq < enq < clear It is easy to extend both pipeline FIFO and normal FIFO with clear To avoid compiler surprises it is sometimes desirable to check guards (not-full, not- empty) explicitly (no run time cost – compiler will eliminate duplicate checks) 30
Atomicity and single rules In a single rule it is possible for actions to communicate information to each other in the same cycle which is not possible when the actions reside in two different rules or methods unless we consider concurrent scheduling rule exchange x <= f(y); y <= g(x); endrule rule exX x <= f(y); endrule rule exY y <= g(x); endrule rule exXwire x <= f(y); xwire.set(x) endrule rule exY y <= g(fromMaybe(xwire)); endrule ≠ = works only when both the rules are scheduled together! It is better to write a single rule when atomicity needs to be preserved 31
2-stage pipeline SMIPS (Harvard) – two rules module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; FIFOF#(TypeFetch2Decode) ir <- mkPipelineFIFO(); Wire#(Maybe#(Addr)) brTakenWire <- mkDwire(Invalid); rule doFetch; if(isValid(brTakenWire)) begin pc <= fromMaybe(brTakenWire); ir.clear; end else if (ir.notFull) begin let inst = iMem.req(pc); ir.enq(TypeFetch2Decode{pc: pc, inst: inst}); pc <= pc+4; end else pc <= pc; endrule 1 For illustrative purposes only; this style is not recommended 32
Two-stage pipeline SMIPS (Harvard) – two rules rule doExec(ir.notEmpty); let irpc = ir.first.pc; let inst = ir.first.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.brTaken) brTakenWire <= Valid (eInst.addr); ir.deq; endrule endmodule 2 Correctness relies on the compiler to schedule the two rules together (doExec < doFetch) 33
Two-stage Pipelined SMIPS Princeton Architecture PC Memory Decode Register File Execute +4 ir Just like the Harvard design except for an additional structural hazard when a memory-type instruction is in the execute phase 34
Pipelined SMIPS (Princeton) – single rule, no wires module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; DMemory mem <- mkDMemory; PipeReg#(TypeFetch2Decode) ir <- mkPipeReg; rule doProc; Maybe#(Addr) brTaken = Invalid; Bool memAcc = False; if(ir.notEmpty) begin let irpc = ir.first.pc; let inst = ir.first.inst; let eInst = decodeExecute(irpc, inst, rf); 1 35
Pipelined SMIPS (Princeton) – single rule, no wires (cont) MemResp memData = ?; if(memType(eInst.iType)) begin memData <- mem.req(MemReq{ op: eInst.iType==Ld ? Ld : St, addr: eInst.addr, data: eInst.data}); memAcc = True; end regUpdate(eInst, memData, rf); if (eInst.brTaken) brTaken = Valid (eInst.addr); ir.deq; end 36
Pipelined SMIPS (Princeton) – single rule, no wires (cont) if(isValid(brTaken)) begin pc <= fromMaybe(brTaken); ir.clear; end else if (ir.notFull && !memAcc) begin let inst <- mem.req( MemReq{op: Ld, addr: pc, data: ?}); ir.enq(TypeFetch2Decode{pc: pc, inst: inst}); pc <= pc+4; end else pc <= pc; endrule endmodule 2 37
Compiler issues For this code to work the BSV compiler needs to figure out that mem.req port is not being used by two different actions concurrently! Indeed the compiler is able to figure out that memAcc makes the two uses of mem.req disjoint Removing synthesis boundary from mem automatically duplicates the port and makes the conflict disappear (not quite Princeton) 38
Killing fetched instructions Our simple solution is not enough if the design permitted us to have outstanding instruction requests in the fetch stage A solution in terms of “epochs” 39