Computer Architecture: A Constructive Approach Multi-Cycle and 2 Stage Pipelined SMIPS Implementations Teacher: Yoav Etsion Taken (with permission) from.

Slides:

Advertisements

Similar presentations

Elastic Pipelines and Basics of Multi-rule Systems Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.

Advertisements

An EHR based methodology for Concurrency management Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

Data Hazards and Multistage Pipeline Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 18, 2012L8-1.

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of.

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

Interrupts / Exceptions / Faults Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 30, 2012L21-1

Pipelining combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February 20, 2013http://csg.csail.mit.edu/6.375L05-1.

Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.

Computer Architecture: A Constructive Approach Branch Prediction - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of.

Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.

Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,

October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Computer Architecture: A Constructive Approach Pipelining combinational circuits Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*, Massachusetts.

Computer Architecture: A Constructive Approach Bluespec execution model and concurrent rule scheduling Teacher: Yoav Etsion Taken (with permission) from.

Computer Architecture: A Constructive Approach Data Hazards and Multistage Pipelines Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*,

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

Elastic Pipelines: Concurrency Issues

Control Hazards Constructive Computer Architecture: Arvind

Bluespec-6: Modeling Processors

Multistage Pipelined Processors and modular refinement

in Pipelined Processors

Performance Specifications

Pipelining combinational circuits

Multirule Systems and Concurrent Execution of Rules

Non-Pipelined Processors - 2

Constructive Computer Architecture: Guards

Pipelining combinational circuits

Modular Refinement Arvind

Modular Refinement Arvind

Lab 4 Overview: 6-stage SMIPS Pipeline

Non-Pipelined and Pipelined Processors

in Pipelined Processors

Multi-cycle SMIPS Implementations

Control Hazards Constructive Computer Architecture: Arvind

Modules with Guarded Interfaces

Pipelining combinational circuits

Bypassing Computer Architecture: A Constructive Approach Joel Emer

Multistage Pipelined Processors and modular refinement

Modular Refinement Arvind

Realistic Memories and Caches

Elastic Pipelines: Concurrency Issues

Multirule systems and Concurrent Execution of Rules

Modular Refinement - 2 Arvind

Multistage Pipelined Processors and modular refinement

in Pipelined Processors

Elastic Pipelines: Concurrency Issues

Pipelined Processors Arvind

Constructive Computer Architecture: Guards

Elastic Pipelines and Basics of Multi-rule Systems

Control Hazards Constructive Computer Architecture: Arvind

Pipelined Processors Constructive Computer Architecture: Arvind

Multirule systems and Concurrent Execution of Rules

Tutorial 4: RISCV modules Constructive Computer Architecture

Modeling Processors Arvind

Modeling Processors Arvind

Modular Refinement Arvind

Control Hazards Constructive Computer Architecture: Arvind

Modular Refinement Arvind

Tutorial 7: SMIPS Labs and Epochs Constructive Computer Architecture

Branch Predictor Interface

Pipelined Processors: Control Hazards

Presentation transcript:

Computer Architecture: A Constructive Approach Multi-Cycle and 2 Stage Pipelined SMIPS Implementations Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*, Massachusetts Institute of Technology Derek Chiou, The University of Texas at Austin * Joel Emer, Li-Shiuan Peh, Murali Vijayaraghavan, Asif Khan, Abhinav Agarwal, Myron King 1

Single-Cycle SMIPS: Clock Speed PC Inst Memory Decode Register File Execute Data Memory +4 t Clock > t M + t DEC + t RF + t ALU + t M + t WB We can improve the clock speed if we execute each instruction in two clock cycles (may or may not be faster overall) t Clock > max {t M, (t DEC + t RF + t ALU + t M + t WB )} 2

Two-Cycle SMIPS PC Inst Memory Decode Register File Execute Data Memory +4 ir stage Introduce register “ir” to hold a fetched instruction and register “stage” to remember which stage (fetch/execute) we are in 3

Additional Types typedef struct { Addr pc; Bit#(32) inst; } TypeFetch2Decode deriving (Bits, Eq); typedef enum {Fetch, Execute} TypeStage deriving (Bits, Eq); 44

Two-Cycle SMIPS module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; Reg#(TypeFetch2Decode) ir <- mkRegU; Reg#(TypeStage) stage <- mkReg(Fetch); rule doFetch (state == Fetch); let inst = iMem.req(pc); ir <= TypeFetch2Decode{pc: pc, inst: inst}; stage <= Execute; endrule 5

Two-Cycle SMIPS rule doExecute(stage==Execute); let irpc = ir.pc; let inst = ir.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); pc <= eInst.brTaken ? eInst.addr : pc + 4; stage <= Fetch; endrule endmodule no change from single-cycle 6

Princeton versus Harvard Architecture Harvard architecture uses different memories for instructions and data needed for a single-cycle implementation Princeton architecture uses the same memory for instruction and data and thus, requires at least two cycles to execute Load/Store instructions The two-cycle implementations of Princeton and Harvard architectures are almost the same 7

SMIPS Princeton Architecture PC Memory Decode Register File Execute +4 ir Since both the Fetch and Execute stages want to use the memory, there is a structural hazard in accessing memory stage 8

Two-Cycle SMIPS Princeton module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; DMemory mem <- mkDMemory; Reg#(TypeFetch2Decode) ir <- mkRegU; Reg#(TypeStage) stage <- mkReg(Fetch); rule doFetch (stage == Fetch); let inst <- mem.req( MemReq{op:Ld, addr:pc, data:?}); ir <= TypeFetch2Decode{pc: pc, inst: inst}; stage <= Execute; endrule 9

Two-Cycle SMIPS Princeton rule doExecute(stage == Execute); let irpc = ir.pc; let inst = ir.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, mem); regUpdate(eInst, memData, rf); pc <= eInst.brTaken ? eInst.addr : pc + 4; stage <= Fetch; endrule endmodule 10

Two-Cycle SMIPS: Analysis PC Inst Memory Decode Register File Execute Data Memory +4 fr stage In any given clock cycle, lot of unused hardware ! ExecuteFetch Pipeline execution of instructions to increase the throughput 11

Instruction pipelining Much more complicated than arithmetic pipelines, e.g., IFFT The entities in an instruction pipeline are not independent of each other This causes pipeline stalls or requires other fancy tricks to avoid stalls sReg1 sReg2 x inQ f0f1f2 outQ Valid/Invalid 12

Hazards in instruction pipelining Structural hazard: Two instructions in the pipeline may require the same resource at the same time, e.g., contention for memory Control hazard: An instruction in the pipeline may determine the next instruction to be executed, e.g., branches Data hazard: An instruction in the pipeline may affect the state of the machine (pc, rf, dMem) – the next instruction must be fully cognizant of this change Notice that none of these hazards are present in the IFFT pipeline. 13

The power of computers comes from the fact that the instructions in a program are not independent of each other  must deal with hazard 14

Two-stage Pipelined SMIPS (Harvard) PC Inst Memory Decode Register File Execute Data Memory +4 ir Let us assume we keep fetching instructions from pc, pc+4, pc+8, … and correct it when control hazard is detected 15

ir: The instruction register You may recall from our earlier discussion of pipelining (e.g., IFFT) that there is a possibility that the intermediate or pipelined registers do not contain any meaningful data We can associate a (Valid/Invalid) bit to ir Equivalently we can think of a pipeline register as a one-element FIFO 16

Two-stage pipeline SMIPS (Harvard) – first attempt module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; FIFOF#(TypeFetch2Decode) ir <- mkPipelineFIFO(); rule doFetch (ir.notFull); let inst = iMem.req(pc); ir.enq(TypeFetch2Decode{pc: pc, inst: inst}); pc <= pc+4; endrule implicit guard simple branch prediction 17

Two-stage pipeline SMIPS (Harvard) – first attempt rule doExecute (ir.notEmpty); let irpc = ir.first.pc; let inst = ir.first.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.brTaken) pc <= eInst.addr; ir.deq; endrule not quite correct! Implicit guard Correct the branch prediction Fix the control hazard 18

Killing miss-predicted fetched instructions If a branch is taken then the instructions in ir are useless and need to be killed; in fact all instructions fetched after the branch instruction and before the pc is corrected need to be thrown away Different designs or even different timing can result in different number of miss-predicted instructions; all such instructions have to be killed before the correct target instruction can execute 19

Killing fetched instructions In the simple design with combinational memory we have discussed so far, all the miss-predicted instructions are present in the ir fifo. So doExecute can atomically Clear the ir fifo (provide the fifo has a method to do so) Set the pc to the correct target rule doExecute (ir.notEmpty);... if (eInst.brTaken) begin pc <= eInst.addr; ir.clear end else ir.deq; endrule 20

Scheduling issues In case both rules want to update the pc or ir then only one of them should execute. Which one? doFetch rule would not fire once the ir fifo is full. However, it would be preferable to give doExecute rule priority over doFetch rule in case both can execute. Why? For proper pipelining both rules must fire together whenever possible. Can they? rule doFetch (ir.notFull);... ir.enq(...); pc <=...; endrule rule doExecute (ir.notEmpty);... ir.deq; if (eInst.brTaken) begin pc <=...; ir.clear end endrule 21

Bluespec concurrency model In Bluespec two rules A and B can execute together only if the concurrent execution results in a state that can be got by either executing A before B or B before A Example: rule A pA(x); x<=f(x); endrule; rule B pB(y); y<=g(x); endrule; rule BA if pA(x) x<=f(x); if pB(y) y<=g(x); endrule; Behaves as if rule B happened before rule A (all the register reads happen before any register write, i.e., action 2 before action 1)

Concurrency analysis two-stage MIPS pipeline – the first attempt In the current example, doFetch reads and updates pc and enqueues into ir; doExecute dequeues ir and sometime also updates pc and clears ir suppose we want doExecute < doFetch then consider two cases: pc is corrected by doExecute  conflict  fire only the doExecute rule pc is not corrected by doExecute  no conflict  fire both the rules Bluespec compiler would not allow these rules to fire in parallel because in general it cannot do such analysis in the presence of conditionals 23

Rewriting the 2-stage pipeline SMIPS (Harvard) single rule; not correct module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; FIFOF#(TypeFetch2Decode) ir <- mkPipelineFIFO(); rule doProc; if(isValid(brTaken)) begin pc <= fromMaybe(brTaken); ir.clear; end else if (ir.notFull) begin let inst = iMem.req(pc); ir.enq(TypeFetch2Decode{pc: pc, inst: inst}); pc <= pc+4; end else pc <= pc; 1 Let brTaken be a Maybe type variable to carry information from Execute to Fetch about whether the branch was taken or not 24

Rewriting 2-stage pipeline SMIPS (Harvard) single rule; not correct if(ir.notEmpty) begin let irpc = ir.first.pc; let inst = ir.first.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.brTaken) brTaken <= Valid (eInst.addr); ir.deq; end endrule endmodule 2 Problems: brTaken should be initialized to Invalid; action 1 should read the value being set action 2, but that is not what the code does! reorder actions 25

Rewriting the 2-stage pipeline SMIPS (Harvard) single rule module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PipeReg#(TypeFetch2Decode) ir <- mkPipeReg; rule doProc; Maybe#(Addr) brTaken = Invalid; if(ir.notEmpty) begin let irpc = ir.first.pc; let inst = ir.first.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); 2 reorder the actions to get the desired data dependencies 26

Rewriting 2-stage pipeline SMIPS (Harvard) single rule if (eInst.brTaken) brTaken = Valid (eInst.addr); ir.deq; end if(isValid(brTaken)) begin pc <= fromMaybe(brTaken); ir.clear; end else if (ir.notFull) begin let inst = iMem.req(pc); ir.enq(TypeFetch2Decode{pc: pc, inst: inst}); pc <= pc+4; end else pc <= pc; endrule endmodule 1 brTaken was assigned in the Execute action 27

Code Reordering in Pipelines In pipeline codes, a stage with an older instruction has priority over any stage with a younger instruction Consequently, reordering the code to reflect that Execute happens before Fetch is OK 28

Reminder: Sequential assignments In BSV multiple conditional assignments to in- scope variables result in muxes. Variable assignment within a rule follows sequential semantics Bit#(32) x = 0; y = x+1; if(p) x = 100; z = x+1; x p +1 z y 29

FIFO with a “clear” method For correct functioning, the effect of clear has to come after deq if both methods are executed concurrently FIFO interface properties: concurrent enq, deq and clear have to be permitted with the functionality deq < enq < clear It is easy to extend both pipeline FIFO and normal FIFO with clear To avoid compiler surprises it is sometimes desirable to check guards (not-full, notempty) explicitly (no run time cost – compiler will eliminate duplicate checks) 30

Atomicity and single rules In a single rule it is possible for actions to communicate information to each other in the same cycle which is not possible when the actions reside in two different rules or methods unless we consider concurrent scheduling rule exchange x <= f(y); y <= g(x); endrule rule exX x <= f(y); endrule rule exY y <= g(x); endrule rule exXwire x <= f(y); xwire.set(x) endrule rule exY y <= g(fromMaybe(xwire)); endrule ≠ = works only when both the rules are scheduled together! It is better to write a single rule when atomicity needs to be preserved 31

2-stage pipeline SMIPS (Harvard) – two rules module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; FIFOF#(TypeFetch2Decode) ir <- mkPipelineFIFO(); Wire#(Maybe#(Addr)) brTakenWire <- mkDwire(Invalid); rule doFetch; if(isValid(brTakenWire)) begin pc <= fromMaybe(brTakenWire); ir.clear; end else if (ir.notFull) begin let inst = iMem.req(pc); ir.enq(TypeFetch2Decode{pc: pc, inst: inst}); pc <= pc+4; end else pc <= pc; endrule 1 For illustrative purposes only; this style is not recommended 32

Two-stage pipeline SMIPS (Harvard) – two rules rule doExec(ir.notEmpty); let irpc = ir.first.pc; let inst = ir.first.inst; let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.brTaken) brTakenWire <= Valid (eInst.addr); ir.deq; endrule endmodule 2 Correctness relies on the compiler to schedule the two rules together (doExec < doFetch) 33

Two-stage Pipelined SMIPS Princeton Architecture PC Memory Decode Register File Execute +4 ir Just like the Harvard design except for an additional structural hazard when a memory-type instruction is in the execute phase 34

Pipelined SMIPS (Princeton) – single rule, no wires module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; DMemory mem <- mkDMemory; PipeReg#(TypeFetch2Decode) ir <- mkPipeReg; rule doProc; Maybe#(Addr) brTaken = Invalid; Bool memAcc = False; if(ir.notEmpty) begin let irpc = ir.first.pc; let inst = ir.first.inst; let eInst = decodeExecute(irpc, inst, rf); 1 35

Pipelined SMIPS (Princeton) – single rule, no wires (cont) MemResp memData = ?; if(memType(eInst.iType)) begin memData <- mem.req(MemReq{ op: eInst.iType==Ld ? Ld : St, addr: eInst.addr, data: eInst.data}); memAcc = True; end regUpdate(eInst, memData, rf); if (eInst.brTaken) brTaken = Valid (eInst.addr); ir.deq; end 36

Pipelined SMIPS (Princeton) – single rule, no wires (cont) if(isValid(brTaken)) begin pc <= fromMaybe(brTaken); ir.clear; end else if (ir.notFull && !memAcc) begin let inst <- mem.req( MemReq{op: Ld, addr: pc, data: ?}); ir.enq(TypeFetch2Decode{pc: pc, inst: inst}); pc <= pc+4; end else pc <= pc; endrule endmodule 2 37

Compiler issues For this code to work the BSV compiler needs to figure out that mem.req port is not being used by two different actions concurrently! Indeed the compiler is able to figure out that memAcc makes the two uses of mem.req disjoint Removing synthesis boundary from mem automatically duplicates the port and makes the conflict disappear (not quite Princeton) 38

Killing fetched instructions Our simple solution is not enough if the design permitted us to have outstanding instruction requests in the fetch stage A solution in terms of “epochs” 39