 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.

 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent data hazards: —In clock cycle 4, AND gets the value $1 - $3 from EX/MEM —In cycle 5, OR gets that same result from MEM/WB 1 Forwarding DM Reg IM DM Reg IM DM Reg IM sub$2, $1, $3 and$12, $2, $5 or$13, $6, $2 Clock cycle 1234567

2 Outline of forwarding hardware  A forwarding unit selects the correct ALU inputs for the EX stage: —No hazard: ALU’s operands come from the register file, like normal —Data hazard: operands come from either the EX/MEM or MEM/WB pipeline registers instead  The ALU sources will be selected by two new multiplexers, with control signals named ForwardA and ForwardB DM Reg IM DM Reg IM DM Reg IM sub$2, $1, $3 and$12, $2, $5 or$13, $6, $2

3 Simplified datapath with forwarding muxes ForwardA Instruction memory Data memory 1010 PC ALU Registers Rd Rt 0101 IF/IDID/EXEX/MEMMEM/WB 012012 012012 ForwardB

4 Detecting EX/MEM Hazards  How to detect an impending hazard?  In the above case, it occurs in cycle 3, when sub is in EX, or is in ID —Hazard because: ID/EX.rd == IF/ID.rs  An EX/MEM hazard occurs between the instruction currently in its EX stage and the previous instruction if: 1.The previous instruction will write to the register file, and 2.The destination is one of the ALU source registers in the EX stage DM Reg IM DM Reg IM sub$2, $1, $3 or$12, $2, $5

5 EX/MEM data hazard equations  The first ALU source comes from the pipeline register when necessary: if (EX/MEM.RegWrite and EX/MEM.rd == ID/EX.rs) ForwardA = 2  The second ALU source is similar: if (EX/MEM.RegWrite and EX/MEM.rd == ID/EX.rt) ForwardB = 2 DM Reg IM DM Reg IM sub$2, $1, $3 and$12, $2, $5

6 MEM/WB data hazards  A MEM/WB hazard may occur between an instruction in the EX stage and the instruction from two cycles ago  One new problem is if a register is updated twice in a row: add$1, $2, $3 add$1, $1, $4 sub$5, $5, $1  Register $1 is written by both of the previous instructions, but only the most recent result (from the second ADD) should be forwarded DM Reg IM DM Reg IM DM Reg IM add$1, $2, $3 add$1, $1, $4 sub$5, $5, $1

7 MEM/WB hazard equations  Here is an equation for detecting and handling MEM/WB hazards for the first ALU source: if (MEM/WB.RegWrite and MEM/WB.rd == ID/EX.rs and (EX/MEM.rd ≠ ID/EX.rs or not(EX/MEM.RegWrite)) ForwardA = 1  The second ALU operand is handled similarly: if (MEM/WB.RegWrite and MEM/WB.rd == ID/EX.rt and (EX/MEM.rd ≠ ID/EX.rt or not(EX/MEM.RegWrite)) ForwardB = 1  Handled by a forwarding unit which uses the control signals stored in pipeline registers to set the values of ForwardA and ForwardB

8 Simplified datapath with forwarding

9 Example sub$2, $1, $3 and$12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2)  Assume again each register initially contains its number plus 100 —After the first instruction, $2 should contain - 2 (= 101 - 103) —The other instructions should all use - 2 as one of their operands  We’ll try to keep the example short: —Assume no forwarding is needed except for register $2 —We’ll skip the first two cycles, since they’re the same as before

10 MEM/WB.RegisterRdID/EX. RegisterRs Clock cycle 3 Instruction memory Data memory 1010 PC ALU Registers 12 (Rd) 5 (Rt) 0101 IF/IDID/EXEX/MEMMEM/WB 2 (Rs) 012012 012012 Forwarding Unit 1 EX: sub $2, $1, $3ID: and $12, $2, $5IF: or $13, $6, $2 102 105 X X 2 5 101 103 101 -2 103 0 0 3 2 2 ID/EX. RegisterRt EX/MEM.RegisterRd

11 -2 ID/EX. RegisterRs 5 MEM/WB.RegisterRd EX/MEM.RegisterRd Clock cycle 4: forwarding $2 from EX/MEM Instruction memory Data memory 1010 PC ALU Registers 13 (Rd) 2 (Rt) 0101 IF/IDID/EXEX/MEMMEM/WB 6 (Rs) 012012 012012 Forwarding Unit 2 EX: and $12, $2, $5ID: or $13, $6, $2IF: add $14, $2, $2 106 102 X X 6 2 105 -2 104 105 0 2 12 MEM: sub $2, $1, $3 -2 2 ID/EX. RegisterRt

12 -2 ID/EX. RegisterRs 2EX/MEM.RegisterRd Clock cycle 5: forwarding $2 from MEM/WB Instruction memory Data memory 1010 PC ALU Registers 14 (Rd) 2 (Rt) 0101 IF/IDID/EXEX/MEMMEM/WB 2 (Rs) 012012 012012 Forwarding Unit 12 6 EX: or $13, $6, $2ID: add $14, $2, $2IF: sw $15, 100($2) -2 2 2 2 106 -2 102 1 0 13 MEM: and $12, $2, $5 104 WB: sub $2, $1, $3 X -2 2 2 ID/EX. RegisterRt MEM/WB.RegisterRd

13 Forwarding resolved two data hazards  The data hazard during cycle 4: —The forwarding unit notices that the ALU’s first source register for the AND is also the destination of the SUB instruction —The correct value is forwarded from the EX/MEM register, overriding the incorrect old value still in the register file  The data hazard during cycle 5: —The ALU’s second source (for OR) is the SUB destination again —This time, the value has to be forwarded from the MEM/WB pipeline register instead  There are no other hazards involving the SUB instruction —During cycle 5, SUB writes its result back into register $2 —The ADD instruction can read this new value from the register file in the same cycle

14 Complete pipelined datapath...so far 0 1 Addr Instruction memory Instr Address Write data Data memory Read data 1010 PC Extend ALUSrc Result Zero ALU Instr [15 - 0] RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers Rd Rt 0101 IF/ID ID/EX EX/MEM MEM/WB EX M WB Control M WB Rs 012012 012012 Forwarding Unit EX/MEM.RegisterRd MEM/WB.RegisterRd

15 What about stores?  Two “easy” cases: DM Reg IM DM Reg IM add$1, $2, $3 sw$1, 0($4) DM Reg IM DM Reg IM add$1, $2, $3 sw $4, 0($1) 123456 123456

16 Store Bypassing: Version 1 0 1 Addr Instruction memory Instr Address Write data Data memory Read data 1010 PC Extend ALUSrc Result Zero ALU Instr [15 - 0] RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers Rd Rt 0101 IF/ID ID/EXEX/MEMMEM/WB Rs 012012 012012 Forwarding Unit EX/MEM.RegisterRd MEM/WB.RegisterRd EX: sw $4, 0($1) MEM: add $1, $2, $3

17 Store Bypassing: Version 2 0 1 Addr Instruction memory Instr Address Write data Data memory Read data 1010 PC Extend ALUSrc Result Zero ALU Instr [15 - 0] RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers Rd Rt 0101 IF/ID ID/EXEX/MEMMEM/WB Rs 012012 012012 Forwarding Unit EX/MEM.RegisterRd MEM/WB.RegisterRd EX: sw $1, 0($4) MEM: add $1, $2, $3

18 What about stores?  A harder case:  In what cycle is the load value available? —End of cycle 4  In what cycle is the store value needed? —Start of cycle 5  What do we have to add to the datapath? DM Reg IM DM Reg IM lw$1, 0($2) sw$1, 0($4) 123456

19 Load/Store Bypassing: Extend the Datapath 0 1 Addr Instruction memory Instr Address Write data Data memory Read data 1010 PC Extend ALUSrc Result Zero ALU Instr [15 - 0] RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers Rd Rt 0101 IF/ID ID/EXEX/MEMMEM/WB Rs 012012 012012 Forwarding Unit EX/MEM.RegisterRd MEM/WB.RegisterRd Sequence : lw $1, 0($2) sw $1, 0($4) ForwardC 0101

20 Miscellaneous comments  Each MIPS instruction writes to at most one register —This makes the forwarding hardware easier to design, since there is only one destination register that ever needs to be forwarded  Forwarding is especially important with deep pipelines like the ones in all current PC processors  The textbook has some additional material not shown here: —Their hazard detection equations also ensure that the source register is not $0, which can never be modified

Load-Use Data Hazard Need to stall for one cycle

22 What about loads?  Consider the instruction sequence shown below: —The load data doesn’t come from memory until the end of cycle 4 —But the AND needs that value at the beginning of the same cycle!  This is a “true” data hazard—the data is not available when we need it  We call this a load-use hazard DM Reg IM DM Reg IM lw$2, 20($3) and$12, $2, $5 Clock cycle 123456

23 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes called a bubble  Notice that we’re still using forwarding in cycle 5, to get data from the MEM/WB pipeline register to the ALU DM Reg IM DM Reg IM lw$2, 20($3) and$12, $2, $5 Clock cycle 1234567

24 Stalling and forwarding  Without forwarding, we’d have to stall for two cycles to wait for the LW instruction’s writeback stage  In general, you can always stall to avoid hazards—but dependencies are very common in real code, and stalling often can reduce performance by a significant amount DM Reg IM DM Reg IM lw$2, 20($3) and$12, $2, $5 Clock cycle 12345678

Load-Use Hazard Detection Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage are given by IF/ID.RegisterRs, IF/ID.RegisterRt Load-use hazard when ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) If detected, stall and insert bubble

How to Stall the Pipeline Force control values in ID/EX register to 0 EX, MEM and WB do nop (no-operation) Prevent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows MEM to read data for lw Can subsequently forward to EX stage

27 Stalling delays the entire pipeline  If we delay the second instruction, we’ll have to delay the third one too —This is necessary to make forwarding work between AND and OR —It also prevents problems such as two instructions trying to write to the same register in the same cycle DM Reg IM DM Reg IM DMReg IM lw$2, 20($3) and$12, $2, $5 or$13, $12, $2 Clock cycle 12345678

28  But what about the ALU during cycle 4, the data memory in cycle 5, and the register file write in cycle 6?  Those units aren’t used in those cycles because of the stall, so we can set the EX, MEM and WB control signals to all 0s. Reg What about EX, MEM, WB DM Reg IM RegIM lw$2, 20($3) and$12, $2, $5 or$13, $12, $2 DMReg IM DM Reg Clock cycle 12345678

29 Detecting Stalls, cont.  When should stalls be detected? EX stage (of the instruction causing the stall) Reg DM Reg IM RegIM lw$2, 20($3) and$12, $2, $5 DM Reg id/exif/id ex/mem mem\wb id/ex if/id ex/mem mem\wb if/id  What is the stall condition? if (ID/EX.MemRead = 1 and (ID/EX.rt = IF/ID.rs or ID/EX.rt = IF/ID.rt)) then stall

30 Adding hazard detection to the CPU

Stalls and Performance  Stalls reduce performance —But are required to get correct results  Compiler can arrange code to avoid hazards and stalls —Requires knowledge of the pipeline structure

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction Ex: c code for A = B + E; C = B + F; lw$t1, 0($t0) lw$t2, 4($t0) add$t3, $t1, $t2 sw$t3, 12($t0) lw$t4, 8($t0) add$t5, $t1, $t4 sw$t5, 16($t0) stall lw$t1, 0($t0) lw$t2, 4($t0) lw$t4, 8($t0) add$t3, $t1, $t2 sw$t3, 12($t0) add$t5, $t1, $t4 sw$t5, 16($t0) 11 cycles13 cycles

33 Branches in the original pipelined datapath Read address Instruction memory Instruction [31-0] Address Write data Data memory Read data MemWrite MemRead 1010 MemToReg 4 Shift left 2 PCPC Add 1010 PCSrc Sign extend ALUSrc Result Zero ALU ALUOp Instr [15 - 0] RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite Add Instr [15 - 11] Instr [20 - 16] 0101 0 1 IF/ID ID/EX EX/MEM MEM/WB EX M WB Control M WB When are they resolved?

Branch Hazards If branch outcome determined in MEM: PC Flush these instructions (Set control values to 0)

Reducing Branch Delay Move hardware to determine outcome to ID stage —Target address adder —Register comparator Example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7... 72: lw $4, 50($7)

Example: Branch Taken

Data Hazards for Branches If a comparison register is a destination of 2 nd or 3 rd preceding ALU instruction … IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB add $4, $5, $6 add $1, $2, $3 beq $1, $4, target Can resolve using forwarding

Data Hazards for Branches If a comparison register is a destination of preceding ALU instruction or 2 nd preceding load instruction Need 1 stall cycle beq stalled IFIDEXMEMWB IFIDEXMEMWB IFID EXMEMWB add $4, $5, $6 lw $1, addr beq $1, $4, target

Data Hazards for Branches If a comparison register is a destination of immediately preceding load instruction —Need 2 stall cycles beq stalled IFIDEXMEMWB IFID EXMEMWB beq stalled lw $1, addr beq $1, $0, target

Branch Prediction Longer pipelines can’t readily determine branch outcome early Stall penalty becomes unacceptable Predict (i.e., guess) outcome of branch Only stall if prediction is wrong Simplest prediction strategy predict branches not taken Works well for loops if the loop tests are done at the start. Fetch instruction after branch, with no delay

MIPS with Predict Not Taken Prediction correct Prediction incorrect

Dynamic Branch Prediction  In deeper and superscalar pipelines, branch penalty is more significant  Use dynamic prediction  Branch prediction buffer (aka branch history table)  Indexed by recent branch instruction addresses  Stores outcome (taken/not taken)  To execute a branch  Check table, expect the same outcome  Start fetching from fall-through or target  If wrong, flush pipeline and flip prediction

1-Bit Predictor: Shortcoming Inner loop branches mispredicted twice! outer: … … inner: … … beq …, …, inner … beq …, …, outer  Mispredict as taken on last iteration of inner loop  Then mispredict as not taken on first iteration of inner loop next time around

2-Bit Predictor Only change prediction on two successive mispredictions

Calculating the Branch Target  Even with predictor, still need to calculate the target address  1-cycle penalty for a taken branch  Branch target buffer  Cache of target addresses  Indexed by PC when instruction fetched  If hit and instruction is branch predicted taken, can fetch target immediately

Concluding Remarks ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput using parallelism More instructions completed per second Latency for each instruction not reduced Hazards: structural, data, control Main additions in hardware: forwarding unit hazard detection and stalling branch predictor branch target table

48 What about loads?  Consider the instruction sequence shown below: —The load data doesn’t come from memory until the end of cycle 4 —But the AND needs that value at the beginning of the same cycle!  This is a “true” data hazard—the data is not available when we need it  We call this a load-use hazard DM Reg IM DM Reg IM lw$2, 20($3) and$12, $2, $5 Clock cycle 123456

49 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes called a bubble  Notice that we’re still using forwarding in cycle 5, to get data from the MEM/WB pipeline register to the ALU DM Reg IM DM Reg IM lw$2, 20($3) and$12, $2, $5 Clock cycle 1234567

50 Stalling and forwarding  Without forwarding, we’d have to stall for two cycles to wait for the LW instruction’s writeback stage  In general, you can always stall to avoid hazards—but dependencies are very common in real code, and stalling often can reduce performance by a significant amount DM Reg IM DM Reg IM lw$2, 20($3) and$12, $2, $5 Clock cycle 12345678

 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.

Similar presentations

Presentation on theme: " The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.

Similar presentations

Presentation on theme: " The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent."— Presentation transcript:

Similar presentations

About project

Feedback