Chapter 4 Sections 4.5 – 4.8 Dr. Iyad F. Jafar MIPS Pipelining Chapter 4 Sections 4.5 – 4.8 Dr. Iyad F. Jafar
Outline Introduction Why Pipelining? MIPS Pipelined Datapath MIPS Pipelined Control Pipelining Hazards Structural Hazards Data Hazards Control Hazards Exceptions and Interrupts Fallacies and Pitfalls Reading Assignment
Introduction Single-cycle datapath Multi-cycle datapath Simple! Hardware replication? Cycle time? Multi-cycle datapath More involved Less HW replication of major units Better performance if the delay of major functional units is balanced! Can we do any better? Pipelining!
Introduction Pipelining In Multi-cycle, only one major unit is used in each cycle while other units are idle! Why not to use them to do something else? Basically, start the next instruction before the current one is finished! Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 LW IFetch Dec Exec Mem WB SW IFetch Dec Exec Mem WB R-Type IFetch Dec Exec Mem WB
Introduction Pipelining The time required to execute one instruction (Instruction latency) is not affected! However, the number of instructions finished per unit time (Throughput) is increased Thus, Pipelining improves the throughput not latency! Most modern processors are pipelined! Notes As in multi-cycle, the cycle time is determined by the slowest unit! However, similar to single-cycle, we can get one instruction done every cycle! It is assumed that all instructions take the same number of cycles!
Introduction Clk Single Cycle Implementation: lw sw Waste Cycle 1 R-type Multiple Cycle Implementation: Clk Cycle 1 IFetch Dec Exec Mem WB Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw sw R-type lw IFetch Dec Exec Mem WB Pipeline Implementation: sw R-type
Why Pipelining? For Performance! Inst 1 Inst 2 Inst 3 Inst 4 Inst 5 Time (clock cycles) Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 (similar to Single-cycle) I n s t r. O r d e ALU IM Reg DM Inst 1 ALU IM Reg DM Inst 2 ALU IM Reg DM Inst 3 ALU IM Reg DM Inst 4 ALU IM Reg DM Inst 5 Time to fill the pipeline
Why Pipelining? Example 1. Comparing pipelining to single-cycle Consider a program that consists of a large number of LOAD instructions only that is executed on a single-cycle CPU and 5-stage pipelined CPU with the operation time for the major units (memory, ALU, and register file) to be 200 ps in both cases. 1) Determine the time required to finish executing 1,000,000 LOAD instructions and compute the speed up of pipelining. 2) Determine the time required to finish executing the first 3 LOAD instructions 3) Repeat (1) and (2) if the delay of the register file is 100 ps instead of 200 ps. Cycle times for the two implementations CCSC = 200 + 200 + 200 + 200 + 200 = 1000 ps CCPP = 200 ps
(very close to the number of stages) Why Pipelining? Example 1. Comparing pipelining to single-cycle 1) Determine the time required to finish executing 1,000,000 LOAD instructions and compute the speed up of pipelining. Single-cycle TimeSC = 1000 ps x 1000000 = 1,000,000,000 ps Pipelining TimePP = 1000 ps + 200 ps x 999999 = 200,000,800 ps After 200*5 seconds, the pipeline is full and we get 1 instruction per cycle afterwards Speeup = 1,000,000,000 / 200,000,800 = 4.99998 (very close to the number of stages)
(less than the number of stages) Why Pipelining? Example 1. Comparing pipelining to single-cycle 2) Determine the time required to finish executing the first 3 LOAD instructions and compute the speed up of pipelining Single-cycle TimeSC = 1000 x 3 = 3000 ps Pipelining TimePP = 200 x 5 +200 + 200 = 1400 ps Speeup = 3000 / 1400 = 2.14 (less than the number of stages)
Why Pipelining? Example 1. Comparing pipelining to single-cycle 3) Repeat (1) and (2) if the delay of the register file is 100 ps . CCSC = 200 + 100 + 200 + 200 + 100 = 800 ps CCPP = 200 ps For 1,000,000 instructions TimeSC = 800 x 1,000,000 = 800,000,000 ps TimePP = 1000+ 200x999,999 = 200,000,800ps Speeup = 800,000,000/ 200,000,600 = 3.99998 (<5) For 3 instructions TimeSC = 800 x 3 = 2400 ps TimePP = 1000 + 200x 2 = 1400 ps Speeup = 2400/ 1400 = 1.71 (<5)
Why Pipelining? Example 1. Summary Ideally, the pipeline speedup is n times faster than the single- cycle, where n is the number of pipeline stages. In the 5-stage MIPS, the pipelined version would be 5 times faster. When the pipeline is full, the throughput will be one instruction per cycle Many factors affect pipelining performance Time to fill empty the pipeline Number of instructions to execute Unbalancecd delay of pipeline stages Instruction mix Pipeline hazards Ideally, the number of cycles required to finish M instructions in N-stages pipeline is N + M – 1
Pipelined MIPS Datapath What do we need to implement pipelining? We need to consider the following: The execution of instructions is divided into 5 stages (cycles): Instruction fetch (IF) , Instruction decode (ID), Execute (EX), Memory Access (MEM), Write Back (WB) Instruction flow is from left to right except in two cases In the write-back stage where the result is written into the register file in the middle of the datapath Choosing between the incremented PC and the branch address in the MEM stage In pipelining, all units are operating in every cycle; thus we have to duplicate hardware where needed Since the execution is over multiple cycles, we need to add State (Pipeline) registers between stages to preserve intermediate data and control for each instruction. These registers hold the values to be used in later stages as long as they are needed.
Pipelined MIPS Datapath IF ID EX MEM WB + 4 Shift left 2 + Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 IFetch/Dec Read Addr 2 Read Address PC Read Data Dec/Exec Exec/Mem Address Write Addr ALU Read Data 2 Mem/WB Write Data Note two exceptions to left-to-right flow WB that writes the result back into the register file in the middle of the datapath Selection of the next value of the PC, one input comes from the calculated branch address from the MEM stage Only later instructions in the pipeline can be influenced by these two REVERSE data movements. The first one (WB to ID) leads to data hazards. The second one (MEM to IF) leads to control hazards. All instructions must update some state in the processor – the register file, the memory, or the PC – so separate pipeline registers are redundant to the state that is updated (not needed). PC can be thought of as a pipeline register: the one that feeds the IF stage of the pipeline. Unlike all of the other pipeline registers, the PC is part of the visible architecture state – its content must be saved when an exception occurs (the contents of the other pipe registers are discarded). Write Data Sign Extend 16 32 System Clock Any problem?
Pipelined MIPS Datapath IF ID EX MEM WB + 4 Shift left 2 + Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 IFetch/Dec Read Addr 2 Read Address PC Read Data Dec/Exec Exec/Mem Address Write Addr ALU Read Data 2 Mem/WB Write Data Note two exceptions to left-to-right flow WB that writes the result back into the register file in the middle of the datapath Selection of the next value of the PC, one input comes from the calculated branch address from the MEM stage Only later instructions in the pipeline can be influenced by these two REVERSE data movements. The first one (WB to ID) leads to data hazards. The second one (MEM to IF) leads to control hazards. All instructions must update some state in the processor – the register file, the memory, or the PC – so separate pipeline registers are redundant to the state that is updated (not needed). PC can be thought of as a pipeline register: the one that feeds the IF stage of the pipeline. Unlike all of the other pipeline registers, the PC is part of the visible architecture state – its content must be saved when an exception occurs (the contents of the other pipe registers are discarded). Write Data Sign Extend 16 32 System Clock Need to preserve the destination register !
Pipelined MIPS Datapath Example 2. Execution of LW instruction (1) Instruction Fetch: Put PC and the loaded instruction in the IF/ID register
Pipelined MIPS Datapath Example 2. Execution of LW instruction (2) Instruction Decode and Read Registers: Store Reg[rs], Reg[rt], sign extended offset , rd, rt, and the updated PC (why?) in the ID/EX register
MIPS Pipelining Example 2. Execution of LW instruction (3) Execute Or Address Calculation: Store branch address, Reg[rt], result, and zero flag in the EX/MEM register
Pipelined MIPS Datapath Example 2. Execution of LW instruction (4) Memory Access: Store the data from memory into MEM/WB register
Pipelined MIPS Datapath Example 2. Execution of LW instruction (5) Write Back: Copy the data loaded in the MEM/WB register to register file
Pipelined MIPS Datapath Required data fields in the pipelining registers Data fields are moved from one pipeline register to another every clock cycle until they are no longer needed Pipeline Register Data Fields Register Size IF/ID Instruction and PC 64 bits ID/EX PC, Reg[rs], Reg[rt], sign-extended offset, rt, rd 138 bits EX/MEM Branch address, Zero, ALU result, Reg[rt], Destination register address (rt or rd) 103 bits MEM/WB ALU Result, Data from memory, Destination register address 69
Pipelined MIPS Control All control signals can be determined during Decode stage while they are needed in later stages! Solution! Expand the pipeline registers to store and move the control signals between stages until they are needed
Pipelined MIPS Control Define the control signals and generate them in the decode stage For the time being, no explicit write signals are required for the pipeline registers since the are updated every cycle
Pipelined MIPS Control Control signals needed in each stage Control signal values based on instruction type Pipeline Stage Control signals IF None ID EX RegDst, ALUOp1, ALUOp0, ALUSrc MEM Branch, MemRead, MemWrite WB MemtoReg, RegWrite
MIPS Pipeline Example 3. Given the code segment and the register contents below, show the contents of the data and control fields in the pipeline registers if the sixth instruction has been fetched (i.e. the beginning of cycle 7) Register Contents $1 1 $2 5 $3 3 $4 -6 $5 2 $6 7 $11 12 $12 -15 $13 10 Address Instruction 0x00000000 lw $10, 20($1) 0x00000004 sub $11,$1,$2 0x00000008 add $12,$3,$4 0x0000000c lw $13, 24($1) 0x00000010 add $3,$2,$1 0x00000014 Sub $1,$5,$6
MIPS Pipeline Example 3. Multi-cycle diagram Time lw $10, 20($1) ALU IM Reg DM lw $10, 20($1) I n s t r. O r d e ALU IM Reg DM sub $11,$1,$2 ALU IM Reg DM add $12,$3,$4 ALU IM Reg DM lw $13, 24($1) ALU IM Reg DM add $3,$2,$1 ALU IM Reg DM sub $1,$5,$6
MIPS Pipeline Example 3. Single-cycle diagram sub $1,$5,$6 add $3,$2,$1 lw $13, 24($1) add $12,$3,$4 sub $11,$1,$2
MIPS Pipeline Example 3. At the beginning of cycle 7, the sixth instruction is stored in the IF/ID register while the data and control for earlier instructions are pushed to next pipeline registers and the register files. Thus, IF/ID register No control signals are stored Store the instruction sub $1,$5,$6 and PC+4 IF/ID.Instruction = 0x00A60822 IF/ID.PC = 0x00000018
MIPS Pipeline ID/EX register Example 3. Store the information of add $3,$2,$1 and PC+4 ID/EX.PC = 0x00000014 ID/EX.RegRsContents = 0x00000005 ID/EX.RegRtContents = 0x00000001 ID/EX.RegRt = (00001)2 ID/EX.RegRd = (00011)2 ID/EX.SignExtend = 0x00001820 Control Information ID/EX.MemToReg = 0 ID/EX.RegWrite = 1 ID/EX.MemRead = 0 ID/EX.MemWrite = 0 ID/EX.Branch = 0 ID/EX.ALUSrc = 0 ID/EX.RegDst = 1 ID/EX.ALUOp = (10)2
MIPS Pipeline EX/MEM register Example 3. Store the information of lw $13,24($1), branch address, and memory address EX/MEM.BranchAddress = 0x00000070 EX/MEM.ALUOut = 0x00000019 EX/MEM.Zero = 0 EX/MEM.RegDestination= (01101)2 EX/MEM.RegRtContents = 0x0000000A Control Information EX/MEM.MemToReg = 0 EX/MEM.RegWrite = 1 EX/MEM.MemRead = 1 EX/MEM.MemWrite = 0 EX/MEM.Branch = 0
MIPS Pipeline MEM/WB register For the sub $11, $1,$2 Example 3. Store the information of add $12, $3,$4, addition result, and data memory MEM/WB.RegDestination= (01100)2 MEM/WB.ALUOut = 0xFFFFFFFD MEM/WB.MemoryData = XXXX Control Information MEM/WB.MemToReg = 0 MEM/WB.RegWrite = 1 For the sub $11, $1,$2 It will be writing (1 - 5) to $11
Pipelining Hazards Is it that easy? Any complications? In general, pipelining is effective! MIPS ISA makes even easy All instructions are of the same length (32 bits) Can fetch the next instruction once the current is being decoded Few instruction formats with symmetry across them Can read the register file in the 2nd stage Memory access is through the Load and Store instructions Can use the execute stage to compute the address Each MIPS instruction writes at most one result in the MEM or WB stage Is it that easy? Any complications? YES! PIPELINING HAZARDS !
Pipelining Hazards Simple Solution? Hazards - problems the might occur during pipeline operation Three basic sources Structural Hazards In pipelining, all functional units are used in any cycle What if two instructions use the same functional unit in the same cycle? Data Hazards In pipelining, execution of instructions is overlapped What if the operand(s) of some instruction comes from an earlier instruction that is still in the pipeline? Control Hazards In pipelining, an instruction is fetched every cycle What if an instruction is a jump or a branch instruction that evaluates to true? The following instruction(s) in the pipeline might not be correct? Simple Solution? Wait until the issue is resolved!
Structural Hazards Single Memory! lw Inst 1 Inst 2 Inst 3 Inst 4 Reading from memory twice in the same cycle! Time (clock cycles) ALU Mem Reg lw I n s t r. O r d e ALU Mem Reg Inst 1 ALU Mem Reg Inst 2 ALU Mem Reg Inst 3 ALU Mem Reg Inst 4 Solution: Use two memories; Data and Instruction!
Structural Hazards Single Register File! add $1, Inst 1 Inst 2 Time (clock cycles) One instruction is writing and the other is reading the register file? ALU IM Reg DM add $1, I n s t r. O r d e ALU IM Reg DM Inst 1 Solution: Design the register file to write in the first half of the cycle and read in the second half! ALU IM Reg DM Inst 2 ALU IM Reg DM add $2,$1, clock edge that controls loading of pipeline state registers clock edge that controls register writing
Data Hazards add $1, sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 ALU IM Reg DM add $1, ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 ALU IM Reg DM xor $4,$1,$5 Dependencies backward in time cause hazards This is called Read-after-Write (RAW) data hazard Register-use data hazard Solution?
Data Hazards Simply, wait for the earlier instruction to finish! This is called stalling the pipeline! However, this affects the CPI? ALU IM Reg DM add $1, I n s t r. O r d e stall stall If the conflicting instruction is not immediately after the earlier instruction, then one stall is needed. sub $4,$1,$5 and $6,$1,$7 ALU IM Reg DM Do we need two stalls all the time?
Data Hazards lw $1,5($s1) sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 ALU IM Reg DM lw $1,5($s1) ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 ALU IM Reg DM xor $4,$1,$5 Dependencies backward in time cause hazards It is a Read-after-Write (RAW) data hazard Load-use data hazard Solution?
Data Hazards Again, wait for the LW instruction to finish by stalling the pipeline! However, this affects the CPI? ALU IM Reg DM lw $1, I n s t r. O r d e stall stall sub $4,$1,$5 and $6,$1,$7 ALU IM Reg DM
Register-use data hazard Data Hazards Example 4. how many cycles are actually required to execute the following code? Assume the pipeline is already full. add $1, $2, $5 add $5, $3, $1 sub $10, $7, $8 sub $5, $6, $7 lw $3, 45($9) add $3, $3, $8 Ideally, and since the pipeline is full, each instruction requires 1 cycle. Thus, we need 6 cycles (CPI =6/6= 1). However, … Register-use data hazard Adds 2 cycles by stalls Load-use data hazard Adds 2 cycles by stalls Thus, 10 cycles are needed. CPI = 10/6 = 1.667 ?? Performance ?? Can we do any better?
Data Hazards Fixing Register-use Hazard by Forwarding Note that data produced by an instruction and needed by a later instruction is pushed through the pipeline registers until it is saved into the register file ! Why not to read the data from the pipeline registers before it is stored ? This is called forwarding! What is required? Need to detect the hazard Is any of the source registers for the instruction the same as the destination register for an earlier instruction that is still in the pipeline? Need to create a path to pass the data between pipeline stages Instead of reading the source registers of the instruction from the register file, read them from the pipeline registers
Data Hazards Fixing Register-use Hazard by Forwarding add $1, ALU IM Reg DM add $1, I n s t r. O r d e ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 ALU IM Reg DM xor $4,$1,$5 No Stalls!
Note that forwarding could be from EX/MEM or from MEM/WB! Why? Data Hazards Forwarding Hardware implementation Note that forwarding could be from EX/MEM or from MEM/WB! Why? Dependency could be up to two later instructions! And the data of the earliest instructions moves in the pipeline every cycle. Add $1 Sub $4, $1 Sub $5, $1
Data Hazards Forwarding Hardware implementation Inside the forwarding unit Forwarding from EX/MEM (MEM Stage) if (EX/MEM.RegWrite and (EX/MEM.RegRd != 0) and (EX/MEM.RegRd = ID/EX.RegRs)) then ForwardA = From EX/MEM and (EX/MEM.RegRd = ID/EX.RegRt)) then ForwardB = From EX/MEM Why to check the RegWrite signal? Why to check the Zero register? Regwrite: what if we have a branch instruction followed by instructions with similar reigsters The reason for checking the $zero register is to avoid forwarding non-zero value if the $zero register is the destination for an instruction and some following instruction uses it as a source !!! add $zero, $2,$3 sub $7, $zero, $2 % in this case the value used for the $zero will be the addition between $2 and $3 Consider this code !!! We have WAW then RAW!! From where to forward MEM or WB! The forwarding unit has to be modified to forward from mem stage EX/MEM! lw $1, 18($15) add $2,$1,$14 addi $2,$1,20 sub $3,$1,$2 or $4,$3,$1 addi $22,$23,$23 sub $5,$6,$7
Data Hazards Forwarding Hardware implementation Inside the forwarding unit Forwarding from MEM/WB (WB Stage) if (MEM/WB.RegWrite and (MEM/WB.RegRd != 0) and (MEM/WB.RegRd = ID/EX.RegRs)) then ForwardA = From MEM/WB and (MEM/WB.RegRd = ID/EX.RegRt)) then ForwardB = From MEM/WB
Data Hazards Can the forwarding hardware be used with Load-use data hazard? ALU IM Reg DM lw $1,4($2) I n s t r. O r d e ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 ALU IM Reg DM xor $4,$1,$5 We still need 1 Stall for the instruction following the load?
Data Hazards How to stall the pipeline? Stall is required when the instruction in the EX stage is Load and the one in the ID stage depends on the loaded value The Load instruction moves normally to EX/MEM on the next cycle The conflicting instruction (the instruction following the load) should stay in the decode stage? How? Don’t write the IF/ID register need IF/IDWrite Signal Don’t update the PC need PCWrite Signal The control signals of the instruction in the decode stage are stored as 0’s (WHY?) in the ID/EX need a multiplexor for the control signals Controlling the process requires a special unit; Hazard Detection Unit Prevent changing the state of the program as all 0 will not write to the register file nor the memory
Data Hazards Stall Implementation
Do we need to stall in all cases? Data Hazards Stall Implementation Inside hazard detection unit if (ID/EX.MemRead and [(ID/EX.RegRt == IF/ID.RegRs) or (ID/EX.RegRt == IF/ID.RegRt)]) then PCWrite = 0 IF/IDWrite = 0 Select 0’s as control signals Here, the condition is met whenever a load instruction is follow by any instruction in which the RS or RT field are the same as the the RT field in the load instruction. This is not true in all cases such as j and jal Solution perform the check later !! If (EX/MEM.memRead) & (EX/MEM.regDestination == ID/EX.rs | EX/MEM.regDestination == ID/EX.rt) & (~ID/EX.jump) then STALL However, this requires modifying the datapath!! We should have ID/EXWrite signal to keep the information of the instruction that follows the load The mux of the control signals is moved to EX stage to reset the control signals of the load Ifwrite and Pcwrite are still there to preserve following instructions Any Problem? Do we need to stall in all cases? How about j and jal that come immediately after load with rs and/or rt fields being the same as the rt field of the load?
Data Hazards Example 5. Consider the following code segment in C A = B + E C = B + F (1) Generate the MIPS code assuming that variables A, B, C, E, and F are in memory and addressable with offsets 0, 4, 8, 12, and 16 from $t0 (2) Find all the data hazards and determine the number of cycles required to run the code. Assume forwarding is implemented. (3) Can you reorder the code to reduce the stalls ?
Data Hazards lw $t1, 4($t0) # loads B lw $t2, 12($t0) # loads E Example 5. lw $t1, 4($t0) # loads B lw $t2, 12($t0) # loads E add $t3, $t1, $t2 # A = B + E sw $t3, 0($t0) # stores A lw $t4, 16($t0) # loads F add $t5, $t1, $t4 # C = B + F sw $t5, 8($t0) # stores C Ideally, each instruction requires 1 cycle after the pipeline is full. Thus, we need (5+7-1) cycles. CPI = 11/7 = 1.57 Load-use data hazard Adds 1 cycle as a stall Load-use data hazard Adds 1 cycle as a stall Thus, 13 cycles are needed. CPI = 13/7 = 1.86 ?? Performance ??
Data Hazards lw $t1, 4($t0) # loads B lw $t2, 12($t0) # loads E Example 5. Reducing stalls by instruction reordering lw $t1, 4($t0) # loads B lw $t2, 12($t0) # loads E lw $t4, 16($t0) # loads F add $t3, $t1, $t2 # A = B + E sw $t3, 0($t0) # stores A add $t5, $t1, $t4 # C = B + F sw $t5, 8($t0) # stores C Moving this instructions fills the first stall and eliminate the second one! Thus, 11 cycles are needed. CPI = 11/7 = 1.57
Data Hazards Example 6. Assume that the pipelined MIPS processor without forwarding is used to run a program with the following instruction mix: 20% loads, 20% store, and 60% ALU. Then compute the average CPI given that 10% of the ALU instructions result in load-use hazards. 15% of the ALU instructions result in read-before-write hazards. Solution Ideally, the average CPI is 1 for each instruction With no forwarding Load-use hazards add two cycles Register-use hazards add two cycles Average CPI = 0.2 x 1 + 0.2 x 1 + 0.75 x 0.60 x 1 + 0.1 x 0.60 x 3 + 0.15 x 0.60 x 3 = 1.30
Control Hazards For the pipelined datapath designed so far, the branch address and decision are known by the end of the MEM stage Instructions following the branch instruction in the pipeline are not correct if the branch evaluates to true! If the branch is true, then these instructions should be removed from the pipeline and execution should continue from the branch address Otherwise, no action is required! This is a dependency backward in time Control Hazard
Control Hazards Solution! Branch Inst2 Inst1 Inst3 Effectively, we have to flush the IF/ID register for 3 cycles instead of stalling since stalling may result in error in the program execution when the branch evaluates to true? Flushing requires clearing the IF/ID register and prevent the update on the program counter. Note that flushing the IF/ID register in case there is a branch instruction in the ID/EX register requires changing PC to the address of the instruction that follows the branch which is flushed! Solution! Once it is known that the instruction is branch, then stall the pipeline for 3 cycles? Is it actually a stall?
Control Hazards beq stall stall stall Inst I n s t r. O r d e ALU IM Reg DM beq I n s t r. O r d e stall stall stall Inst ALU IM Reg DM If we don’t use stalls and start executing the instructions following the branch, we only lose three cycles if the branch is true! Are these actual stalls? Why not to start the execution of the following instructions normally and if the branch is true, then flush these instructions?! Fetching from instruction memory is either from PC+4 or Branch address depending on the branch result
Control Hazards Reducing the Cost of Branch Hazard Note that three cycles are lost if the branch evaluates to true in order to remove the three instructions following the branch instruction! This could affect the performance significantly! Can we reduce this cost? Move the branch address computation to the decode stage Add additional hardware to compare the two registers in the ID stage! Whenever there is a branch instruction in the ID/EX register (ID/EX.branch =1), flush the instruction in the IF/ID register. The branch penalty in this case will be 1 cycle instead of 3 cycles!
Control Hazards Reducing the Cost of Branch Hazard If we don’t flush the instruction, it will be executed later. Note that on flushing, the pc is not updated, so it is still pointing to the instruction that follows the branch.
Control Hazards Reducing the Cost of Branch Hazard Modifying the Hazard Detection Unit IF (ID/EX.Branch) then Flush IF/ID register Note that we lose one cycle whenever a branch instruction is encountered! Can we do any better? ALU IM Reg DM beq stall lw ALU IM Reg DM
IF (ID/EX.Branch) and (ID/EX.ZERO) Then Flush IF/ID register Control Hazards Reducing the Cost of Branch Hazard Approach I – Static Branch Prediction Always predict the branch as Not Taken and start fetching the instruction following the branch If the branch evaluates to Not Taken, then the prediction is correct and no further actions are required! If the branch evaluates to Taken, then the prediction is not correct! Remove the fetched instruction and start fetching from the branch address In this approach, we only lose one cycle if the prediction is not correct Inside the hazard detection unit IF (ID/EX.Branch) and (ID/EX.ZERO) Then Flush IF/ID register
Control Hazards Reducing the Cost of Branch Hazard Approach II – Dynamic Branch Prediction Prediction could be Taken or Not Taken If the branch is predicted as Not Taken Fetch the next instruction If prediction is false, flush the instruction. One cycle is lost! If branch is predicted as Taken Fetch the instruction from the branch address If prediction is false, flush and fetch from PC+4 How to store branch prediction? Use Branch History Table or Branch Prediction Buffer The table is addressable by the lower bits of the branch instruction address If branch is predicted as taken, we need to wait for the branch address to be computed? Use Branch Target Buffer
Control Hazards Approach II – Dynamic Branch Prediction 1-bit Branch Predictor Basically we have two states (Taken and Not Taken) One bit is used to store the prediction Prediction state is changed when prediction is wrong Performance Issues Consider branching in loops? EXAMPLE?
Control Hazards Approach II – Dynamic Branch Prediction 2-bit Branch Predictor Basically we have four states two bits are used to store the prediction Prediction state is changed when prediction is wrong twice
Control Hazards Example 7. Consider a certain program that have a conditional branch instruction whose actual outcome is given below when the program is executed. T-T-N-T-T-N-T List predictions for the following branch prediction schemes and find the prediction accuracy. Predict always taken Predict always not taken 1-bit predictor, initialized to predict taken 2-bit predictor, initialized to weakly predict taken
Control Hazards Example 7. Actual branch actions : T-T-N-T-T-N-T Predict as always taken Predictions : T-T-T-T-T-T-T Accuracy = 5/7 = 71% Predict as always not taken Predictions : N-N-N-N-N-N-N Accuracy = 2/7 = 29% 1-bit predictor initialized to predict taken Predictions: T-T-T-N-T-T-N Accuracy = 3/7 = 43% 2-bit predictor initialized to weakly predict taken Predictions: T-T-T-T-T-T-T Accuracy = 5/7 = 71%
Pipelining Performance Example 8. Let’s compare the performance of single-cycle, multi-cycle, and pipeline implementation of MIPS processor given the operation times and instruction mix below. For the pipelined implementation, assume that: 1) Branch decision is done in the MEM cycle. Branch handling in the pipeline implementation is done by stalling the pipeline. 2) Half of the load instructions incur load-use hazard. 3) Forwarding is implemented. 4) The jump instruction is completed in the ID stage Instruction type Percentage % ALU 52 Load 25 Store 10 Branch 11 Jump 2 Unit Time (ps) Memory 200 ALU and adders 100 Register File 50
Pipelining Performance Example 8. Clock cycle time Single-cycle = 200 + 50 + 100 + 50 + 200 = 600 ps Multi-cycle = 200 ps Pipeline = 200 ps CPI Single-cycle = 1 Multi-cycle = 5x 0.25 + 4x0.52 + 4x0.10 + 3x0.11 + 3x0.02 = 4.12 Pipeline = 0.125x2 + 0.125x1 + 0.52x1 + 0.1x1 + 0.11x4 + 0.02x2 = 1.475 Execution Time per instruction Single-cycle = 600 ps Multi-cycle = 4.12 x 200 ps = 824 ps Pipeline = 1.475 x 200 = 295 ps
Pipelining Performance Example 9. Redo example 8 by assuming that branch prediction is employed and 1/4th of the branch instructions are miss predicted.
Exceptions & Interrupts Exceptions and interrupts are unexpected events that require the change in the flow The two terms are used interchangeably and depending is ISA Intel x86 uses the term interrupt only In MIPS Exceptions: any internal unexpected change in the flow (undefined opecode, overflow, system calls) Interrupts: the event is external (I/O controller request) Dealing with them Is a challenging part of processor design Affects performance
Exceptions & Interrupts In MIPs, when an exception is generated, the following sequence of steps are taken The address of the offending instruction is saved into a special called the Exception Program Counter (EPC). The cause of the exception is saved in a special register called the Cause Register. The control is transferred to the operating system by loading a special address (0x8000 00180) into the PC. The code loaded starting at this address Determines what actions will be done by the operating system in response to the exception based on the value found in the Cause Register. The operating system may terminate the program or resume the execution using the value found in the EPC
Overflow Exception Modifications to the Datapath The exception is raised in the execute stage (the offending instruction is in the ID/EX register) Modifications (part is available from the hardware used to correct the missprediction in branch) add cause and EPC registers clear the control signals of the offending instruction (add the muxes in the EX stage). These are needed instead of a EX/MEMFlush since the dealing with the exception is within the same cycle. flush the IF/ID and ID/EX Expand the MUX at the PC input to include the exception address
Fallacies Fallacy 1. Pipelining is easy ! Not true ! Hazards complicate the operation Fallacy 2. Pipelining is independent of technology! Why didn’t we have pipelined processors before ? Advanced technology allowed more transistors and thus more operations !
Reading Assignment Read the following from the textbook Section 4.9 – Exceptions Section 4.10 – Parallelism and Advanced Instruction Level Parallelism