Lecture 8 Pipeline Hazard

Lecture 8 Pipeline Hazard
Peng Liu

Pipelining and Clock Cycle Time
Min clock cycle = longest combinatorial delay + FF setup + clock skew Pipelining reduces the combinatorial delay Less work per pipeline stage Ideally, N stages reduce delay to 1/N Best you can achieve is Clock cycle= FF setup + clock skew Diminishing returns from ever longer pipelines… Imbalance between stages also reduces benefits from subdividing Even if you could continuously improve clock frequency Power consumption ∞Frequency2

Pipelining & CPI: Dependencies and Hazards
Hazards: situations that prevent starting the next instruction in the next cycle Wasted cycles, CPI >1 Hazards are due to dependencies between instructions Two instructions share resources or data Pipelining may lead to overlapping their execution Types of hazards Structural Hazard (resource conflict) Two instructions need to use the same piece of hardware Data Hazard Instruction depends on result of instruction still in the pipeline Control Hazard Instruction fetch depends on the result of instruction in pipeline

Structural Hazards Resource conflict
Occurs when two instructions try to use same hardware Often arise when functional unit is not fully pipelined Simple example: MIPS pipeline with a single unified memory No separate instruction & data memories Load/store requires data access Instruction fetch would have to stall for that cycle Would cause a pipeline “bubble” Also used for units that are not fully pipelined (mult, div)

Data Dependencies Data dependencies for instruction j following instruction i Read after Write (RAW) (true dependence) Instruction j tries to read before instruction i tries to write it Write after Write (WAW) (output dependence) Instruction j tries to write an operand before i writes its value Write after Read (WAR) (anti dependence) Instruction j tries to write a destination before it is read by i No such thing as a Read after Read (RAR) hazard since there is never a problem reading twice Dependencies are a property of your program (always there) Dependencies may lead to hazards on a specific pipeline

Dealing with RAW Hazards
Must keep our “promise” in the instruction set Each instruction fully completes before next on starts All RAW dependencies are respected Pipelining may break this promise Overlapping i and j i writes late in the pipeline (WB); j reads early (ID) Must ensure that programmers cannot observe this behavior Without necessarily reverting to single-cycle design…

The Five Stages of Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory WB: Write the result data into the register file As shown here, each of these five steps will take one clock cycle to complete.

Pipelined MIPS Processor
Start the next instruction while still working on the current one improves throughput or bandwidth - total amount of work done in a given time (average instructions per second or per clock) instruction latency is not reduced (time from the start of an instruction to its completion) pipeline clock cycle (pipeline stage time) is limited by the slowest stage for some instructions, some stages are wasted cycles Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB sw Latency = execution time (delay or response time) – the total time from start to finish of ONE instruction For processors one important measure is THROUGHPUT (or the execution bandwidth) – the total amount of work done in a given amount of time For memories one important measure is BANDWIDTH – the amount of information communicated across an interconnect (e.g., bus) per unit time; the number of operations performed per second (the WIDTH of the operation and the RATE of the operation) IFetch Dec Exec Mem WB R-type

Single Cycle, Multiple Cycle, vs. Pipeline
Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. Pipeline Implementation: “wasted” cycles lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB

Multiple Cycle v. Pipeline, Bandwidth v. Latency
Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Pipeline Implementation: lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. Latency per lw = 5 clock cycles for both Bandwidth of lw is 1 per clock clock (IPC) for pipeline vs. 1/5 IPC for multicycle Pipelining improves instruction bandwidth, not instruction latency

Pipelining the MIPS ISA
What makes it easy all instructions are the same length (32 bits) easier to fetch in 1st stage and decode in 2nd stage few instruction formats (three) with symmetry across formats can begin reading register file in 2nd stage memory operations can occur only in loads and stores can use the execute stage to calculate memory addresses each MIPS instruction writes at most one result and does so near the end of the pipeline What makes it hard structural hazards: what if we had only one memory? control hazards: what about branches? data hazards: what if an instruction’s input operands depend on the output of a previous instruction?

MIPS Pipeline Datapath Modifications
What do we need to add/modify in our MIPS datapath? registers between pipeline stages to isolate them IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack 1 Add Add 4 Shift left 2 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 Read Address IFetch/Dec PC Read Data Dec/Exec Exec/Mem Address 1 Note two exceptions to right-to-left flow WB that writes the result back into the register file in the middle of the datapath Selection of the next value of the PC, one input comes from the calculated branch address from the MEM stage Only later instructions in the pipeline can be influenced by these two REVERSE data movements. The first one (WB to ID) leads to data hazards. The second one (MEM to IF) leads to control hazards. All instructions must update some state in the processor – the register file, the memory, or the PC – so separate pipeline registers are redundant to the state that is updated (not needed). PC can be thought of as a pipeline register: the one that feeds the IF stage of the pipeline. Unlike all of the other pipeline registers, the PC is part of the visible architecture state – its content must be saved when an exception occurs (the contents of the other pipe registers are discarded). Write Addr ALU Read Data 2 Mem/WB Write Data Write Data 1 Sign Extend 16 32 System Clock

Graphically Representing MIPS Pipeline
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? is there a hazard, why does it occur, and how can it be fixed? ALU IM Reg DM

Why Pipeline? For Throughput!
Time (clock cycles) ALU IM Reg DM Inst 0 Once the pipeline is full, one instruction is completed every cycle I n s t r. O r d e ALU IM Reg DM Inst 1 ALU IM Reg DM Inst 2 ALU IM Reg DM Inst 3 ALU IM Reg DM Inst 4 Time to fill the pipeline

Important Observation
Each functional unit can only be used once per instruction (since 4 other instructions executing) If each functional unit used at different stages then leads to hazards: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage 2 ways to solve this pipeline hazard. Ifetch Reg/Dec Exec Mem Wr Load 1 2 3 4 5 I already told you that in order for pipeline to work perfectly, each functional unit can ONLY be used once per instruction. What I have not told you is that this (1st bullet) is a necessary but NOT sufficient condition for pipeline to work. The other condition to prevent pipeline hiccup is that each functional unit must be used at the same stage for all instructions. For example here, the load instruction uses the Register File’s Wr port during its 5th stage but the R-type instruction right now will use the Register File’s port during its 4th stage. This (5 versus 4) is what caused our problem. How do we solve it? We have 2 solutions. Ifetch Reg/Dec Exec Wr R-type 1 2 3 4

Solution 1: Insert “Bubble” into the Pipeline
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type R-type Ifetch Bubble Reg/Dec Exec Wr Ifetch Reg/Dec Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else.

Solution 2: Delay R-type’s Write by One Cycle
Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr

Can Pipelining Get Us Into Trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready instruction source operands are produced by a prior instruction still in the pipeline load instruction followed immediately by an ALU instruction that uses the load operand as a source value control hazards: attempt to make a decision before condition has been evaluated branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards

A Single Memory Would Be a Structural Hazard
Time (clock cycles) Reading data from memory ALU Mem Reg lw I n s t r. O r d e ALU Mem Reg Inst 1 ALU Mem Reg Inst 2 ALU Mem Reg Inst 3 Reading instruction from memory ALU Mem Reg Inst 4

How About Register File Access?
Time (clock cycles) ALU IM Reg DM add r1, I n s t r. O r d e ALU IM Reg DM Inst 1 ALU IM Reg DM Inst 2 ALU IM Reg DM add r2,r1, For class handout ALU IM Reg DM Inst 4 Potential read before write data hazard

How About Register File Access?
Time (clock cycles) Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. ALU IM Reg DM add r1, I n s t r. O r d e ALU IM Reg DM Inst 1 ALU IM Reg DM Inst 2 ALU IM Reg DM add r2,r1, For lecture Define register reads to occur in the second half of the cycle and register writes in the first half ALU IM Reg DM Inst 4 Potential read before write data hazard

Register Usage Can Cause Data Hazards
Dependencies backward in time cause hazards ALU IM Reg DM add r1,r2,r3 I n s t r. O r d e ALU IM Reg DM sub r4,r1,r5 ALU IM Reg DM and r6,r1,r7 ALU IM Reg DM or r8, r1, r9 For class handout ALU IM Reg DM xor r4,r1,r5 Which are read before write data hazards?

Register Usage Can Cause Data Hazards
Dependencies backward in time cause hazards ALU IM Reg DM add r1,r2,r3 I n s t r. O r d e ALU IM Reg DM sub r4,r1,r5 ALU IM Reg DM and r6,r1,r7 ALU IM Reg DM or r8, r1, r9 For lecture ALU IM Reg DM xor r4,r1,r5 Read before write data hazards

Loads Can Cause Data Hazards
Dependencies backward in time cause hazards ALU IM Reg DM lw r1,100(r2) I n s t r. O r d e ALU IM Reg DM sub r4,r1,r5 ALU IM Reg DM and r6,r1,r7 ALU IM Reg DM or r8, r1, r9 Note that lw is just another example of register usage (beyond ALU ops) ALU IM Reg DM xor r4,r1,r5 Load-use data hazard

One Way to “Fix” a Data Hazard
Can fix data hazard by waiting – stall – but affects throughput ALU IM Reg DM add r1,r2,r3 I n s t r. O r d e stall stall sub r4,r1,r5 and r6,r1,r7 ALU IM Reg DM

Another Way to “Fix” a Data Hazard
Can fix data hazard by forwarding results as soon as they are available to where they are needed. ALU IM Reg DM add r1,r2,r3 I n s t r. O r d e ALU IM Reg DM sub r4,r1,r5 ALU IM Reg DM and r6,r1,r7 ALU IM Reg DM or r8, r1, r9 Forwarding paths are valid only if the destination stage is later in time than the source stage. Forwarding is harder if there are multiple results to forward per instruction or if they need to write a result early in the pipeline ALU IM Reg DM xor r4,r1,r5

Forwarding with Load-use Data Hazards
ALU IM Reg DM lw r1,100(r2) I n s t r. O r d e ALU IM Reg DM sub r4,r1,r5 ALU IM Reg DM and r6,r1,r7 ALU IM Reg DM or r8, r1, r9 Note that lw is just another example of register usage (beyond ALU ops) Need to stall even with forwarding when data hazard involves a load ALU IM Reg DM xor r4,r1,r5 Will still need one stall cycle even with forwarding

Branch Instructions Cause Control Hazards
Dependencies backward in time cause hazards beq ALU IM Reg DM I n s t r. O r d e ALU IM Reg DM lw ALU IM Reg DM Inst 3 ALU IM Reg DM Inst 4

One Way to “Fix” a Control Hazard
ALU IM Reg DM beq I n s t r. O r d e Can fix branch hazard by waiting – stall – but affects throughput stall stall stall lw ALU IM Reg DM Inst 3 Another “solution” is to put in enough extra hardware so that we can test registers, calculate the branch address, and update the PC during the second stage of the pipeline. That would reduce the number of stalls to only one. A third approach is to prediction to handle branches, e.g., always predict that branches will be untaken. When right, the pipeline proceeds at full speed. When wrong, have to stall (and make sure nothing completes – changes machine state – that shouldn’t have)..

Corrected Datapath to Save RegWrite Addr
Need to preserve the destination register address in the pipeline state registers Read Address Instruction Memory Add PC 4 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU Shift left 2 Data IF/ID Sign Extend ID/EX EX/MEM MEM/WB For lecture

MIPS Pipeline Control Path Modifications
All control signals can be determined during Decode and held in the state registers between pipeline stages Read Address Instruction Memory Add PC 4 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU Shift left 2 Data IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control For lecture

Control Settings EX Stage MEM Stage WB Stage RegDst ALUOp1 ALUOp0
ALUSrc Brch MemRead MemWrite RegWrite Mem toReg R 1 lw sw X beq Q: Why not show write enable for pipeline registers? A: Written every clock cycle (like PC) Q: Why not show control for IF and ID stages? A: Control same for all instructions in IF and ID stages: fetch instruction, increment PC

Other Pipeline Structures Are Possible
What about (slow) multiply operation? let it take two cycles MUL ALU IM Reg DM What if the data memory access is twice as slow as the instruction memory? make the clock twice as slow or … let data memory access take two cycles (and keep the same clock rate) Note that we don’t need the output of MUL until the WB cycle, so we can span two pipeline stages with the MUL hardware (so the multiplier is a two stage pipelined multiplier) ALU IM Reg DM1 DM2

Sample Pipeline Alternatives (for ARM ISA)
ARM7 (3-stage pipeline) StrongARM-1 (5-stage pipeline) XScale (7-stage pipeline) IM Reg EX PC update IM access decode reg access ALU op DM access shift/rotate commit result (write back) ALU IM Reg DM ALU Reg DM2 IM1 IM2 Reg DM1 SHFT PC update BTB access start IM access decode reg 1 access DM write reg write ALU op shift/rotate reg 2 access start DM access exception IM access

Designing a Pipelined Processor
Go back and examine your data path and control diagram Associate resources with states Be sure there are no structural hazards: one use / clock cycle Add pipeline registers between stages to balance clock cycle Amdahl’s Law suggests splitting longest stage Resolve all data and control dependencies If backwards in time in pipeline drawing to registers => data hazard: forward or stall to resolve them If backwards in time in pipeline drawing to PC => control hazard Assert control in appropriate stage Develop test instruction sequences likely to uncover pipeline bugs If you don’t test it, it won’t work

Data Forwarding (aka Bypassing)
Any data dependence line that goes backwards in time EX stage generating R-type ALU results or effective address calculation MEM stage generating lw results Forward by taking the inputs to the ALU from any pipeline register rather than just ID/EX by adding multiplexors to the inputs of the ALU so can pass Rd back to either (or both) of the EX’s stage Rs and Rt ALU inputs 00: normal input (ID/EX pipeline registers) 10: forward from previous instr (EX/MEM pipeline registers) 01: forward from instr 2 back (MEM/WB pipeline registers) adding the proper control hardware With forwarding can run at full speed even in the presence of data dependencies

Data Forwarding Control Conditions (1/4)
EX/MEM hazard: if (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ForwardA = 10 “RegisterRd” is number of register to be written (RD or RT) “RegisterRs” is number of RS register “RegisterRt” is number of RT register “ForwardA, ForwardB” controls forwarding muxes if (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 MEM/WB hazard: if (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the ALU. Forwards the result from the second previous instr. to either input of the ALU. What’s wrong with this hazard control? (When might it forward when it shouldn’t?) (Which sequences would reveal this bug?)

EX/MEM hazard: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ForwardA = 10 and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB = 10 MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the ALU provided it writes. Forwards the result from the second previous instr. to either input of the ALU provided it writes. What’s wrong with this hazard control? (When might it forward when it shouldn’t?) (Which sequences would reveal this bug?)

EX/MEM hazard: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ForwardA = 10 and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB = 10 MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the ALU provided it writes and != R0. Forwards the result from the second previous instr. to either input of the ALU provided it writes and != R0. What’s wrong with this hazard control?

Yet Another Complication!
Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded? More recent result! I n s t r. O r d e ALU IM Reg DM add $1,$1,$2 add $1,$1,$3 ALU IM Reg DM ALU IM Reg DM add $1,$1,$4

Corrected Data Forwarding Control Conditions
MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRs) and (EX/MEM.RegisterRd != ID/EX.RegisterRs || ~ EX/MEM.RegWrite)) ForwardA = 01 and (MEM/WB.RegisterRd == ID/EX.RegisterRt) and (EX/MEM.RegisterRd != ID/EX.RegisterRt || ~ EX/MEM.RegWrite))) ForwardB = 01 Forward if this instruction writes AND its not writing R0 AND this dest reg == source AND in between instr either dest. reg. doesn’t match OR it doesn’t write reg.

Datapath with Forwarding Hardware
PCSrc Read Address Instruction Memory Add PC 4 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU Shift left 2 Data IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control cntrl Branch Forward Unit For lecture. How many bits wide is each pipeline register now? ID/EX – x = = 157 Control line inputs to Forward Unit EX/MEM.RegWrite and MEM/WB.RegWrite not shown on diagram EX/MEM.RegisterRd MEM/WB.RegisterRd IF/ID.RegisterRs IF/ID.RegisterRt

Forwarding with Load-use Data Hazards
ALU IM Reg DM lw r1,100(r2) I n s t r. O r d e flush ALU IM Reg DM sub r4,r1,r5 ALU IM Reg DM sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DM and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DM The one case where forwarding cannot save the day is when an instruction tries to read a register following a load instruction that writes the same register. ALU IM Reg DM

Load-use Hazard Detection Unit
Need a hazard detection unit in the ID stage that inserts a stall between the load and its use ID Hazard Detection if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline The first line tests to see if the instruction is a load; the next two lines check to see if the destination register of the load in the EX stage matches either source registers of the instruction in the ID stage After this 1-cycle stall, the forwarding logic can handle the remaining data hazards

Stall Hardware In addition to the hazard detection unit, we have to implement the stall Prevent the IF and ID stage instructions from making progress down the pipeline, done by preventing the PC register and the IF/ID pipeline register from changing Hazard detection unit controls the writing of the PC and IF/ID registers The instructions in the back half of the pipeline starting with the EX stage must be flushed (execute noop) Must deassert the control signals (setting them to 0) in the EX, MEM, and WB control fields of the ID/EX pipeline register. Hazard detection unit controls the multiplexor that chooses between the real control values and 0’s. Assume that 0’s are benign values in datapath: nothing changes

Adding the Hazard Hardware
Read Address Instruction Memory Add PC 4 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU Shift left 2 Data IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control cntrl Branch PCSrc Forward Unit ID/EX.MemRead Hazard Unit ID/EX.RegisterRt 1 For lecture In reality, only the signals RegWrite and MemWrite need to be 0, the other control signals can be don’t cares. Another consideration is energy – where clock gating is called for.

Memory-to-Memory Copies
For loads immediately followed by stores (memory-to-memory copies) can avoid a stall by adding forwarding hardware from the MEM/WB register to the data memory input. Would need to add a Forward Unit to the memory access stage Should avoid stalling on such a load I n s t r. O r d e ALU IM Reg DM lw $1,10($2) However, if $1 was used to compute the effective address it would be a load-use data hazard and would require a stall insertion between the lw and sw ALU IM Reg DM sw $1,10($3)

Control Hazards When the flow of instruction addresses is not what the pipeline expects; incurred by change of flow instructions Conditional branches (beq, bne) Unconditional branches (j) Possible solutions Stall Move decision point earlier in the pipeline Predict Delay decision (requires compiler support) Control hazards occur less frequently than data hazards; there is nothing as effective against control hazards as forwarding is for data hazards

Datapath Branch and Jump Hardware
1 Shift left 2 Jump PC+4[31-28] 1 Branch PCSrc Shift left 2 Add ID/EX Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU 1 Data IF/ID Sign Extend EX/MEM MEM/WB Control cntrl Forward Unit For lecture

Jumps Incur One Stall Jumps not decoded until ID, so one stall is needed ALU IM Reg DM j I n s t r. O r d e stall lw Fortunately, jumps are very infrequent – less that 2% of the SPECint instructions and x% of the SPECfp ones. and Fortunately, jumps are very infrequent – only 2% of the SPECint instruction mix

Review: Branches Incur Three Stalls
ALU IM Reg DM beq I n s t r. O r d e Can fix branch hazard by waiting – stall – but affects throughput stall stall stall lw ALU IM Reg DM and Another “solution” is to put in enough extra hardware so that we can test registers, calculate the branch address, and update the PC during the second stage of the pipeline. That would reduce the number of stalls to only one. A third approach is to prediction to handle branches, e.g., always predict that branches will be untaken. When right, the pipeline proceeds at full speed. When wrong, have to stall (and make sure nothing completes – changes machine state – that shouldn’t have).

Moving Branch Decisions Earlier in Pipe
Move the branch decision hardware back to the EX stage Reduces the number of stall cycles to two Adds an and gate and a 2x1 mux to the EX timing path Add hardware to compute the branch target address and evaluate the branch decision to the ID stage Reduces the number of stall cycles to one (like with jumps) Computing branch target address can be done in parallel with RegFile read (done for all instructions – only used when needed) Comparing the registers can’t be done until after RegFile read, so comparing and updating the PC adds a comparator, an and gate, and a 3x1 mux to the ID timing path Need forwarding hardware in ID stage For longer pipelines, decision points are later in the pipeline, incurring more stalls, so we need a better solution Want a small branch penalty. Need more forwarding and hazard detection hardware for second option (one stall implementation) since a branch depended on a result still in the pipeline (that is one of the source operands for the comparison logic) must be forwarded from the EX/MEM or MEM/WB pipeline latches.

Early Branch Forwarding Issues
Bypass of source operands from the EX/MEM if (IDcontrol.Branch and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = IF/ID.RegisterRs)) ForwardC = 1 and (EX/MEM.RegisterRd = IF/ID.RegisterRt)) ForwardD = 1 Forwards the result from the second previous instr. to either input of the Compare MEM/WB “forwarding” is taken care of by the normal RegFile write before read operation If the instruction immediately before the branch produces one of the branch compare source operands, then a stall will be required since the EX stage ALU operation is occurring at the same time as the ID stage branch compare operation

Supporting ID Stage Branches
PCSrc Branch 1 Hazard Unit ID/EX IF.Flush EX/MEM 1 IF/ID Control Add MEM/WB 4 Shift left 2 Add Compare Read Addr 1 Instruction Memory Data Memory RegFile Read Addr 2 Read Address PC Read Data 1 Read Data 1 Write Addr ALU Address ReadData 2 1 Write Data Write Data ALU cntrl 16 Sign Extend 32 Forward Unit 1 Forward Unit

Branch Prediction Resolve branch hazards by assuming a given outcome and proceeding without waiting to see the actual branch outcome Predict not taken – always predict branches will not be taken, continue to fetch from the sequential instruction stream, only when branch is taken does the pipeline stall If taken, flush instructions in the pipeline after the branch in IF, ID, and EX if branch logic in MEM – three stalls in IF if branch logic in ID – one stall ensure that those flushed instructions haven’t changed machine state– automatic in the MIPS pipeline since machine state changing operations are at the tail end of the pipeline (MemWrite or RegWrite) restart the pipeline at the branch destination

Flushing with Misprediction (Not Taken)
ALU IM Reg DM I n s t r. O r d e 4 beq $1,$2,2 flush ALU IM Reg DM 8 sub $4,$1,$5 16 and $6,$1,$7 20 or r8,$1,$9 ALU IM Reg DM For lecture Note branch address is PC-relative branch to 4+4+2*4 = 16 To flush the IF stage instruction, add a IF.Flush control line that zeros the instruction field of the IF/ID pipeline register (transforming it into a noop)

Branch Prediction, con’t
Resolve branch hazards by statically assuming a given outcome and proceeding Predict taken – always predict branches will be taken Predict taken always incurs a stall (if branch destination hardware has been moved to the ID stage) As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance With more hardware, possible to try to predict branch behavior dynamically during program execution Dynamic branch prediction – predict branches at run-time using run-time information Predict taken always incurs one stall at least – assuming the branch destination address hardware has been moved up to the ID stage. So predict not taken is easier since sequential instruction address can be computed in the IF stage.

Dynamic Branch Prediction
A branch prediction buffer (aka branch history table (BHT)) in the IF stage, addressed by the lower bits of the PC, contains a bit that tells whether the branch was taken the last time it was execute Bit may predict incorrectly (may be from a different branch with the same low order PC bits, or may be a wrong prediction for this branch) but the doesn’t affect correctness, just performance If the prediction is wrong, flush the incorrect instructions in pipeline, restart the pipeline with the right instructions, and invert the prediction bit The BHT predicts when a branch is taken, but does not tell where its taken to! A branch target buffer (BTB) in the IF stage can cache the branch target address (or !even! the branch target instruction) so that a stall can be avoided 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% 4096 about as good as infinite table, but 4096 is a lot of hardware Ask class why would want to store branch target instruction in the BTB

1-bit Prediction Accuracy
1-bit predictor in loop is incorrect twice when not taken Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the bottom of the loop code First time through the loop, the predictor mispredicts the branch since the branch is taken back to the top of the loop; invert prediction bit (predict_bit = 1) As long as branch is taken (looping), prediction is correct Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) Loop: 1st loop instr 2nd loop instr . last loop instr bne $1,$2,Loop fall out instr For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time

2-bit Predictors A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed right 9 times Loop: 1st loop instr 2nd loop instr . last loop instr bne $1,$2,Loop fall out instr wrong on loop fall out Taken Not taken 1 Predict Taken 1 Predict Taken Taken right on 1st iteration Not taken Taken Not taken For lecture Predict Not Taken Predict Not Taken Taken Not taken

Delayed Decision First, move the branch decision hardware and target address calculation to the ID pipeline stage A delayed branch always executes the next sequential instruction – the branch takes effect after that next instruction MIPS software moves an instruction to immediately after the branch that is not affected by the branch (a safe instruction) thereby hiding the branch delay As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot. Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches Growth in available transistors has made dynamic approaches relatively cheaper No processor uses delayed branches of more than 1 cycl. For longer branch delays, hardware-base branch prediction is used.

Scheduling Branch Delay Slots
A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then add $1,$2,$3 if $1=0 then sub $4,$5,$6 delay slot delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes if $2=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 add $1,$2,$3 if $1=0 then sub $4,$5,$6 Limitations on delayed-branch scheduling come from 1) restrictions on the instructions that can be moved/copied into the delay slot and 2) limited ability to predict at compile time whether a branch is likely to be taken or not. In B and C, the use of $1 prevents the add instruction from being moved to the delay slot In B the sub may need to be copied because it could be reached by another path. B is preferred when the branch is taken with high probability (such as loop branches A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails

In Conclusion Data dependencies in pipelines often solved by forwarding Need to be sure prior instructions will write, destination matches source, and no earlier instruction has priority Need forwarding hardware every place where can forward, stall if stage needs to wait for result EX stage, MEM stage for store, ID stage for early branch Loads require stall since overlap EX and MEM stages Branches may require stall too Control hazards improved via delayed branch/jump in ISA, static prediction for branches, dynamic prediction for branches If predict, hard part of design is recovering from misprediction

Summary: Designing a Pipelined Processor
Go back and examine your data path and control diagram Associate resources with states Be sure there are no structural hazards: one use / clock cycle Add pipeline registers between stages to balance clock cycle Amdahl’s Law suggests splitting longest stage Resolve all data and control dependencies If backwards in time in pipeline drawing to registers => data hazard: forward or stall to resolve them If backwards in time in pipeline drawing to PC => control hazard: we’ll see next time 5 stage pipeline with reads early in same stage, writes later in same stage, avoids WAR/WAW hazards Assert control in appropriate stage Develop test instruction sequences likely to uncover pipeline bugs (If you don’t test it, it won’t work )

Acknowledgements These slides contain material from courses: UCB CS152
Stanford EE108B

Lecture 8 Pipeline Hazard

Similar presentations

Presentation on theme: "Lecture 8 Pipeline Hazard"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 8 Pipeline Hazard

Similar presentations

Presentation on theme: "Lecture 8 Pipeline Hazard"— Presentation transcript:

Similar presentations

About project

Feedback