1 Lecture 8 Pipeline Hazard Peng Liu
2 Pipelining and Clock Cycle Time Min clock cycle = longest combinatorial delay + FF setup + clock skew Pipelining reduces the combinatorial delay –Less work per pipeline stage –Ideally, N stages reduce delay to 1/N –Best you can achieve is Clock cycle = FF setup + clock skew Diminishing returns from ever longer pipelines Imbalance between stages also reduces benefits from subdividing Even if you could continuously improve clock frequency –Power consumption ∞ Frequency 2
3 Pipelining & CPI: Dependencies and Hazards Hazards: situations that prevent starting the next instruction in the next cycle –Wasted cycles, CPI >1 Hazards are due to dependencies between instructions –Two instructions share resources or data –Pipelining may lead to overlapping their execution Types of hazards –Structural Hazard (resource conflict) Two instructions need to use the same piece of hardware –Data Hazard Instruction depends on result of instruction still in the pipeline –Control Hazard Instruction fetch depends on the result of instruction in pipeline
4 Structural Hazards Resource conflict –Occurs when two instructions try to use same hardware –Often arise when functional unit is not fully pipelined Simple example: MIPS pipeline with a single unified memory No separate instruction & data memories –Load/store requires data access –Instruction fetch would have to stall for that cycle Would cause a pipeline “bubble” –Also used for units that are not fully pipelined (mult, div)
5 Data Dependencies Data dependencies for instruction j following instruction i –Read after Write (RAW) (true dependence) Instruction j tries to read before instruction i tries to write it –Write after Write (WAW) (output dependence) Instruction j tries to write an operand before i writes its value –Write after Read (WAR) (anti dependence) Instruction j tries to write a destination before it is read by i No such thing as a Read after Read (RAR) hazard since there is never a problem reading twice Dependencies are a property of your program (always there) Dependencies may lead to hazards on a specific pipeline
6 Dealing with RAW Hazards Must keep our “promise” in the instruction set –Each instruction fully completes before next on starts –All RAW dependencies are respected Pipelining may break this promise –Overlapping i and j –i writes late in the pipeline (WB); j reads early (ID) Must ensure that programmers cannot observe this behavior –Without necessarily reverting to single-cycle design
7 The Five Stages of Load Instruction IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory WB: Write the result data into the register file Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IFetchDecExecMemWB lw
8 Pipelined MIPS Processor Start the next instruction while still working on the current one –improves throughput or bandwidth - total amount of work done in a given time (average instructions per second or per clock) –instruction latency is not reduced (time from the start of an instruction to its completion) –pipeline clock cycle (pipeline stage time) is limited by the slowest stage –for some instructions, some stages are wasted cycles Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 IFetchDecExecMemWB lw Cycle 7Cycle 6Cycle 8 sw IFetchDecExecMemWB R-type IFetchDecExecMemWB
9 Single Cycle, Multiple Cycle, vs. Pipeline Clk Cycle 1 Multiple Cycle Implementation: IFetchDecExecMemWB Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 lw IFetchDecExecMemWB IFetchDecExecMem lwsw Pipeline Implementation: IFetchDecExecMemWB sw Clk Single Cycle Implementation: LoadStoreWaste IFetch R-type IFetchDecExecMemWB R-type Cycle 1Cycle 2 “wasted” cycles
10 Multiple Cycle v. Pipeline, Bandwidth v. Latency Clk Cycle 1 Multiple Cycle Implementation: IFetchDecExecMemWB Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10 lw IFetchDecExecMemWB IFetchDecExecMem lwsw Pipeline Implementation: IFetchDecExecMemWB sw IFetch R-type IFetchDecExecMemWB R-type Latency per lw = 5 clock cycles for both Bandwidth of lw is 1 per clock clock (IPC) for pipeline vs. 1/5 IPC for multicycle Pipelining improves instruction bandwidth, not instruction latency
11 Pipelining the MIPS ISA What makes it easy –all instructions are the same length (32 bits) easier to fetch in 1 st stage and decode in 2 nd stage –few instruction formats (three) with symmetry across formats can begin reading register file in 2 nd stage –memory operations can occur only in loads and stores can use the execute stage to calculate memory addresses –each MIPS instruction writes at most one result and does so near the end of the pipeline What makes it hard –structural hazards: what if we had only one memory? –control hazards: what about branches? –data hazards: what if an instruction’s input operands depend on the output of a previous instruction?
12 MIPS Pipeline Datapath Modifications Read Address Instruction Memory Add PC Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 What do we need to add/modify in our MIPS datapath? –registers between pipeline stages to isolate them IFetch/Dec Dec/Exec Exec/Mem Mem/WB IF:IFetchID:DecEX:ExecuteMEM: MemAccess WB: WriteBack System Clock Sign Extend
13 Graphically Representing MIPS Pipeline Can help with answering questions like: –how many cycles does it take to execute this code? –what is the ALU doing during cycle 4? –is there a hazard, why does it occur, and how can it be fixed? ALU IM Reg DMReg
14 Why Pipeline? For Throughput! I n s t r. O r d e r Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Once the pipeline is full, one instruction is completed every cycle Time to fill the pipeline
15 Important Observation Each functional unit can only be used once per instruction (since 4 other instructions executing) If each functional unit used at different stages then leads to hazards: –Load uses Register File’s Write Port during its 5th stage –R-type uses Register File’s Write Port during its 4th stage °2 ways to solve this pipeline hazard. IfetchReg/DecExecMemWrLoad IfetchReg/DecExecWrR-type 1234
16 Solution 1: Insert “Bubble” into the Pipeline Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle –The control logic can be complex. –Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecExecWrR-type IfetchReg/DecExec IfetchReg/DecExecMemWrLoad IfetchReg/DecExecWr R-type IfetchReg/DecExecWr R-type Pipeline Bubble IfetchReg/DecExecWr
17 Solution 2: Delay R-type’s Write by One Cycle Delay R-type’s register write by one cycle: –Now R-type instructions also use Reg File’s write port at Stage 5 –Mem stage is a NOP stage: nothing is being done. Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9 IfetchReg/DecMemWrR-type IfetchReg/DecMemWrR-type IfetchReg/DecExecMemWrLoad IfetchReg/DecMemWrR-type IfetchReg/DecMemWrR-type IfetchReg/Dec Exec WrR-type Mem Exec
18 Can Pipelining Get Us Into Trouble? Yes: Pipeline Hazards –structural hazards: attempt to use the same resource by two different instructions at the same time –data hazards: attempt to use data before it is ready instruction source operands are produced by a prior instruction still in the pipeline load instruction followed immediately by an ALU instruction that uses the load operand as a source value –control hazards: attempt to make a decision before condition has been evaluated branch instructions Can always resolve hazards by waiting –pipeline control must detect the hazard –take action (or delay action) to resolve hazards
19 I n s t r. O r d e r Time (clock cycles) lw Inst 1 Inst 2 Inst 4 Inst 3 ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg ALU Mem Reg MemReg A Single Memory Would Be a Structural Hazard Reading data from memory Reading instruction from memory
20 How About Register File Access? I n s t r. O r d e r Time (clock cycles) add r1, Inst 1 Inst 2 Inst 4 add r2,r1, ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Potential read before write data hazard
21 How About Register File Access? I n s t r. O r d e r Time (clock cycles) Inst 1 Inst 2 Inst 4 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. add r1, add r2,r1, Potential read before write data hazard
22 Register Usage Can Cause Data Hazards I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Dependencies backward in time cause hazards Which are read before write data hazards?
23 Register Usage Can Cause Data Hazards I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Dependencies backward in time cause hazards Read before write data hazards
24 Loads Can Cause Data Hazards I n s t r. O r d e r lw r1,100(r2) sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Dependencies backward in time cause hazards Load-use data hazard
25 stall One Way to “Fix” a Data Hazard I n s t r. O r d e r add r1,r2,r3 ALU IM Reg DMReg sub r4,r1,r5 and r6,r1,r7 ALU IM Reg DMReg ALU IM Reg DMReg Can fix data hazard by waiting – stall – but affects throughput
26 Another Way to “Fix” a Data Hazard I n s t r. O r d e r add r1,r2,r3 ALU IM Reg DMReg sub r4,r1,r5 and r6,r1,r7 ALU IM Reg DMReg ALU IM Reg DMReg Can fix data hazard by forwarding results as soon as they are available to where they are needed. xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg
27 Another Way to “Fix” a Data Hazard I n s t r. O r d e r add r1,r2,r3 ALU IM Reg DMReg sub r4,r1,r5 and r6,r1,r7 ALU IM Reg DMReg ALU IM Reg DMReg Can fix data hazard by forwarding results as soon as they are available to where they are needed. xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg
28 Forwarding with Load-use Data Hazards I n s t r. O r d e r lw r1,100(r2) sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Will still need one stall cycle even with forwarding
29 Branch Instructions Cause Control Hazards I n s t r. O r d e r lw Inst 4 Inst 3 beq ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Dependencies backward in time cause hazards
30 stall One Way to “Fix” a Control Hazard I n s t r. O r d e r beq ALU IM Reg DMReg lw ALU IM Reg DMReg ALU Inst 3 IM Reg DM Can fix branch hazard by waiting – stall – but affects throughput
31 Corrected Datapath to Save RegWrite Addr Need to preserve the destination register address in the pipeline state registers
32 Corrected Datapath to Save RegWrite Addr Need to preserve the destination register address in the pipeline state registers
33 MIPS Pipeline Control Path Modifications All control signals can be determined during Decode –and held in the state registers between pipeline stages Control
34 Control Settings EX StageMEM StageWB Stage RegD st ALU Op1 ALU Op0 ALU Src BrchMem Read Mem Write RegW rite Mem toReg R lw sw X X beq X X Q: Why not show write enable for pipeline registers? A: Written every clock cycle (like PC) Q: Why not show control for IF and ID stages? A: Control same for all instructions in IF and ID stages: fetch instruction, increment PC
35 Other Pipeline Structures Are Possible What about (slow) multiply operation? –let it take two cycles ALU IM Reg DMReg MUL ALU IM Reg DM1Reg DM2 What if the data memory access is twice as slow as the instruction memory? –make the clock twice as slow or … –let data memory access take two cycles (and keep the same clock rate)
36 Sample Pipeline Alternatives (for ARM ISA) ARM7 (3-stage pipeline) StrongARM-1 (5-stage pipeline) XScale (7-stage pipeline) ALU IM1 IM2 DM1 Reg DM2 IM Reg EX PC update IM access decode reg access ALU op DM access shift/rotate commit result (write back) ALU IM Reg DMReg SHFT PC update BTB access start IM access IM access decode reg 1 access shift/rotate reg 2 access ALU op start DM access exception DM write reg write
37 Peer Instruction Suppose a big data cache results in a data cache latency of 2 clock cycles and a 6-stage pipeline. (Pipelined, so can do 1 access / clock cycle.) What is the impact? 1. Instruction bandwidth is now 5/6-ths of the 5-stage pipeline 2. Instruction bandwidth is now 1/2 of the 5-stage pipeline 3. The branch delay slot is now 2 instructions 4. The load-use hazard can be with 2 instructions following load 5. Both 3 and 4: branch delay and load-use now 2 instructions 6. None of the above Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7 1st lw 2nd lw 3rd lw IfetchReg/DecExecMem1WrMem2 IfetchReg/DecExecMem1WrMem2 IfetchReg/DecExecMem1WrMem2
38 Peer Instruction Suppose a big data cache results in a data cache latency of 2 clock cycles and a 6-stage pipeline. (Pipelined, so can do 1 access / clock cycle.) What is the impact? 1. Instruction bandwidth is now 5/6-ths of the 5-stage pipeline 2. Instruction bandwidth is now 1/2 of the 5-stage pipeline 3. The branch delay slot is now 2 instructions 4. The load-use hazard can be with 2 instructions following load 5. Both 3 and 4: branch delay and load-use now 2 instructions 6. None of the above Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7 1st lw 2nd lw 3rd lw IfetchReg/DecExecMem1WrMem2 IfetchReg/DecExecMem1WrMem2 IfetchReg/DecExecMem1WrMem2
39 Peer Instruction Suppose a big I cache results in a I cache latency of 2 clock cycles and a 6-stage pipeline. (Pipelined, so can do 1 access / clock cycle.) What is the impact? 1. Instruction bandwidth is now 5/6-ths of the 5-stage pipeline 2. Instruction bandwidth is now 1/2 of the 5-stage pipeline 3. The branch delay slot is now 2 instructions 4. The load-use hazard can be with 2 instructions following load 5. Both 3 and 4: branch delay and load-use now 2 instructions 6. None of the above Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7 1st lwIfetch1Reg/DecExecMemWrIfetch2
40 Peer Instruction Suppose a big I cache results in a I cache latency of 2 clock cycles and a 6-stage pipeline. (Pipelined, so can do 1 access / clock cycle.) What is the impact? 1. Instruction bandwidth is now 5/6-ths of the 5-stage pipeline 2. Instruction bandwidth is now 1/2 of the 5-stage pipeline 3. The branch delay slot is now 2 instructions 4. The load-use hazard can be with 2 instructions following load 5. Both 3 and 4: branch delay and load-use now 2 instructions 6. None of the above Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7 1st lwIfetch1Reg/DecExecMemWrIfetch2
41 Peer Instruction Suppose we use with a 4 stage pipeline that combines memory access and write back stages for all instructions but load, stalling when there are structural hazards. Impact? 1. The branch delay slot is now 0 instructions 2. Most loads cause stall since often a structural hazard on reg. writes 3. Most stores cause stall since they have a structural hazard 4. Both 2 & 3: most loads&stores cause stall due to structural hazards 5. Most loads cause stall, but there is no load-use hazard anymore 6. Both 2 & 3, but there is no load-use hazard anymore 7. None of the above Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7 1st add 2nd lw 3rd add IfetchReg/DecExecMem/Wr IfetchReg/DecExec IfetchReg/DecExecMem/Wr MemWr
42 Peer Instruction Suppose we use with a 4 stage pipeline that combines memory access and write back stages for all instructions but load, stalling when there are structural hazards. Impact? 1. The branch delay slot is now 0 instructions 2. Most loads cause stall since often a structural hazard on reg. writes 3. Most stores cause stall since they have a structural hazard 4. Both 2 & 3: most loads&stores cause stall due to structural hazards 5. Most loads cause stall, but there is no load-use hazard anymore 6. Both 2 & 3, but there is no load-use hazard anymore 7. None of the above Clock Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7 1st add 2nd lw 3rd add IfetchReg/DecExecMem/Wr IfetchReg/DecExec IfetchReg/DecExecMem/Wr MemWr Q: Why not say every load stalls? A: Not all next instructions write in Wr stage
43 Designing a Pipelined Processor Go back and examine your data path and control diagram Associate resources with states –Be sure there are no structural hazards: one use / clock cycle Add pipeline registers between stages to balance clock cycle –Amdahl’s Law suggests splitting longest stage Resolve all data and control dependencies –If backwards in time in pipeline drawing to registers => data hazard: forward or stall to resolve them –If backwards in time in pipeline drawing to PC => control hazard Assert control in appropriate stage Develop test instruction sequences likely to uncover pipeline bugs –If you don’t test it, it won’t work
44 MIPS Pipeline Data and Control Paths Read Address Instruction Memory Add PC Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control 0 1 ALU cntrl RegWrite MemWriteMemRead MemtoReg RegDst ALUOp ALUSrc Branch PCSrc
45 Data Forwarding (aka Bypassing) Any data dependence line that goes backwards in time –EX stage generating R-type ALU results or effective address calculation –MEM stage generating lw results Forward by taking the inputs to the ALU from any pipeline register rather than just ID/EX by –adding multiplexors to the inputs of the ALU so can pass Rd back to either (or both) of the EX’s stage Rs and Rt ALU inputs 00: normal input (ID/EX pipeline registers) 10: forward from previous instr (EX/MEM pipeline registers) 01: forward from instr 2 back (MEM/WB pipeline registers) –adding the proper control hardware With forwarding can run at full speed even in the presence of data dependencies
46 Data Forwarding Control Conditions (1/4) 1.EX/MEM hazard: if (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ForwardA = 10 –“RegisterRd” is number of register to be written (RD or RT) –“RegisterRs” is number of RS register –“RegisterRt” is number of RT register –“ForwardA, ForwardB” controls forwarding muxes if (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 2.MEM/WB hazard: if (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the ALU. Forwards the result from the second previous instr. to either input of the ALU. What’s wrong with this hazard control? (When might it forward when it shouldn’t?) (Which sequences would reveal this bug?)
47 Data Forwarding Control Conditions (2/4) 1.EX/MEM hazard: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB = 10 2.MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the ALU provided it writes. Forwards the result from the second previous instr. to either input of the ALU provided it writes. What’s wrong with this hazard control? (When might it forward when it shouldn’t?) (Which sequences would reveal this bug?)
48 Data Forwarding Control Conditions (3/4) 1.EX/MEM hazard: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB = 10 2.MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the ALU provided it writes and != R0. Forwards the result from the second previous instr. to either input of the ALU provided it writes and != R0. What’s wrong with this hazard control?
49 Yet Another Complication! I n s t r. O r d e r add $1,$1,$2 ALU IM Reg DMReg add $1,$1,$3 add $1,$1,$4 ALU IM Reg DMReg ALU IM Reg DMReg Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded? More recent result!
50 Corrected Data Forwarding Control Conditions 2.MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRs) and (EX/MEM.RegisterRd != ID/EX.RegisterRs || ~ EX/MEM.RegWrite)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRt) and (EX/MEM.RegisterRd != ID/EX.RegisterRt || ~ EX/MEM.RegWrite))) ForwardB = 01 Forward if this instruction writes AND its not writing R0 AND this dest reg == source AND in between instr either dest. reg. doesn’t match OR it doesn’t write reg.
51 Datapath with Forwarding Hardware PCSrc Read Address Instruction Memory Add PC Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control 0 1 ALU cntrl Branch Forward Unit
52 Datapath with Forwarding Hardware PCSrc IF/ID.RegisterRs IF/ID.RegisterRt EX/MEM.RegisterRd MEM/WB.RegisterRd
53 flush Forwarding with Load-use Data Hazards I n s t r. O r d e r lw r1,100(r2) sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DMReg ALU IM Reg DM ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9
54 Load-use Hazard Detection Unit Need a hazard detection unit in the ID stage that inserts a stall between the load and its use 2.ID Hazard Detection if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline The first line tests to see if the instruction is a load; the next two lines check to see if the destination register of the load in the EX stage matches either source registers of the instruction in the ID stage After this 1-cycle stall, the forwarding logic can handle the remaining data hazards
55 Stall Hardware In addition to the hazard detection unit, we have to implement the stall Prevent the IF and ID stage instructions from making progress down the pipeline, done by preventing the PC register and the IF/ID pipeline register from changing –Hazard detection unit controls the writing of the PC and IF/ID registers The instructions in the back half of the pipeline starting with the EX stage must be flushed (execute nop ) –Must deassert the control signals (setting them to 0) in the EX, MEM, and WB control fields of the ID/EX pipeline register. –Hazard detection unit controls the multiplexor that chooses between the real control values and 0’s. –Assume that 0’s are benign values in datapath: nothing changes
56 Adding the Hazard Hardware Hazard Unit 0 1
57 Adding the Hazard Hardware Hazard Unit 0 1 ID/EX.RegisterRt 0 ID/EX.MemRead
58 Memory-to-Memory Copies I n s t r. O r d e r lw $1,10($2) ALU IM Reg DMReg sw $1,10($3) ALU IM Reg DMReg For loads immediately followed by stores (memory-to-memory copies) can avoid a stall by adding forwarding hardware from the MEM/WB register to the data memory input. –Would need to add a Forward Unit to the memory access stage –Should avoid stalling on such a load
59 Control Hazards When the flow of instruction addresses is not what the pipeline expects; incurred by change of flow instructions –Conditional branches ( beq, bne ) –Unconditional branches ( j ) Possible solutions –Stall –Move decision point earlier in the pipeline –Predict –Delay decision (requires compiler support) Control hazards occur less frequently than data hazards; there is nothing as effective against control hazards as forwarding is for data hazards
60 Datapath Branch and Jump Hardware ID/EX Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data ALU 1 0 Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend EX/MEM MEM/WB Control 0 1 ALU cntrl Forward Unit
61 Datapath Branch and Jump Hardware ID/EX Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data ALU 1 0 Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend EX/MEM MEM/WB Control 0 1 ALU cntrl Forward Unit 0 1 Branch PCSrc Shift left 2 Add 0 1 Shift left 2 Jump PC+4[31-28]
62 stall Jumps Incur One Stall I n s t r. O r d e r j lw and ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg Fortunately, jumps are very infrequent – only 2% of the SPECint instruction mix Jumps not decoded until ID, so one stall is needed
63 stall Review: Branches Incur Three Stalls I n s t r. O r d e r beq ALU IM Reg DMReg lw ALU IM Reg DMReg ALU and IM Reg DM Can fix branch hazard by waiting – stall – but affects throughput
64 Moving Branch Decisions Earlier in Pipe Move the branch decision hardware back to the EX stage –Reduces the number of stall cycles to two –Adds an AND gate and a 2x1 MUX to the EX timing path Add hardware to compute the branch target address and evaluate the branch decision to the ID stage –Reduces the number of stall cycles to one (like with jumps) –Computing branch target address can be done in parallel with RegFile read (done for all instructions – only used when needed) –Comparing the registers can’t be done until after RegFile read, so comparing and updating the PC adds a comparator, an and gate, and a 3x1 mux to the ID timing path –Need forwarding hardware in ID stage For longer pipelines, decision points are later in the pipeline, incurring more stalls, so we need a better solution
65 Early Branch Forwarding Issues Bypass of source operands from the EX/MEM if (IDcontrol.Branch and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = IF/ID.RegisterRs)) ForwardC = 1 if (IDcontrol.Branch and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = IF/ID.RegisterRt)) ForwardD = 1 Forwards the result from the second previous instr. to either input of the Compare MEM/WB “forwarding” is taken care of by the normal RegFile write before read operation If the instruction immediately before the branch produces one of the branch compare source operands, then a stall will be required since the EX stage ALU operation is occurring at the same time as the ID stage branch compare operation
66 Supporting ID Stage Branches Read Address Instruction Memory PC Write Data Read Addr 1 Read Addr 2 Write Addr RegFile Read Data 1 ReadData ALU 1 0 Shift left 2 Add Data Memory Address Write Data Read Data 1 0 IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control 0 1 ALU cntrl Branch PCSrc Forward Unit Hazard Unit 0 10 Compare Forward Unit Add IF.Flush 0
67 Branch Prediction Resolve branch hazards by assuming a given outcome and proceeding without waiting to see the actual branch outcome 1.Predict not taken – always predict branches will not be taken, continue to fetch from the sequential instruction stream, only when branch is taken does the pipeline stall –If taken, flush instructions in the pipeline after the branch in IF, ID, and EX if branch logic in MEM – three stalls in IF if branch logic in ID – one stall –ensure that those flushed instructions haven’t changed machine state– automatic in the MIPS pipeline since machine state changing operations are at the tail end of the pipeline (MemWrite or RegWrite) –restart the pipeline at the branch destination
68 Flushing with Misprediction (Not Taken) 4 beq $1,$2,2 I n s t r. O r d e r ALU IM Reg DMReg ALU IM Reg DMReg 8 sub $4,$1,$5 To flush the IF stage instruction, add a IF.Flush control line that zeros the instruction field of the IF/ID pipeline register (transforming it into a noop )
69 flush Flushing with Misprediction (Not Taken) 4 beq $1,$2,2 I n s t r. O r d e r ALU IM Reg DMReg 16 and $6,$1,$7 20 or r8,$1,$9 ALU IM Reg DMReg ALU IM Reg DMReg ALU IM Reg DMReg 8 sub $4,$1,$5 To flush the IF stage instruction, add a IF.Flush control line that zeros the instruction field of the IF/ID pipeline register (transforming it into a noop )
70 Branch Prediction, con’t Resolve branch hazards by statically assuming a given outcome and proceeding 2.Predict taken – always predict branches will be taken –Predict taken always incurs a stall (if branch destination hardware has been moved to the ID stage) As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance With more hardware, possible to try to predict branch behavior dynamically during program execution 3.Dynamic branch prediction – predict branches at run- time using run-time information
71 Dynamic Branch Prediction A branch prediction buffer (aka branch history table (BHT)) in the IF stage, addressed by the lower bits of the PC, contains a bit that tells whether the branch was taken the last time it was execute –Bit may predict incorrectly (may be from a different branch with the same low order PC bits, or may be a wrong prediction for this branch) but the doesn’t affect correctness, just performance –If the prediction is wrong, flush the incorrect instructions in pipeline, restart the pipeline with the right instructions, and invert the prediction bit The BHT predicts when a branch is taken, but does not tell where its taken to! –A branch target buffer (BTB) in the IF stage can cache the branch target address (or !even! the branch target instruction) so that a stall can be avoided
72 1-bit Prediction Accuracy 1-bit predictor in loop is incorrect twice when not taken For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time –Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the bottom of the loop code 1.First time through the loop, the predictor mispredicts the branch since the branch is taken back to the top of the loop; invert prediction bit (predict_bit = 1) 2.As long as branch is taken (looping), prediction is correct 3.Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) Loop: 1 st loop instr 2 nd loop instr. last loop instr bne $1,$2,Loop fall out instr
73 2-bit Predictors A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed. Predict Taken Predict Not Taken Predict Taken Predict Not Taken Taken Not taken Taken Loop: 1 st loop instr 2 nd loop instr. last loop instr bne $1,$2,Loop fall out instr
74 2-bit Predictors A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed Predict Taken Predict Not Taken Predict Taken Predict Not Taken Taken Not taken Taken Loop: 1 st loop instr 2 nd loop instr. last loop instr bne $1,$2,Loop fall out instr wrong on loop fall out right 9 times right on 1 st iteration 0
75 Delayed Decision First, move the branch decision hardware and target address calculation to the ID pipeline stage A delayed branch always executes the next sequential instruction – the branch takes effect after that next instruction –MIPS software moves an instruction to immediately after the branch that is not affected by the branch (a safe instruction) thereby hiding the branch delay As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot. –Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches –Growth in available transistors has made dynamic approaches relatively cheaper
76 Scheduling Branch Delay Slots A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails add $1,$2,$3 if $2=0 then delay slot A. From before branchB. From branch targetC. From fall through add $1,$2,$3 if $1=0 then delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes if $2=0 then add $1,$2,$3 if $1=0 then sub $4,$5,$6 add $1,$2,$3 if $1=0 then sub $4,$5,$6
77 In Conclusion Data dependencies in pipelines often solved by forwarding Need to be sure prior instructions will write, destination matches source, and no earlier instruction has priority Need forwarding hardware every place where can forward, stall if stage needs to wait for result –EX stage, MEM stage for store, ID stage for early branch Loads require stall since overlap EX and MEM stages –Branches may require stall too Control hazards improved via delayed branch/jump in ISA, static prediction for branches, dynamic prediction for branches –If predict, hard part of design is recovering from misprediction
78 Summary: Designing a Pipelined Processor Go back and examine your data path and control diagram Associate resources with states –Be sure there are no structural hazards: one use / clock cycle Add pipeline registers between stages to balance clock cycle –Amdahl’s Law suggests splitting longest stage Resolve all data and control dependencies –If backwards in time in pipeline drawing to registers => data hazard: forward or stall to resolve them –If backwards in time in pipeline drawing to PC => control hazard: we’ll see next time 5 stage pipeline with reads early in same stage, writes later in same stage, avoids WAR/WAW hazards Assert control in appropriate stage Develop test instruction sequences likely to uncover pipeline bugs (If you don’t test it, it won’t work )
79 Acknowledgements These slides contain material from courses: –UCB CS152 –Stanford EE108B