1 Stalls and flushes  So far, we have discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing.

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers The Processor
Advertisements

ECE 445 – Computer Organization
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Pipelined Processor.
Part 2 - Data Hazards and Forwarding 3/24/04++
Review: MIPS Pipeline Data and Control Paths
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Chapter Six Enhancing Performance with Pipelining
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.
Lecture 28: Chapter 4 Today’s topic –Data Hazards –Forwarding 1.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Pipeline Data Hazards: Detection and Circumvention Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.
Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.
Pipelined Datapath and Control
CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-2 Read Section 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Basic Pipelining & MIPS Pipelining Chapter 6 [Computer Organization and Design, © 2007 Patterson (UCB) & Hennessy (Stanford), & Slides Adapted from: Mary.
CMPE 421 Parallel Computer Architecture
CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]
1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,
2/15/02CSE Data Hazzards Data Hazards in the Pipelined Implementation.
CSE431 L07 Overcoming Data Hazards.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 07: Overcoming Data Hazards Mary Jane Irwin (
Computing Systems Pipelining: enhancing performance.
CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and
CDA 3101 Summer 2003 Introduction to Computer Organization Pipeline Control And Pipeline Hazards 17 July 2003.
CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.
CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-1 Read Sections 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State.
LECTURE 9 Pipeline Hazards. PIPELINED DATAPATH AND CONTROL In the previous lecture, we finalized the pipelined datapath for instruction sequences which.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Spr 2016, Mar 9... ELEC / Lecture 7 1 ELEC / Computer Architecture and Design Spring 2016 Pipeline Control and Performance.
CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.
Computer Organization
Stalling delays the entire pipeline
Note how everything goes left to right, except …
CDA 3101 Spring 2016 Introduction to Computer Organization
IT 251 Computer Organization and Architecture
Single Clock Datapath With Control
Appendix C Pipeline implementation
ECS 154B Computer Architecture II Spring 2009
ECS 154B Computer Architecture II Spring 2009
ECE232: Hardware Organization and Design
Forwarding Now, we’ll introduce some problems that data hazards can cause for our pipelined processor, and show how to handle them with forwarding.
Chapter 4 The Processor Part 3
Review: MIPS Pipeline Data and Control Paths
Morgan Kaufmann Publishers The Processor
Chapter 6 Enhancing Performance with Pipelining
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers The Processor
SOLUTIONS CHAPTER 4.
Pipelining review.
Single-cycle datapath, slightly rearranged
Computer Organization CS224
A pipeline diagram Clock cycle lw $t0, 4($sp) IF ID
Pipelining in more detail
The Processor Lecture 3.6: Control Hazards
The Processor Lecture 3.5: Data Hazards
Pipeline Control unit (highly abstracted)
Pipelining (II).
Introduction to Computer Organization and Architecture
Stalls and flushes Last time, we discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing.
©2003 Craig Zilles (derived from slides by Howard Huang)
Pipelined datapath and control
Need to stall for one cycle.
ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.
Presentation transcript:

1 Stalls and flushes  So far, we have discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing. —Many hazards can be resolved by forwarding data from the pipeline registers, instead of waiting for the writeback stage. —The pipeline continues running at full speed, with one instruction beginning on every clock cycle.  Now, we’ll see some real limitations of pipelining. —Forwarding may not work for data hazards from load instructions. —Branches affect the instruction fetch for the next clock cycle.  In both of these cases we may need to slow down, or stall, the pipeline.

2 Data hazard review  A data hazard arises if one instruction needs data that isn’t ready yet. —Below, the AND and OR both need to read register $2. —But $2 isn’t updated by SUB until the fifth clock cycle.  Dependency arrows that point backwards indicate hazards. DM Reg IM DM Reg IM DM Reg IM sub$2, $1, $3 and$12, $2, $5 or$13, $6, $2 Clock cycle

3 Forwarding  The desired value ($1 - $3) has actually already been computed—it just hasn’t been written to the registers yet.  Forwarding allows other instructions to read ALU results directly from the pipeline registers, without going through the register file. DM Reg IM DM Reg IM DM Reg IM sub$2, $1, $3 and$12, $2, $5 or$13, $6, $2 Clock cycle

4 What about loads?  Imagine if the first instruction in the example was LW instead of SUB. —How does this change the data hazard? DM Reg IM DM Reg IM lw$2, 20($3) and$12, $2, $5 Clock cycle

5 What about loads?  Imagine if the first instruction in the example was LW instead of SUB. —The load data doesn’t come from memory until the end of cycle 4. —But the AND needs that value at the beginning of the same cycle!  This is a “true” data hazard—the data is not available when we need it. DM Reg IM DM Reg IM lw$2, 20($3) and$12, $2, $5 Clock cycle

6 Stalling  The easiest solution is to stall the pipeline.  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes called a bubble.  Notice that we’re still using forwarding in cycle 5, to get data from the MEM/WB pipeline register to the ALU. DM Reg IM DM Reg IM lw$2, 20($3) and$12, $2, $5 Clock cycle

7 Stalling and forwarding  Without forwarding, we’d have to stall for two cycles to wait for the LW instruction’s writeback stage.  In general, you can always stall to avoid hazards—but dependencies are very common in real code, and stalling often can reduce performance by a significant amount. DM Reg IM DM Reg IM lw$2, 20($3) and$12, $2, $5 Clock cycle

8 Stalling delays the entire pipeline  If we delay the second instruction, we’ll have to delay the third one too. —Why? DM Reg IM DM Reg IM DMReg IM lw$2, 20($3) and$12, $2, $5 or$13, $12, $2 Clock cycle

9 Stalling delays the entire pipeline  If we delay the second instruction, we’ll have to delay the third one too. —This is necessary to make forwarding work between AND and OR. —It also prevents problems such as two instructions trying to write to the same register in the same cycle. DM Reg IM DM Reg IM DMReg IM lw$2, 20($3) and$12, $2, $5 or$13, $12, $2 Clock cycle

10  One way to implement a stall is to force the two instructions after LW to pause and remain in their ID and IF stages for one extra cycle.  This is easily accomplished. —Don’t update the PC, so the current IF stage is repeated. —Don’t update the IF/ID register, so the ID stage is also repeated. Reg Implementing stalls DM Reg IM RegIM lw$2, 20($3) and$12, $2, $5 or$13, $12, $2 DMReg IM DM Reg Clock cycle

11  But what about the ALU during cycle 4, the data memory in cycle 5, and the register file write in cycle 6?  Those units aren’t used in those cycles because of the stall, so we can set the EX, MEM and WB control signals to all 0s. Reg What about EXE, MEM, WB DM Reg IM RegIM lw$2, 20($3) and$12, $2, $5 or$13, $12, $2 DMReg IM DM Reg Clock cycle

12 Stall = Nop conversion  The effect of a load stall is to insert an empty or nop instruction into the pipeline DM Reg IM RegIM lw$2, 20($3) and-> nop and$12, $2, $5 or$13, $12, $2 DMReg IM DMReg Clock cycle DM Reg

13 Detecting stalls  Detecting stall is much like detecting data hazards.  Recall the format of hazard detection equations: if (EX/MEM.RegWrite = 1 and EX/MEM.RegisterRd = ID/EX.RegisterRs) then Bypass Rs from EX/MEM stage latch DM Reg IM DM Reg IM sub$2, $1, $3 and$12, $2, $5 id/exif/id ex/mem mem\wb id/exif/id ex/mem mem\wb

14 Detecting Stalls, cont.  When should stalls be detected? Reg DM Reg IM RegIM lw$2, 20($3) and$12, $2, $5 DM Reg id/exif/id ex/mem mem\wb id/ex if/id ex/mem mem\wb if/id  What is the stall condition? if ( ) then stall

15 Detecting stalls  We can detect a load hazard between the current instruction in its ID stage and the previous instruction in the EX stage just like we detected data hazards.  A hazard occurs if the previous instruction was LW... ID/EX.MemRead = 1...and the LW destination is one of the current source registers. ID/EX.RegisterRt = IF/ID.RegisterRs or ID/EX.RegisterRt = IF/ID.RegisterRt  The complete test for stalling is the conjunction of these two conditions. if (ID/EX.MemRead = 1 and (ID/EX.RegisterRt = IF/ID.RegisterRs or ID/EX.RegisterRt = IF/ID.RegisterRt)) then stall

16 Adding hazard detection to the CPU 0 1 Addr Instruction memory Instr Address Write data Data memory Read data 1010 PC Extend ALUSrc Result Zero ALU Instr [15 - 0] RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers Rd Rt 0101 IF/ID ID/EX EX/MEM MEM/WB EX M WB Control M WB Rs Forwarding Unit EX/MEM.RegisterRd MEM/WB.RegisterRd Hazard Unit

17 Adding hazard detection to the CPU

18 The hazard detection unit  The hazard detection unit’s inputs are as follows. —IF/ID.RegisterRs and IF/ID.RegisterRt, the source registers for the current instruction. —ID/EX.MemRead and ID/EX.RegisterRt, to determine if the previous instruction is LW and, if so, which register it will write to.  By inspecting these values, the detection unit generates three outputs. —Two new control signals PCWrite and IF/ID Write, which determine whether the pipeline stalls or continues. —A mux select for a new multiplexer, which forces control signals for the current EX and future MEM/WB stages to 0 in case of a stall.

19 Generalizing Forwarding/Stalling  What if data memory access was so slow, we wanted to pipeline it over 2 cycles?  How many bypass inputs would the muxes in EXE have?  Which instructions in the following require stalling and/or bypassing? lwr13, 0(r11) addr7, r8, r9 addr15, r7, r13 Clock cycle DM RegIM Reg

20 Branches in the original pipelined datapath Read address Instruction memory Instruction [31-0] Address Write data Data memory Read data MemWrite MemRead 1010 MemToReg 4 Shift left 2 PCPC Add 1010 PCSrc Sign extend ALUSrc Result Zero ALU ALUOp Instr [15 - 0] RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite Add Instr [ ] Instr [ ] IF/ID ID/EX EX/MEM MEM/WB EX M WB Control M WB When are they resolved?

21 Branches  Most of the work for a branch computation is done in the EX stage. —The branch target address is computed. —The source registers are compared by the ALU, and the Zero flag is set or cleared accordingly.  Thus, the branch decision cannot be made until the end of the EX stage. —But we need to know which instruction to fetch next, in order to keep the pipeline running! —This leads to what’s called a control hazard. DM Reg IM beq$2, $3, Label ? ? ? IM Clock cycle

22 Stalling is one solution  Again, stalling is always one possible solution.  Here we just stall until cycle 4, after we do make the branch decision. DM Reg IM beq$2, $3, Label ? ? ? DMReg IM Clock cycle IM

23 Branch prediction  Another approach is to guess whether or not the branch is taken. —In terms of hardware, it’s easier to assume the branch is not taken. —This way we just increment the PC and continue execution, as for normal instructions.  If we’re correct, then there is no problem and the pipeline keeps going at full speed. DM Reg IM beq$2, $3, Label next instruction 1 next instruction 2 DMReg IM Clock cycle DMReg IM

24 Branch misprediction  If our guess is wrong, then we would have already started executing two instructions incorrectly. We’ll have to discard, or flush, those instructions and begin executing the right ones from the branch target address, Label. DM Reg IM beq$2, $3, Label next instruction 1 next instruction 2 Label:... RegIM Clock cycle IM DMReg IM flush

25 Performance gains and losses  Overall, branch prediction is worth it. —Mispredicting a branch means that two clock cycles are wasted. —But if our predictions are even just occasionally correct, then this is preferable to stalling and wasting two cycles for every branch.  All modern CPUs use branch prediction. —Accurate predictions are important for optimal performance. —Most CPUs predict branches dynamically—statistics are kept at run- time to determine the likelihood of a branch being taken.  The pipeline structure also has a big impact on branch prediction. —A longer pipeline may require more instructions to be flushed for a misprediction, resulting in more wasted time and lower performance. —We must also be careful that instructions do not modify registers or memory before they get flushed.

26 Implementing branches  We can actually decide the branch a little earlier, in ID instead of EX. —Our sample instruction set has only a BEQ. —We can add a small comparison circuit to the ID stage, after the source registers are read.  Then we would only need to flush one instruction on a misprediction. DM Reg IM beq$2, $3, Label next instruction 1 Label:... IM Clock cycle DMReg IM flush

27 Implementing flushes  We must flush one instruction (in its IF stage) if the previous instruction is BEQ and its two source registers are equal.  We can flush an instruction from the IF stage by replacing it in the IF/ID pipeline register with a harmless nop instruction. —MIPS uses sll $0, $0, 0 as the nop instruction. —This happens to have a binary encoding of all 0s:  Flushing introduces a bubble into the pipeline, which represents the one- cycle delay in taking the branch.  The IF.Flush control signal shown on the next page implements this idea, but no details are shown in the diagram.

28 Branching without forwarding and load stalls 0 1 Addr Instruction memory Instr Address Write data Data memory Read data 1010 Extend ALUSrc Result Zero ALU RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers Rd Rt 0101 IF/ID ID/EX EX/MEM MEM/WB EX M WB Control M WB = Add Shift left 2 4 PCPC 1010 PCSrc IF.Flush The other stuff just won’t fit!

29 Timing  If no prediction: IF ID EX MEM WB IF IF ID EX MEM WB --- lost 1 cycle  If prediction: —If Correct IF ID EX MEM WB IF ID EX MEM WB -- no cycle lost —If Misprediction: IF ID EX MEM WB IF0 IF1 ID EX MEM WB cycle lost

30 Summary  Three kinds of hazards conspire to make pipelining difficult.  Structural hazards result from not having enough hardware available to execute multiple instructions simultaneously. —These are avoided by adding more functional units (e.g., more adders or memories) or by redesigning the pipeline stages.  Data hazards can occur when instructions need to access registers that haven’t been updated yet. —Hazards from R-type instructions can be avoided with forwarding. —Loads can result in a “true” hazard, which must stall the pipeline.  Control hazards arise when the CPU cannot determine which instruction to fetch next. —We can minimize delays by doing branch tests earlier in the pipeline. —We can also take a chance and predict the branch direction, to make the most of a bad situation.