John Lazzaro (www.cs.berkeley.edu/~lazzaro)

Slides:

Advertisements

Similar presentations

Pipeline Control Hazards: Detection and Circumvention Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.

Advertisements

Pipeline Hazards CS365 Lecture 10. D. Barbara Pipeline Hazards CS465 2 Review  Pipelined CPU  Overlapped execution of multiple instructions  Each on.

ECE 445 – Computer Organization

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Pipelined Processor.

Review: MIPS Pipeline Data and Control Paths

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 18 - Pipelined.

CS 152 L12 RAW Hazards, Interrrupts (1)Fall 2004 © UC Regents CS152 – Computer Architecture and Engineering Lecture 12 – Pipeline Wrap up: Control Hazards,

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Chapter 6 Pipelining to Increase Effective Computer Speed.

 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.

Computer Organization Lecture Set – 06 Chapter 6 Huei-Yung Lin.

Control Hazards.1 Review: Datapath with Data Hazard Control Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register.

1 Stalls and flushes  So far, we have discussed data hazards that can occur in pipelined CPUs if some instructions depend upon others that are still executing.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

Chapter 4B: The Processor, Part B. Review: Why Pipeline? For Performance! I n s t r. O r d e r Time (clock cycles) Inst 0 Inst 1 Inst 2 Inst 4 Inst 3.

Pipeline Data Hazards: Detection and Circumvention Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.

Pipelined Datapath and Control

CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-2 Read Section 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

Basic Pipelining & MIPS Pipelining Chapter 6 [Computer Organization and Design, © 2007 Patterson (UCB) & Hennessy (Stanford), & Slides Adapted from: Mary.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Computer Architecture and Design – ELEN 350 Part 8 [Some slides adapted from M. Irwin, D. Paterson. D. Garcia and others]

CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding.

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,

CSE431 L07 Overcoming Data Hazards.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 07: Overcoming Data Hazards Mary Jane Irwin (

Computing Systems Pipelining: enhancing performance.

1/24/ :00 PM 1 of 86 Pipelining Chapter 6. 1/24/ :00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.

CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.

CSIE30300 Computer Architecture Unit 06: Containing Control Hazards

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Designing a Pipelined Processor

CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-1 Read Sections 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State.

LECTURE 9 Pipeline Hazards. PIPELINED DATAPATH AND CONTROL In the previous lecture, we finalized the pipelined datapath for instruction sequences which.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

1 Lecture 8 Pipeline Hazard Peng Liu

CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.

Stalling delays the entire pipeline

Note how everything goes left to right, except …

Lecture 08: Control Hazards

Appendix C Pipeline implementation

Chapter 4 The Processor Part 4

ECS 154B Computer Architecture II Spring 2009

ECS 154B Computer Architecture II Spring 2009

ECE232: Hardware Organization and Design

Processor Design: Pipeline Handling Hazards

Forwarding Now, we’ll introduce some problems that data hazards can cause for our pipelined processor, and show how to handle them with forwarding.

Chapter 4 The Processor Part 3

Review: MIPS Pipeline Data and Control Paths

Morgan Kaufmann Publishers The Processor

Morgan Kaufmann Publishers The Processor

EI 209 Computer Organization Fall 2017

Morgan Kaufmann Publishers Enhancing Performance with Pipelining

Computer Organization CS224

The Processor Lecture 3.6: Control Hazards

The Processor Lecture 3.5: Data Hazards

CSC3050 – Computer Architecture

Pipelining (II).

Morgan Kaufmann Publishers The Processor

©2003 Craig Zilles (derived from slides by Howard Huang)

Presentation transcript:

CS152 – Computer Architecture and Engineering Fall 2004 Lecture 11: Pipeline Control and Hazards John Lazzaro (www.cs.berkeley.edu/~lazzaro) Dave Patterson (www.cs.berkeley.edu/~patterson) [Adapted from Mary Jane Irwin’s slides www.cse.psu.edu/~cg431 ] Other handouts To handout next time

Recap last lecture All recent processors use pipelining Pipelining doesn’t help latency of single task, it helps throughput or bandwidth of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number of pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Must detect and resolve hazards Stalling negatively affects throughput Today: pipeline control, including hazards

Pipeline registers written every clock cycle (like PC) Control Settings EX Stage MEM Stage WB Stage Reg Dst ALU Op1 ALU Op0 ALU Src Brch Mem Read Mem Write Reg Write Mem toReg R 1 lw sw X beq Pipeline registers written every clock cycle (like PC) Control same for all instructions in IF and ID stages: fetch instruction, increment PC

MIPS Pipeline Data and Control Paths 1 PCSrc ID/EX EX/MEM Control IF/ID Add MEM/WB Branch Add 4 RegWrite Shift left 2 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 MemtoReg Read Address ALUSrc PC Read Data Address 1 Write Addr ALU Read Data 2 Write Data Write Data 1 How many bits wide is each pipeline register? PC – 32 bits IF/ID – 64 bits ID/EX – 9 + 32x4 + 10 = 147 EX/MEM – 5 + 1 + 32x3 + 5 = 107 MEM/WB – 2 + 32x2 + 5 = 71 ALU cntrl MemWrite MemRead Sign Extend 16 32 ALUOp 1 RegDst

Review: One Way to “Fix” a Data Hazard Can fix data hazard by waiting – stall – but affects throughput ALU IM Reg DM add r1,r2,r3 I n s t r. O r d e stall stall sub r4,r1,r5 and r6,r7,r1 ALU IM Reg DM

Review: Another Way to “Fix” a Data Hazard Can fix data hazard by forwarding results as soon as they are available to where they are needed. ALU IM Reg DM add r1,r2,r3 I n s t r. O r d e ALU IM Reg DM sub r4,r1,r5 ALU IM Reg DM and r6,r7,r1 ALU IM Reg DM Notice that for now we are showing the forwarded data coming out of the ALU. After looking at the problem more closely we will see that it really is supplied by the pipeline register EX/MEM and will depict it as such. or r8,r1,r1 ALU IM Reg DM sw r4,100(r1)

Data Forwarding (aka Bypassing) Any data dependence line that goes backwards in time EX stage generating R-type ALU results or effective address calculation MEM stage generating lw results Forward by taking the inputs to the ALU from any pipeline register rather than just ID/EX by adding multiplexors to the inputs of the ALU so can pass Rd back to either (or both) of the EX’s stage Rs and Rt ALU inputs 00: normal input (ID/EX pipeline registers) 10: forward from previous instr (EX/MEM pipeline registers) 01: forward from instr 2 back (MEM/WB pipeline registers) adding the proper control hardware With forwarding can run at full speed even in the presence of data dependencies

Data Forwarding Control Conditions (1/4) Forwards the result from the previous instr. to either input of the ALU. EX/MEM hazard: if (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ForwardA = 10 “RegisterRd” is number of register to be written (RD or RT) “RegisterRs” is number of RS register “RegisterRt” is number of RT register “ForwardA, ForwardB” controls forwarding muxes if (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 MEM/WB hazard: if (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the second previous instr. to either input of the ALU. What’s wrong with this hazard control? (When might it forward when it shouldn’t?) (Which sequences would reveal this bug?)

Data Forwarding Control Conditions (2/4) EX/MEM hazard: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ForwardA = 10 and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB = 10 MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the ALU provided it writes. Forwards the result from the second previous instr. to either input of the ALU provided it writes. What’s wrong with this hazard control? (When might it forward when it shouldn’t?) (Which sequences would reveal this bug?)

Data Forwarding Control Conditions (3/4) EX/MEM hazard: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd == ID/EX.RegisterRs)) ForwardA = 10 and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB = 10 MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the previous instr. to either input of the ALU provided it writes and != R0. Forwards the result from the second previous instr. to either input of the ALU provided it writes and != R0. What’s wrong with this hazard control?

Administrivia Start Lab 3; Plan due Thursday evening; meet with TA Friday Reading Chapter 6, sections 6.5 to 6.9 for this week Midterm Tue Oct 12 5:30 - 8:30 in 101 Morgan (you asked for it) Northwest corner of campus, near Arch and Hearst Midterm review Sunday Oct 10, 7 PM, 306 Soda Bring 1 page, handwritten notes, both sides Nothing electronic: no calculators, cell phones, pagers, … Meet at LaVal’s Northside afterwards for Pizza

Yet Another Complication! Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded? More recent result! I n s t r. O r d e ALU IM Reg DM add $1,$1,$2 add $1,$1,$3 ALU IM Reg DM ALU IM Reg DM add $1,$1,$4

Corrected Data Forwarding Control Conditions MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd == ID/EX.RegisterRs) and (EX/MEM.RegisterRd != ID/EX.RegisterRs || ~ EX/MEM.RegWrite)) ForwardA = 01 and (MEM/WB.RegisterRd == ID/EX.RegisterRt) and (EX/MEM.RegisterRd != ID/EX.RegisterRt || ~ EX/MEM.RegWrite))) ForwardB = 01 Forward if this instruction writes AND its not writing R0 AND this dest reg == source AND in between instr either dest. reg. doesn’t match OR it doesn’t write reg.

Datapath with Forwarding Hardware PCSrc 1 ID/EX EX/MEM Control IF/ID Add MEM/WB Branch Add 4 Shift left 2 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address 1 Write Addr ALU Read Data 2 1 Write Data Write Data For class handout ALU cntrl 16 32 Sign Extend 1 Forward Unit

Datapath with Forwarding Hardware PCSrc Read Address Instruction Memory Add PC 4 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU Shift left 2 Data IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control cntrl Branch Forward Unit For lecture. How many bits wide is each pipeline register now? ID/EX – 9 + 32x4 + 10 = 147 + 10 = 157 Control line inputs to Forward Unit EX/MEM.RegWrite and MEM/WB.RegWrite not shown on diagram EX/MEM.RegisterRd MEM/WB.RegisterRd IF/ID.RegisterRs IF/ID.RegisterRt

Forwarding with Load-use Data Hazards ALU IM Reg DM lw r1,100(r2) I n s t r. O r d e flush ALU IM Reg DM sub r4,r1,r5 ALU IM Reg DM sub r4,r1,r5 and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 ALU IM Reg DM and r6,r1,r7 xor r4,r1,r5 or r8, r1, r9 The one case where forwarding cannot save the day is when an instruction tries to read a register following a load instruction that writes the same register. ALU IM Reg DM ALU IM Reg DM

Load-use Hazard Detection Unit Need a hazard detection unit in the ID stage that inserts a stall between the load and its use ID Hazard Detection if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline The first line tests to see if the instruction is a load; the next two lines check to see if the destination register of the load in the EX stage matches either source registers of the instruction in the ID stage After this 1-cycle stall, the forwarding logic can handle the remaining data hazards

Stall Hardware In addition to the hazard detection unit, we have to implement the stall Prevent the IF and ID stage instructions from making progress down the pipeline, done by preventing the PC register and the IF/ID pipeline register from changing Hazard detection unit controls the writing of the PC and IF/ID registers The instructions in the back half of the pipeline starting with the EX stage must be flushed (execute noop) Must deassert the control signals (setting them to 0) in the EX, MEM, and WB control fields of the ID/EX pipeline register. Hazard detection unit controls the multiplexor that chooses between the real control values and 0’s. Assume that 0’s are benign values in datapath: nothing changes

Adding the Hazard Hardware Read Address Instruction Memory Add PC 4 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU Shift left 2 Data IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control cntrl Branch PCSrc Forward Unit Hazard Unit 1 For class handout

Adding the Hazard Hardware Read Address Instruction Memory Add PC 4 1 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU Shift left 2 Data IF/ID Sign Extend ID/EX EX/MEM MEM/WB Control cntrl Branch PCSrc Forward Unit ID/EX.MemRead Hazard Unit ID/EX.RegisterRt 1 For lecture In reality, only the signals RegWrite and MemWrite need to be 0, the other control signals can be don’t cares. Another consideration is energy – where clock gating is called for.

Memory-to-Memory Copies For loads immediately followed by stores (memory-to-memory copies) can avoid a stall by adding forwarding hardware from the MEM/WB register to the data memory input. Would need to add a Forward Unit to the memory access stage Should avoid stalling on such a load I n s t r. O r d e ALU IM Reg DM lw $1,10($2) However, if $1 was used to compute the effective address it would be a load-use data hazard and would require a stall insertion between the lw and sw ALU IM Reg DM sw $1,10($3)

Where else need to forward? What about this code? addu $r5, … sw $r5, … Or addu $r5, … subu $r6, … sw $r5, … Need forwarding path Elaboration on page 412 shows how if have separate mux for immediate vs. forwarded value, can use EX stage forwarding control to pass proper result to EX/MEM register for sw to store

Control Hazards When the flow of instruction addresses is not what the pipeline expects; incurred by change of flow instructions Conditional branches (beq, bne) Unconditional branches (j) Possible solutions Stall Move decision point earlier in the pipeline Predict Delay decision (requires compiler support) Control hazards occur less frequently than data hazards; there is nothing as effective against control hazards as forwarding is for data hazards

Datapath Branch and Jump Hardware ID/EX Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU 1 Data IF/ID Sign Extend EX/MEM MEM/WB Control cntrl Forward Unit For class handout

Datapath Branch and Jump Hardware 1 Shift left 2 Jump PC+4[31-28] 1 Branch PCSrc Shift left 2 Add ID/EX Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU 1 Data IF/ID Sign Extend EX/MEM MEM/WB Control cntrl Forward Unit For lecture

Jumps Incur One Stall Jumps not decoded until ID, so one stall is needed ALU IM Reg DM j I n s t r. O r d e stall lw Fortunately, jumps are very infrequent – less that 2% of the SPECint instructions and x% of the SPECfp ones. and Fortunately, jumps are very infrequent – only 2% of the SPECint instruction mix

Review: Branches Incur Three Stalls ALU IM Reg DM beq I n s t r. O r d e Can fix branch hazard by waiting – stall – but affects throughput stall stall stall Another “solution” is to put in enough extra hardware so that we can test registers, calculate the branch address, and update the PC during the second stage of the pipeline. That would reduce the number of stalls to only one. A third approach is to prediction to handle branches, e.g., always predict that branches will be untaken. When right, the pipeline proceeds at full speed. When wrong, have to stall (and make sure nothing completes – changes machine state – that shouldn’t have). Will talk about these options in more detail in next,next lecture. lw ALU IM Reg DM and

Moving Branch Decisions Earlier in Pipe Move the branch decision hardware back to the EX stage Reduces the number of stall cycles to two Adds an and gate and a 2x1 mux to the EX timing path Add hardware to compute the branch target address and evaluate the branch decision to the ID stage Reduces the number of stall cycles to one (like with jumps) Computing branch target address can be done in parallel with RegFile read (done for all instructions – only used when needed) Comparing the registers can’t be done until after RegFile read, so comparing and updating the PC adds a comparator, an and gate, and a 3x1 mux to the ID timing path Need forwarding hardware in ID stage For longer pipelines, decision points are later in the pipeline, incurring more stalls, so we need a better solution Want a small branch penalty. Need more forwarding and hazard detection hardware for second option (one stall implementation) since a branch depended on a result still in the pipeline (that is one of the source operands for the comparison logic) must be forwarded from the EX/MEM or MEM/WB pipeline latches.

Early Branch Forwarding Issues Bypass of source operands from the EX/MEM if (IDcontrol.Branch and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = IF/ID.RegisterRs)) ForwardC = 1 and (EX/MEM.RegisterRd = IF/ID.RegisterRt)) ForwardD = 1 Forwards the result from the second previous instr. to either input of the Compare MEM/WB “forwarding” is taken care of by the normal RegFile write before read operation If the instruction immediately before the branch produces one of the branch compare source operands, then a stall will be required since the EX stage ALU operation is occurring at the same time as the ID stage branch compare operation

Supporting ID Stage Branches PCSrc Branch 1 Hazard Unit ID/EX IF.Flush EX/MEM 1 IF/ID Control Add MEM/WB 4 Shift left 2 Add Compare Read Addr 1 Instruction Memory Data Memory RegFile Read Addr 2 Read Address PC Read Data 1 Read Data 1 Write Addr ALU Address ReadData 2 1 Write Data Write Data Book claims that you have to forward from the MEM/WB pipeline latch, but with RegFile write before read, I don’t think that is the case!! ALU cntrl 16 Sign Extend 32 Forward Unit 1 Forward Unit

Branch Prediction Resolve branch hazards by assuming a given outcome and proceeding without waiting to see the actual branch outcome Predict not taken – always predict branches will not be taken, continue to fetch from the sequential instruction stream, only when branch is taken does the pipeline stall If taken, flush instructions in the pipeline after the branch in IF, ID, and EX if branch logic in MEM – three stalls in IF if branch logic in ID – one stall ensure that those flushed instructions haven’t changed machine state– automatic in the MIPS pipeline since machine state changing operations are at the tail end of the pipeline (MemWrite or RegWrite) restart the pipeline at the branch destination

Flushing with Misprediction (Not Taken) ALU IM Reg DM I n s t r. O r d e 4 beq $1,$2,2 ALU IM Reg DM 8 sub $4,$1,$5 For class handout To flush the IF stage instruction, add a IF.Flush control line that zeros the instruction field of the IF/ID pipeline register (transforming it into a noop)

Flushing with Misprediction (Not Taken) ALU IM Reg DM I n s t r. O r d e 4 beq $1,$2,2 flush ALU IM Reg DM 8 sub $4,$1,$5 16 and $6,$1,$7 20 or r8,$1,$9 ALU IM Reg DM For lecture Note branch address is PC-relative branch to 4+4+2*4 = 16 To flush the IF stage instruction, add a IF.Flush control line that zeros the instruction field of the IF/ID pipeline register (transforming it into a noop)

Branch Prediction, con’t Resolve branch hazards by statically assuming a given outcome and proceeding Predict taken – always predict branches will be taken Predict taken always incurs a stall (if branch destination hardware has been moved to the ID stage) As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance With more hardware, possible to try to predict branch behavior dynamically during program execution Dynamic branch prediction – predict branches at run- time using run-time information Predict taken always incurs one stall at least – assuming the branch destination address hardware has been moved up to the ID stage. So predict not taken is easier since sequential instruction address can be computed in the IF stage.

Dynamic Branch Prediction A branch prediction buffer (aka branch history table (BHT)) in the IF stage, addressed by the lower bits of the PC, contains a bit that tells whether the branch was taken the last time it was execute Bit may predict incorrectly (may be from a different branch with the same low order PC bits, or may be a wrong prediction for this branch) but the doesn’t affect correctness, just performance If the prediction is wrong, flush the incorrect instructions in pipeline, restart the pipeline with the right instructions, and invert the prediction bit The BHT predicts when a branch is taken, but does not tell where its taken to! A branch target buffer (BTB) in the IF stage can cache the branch target address (or !even! the branch target instruction) so that a stall can be avoided 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% 4096 about as good as infinite table, but 4096 is a lot of hardware Ask class why would want to store branch target instruction in the BTB

1-bit Prediction Accuracy 1-bit predictor in loop is incorrect twice when not taken Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the bottom of the loop code First time through the loop, the predictor mispredicts the branch since the branch is taken back to the top of the loop; invert prediction bit (predict_bit = 1) As long as branch is taken (looping), prediction is correct Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) Loop: 1st loop instr 2nd loop instr . last loop instr bne $1,$2,Loop fall out instr For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time

2-bit Predictors A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed. Loop: 1st loop instr 2nd loop instr . last loop instr bne $1,$2,Loop fall out instr Taken Not taken Predict Taken Predict Taken Taken Not taken Taken For class handout Not taken Predict Not Taken Predict Not Taken Taken Not taken

2-bit Predictors A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed right 9 times Loop: 1st loop instr 2nd loop instr . last loop instr bne $1,$2,Loop fall out instr wrong on loop fall out Taken Not taken 1 Predict Taken 1 Predict Taken Taken right on 1st iteration Not taken Taken For lecture Not taken Predict Not Taken Predict Not Taken Taken Not taken

Delayed Decision First, move the branch decision hardware and target address calculation to the ID pipeline stage A delayed branch always executes the next sequential instruction – the branch takes effect after that next instruction MIPS software moves an instruction to immediately after the branch that is not affected by the branch (a safe instruction) thereby hiding the branch delay As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot. Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches Growth in available transistors has made dynamic approaches relatively cheaper No processor uses delayed branches of more than 1 cycl. For longer branch delays, hardware-base branch prediction is used.

Scheduling Branch Delay Slots A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then add $1,$2,$3 if $1=0 then sub $4,$5,$6 delay slot delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes if $2=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 add $1,$2,$3 if $1=0 then sub $4,$5,$6 Limitations on delayed-branch scheduling come from 1) restrictions on the instructions that can be moved/copied into the delay slot and 2) limited ability to predict at compile time whether a branch is likely to be taken or not. In B and C, the use of $1 prevents the add instruction from being moved to the delay slot In B the sub may need to be copied because it could be reached by another path. B is preferred when the branch is taken with high probability (such as loop branches A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails

Brain storm on bugs (if time permits) Depending on branch solution (move to ID, delayed, static prediction, dynamic prediction), where are bugs likely to hide? … How can you write tests to uncover these likely bugs? Once it passes a test, don’t need to run it again?

In Conclusion Data dependencies in pipelines often solved by forwarding Need to be sure prior instructions will write, destination matches source, and no earlier instruction has priority Need forwarding hardware every place where can forward, stall if stage needs to wait for result EX stage, MEM stage for store, ID stage for early branch Loads require stall since overlap EX and MEM stages Branches may require stall too Control hazards improved via delayed branch/jump in ISA, static prediction for branches, dynamic prediction for branches If predict, hard part of design is recovering from misprediction