1 Lecture 5 Branch Prediction (2.3) and Scoreboarding (A.7)

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
A scheme to overcome data hazards
Pipelining and Control Hazards Oct
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Dynamic Branch Prediction
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EECE476: Computer Architecture Lecture 20: Branch Prediction Chapter extra The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Goal: Reduce the Penalty of Control Hazards
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
Dynamic Branch Prediction
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,
CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
COMP25212 Advanced Pipelining Out of Order Processors.
Dynamic Branch Prediction
Instruction-Level Parallelism and Its Dynamic Exploitation
CS203 – Advanced Computer Architecture
/ Computer Architecture and Design
CS 704 Advanced Computer Architecture
Approaches to exploiting Instruction Level Parallelism (ILP)
CS203 – Advanced Computer Architecture
Lecture 6 Score Board And Tomasulo’s Algorithm
Advantages of Dynamic Scheduling
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Last Week Talks Any feedback from the talks? What did you like?
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
CS 704 Advanced Computer Architecture
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Dynamic Branch Prediction
CSCE430/830 Computer Architecture
Advanced Computer Architecture
/ Computer Architecture and Design
Lecture 10: Branch Prediction and Instruction Delivery
Lecture 5 Scoreboarding: Enforce Register Data Dependence
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
Adapted from the slides of Prof
Dynamic Hardware Prediction
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Presentation transcript:

1 Lecture 5 Branch Prediction (2.3) and Scoreboarding (A.7)

2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed branch Modern processor and next generation – multiple instructions issued per cycle, more branch hazard cycles will incur. –Cost of branch misfetch goes up –Pentium Pro – 3 instructions issued per cycle, 12+ cycle misfetch penalty  HUGE penalty for a misfetched path following a branch

3 Branch Prediction Easiest (static prediction) –Always taken, always not taken –Opcode/Profile based –Displacement based (forward not taken, backward taken) –Compiler directed (branch likely, branch not likely) Next easiest – 1 bit predictor – remember last taken/not taken per branch  Use a branch-prediction buffer or branch-history table with 1 bit entry  Use part of the PC (low-order bits) to index buffer/table – Why? –Multiple branches may share the same bit  Invert the bit if the prediction is wrong  Backward branches for loops will be mispredicted twice EX: If a loop branches 9 times in a row and not taken once, what is the prediction accuracy? Ans: Misprediction at the first loop and last loop => 80% prediction accuracy although branch is taken 90% time.

4 2-bit Branch Prediction Has 4 states instead of 2, allowing for more information about tendencies A prediction must miss twice before it is changed Good for backward branches of loops

5 Branch History Table 01 BHT branch PC Has limited size 2 bits by N (e.g. 4K entries) Uses low-order bits of branch PC to choose entry Plot misprediction instead of prediction

6 Observations Prediction Accuracy ranges from 99% to 82% or a misprediction rate of 1% to 18% Mis-prediction for integer programs (gcc, espresso, eqntott, li) is substantially higher than FP programs (nasa7, matrix300, tomcatv, doduc, spice, fppp) Branch penalty involves both mis-prediction rate and branch frequency, and is higher for integer benchmarks Prediction accuracy improves with buffer size, but doesn’t improve beyond 4K entries (Fig. C.20)

Copyright © 2011, Elsevier Inc. All rights Reserved. Figure C.20 Prediction accuracy of a 4096-entry 2-bit prediction buffer versus an infinite buffer for the SPEC89 benchmarks. Although these data are for an older version of a subset of the SPEC benchmarks, the results would be comparable for newer versions with perhaps as many as 8K entries needed to match an infinite 2-bit predictor.

8 Correlating or Two-level Predictors Correlating branch predictors also look at other branches for clues. Consider the following example. if (aa==2) -- branch b1 aa = 0; if (bb==2) --- branch b2 bb = 0; if(aa!=bb) { … --- branch b3 – Clearly depends on the results of b1 and b2 If b1 is not taken and b2 is not taken, b3 will be taken. If prediction of b3 is based only on its own history (instead of also those of b1 and b2), it is not an accurate prediction. (1,2) predictor – uses history of 1 branch and uses a 2-bit predictor (m,n) predictor – uses the behavior of last m branches to choose from 2**m branch predictors, each of which is an n-bit predictor for a single branch.

9 Correlating Branch Predictor If we use 2 branches as histories, then there are 4 possibilities (T-T, T-NT, NT-NT, NT-T). For each possibility, we need to use a predictor (1-bit, 2-bit). And this repeats for every branch. (2,2) branch prediction

10 Performance of Correlating Branch Prediction With 1024 entries, (2,2) performs better than a 4096 entry non- correlating 2-bit predictor. Outperforms a 2-bit predictor with infinite number of entries

Tournament Predictor 11 Adaptively combine local and global predictors – Ex. Two mis- predictions before changing from local to global or vice versa Mis-prediction in Tournament Predictor, Fig. 2.8

12 Branch Target Buffer (BTB) BTB is a cache that contains the predicted PC value instead of whether the branch will take place or not (Ex. Loop address) Is the current instruction a branch ? BTB provides the answer before the current instruction is decoded and therefore enables fetching to begin after IF-stage. What is the branch target ? BTB provides the branch target if the prediction is a taken direct branch (for not taken branches the target is simply PC+4 ).

BTB Fig. 2.22

14 BTB operations Fig BTB hit, prediction taken → 0 cycle delay BTB hit, misprediction ≥ 2 cycle penalty – Correct BTB BTB miss, branch ≥ 1 cycle penalty (Detected at the ID stage and entered in BTB)

15 BTB Performance Two things can go wrong –BTB miss (misfetch) –Mispredicted a branch (mispredict) Suppose for branches, BTB hit rate of 85% and predict accuracy of 90%, misfetch penalty of 2 cycles and mispredict penalty of 5 cycles. What is the average branch penalty? 2*(15%) + 5*(85%*10%) BTB and BPT can be used together to perform better prediction

16 Integrated Instruction Fetch Unit Separate out IF from the pipeline and integrate with the following components. So, the pipeline consists of Issue, Read, EX, and WB (scoreboarding) ; Or Issue, EX and WB stages (Tomasulo). 1.Integrated Branch Prediction – Branch predictor is part of the IFU. 2.Instruction Prefetch – Fetch instn from IM ahead of PC with the help of branch predictor and store in a prefetch buffer. 3.Instruction Memory Access and Buffering - Keep on filling the Instruction Queue independent of the execution => Decoupled Execution?

17 Branch Prediction Summary The better we predict, the higher penalty we might incur 2-bit predictors capture tendencies well Correlating predictors improve accuracy, particularly when combined with 2-bit predictors Accurate branch prediction does no good if we don’t know there was a branch to predict BTB identifies branches in IF stage BTB combined with branch prediction table identifies branches to predict, and predicts them well

18 Instruction Level Parallelism

19 How to obtain CPI>1? Issue more than one instruction per cycle Compiler needs to do a good job in scheduling code (rearranging code sequence) – statically scheduled Fetch up to n instructions as an issue packet if issue width is n Check hazards during issue stage (including decode) –Issue checks are too complex to perform in one clock cycle –Issue stage is split and pipelined –Needs to check hazards within a packet, between two packets, among current and all the earlier instructions in execution. In effect an n-fold pipeline with complex issue logic and large set of bypass paths. TypePipeStages Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB Int. instructionIFIDEXMEMWB FP instructionIFIDEXMEMWB

20 HW Schemes: Instruction Parallelism Compiler or Static instruction scheduling can avoid some pipeline hazards. –e.g. filling branch delay slot. Why in HW at run time? –Works when can’t know dependence at compile time  WAW can only be detected at run time –Compiler simpler –Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVDF0,F2,F4 ADDDF10,F0,F8 SUBDF8,F8,F14 –Enables out-of-order execution => out-of-order completion –But, both structural and data hazards are checked in MIPS  ADDD is stalled at ID, SUBD can not even proceed to ID.

21 HW Schemes: Instruction Parallelism Out-of-order execution divides ID stage: 1.Issue—decode instructions, check for structural hazards, Issue in order if the functional unit is free and no WAW. 2.Read operands (RO)—wait until no (RAW) data hazards, then read operands  ADDD would stall at RO, and SUBD could proceed with no stalls. Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions. IFISSUE ROEX 1 … EX m ROEX 1 ………..EX n ROEX 1 ………….. EX p WB? WB – if no WAR …

22 Scoreboard Implications Out-of-order completion => WAR, WAW hazards Solutions for WAR –CDC 6600: Stall Write to allow Reads to take place; Read registers only during Read Operands stage. For WAW, must detect hazard: stall in the Issue stage until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard replaces ID with 2 stages (Issue and RO) Scoreboard keeps track of dependencies, state or operations –Monitors every change in the hardware. –Determines when to read ops, when can execute, when can wb. –Hazard detection and resolution is centralized.

23 Four Stages of Scoreboard Control 1.Issue—decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 2.Read operands—wait until no data hazards, then read operands (ID2) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

24 Four Stages of Scoreboard Control 3.Execution—operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4.Write result—finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVDF0,F2,F4 ADDDF10,F0,F8 SUBDF8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands CDC 6600 has one integer, 2 FP multipliers, 1 FP divide, 1 FP add units. See Fig. A.50.

25 Scoreboard Example Cycle 7 I3 stalled at read because I2 isn’t complete

26 Three Parts of the Scoreboard 1.Instruction status—which of 4 steps the instruction is in 2.Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready and not yet read. Set to No after operand are read. 3.Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

27