1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14)

Slides:



Advertisements
Similar presentations
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Advertisements

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Lecture: Branch Prediction Topics: branch prediction, bimodal/global/local/tournament predictors, branch target buffer (Section 3.3, notes on class webpage)
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
Dynamic Branch Prediction
1 Lecture 4: Advanced Pipelines Control hazards, multi-cycle in-order pipelines, static ILP (Appendix A.4-A.10, Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Instruction-Level Parallelism and Its Dynamic Exploitation
Lecture: Out-of-order Processors
CS203 – Advanced Computer Architecture
Lecture: Branch Prediction
Lecture: Branch Prediction
Lecture: Out-of-order Processors
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 6: Static ILP, Branch prediction
Lecture 18: Pipelining Today’s topics:
Lecture: Static ILP, Branch Prediction
Lecture 18: Pipelining Today’s topics:
Lecture: Branch Prediction
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Advanced Computer Architecture
Lecture 10: Branch Prediction and Instruction Delivery
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Dynamic Hardware Prediction
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Lecture 7: Branch Prediction, Dynamic ILP
Presentation transcript:

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

2 1-Bit Prediction For each branch, keep track of what happened last time and use that outcome as the prediction What are prediction accuracies for branches 1 and 2 below: while (1) { for (i=0;i<10;i++) { branch-1 … } for (j=0;j<20;j++) { branch-2 … }

3 2-Bit Prediction For each branch, maintain a 2-bit saturating counter: if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1) If (counter >= 2), predict taken, else predict not taken Advantage: a few atypical branches will not influence the prediction (a better measure of “the common case”) Especially useful when multiple branches share the same counter (some bits of the branch PC are used to index into the branch predictor) Can be easily extended to N-bits (in most processors, N=2)

4 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each entry (or use 10 branch PC bits to index into one of 1024 counters) – captures the recent “common case” for each branch Can we take advantage of additional information?  If a branch recently went 01111, expect 0; if it recently went 11101, expect 1; can we have a separate counter for each case?  If the previous branches went 01, expect 0; if the previous branches went 11, expect 1; can we have a separate counter for each case? Hence, build correlating predictors

5 Local/Global Predictors Instead of maintaining a counter for each branch to capture the common case,  Maintain a counter for each branch and surrounding pattern  If the surrounding pattern belongs to the branch being predicted, the predictor is referred to as a local predictor  If the surrounding pattern includes neighboring branches, the predictor is referred to as a global predictor

6 Global Predictor A single register that keeps track of recent history for all branches Branch PC 8 bits 6 bits Table of 16K entries of 2-bit saturating counters Also referred to as a two-level predictor

7 Local Predictor Branch PC Table of 16K entries of 2-bit saturating counters Table of 64 entries of 14-bit histories for a single branch Use 6 bits of branch PC to index into local history table 14-bit history indexes into next level Also a two-level predictor that only uses local histories at the first level

8 Tournament Predictors A local predictor might work well for some branches or programs, while a global predictor might work well for others Provide one of each and maintain another predictor to identify which predictor is best for each branch Tournament Predictor Branch PC Table of 2-bit saturating counters Local Predictor Global Predictor MUXMUX Alpha 21264: 1K entries in level-1 1K entries in level-2 4K entries 12-bit global history 4K entries Total capacity: ?

9 Predictor Comparison Note that predictors of equal capacity must be compared Sizes of each level have to be selected to optimize prediction accuracy Influencing factors: degree of interference between branches, program likely to benefit from local/global history

10 Branch Target Prediction In addition to predicting the branch direction, we must also predict the branch target address Branch PC indexes into a predictor table; indirect branches might be problematic Most common indirect branch: return from a procedure – can be easily handled with a stack of return addresses

11 Multiple Instruction Issue The out-of-order processor implementation can be easily extended to have multiple instructions in each pipeline stage Increased complexity (lower clock speed!):  more reads and writes per cycle to register map table  more read and write ports in issue queue  more tags being broadcast to issue queue every cycle  higher complexity for bypassing/forwarding among FUs  more register read and write ports  more ports in the LSQ  more ports in the data cache  more ports in the ROB

12 ILP Limits The perfect processor:  Infinite registers (no WAW or WAR hazards)  Perfect branch direction and target prediction  Perfect memory disambiguation  Perfect instruction and data caches  Single-cycle latencies for all ALUs  Infinite ROB size (window of in-flight instructions)  No limit on number of instructions in each pipeline stage The last instruction may be scheduled in the first cycle The only constraint is a true dependence (register or memory RAW hazards) (with value prediction, how would the perfect processor behave?)

13 Infinite Window Size and Issue Rate

14 Effect of Window Size Window size is effected by register file/ROB size, branch mispredict rate, fetch bandwidth, etc. We will use a window size of 2K instrs and a max issue rate of 64 for subsequent experiments

15 Imperfect Branch Prediction Note: no branch mispredict penalty; branch mispredict restricts window size Assume a large tournament predictor for subsequent experiments

16 Effect of Name Dependences More registers  fewer WAR and WAW constraints (usually register file size goes hand in hand with in-flight window size) 256 int and fp registers for subsequent experiments

17 Memory Dependences

18 Limits of ILP – Summary Int programs are more limited by branches, memory disambiguation, etc., while FP programs are limited most by window size We have not yet examined the effect of branch mispredict penalty and imperfect caching All of the studied factors have relatively comparable influence on CPI: window/register size, branch prediction, memory disambiguation Can we do better? Yes: better compilers, value prediction, memory dependence prediction, multi-path execution

19 Pentium III (P6 Microarchitecture) Case Study 14-stage pipeline: 8 for fetch/decode/dispatch, 3+ for o-o-o, 3 for commit  branch mispredict penalty of cycles Out-of-order execution with a 40-entry ROB (40 temporary or virtual registers) and 20 reservation stations Each x86 instruction gets converted into RISC-like micro-ops – on average, one CISC instr  1.37 micro-ops Three instructions in each pipeline stage  3 instructions can simultaneously leave the pipeline  ideal CP  I = 0.33  ideal CPI = 0.45

20 Branch Prediction 512-entry global two-level branch predictor and 512-entry BTB  20% combined mispredict rate For every instruction committed, 0.2 instructions on the mispredicted path are also executed (wasted power!) Mispredict penalty is cycles

21 Where is Time Lost? Branch mispredict stalls Cache miss stalls (dominated by L1D misses) Instruction fetch stalls (happens often because subsequent stages are stalled, and occasionally because of an I-cache miss

22 CPI Performance Owing to stalls, the processor can fall behind (no instructions are committed for 55% of all cycles), but then recover with multi-instruction commits (31% of all cycles)  average CPI = 1.15 (Int) and 2.0 (FP) Overlap of different stalls  CPI is not the sum of individual stalls IPC is also an attractive metric

23 Title Bullet