CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO-33 2000.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

Yue Hu David M. Koppelman Lu Peng A Penalty-Sensitive Branch Predictor Department of Electrical and Computer Engineering Louisiana State University.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Lecture: Branch Prediction Topics: branch prediction, bimodal/global/local/tournament predictors, branch target buffer (Section 3.3, notes on class webpage)

Computer Architecture 2011 – Branch Prediction 1 Computer Architecture Advanced Branch Prediction Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EE8365/CS8203 ADVANCED COMPUTER ARCHITECTURE A Survey on BRANCH PREDICTION METHODOLOGY By, Baris Mustafa Kazar Resit Sendag.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

Combining Branch Predictors

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Branch Target Buffers BPB: Tag + Prediction

EECC551 - Shaaban #1 lec # 5 Winter Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

Branch Prediction Dimitris Karteris Rafael Pasvantidιs.

Dynamic Branch Prediction

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Branch Prediction CSE 4711 Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications;

Lecture 3. Branch Prediction Prof. Taeweon Suh Computer Science Education Korea University COM506 Computer Design.

Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.

1 A New Case for the TAGE Predictor André Seznec INRIA/IRISA.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.

Korea UniversityG. Lee CRE652 Processor Architecture Dynamic Branch Prediction.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Computer Structure Advanced Branch Prediction

CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Dynamic Branch Prediction

CSL718 : Pipelined Processors

Lecture: Out-of-order Processors

CS203 – Advanced Computer Architecture

Computer Structure Advanced Branch Prediction

Lecture: Branch Prediction

Computer Architecture Advanced Branch Prediction

Lecture: Branch Prediction

COSC3330 Computer Architecture Lecture 15. Branch Prediction

Samira Khan University of Virginia Dec 4, 2017

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture: Branch Prediction

Lecture: Out-of-order Processors

Dynamic Branch Prediction

Advanced Computer Architecture

So far we have dealt with control hazards in instruction pipelines by:

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

TAGE-SC-L Again MTAGE-SC

So far we have dealt with control hazards in instruction pipelines by:

Adapted from the slides of Prof

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Samira Khan University of Virginia Mar 6, 2019

Lecture 7: Branch Prediction, Dynamic ILP

Presentation transcript:

CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO

Bimodal Predictor Branch PC 14 bits Table of 16K entries of 2-bit saturating counters

Global Predictor A single register that keeps track of recent history for all branches Branch PC 8 bits 6 bits Table of 16K entries of 2-bit saturating counters Also referred to as a two-level predictor

Local Predictor Branch PC Table of 16K entries of 2-bit saturating counters Table of 64 entries of 14-bit histories for a single branch Use 6 bits of branch PC to index into local history table 14-bit history indexes into next level A two-level predictor that only uses local histories at the first level

Tournament Predictors A local predictor might work well for some branches or programs, while a global predictor might work well for others Provide one of each and maintain another predictor to identify which predictor is best for each branch Tournament Predictor Branch PC Table of 2-bit saturating counters Local Predictor Global Predictor MUXMUX

Terminology GAG: Global history indexes into global array of saturating counters PAG: Per-address history indexes into global array of saturating counters GAP: Global history indexes into each PC’s private array of counters (gselect) PAP: Per-address history indexes into each PC’s private array of counters

Prediction Accuracy Vs. IPC

Fig.1 – IPC saturates at around 1.28, assuming single-cycle predictions A 2KB predictor takes two cycles to access – multi-cycle predictors can’t yield IPC > 1.0 (reduced fetch bandwidth) However, note that a single cycle predictor is within 10% of optimal IPC (might not be true for more aggressive o-o-o processors)

Long Latency Predictions Total branch latency C = d + (r x p) d = delay = 1 r = mpred rate = 0.04 p = penalty = 20 Always better to reduce d than r Note that correctly predicted branches are often not on the program critical path

Branch Frequency Branches are not as common as we think – on average, they occur every 6 instructions, but 61% of the time, there is at least 1 cycle of separation Branches can be treated differently, based on whether they can tolerate latency or not

Branch Predictor Cache The cache is a subset of the 3-cycle predictor and requires tags ABP provides a prediction if there is a cache miss 3-cycle PHT 1-cycle PHTTagsABP Xor of address and history Hit/Miss Prediction

Cascading Lookahead Prediction Use the current PC to predict where the next branch will go – initiate the look-up before you see that branch Use predictors with different latencies – when you do see the branch, use the prediction available to you You can use a good prediction 60% of the time and a poor prediction 40% of the time

Overriding Branch Predictor Use a quick-and-dirty prediction When you get the slow-and-clean prediction and it disagrees, initiate recovery action If prediction rates are 92% and 97%, 5% of all branches see a 2-cycle mispredict penalty and 3% see a 20-cycle penalty

Combining the Predictors? Lookahead into a number of predictors When you see a branch (after 3 cycles), use the prediction from your cache (in case of a hit) or the prediction from the regular 3-cycle predictor (in case of a miss) When you see the super-duper 5-cycle prediction, let it override any previous incorrect prediction

Latencies TechnologyABP Delay ABP Entries PHTC Entries PHT Delay PHT Entries 100nm11K K 35nm K

Results (Fig.8)

The cache doesn’t seem to help at all (IPC of 1.1!) (it is very surprising that the ABP and PHT have matching predictions most of the time) For the cascading predictor, the slow predictor is used 45% of the time and it gives a better prediction than the 1-cycle predictor 5.5% of the time The overriding predictor disagrees 16.5% of the time and yields an IPC of 1.2 – hmmm…

Alpha Predictor chooser PHT global predictor PHT local history PHT global history global history PC 512 entries 128 entries 3200 bits 128 entries

Alpha (EV8) 352Kb! 2-cycle access time – 4 predictor arrays accessed in parallel – overrides line prediction cycle mispredict penalty – 8-wide processor in-flight instructions

Predictor Sizes All tables are indexed using combinations of history and PC BIMG0G1Meta Prediction table16K64K Hysteresis table16K32K64K32K History length

2Bc-gskew Address Address+History BIM Meta G1 G0 Pred Vote

Rules On a correct prediction  if all agree, no update  if they disagree, strengthen correct preds and chooser On a misprediction  update chooser and recompute the prediction  on a correct prediction, strengthen correct preds  on a misprediction, update all preds

Design Choices Local predictor was avoided because you need up to 16 predictions in a cycle and it is hard maintaining speculative local histories  You have no control over local histories – will need 16-ported PHT  Since global history is common for all 16 predictions, you can control indexing into PHT They advocate the use of larger overriding predictors for future technologies

Title Bullet