Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.

Slides:



Advertisements
Similar presentations
Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.
Advertisements

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.
Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Clustered Indexing for Conditional Branch Predictors Veerle Desmet Ghent University Belgium.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
CS 152 Computer Architecture & Engineering Andrew Waterman University of California, Berkeley Section 8 Spring 2010.
® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Branch Target Buffers BPB: Tag + Prediction
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlated Branches from a Large Global History Renjiu Thomas, Manoij Franklin,
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.
Evaluation of the Gini-index for Studying Branch Prediction Features Veerle Desmet Lieven Eeckhout Koen De Bosschere.
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Computer Structure Advanced Branch Prediction
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Workload Design: Selecting Representative Program-Input Pairs Lieven Eeckhout Hans Vandierendonck Koen De Bosschere Ghent University, Belgium PACT 2002,
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Prophet/Critic Hybrid Branch Prediction B B B
CSL718 : Pipelined Processors
Prof. Hsien-Hsin Sean Lee
CS203 – Advanced Computer Architecture
Computer Structure Advanced Branch Prediction
Computer Architecture Advanced Branch Prediction
COSC3330 Computer Architecture Lecture 15. Branch Prediction
Flow Path Model of Superscalars
CS 152 Computer Architecture & Engineering
Module 3: Branch Prediction
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
Ka-Ming Keung Swamy D Ponpandi
Alpha Microarchitecture
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Pipelining and control flow
Lecture 10: Branch Prediction and Instruction Delivery
Serene Banerjee, Lizy K. John, Brian L. Evans
rePLay: A Hardware Framework for Dynamic Optimization
Ka-Ming Keung Swamy D Ponpandi
Procedure Return Predictors
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt

August 27, 2003Euro-Par Instruction Fetch Wide-issue superscalar processors need to fetch multiple branches per cycle –IPC=8 implies fetching ~16 instructions/cycle and predicting ~3 branches/cycle –Multi-ported instruction cache? Trace cache: –Packs fetch groups in a trace –Trace tagged with PC, path, next fetch PC –Multiple branch predictor (MBP) predicts branch directions

August 27, 2003Euro-Par The Trace Cache instruction cache trace cache MBP MUX select hit pred. trace pred. insn fetch address instructions hit/miss legend pred. path fetch address next addressinstructions fill unit only executed paths!

August 27, 2003Euro-Par Overview Observation –Trace cache misses are (sometimes) branch mispredictions Trace Substitution –How to make use of it Evaluation –Is it worth it? Conclusion

August 27, 2003Euro-Par Observation Multiple branch predictor affects trace cache: –Non-perfect branch predictors reduce the trace cache hit rate –FIPA correlates better with TC hit rate than with MBP accuracy TC: 16K-traces, 4-way set-assoc, path associativity MGAg, Mgshare: 12-bit history repeat: 8Kbit hybrid, accessed 3x

August 27, 2003Euro-Par TC Misses Are a Tell-Tale for MBP misses Trace cache misses coincide with branch mispredictions, e.g.: –16K-entry trace cache, 12-bit MGAg: 84.9% of TC misses are also MBP misses 37.6% of MBP misses are also TC misses –256-entry trace cache, 12 bit MGAg: 25.1% of TC misses are also MBP misses 55.9% of MBP misses are also TC misses This work: use TC misses to detect MBP misses and fix them high accuracy, low coverage low accuracy, higher coverage

August 27, 2003Euro-Par Trace Substitution Assumption: TC miss implies MBP miss –Correlation between branches implies that some paths never occur –TC stores only those paths that do occur If the predicted path is wrong … –Fetch a different trace –Override MBP with MRU trace starting at fetch PC Detect MRU trace from LRU bits stored in TC No trace substitution applied if it does not exist

August 27, 2003Euro-Par Implementation instruction cache trace cache MBP MUX select hit MRU hit MRU pred. trace pred. insn fetch address instructions hit/miss legend pred. path fetch address next addressinstructions fill unit

August 27, 2003Euro-Par Evaluation Setup Benchmarks –SPECint95 (except compress, go), reference inputs –500 million instructions from start of program –Compiled for Alpha ISA, Compaq C compiler, -O4 Fetch Unit –TC: 1 trace = 16 instructions, 3 cond. branches, trace ends at system call, indirect jump –TC: 4-way set-assoc., path associativity –MBP: MGAg, varying history length –Instruction cache: 32K, 2-way, 32byte blocks, LRU Metric –FIPA = fetched instructions per fetch unit access

August 27, 2003Euro-Par Evaluation (1) Observations: –Gap MGAg-perfect increases with TC size –20-40% of gap filled with trace substitution –Only on TC miss, thus performance increase drops with TC size TC: 4-way set-associative MGAg: 12-bit history

August 27, 2003Euro-Par Evaluation (2) Observations: –Compensate poor branch predictor –No history ~ 10 bit history –Improvement drops with more accurate predictor TC: 256 traces, 4-ways

August 27, 2003Euro-Par Accuracy vs. Usage Definitions: –Usage = substitutions per fetch unit access –Accuracy = fraction correct substitutions Note –Accuracy limited because correct-path trace is not always present! TC: 256 traces, 4-way

August 27, 2003Euro-Par Conclusion Proposed trace substitution –TC miss flags MBP miss Not always correct, not all MBP misses found Fetch MRU trace instead: cheap implementation Results in –Consistent performance improvement No history+substitution ~ MGAg with 10-bit history In other cases: 0.2 instructions/access or same performance as with 16 times smaller MBP Most effective when MBP or TC is small