Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Slides:



Advertisements
Similar presentations
Bimode Cascading: Adaptive Rehashing for ITTAGE Indirect Branch Predictor Y.Ishii, K.Kuroyanagi, T.Sawada, M.Inaba, and K.Hiraki.
Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.
Dynamic Branch Prediction
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Branch Target Buffers BPB: Tag + Prediction
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Spring 2003CSE P5481 Control Hazard Review The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.
CB E D F G Frequently executed path Not frequently executed path Hard to predict path A C E B H Insert select-µops (φ-nodes SSA) Diverge Branch CFM point.
Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.
Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.
Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Revisiting Load Value Speculation:
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Analysis of Branch Predictors
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
2D-Profiling Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt HPS Research Group The.
Copyright 2016 Csaba Andras MoritzECE668 Power Aware Branching.1 Few slides adapted from Patterson, et al © UCB and Morgan Kaufmann Csaba Andras Moritz.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Prophet/Critic Hybrid Branch Prediction B B B
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Samira Khan University of Virginia April 12, 2016
Computer Architecture Lecture 10: Branch Prediction II
Computer Architecture: Branch Prediction (II) and Predicated Execution
Prof. Hsien-Hsin Sean Lee
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Advanced Branch Prediction
Samira Khan University of Virginia Nov 13, 2017
CMSC 611: Advanced Computer Architecture
15-740/ Computer Architecture Lecture 25: Control Flow II
Milad Hashemi, Onur Mutlu, Yale N. Patt
EE 382N Guest Lecture Wish Branches
Address-Value Delta (AVD) Prediction
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
15-740/ Computer Architecture Lecture 24: Control Flow
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Lecture 10: Branch Prediction and Instruction Delivery
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Serene Banerjee, Lizy K. John, Brian L. Evans
15-740/ Computer Architecture Lecture 14: Prefetching
rePLay: A Hardware Framework for Dynamic Optimization
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal †‡ Yale N. Patt * * HPS Research Group University of Texas at Austin ‡ Computer Architecture Group Microsoft Research § College of Computing Georgia Institute of Technology † Dept. of Computer Science and Eng. IIT Kanpur

Motivation 2 Polymorphism is a key feature of Object-Oriented Languages  Allows modular, extensible, and flexible software design Object-Oriented Languages include virtual functions to support polymorphism  Dynamically dispatched function calls based on object type Virtual functions are usually implemented using indirect jump/call instructions in the ISA Other programming constructs are also implemented with indirect jumps/calls: switch statements, jump tables, interface calls Indirect jumps are becoming more frequent with modern languages

Example from DaCapo fop (Java) public int mvalue() { if (!bIsComputed) computeValue(); return millipoints; } Length.class: Length LinearCombinationLength PercentLength MixedLength protected void computeValue() { // … setComputedValue(result); } LinearCombinationLength.class: protected void computeValue() { // … setComputedValue(result1); } PercentLength.class: LinearCombinationLength PercentLength LinearCombinationLength PercentLength LinearCombinationLength PercentLength This indirect call is hard to predict 3 protected void computeValue() {} LinearCombinationLength PercentLength

Predicting Direct Branches vs. Indirect Jumps 4 TARG A+1 A T N  A  ? Conditional (Direct) Branch Indirect Jump  Indirect jumps: Multiple target addresses  More difficult to predict than conditional (direct) branches Can degrade performance br.cond TARGET R1 = MEM[R2] branch R1

The Problem 5 Most processors predict using the BTB: target of indirect jump = target in previous execution  Stores only one target per jump (already done for conditional branches)  Inaccurate Indirect jumps usually switch between multiple targets ~50% of indirect jumps are mispredicted Most history-based indirect jump target predictors add large hardware resources for multiple targets

Indirect Jump Mispredictions 6 Data from Intel Core2 Duo processor 41% of mispredictions due to Indirect Jumps

BC F E H Frequently executed path Not frequently executed path A C G B I Insert select-µops (φ-nodes in SSA) DIP-jump CFM point A I Hard to predict p1 p2 nop Dynamic Indirect Jump Predication (DIP) 7 D call R1 G return F TARGET 3TARGET 1TARGET 2 p1 p2 nop

Dynamic Predication of Indirect Jumps The compiler uses control-flow analysis and profiling to identify  DIP-jumps: highly-mispredicted indirect jumps  Control-flow merge (CFM) points The microarchitecture decides when and what to predicate dynamically  Dynamic target selection 8

Dynamic Target Selection Last target hash hash2 Most-freq target 2 nd most-freq target PC Target Selection Table BTB Branch Target Buffer Most-freq target 2 nd most-freq target To Fetch Three frequency counters per entry Associated targets in the BTB Three frequency counters per entry Associated targets in the BTB 3.6KB

Additional DIP Entry/Exit Policies 10 Single predominant target in the TST  TST has more accurate information  Override the target prediction Nested low confidence DIP-jumps  Exit dynamic predication for the earlier jump and re-enter for the later one Return instructions inside switch statements  Merging address varies with calling site  Return CFM points

Methodology Dynamic profiling tool for DIP-jump and CFM point selection Cycle-accurate x86 simulator:  Processor configuration 64KB perceptron predictor 4K-entry, 4-way BTB (baseline indirect jump predictor) Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window 300-cycle minimum memory latency 2KB 12-bit history enhanced JRS confidence estimator 32 predicate registers, 1 CFM register  Also less aggressive processor (in paper) Benchmarks: DaCapo suite (Java), matlab, m5, perl  Also evaluated SPEC CPU 2000 and

Indirect Jump Predictors Tagged Target Cache Predictor (TTC) [P. Chang et al., ISCA 97]  4-way set associative fully-tagged target table  Our version does not store easy-to-predict indirect jumps Cascaded Predictor [Driesen and Hölzle, MICRO 98, Euro-Par 99]  Hybrid predictor with tables of increasing complexity  3-stage predictor performs best Virtual Program Counter (VPC) Predictor [Kim et al., ISCA 07]  Predicts indirect jumps using the conditional branch predictor  Stores multiple targets on the BTB, as our target selection logic does 12

Performance, Power, and Energy % 24.8% 2.3% 45.5%

DIP vs. Indirect Jump Predictors 14

Outcome of Executed Indirect Jumps 15 DIP used BTB correct

Additional Evaluation (in paper) 16 Static vs. dynamic target selection policies DIP with more than 2 targets  2 dynamic targets is best DIP on top of a baseline with TTC, VPC or Cascaded predictors Sensitivity to:  Processor configuration  BTB size  TST size and structure More benchmarks (SPEC CPU 2000 and 2006)

Conclusion 17 Object-oriented languages use more indirect jumps  Indirect jumps are hard to predict and have already become an important performance limiter We propose DIP, a cooperative hardware-software technique  Improves performance by 37.8%  Reduces energy by 24.8%  Provides better performance and energy-efficiency than three indirect jump predictors  Incurs low hardware cost (3.6KB) if dynamic predication is already used for conditional branches  Can be an enabler encouraging developers to use object-oriented programming

Thank You! Questions?