Revisiting Load Value Speculation:

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

1 A New Case for the TAGE Predictor André Seznec INRIA/IRISA.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Lecture: Out-of-order Processors

Lecture: Out-of-order Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Out-of-Order Commit Processors

Exploring Value Prediction with the EVES predictor

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Address-Value Delta (AVD) Prediction

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Dynamic Branch Prediction

Out-of-Order Commit Processors

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Dynamic Hardware Prediction

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Project Guidelines Prof. Eric Rotenberg.

Presentation transcript:

Revisiting Load Value Speculation: Lois Orosa and Rodolfo Azevedo University of Campinas lois.orosa@ic.unicamp.br Revisiting Load Value Speculation: an Approach to Mainstream Processors

Projects Load Value Speculation Frequent Value Locality On-chip photonics (Jorge González , PhD candidate)

Introduction to Value Speculation (I) It was proposed in the 90´s Improve ILP by breaking true data dependencies (RAW) Speculation in all the instructions The prediction is written in the output register Predictors indexed by PC (at fetch time) The proposals were very complex in that time (many changes in the OoO engine) Recently Perais and Seznec revisited the topic [HPCA´13][ISCA´14][HPCA´15] Propose simplifications in the implementation Propose new predictors

Introduction to Value Speculation (II) Confidence counters [Perais'13] (per instruction) to increase precision Only speculates when the confidence is high Reduce mispredictions Decrease coverage Increase when prediction is ok, reset when misprediction Precision If mispenalty is low, the system could tolerate low precision If mispenalty is high, precision should be high (99% or more) The prediction have to be available before dispatch time Available cycles: from fetch to dispatch The predictor delay is not critical

Introduction to Value Speculation (III) Validation At execution time (OoO changes, small misprediction penalty) At commit time (no OoO changes, higher misprediction penalty) Recovering from misprediction: Selective reissue: faster, more complex (validation at execution time) Pipeline squashing: slower, more simple Two main problems: Register port pressure New extra ports (extra writes for predictions, extra reads for validations and predictor updates) Back-to-back predictions Predictors may depend on previous values

Contributions Analysis of the potential of Value Speculation in a narrow processor for different types of instructions Reducing complexity in narrow-width-issue processors by speculating only in load instructions AV predictor: two phase value predictor with prediction of addresses XLStride predictor: multilevel stride predictor

Baseline Processor & Benchmarks Baseline: real narrow-width-issue processor ZSIM simulator: Westmere OoO x86-64bit , 4-issue, 2-level branch predictor 128-entry ROB, 32-entry load queue, 32-entry store queue L1I & L1D : 32KB 4-way, LRU, 4-cycle latency L2 Cache : 256KB, 8-way, LRU, 12-cycle latency Pipeline squashing, validation at commit Benchmarks Splash2, Parsec, SPEC2000, SPEC2006

Potential of Value Speculation (I) Six categories of instructions Loads are the 25% of all dynamic micro-instructions High latency micro-instructions (more than 5 cycles) are not representative (included in “Other”)

Potential of Value Speculation (III) Oracle predictor (no mispredictions) Value Speculation in each category of instruction Loads have almost the same potential than speculating in all instructions LOADS ALL NOTLOADS Loads (25%) have more potential gains than all the other instructions together (75%) LOADS ALL NOTLOADS

Advantages of Speculating only in Loads in a narrow processor Value Speculation in Narrow-issue processors Reduced back-to-back prediction: less on-flight instructions Approach to mainstream processors Reduced misprediction penalty (smaller pipeline) Speculation in ¼ of the instructions (loads), with almost the same potential gains: Reduced Register port presure Reduced back-to-back prediction Still need confidence counters to increase precision “mcf” minimun precision: 76,7 % “tonto” minimum precision: 99,6 %

State of the Art Predictors Last Value Predictor (LVP) Stride predictor {1,2,3,4,5}, Variants: 2D-Stride FCM VTAGE DVTAGE, DFCM

XLStride Predictor It detects strides between consecutive values, and also between alternating values: Examples: {2,1,1,4,4,3,6,6,7,8} , {4,0,4,9,4,1,4} It can have several levels X histories, each one containing stride information about the last X occurrences of the instruction. It requires X^2 strides + last value 16 bit strides X predictions: selection by confidence counters We implemented a 2LStride predictor (good relation performance/cost) Example: 2LStride, 1 bit confidence counter

AV Predictor Some benchmarks exhibit patterns in the addresses, not in the values Address table (AT): index by PC, result: predicted address Implemented with a state-of-the-art predictor Value Table (VT): index by predicted address, result: predicted value Implemented with a last value predictor VT is also updated in stores Detect patterns in the addresses: results are totally different from traditional predictors

Evaluation Load speculation 7 State of the art predictors 2LStride predictor 3 AV predictors Several Hybrid Predictors Uses the half of the entries of state of the art predictors [Perais and Seznec, HPCA'13]

Best of the single preditors Results (I) Individual results: Hybrid Results: Always better than the best of the single predictors Best of the single preditors

Results (II) Multicore experiments with 24 cores To check the influence of shared memory in the precision Precision on the value table => No changes in shared memory by remote processors

Conclusions We simulate a real processor (Intel Westmere) to approximate Value Prediction to general purpose processors (narrow-issue processors) Speculating in Loads has better cost/benefit than speculating in all the instructions (in narrow processors) We propose the XLStride predictor Detect more complex stride patterns We propose the AV predictor Complementary to the traditional predictors: ideal for hybrid predictors Speed-up up to 33% (average 10%) Shared memory in multicore processors barely affects the precision of predictors

Lois Orosa lois.orosa@ic.unicamp.br Thank You!!

Potential of Value Speculation (II)