1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Slides:

Advertisements

Similar presentations

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

EECS 470 Lecture 8 RS/ROB examples True Physical Registers? Project.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.

1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.

Revisiting Load Value Speculation:

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Moshovos © 1 Memory State Compressors for Gigascale Checkpoint/Restore Andreas Moshovos

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

Sunpyo Hong, Hyesoon Kim

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Prophet/Critic Hybrid Branch Prediction B B B

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

Lecture: Out-of-order Processors

Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Lecture: Out-of-order Processors

Sequential Execution Semantics

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Address-Value Delta (AVD) Prediction

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture 10: Branch Prediction and Instruction Delivery

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Patrick Akl and Andreas Moshovos AENAO Research Group

rePLay: A Hardware Framework for Dynamic Optimization

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto

2/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control We wish to make the recovery fast What Happens on a Branch Misprediction? Execution Timeline Misprediction Discovered Recover Processor State Redirect Fetch Resume Execution Predict a Branch Outcome Predicted PathCorrect Path

3/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Existing mechanisms –Reorder buffer based: slow –Instantaneous checkpoints: faster Problem: can’t have enough checkpoints State-of-the-art solution: checkpoint prediction –Allocate the few checkpoints judiciously Another degree of freedom: speculation control –Sometimes deeper speculation = higher recovery cost Can hurt performance –Throttle speculation State-of-the-art recovery

4/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control No additional checkpoints are needed Dynamically adapts to application behavior Improves performance for most programs –Misprediction performance penalty reduced by 28% on AVG BranchTap comes “for free” –Very simple to implement –Better than more accurate checkpoint predictors BranchTap Results / Benefits

5/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Outline Background BranchTap Methodology and Results Summary

6/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control State Recovery Example: Register Alias Table RAT Architectural Register Physical Register # arch. regs Lg(# arch. regs) A add r1, r2, 100 B breq r1, E Csub r1, r2, r2 Original Code A add p4, p2, 100 B breq p4, E Csub r5, p2, p2 Renamed Code p1 p2 p3 p4p5 p4

7/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control ROB: Slow, Fine-Grain Recovery Too slow: recovery latency proportional to number of instructions to squash Reorder Buffer BBBBB 1.Misprediction discovered 2. Locate newest instruction 3. Undo RAT updates in reverse order Program Order RAT INVALID Each entry contains 1.Architectural destination register 2.Its previous RAT map

8/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Global Checkpoints: Fast, Coarse-Grain Recovery Branch w/ GC: Recovery is “Instantaneous” Reorder Buffer BBBBB 1.Misprediction discovered Program Order RAT INVALID checkpoint

9/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Impact of More Checkpoints More checkpoints ? –Power hungry structure –Increased delay Only a few checkpoints can practically be implemented –Cannot always cover all branches architectural register physical register Actual Implementation Working Copy checkpoints RAT Concept

10/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Intelligent Checkpointing State of the art solution –Checkpoint allocation: Allocate checkpoints at hard-to- predict branches –Checkpoint management: Release checkpoints as soon as they are no longer needed Use few checkpoints efficiently

11/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Mispeculation on a branch w/ a GC: Direct recovery Mispeculation on a branch w/o a GC: Indirect recovery With intelligent checkpointing: 30% Indirect recoveries  75% of performance loss Conventional Mechanisms: Recovery Scenarios BBB ROB BBB checkpoint Fast Recovery Slow Recovery checkpoint

12/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Outline Background BranchTap Methodology and Results Summary

13/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Motivation ROB No Wait Scenario Misprediction discovered ~ Recovery Cost checkpoint Low confidence branch checkpoint ROB Sometimes, it is better to wait if no checkpoint is available Wait Scenario BBB BBB

14/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Concept Key idea: stall when speculation is likely to deteriorate performance –Count the number of low confidence branches w/o a checkpoint –If it exceeds a threshold, stall Threshold selection –Fixed Varies greatly across programs Can deteriorate performance significantly –Adaptive Robust performance Minimize recovery cost while conserving good speculation opportunities

15/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Threshold Adaptation Policy BranchTap adapts across and within applications

16/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Outline Background BranchTap Methodology and Results Summary

17/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Results Overview Performance w/o Checkpoints –BranchTap improves even with just an ROB Performance w/ 4 Checkpoints –BranchTap improves over conventional recovery methods Performance w/ Larger Checkpoint Predictors –BranchTap offers better performance than a 64x larger predictor

18/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Methodology Simulator based on Simplescalar 24 SPEC CPU 2000 benchmarks Reference Inputs Processor configurations –8-way OoO core –Up to 1K in-flight instructions –1K-entry confidence table for low confidence branch identification 1B committed instructions after skipping 100B

19/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control “Perfect Checkpointing” Configuration A checkpoint is auto-magically taken at all mispredicted branches –All recoveries are fast We report the “deterioration relative to perfect checkpointing”

20/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Performance with No Checkpoints Deterioration relative to “perfect checkpointing” -39% deterioration BranchTap improves over conventional mechanisms Adaptation leads to robust performance improvements better

21/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Deterioration relative to “perfect checkpointing” BranchTap with 4 checkpoints is better than 6 checkpoints alone Performance Evaluation with 4 Checkpoints -28% deterioration better

22/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap with a 1K-entry confidence table and 4 GCs: –Higher performance than a 64K-entry confidence table with 4 GCs –Lower complexity, virtually comes “for free” BranchTap vs. Larger Checkpoint Predictors BranchTap deterioration confidence table size better

23/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Outline Background BranchTap Methodology and Results Summary

24/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Summary Performance with 4 (no) checkpoints –~28 (39) % of misprediction penalty removed –BranchTap is robust: Up to 6 (13) % better and max 1.2 (0.1) % worse than conventional mechanisms BranchTap is very simple to implement –Few counters and comparators BranchTap is better than other alternatives –BT + 1K predictor better than a 64K predictor alone –BT + 4 GCs better than 6 GCs alone

25/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto {pakl,