Patrick Akl and Andreas Moshovos AENAO Research Group

Slides:



Advertisements
Similar presentations
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Advertisements

CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.
Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
EECS 470 Lecture 8 RS/ROB examples True Physical Registers? Project.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Moshovos © 1 Memory State Compressors for Gigascale Checkpoint/Restore Andreas Moshovos
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.
Lecture: Out-of-order Processors
CS203 – Advanced Computer Architecture
Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison
Simultaneous Multithreading
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
/ Computer Architecture and Design
Out-of-Order Commit Processors
Commit out of order Phd student: Adrián Cristal.
Computer Architecture
Hyperthreading Technology
Sequential Execution Semantics
Lecture 6: Advanced Pipelines
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 14: Reducing Cache Misses
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Address-Value Delta (AVD) Prediction
Yingmin Li Ting Yan Qi Zhao
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Out-of-Order Commit Processors
Lecture 10: Branch Prediction and Instruction Delivery
Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
rePLay: A Hardware Framework for Dynamic Optimization
Lecture 9: Dynamic ILP Topics: out-of-order processors
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Spring 2019 Prof. Eric Rotenberg
ECE 721 Alternatives to ROB-based Retirement
Presentation transcript:

BranchTap Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto

What Happens on a Branch Misprediction? Execution Timeline Predict a Branch Outcome Predicted Path Correct Path Misprediction Discovered Recover Processor State Redirect Fetch Resume Execution We wish to make the recovery fast

State-of-the-art recovery Existing mechanisms Reorder buffer based: slow Instantaneous checkpoints: faster Problem: can’t have enough checkpoints State-of-the-art solution: checkpoint prediction Allocate the few checkpoints judiciously Another degree of freedom: speculation control Sometimes deeper speculation = higher recovery cost Can hurt performance Throttle speculation

BranchTap Results / Benefits No additional checkpoints are needed Dynamically adapts to application behavior Improves performance for most programs Misprediction performance penalty reduced by 28% on AVG BranchTap comes “for free” Very simple to implement Better than more accurate checkpoint predictors

Outline Background BranchTap Methodology and Results Summary

State Recovery Example: Register Alias Table Original Code Lg(# arch. regs) RAT A add r1, r2, 100 B breq r1, E C sub r1, r2, r2 p4 p1 p5 p5 p4 Architectural Register p2 p3 # arch. regs Renamed Code A add p4, p2, 100 B breq p4, E C sub r5, p2, p2 Physical Register

ROB: Slow, Fine-Grain Recovery Each entry contains Architectural destination register Its previous RAT map Program Order 3. Undo RAT updates in reverse order B B B B B Reorder Buffer Misprediction discovered 2. Locate newest instruction INVALID RAT Too slow: recovery latency proportional to number of instructions to squash

Global Checkpoints: Fast, Coarse-Grain Recovery Program Order checkpoint checkpoint checkpoint checkpoint B B B B B Reorder Buffer Misprediction discovered INVALID RAT Branch w/ GC: Recovery is “Instantaneous”

Impact of More Checkpoints Concept Actual Implementation Working Copy checkpoints RAT architectural register physical register More checkpoints ? Power hungry structure Increased delay Only a few checkpoints can practically be implemented Cannot always cover all branches

Intelligent Checkpointing State of the art solution Checkpoint allocation: Allocate checkpoints at hard-to-predict branches Checkpoint management: Release checkpoints as soon as they are no longer needed Use few checkpoints efficiently

Conventional Mechanisms: Recovery Scenarios Mispeculation on a branch w/ a GC: Direct recovery Mispeculation on a branch w/o a GC: Indirect recovery With intelligent checkpointing: 30% Indirect recoveries  75% of performance loss B B B ROB Fast Recovery checkpoint B B B ROB Slow Recovery checkpoint

Outline Background BranchTap Methodology and Results Summary

BranchTap Motivation Low confidence branch ~ Recovery Cost No Wait Scenario B B B ROB checkpoint checkpoint Misprediction discovered Wait Scenario B B B ROB ~ Recovery Cost checkpoint checkpoint Sometimes, it is better to wait if no checkpoint is available

BranchTap Concept Key idea: stall when speculation is likely to deteriorate performance Count the number of low confidence branches w/o a checkpoint If it exceeds a threshold, stall Threshold selection Fixed Varies greatly across programs Can deteriorate performance significantly Adaptive Robust performance Minimize recovery cost while conserving good speculation opportunities

Threshold Adaptation Policy BranchTap adapts across and within applications

Outline Background BranchTap Methodology and Results Summary

Results Overview Performance w/o Checkpoints BranchTap improves even with just an ROB Performance w/ 4 Checkpoints BranchTap improves over conventional recovery methods Performance w/ Larger Checkpoint Predictors BranchTap offers better performance than a 64x larger predictor

Methodology Simulator based on Simplescalar 24 SPEC CPU 2000 benchmarks Reference Inputs Processor configurations 8-way OoO core Up to 1K in-flight instructions 1K-entry confidence table for low confidence branch identification 1B committed instructions after skipping 100B

“Perfect Checkpointing” Configuration A checkpoint is auto-magically taken at all mispredicted branches All recoveries are fast We report the “deterioration relative to perfect checkpointing” We compare BranchTap against the obvious solution of unrestricted speculation. We normalize our performance results relative to a “Perfect Checkpointing” configuration where we assume all mispredictions are automagically checkpointed.

Performance with No Checkpoints Deterioration relative to “perfect checkpointing” better -39% deterioration BranchTap improves over conventional mechanisms Adaptation leads to robust performance improvements

Performance Evaluation with 4 Checkpoints Deterioration relative to “perfect checkpointing” BranchTap with 4 checkpoints is better than 6 checkpoints alone better -28% deterioration

BranchTap vs. Larger Checkpoint Predictors BranchTap with a 1K-entry confidence table and 4 GCs: Higher performance than a 64K-entry confidence table with 4 GCs Lower complexity, virtually comes “for free” better deterioration BranchTap confidence table size

Outline Background BranchTap Methodology and Results Summary

Summary Performance with 4 (no) checkpoints ~28 (39) % of misprediction penalty removed BranchTap is robust: Up to 6 (13) % better and max 1.2 (0.1) % worse than conventional mechanisms BranchTap is very simple to implement Few counters and comparators BranchTap is better than other alternatives BT + 1K predictor better than a 64K predictor alone BT + 4 GCs better than 6 GCs alone

BranchTap Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto {pakl, moshovos}@eecg.toronto.edu