PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

Slides:

Advertisements

Similar presentations

NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.

Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EECS 470 Lecture 8 RS/ROB examples True Physical Registers? Project.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

Revisiting Load Value Speculation:

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-34 :: June, 2007.

MoBS-5 :: June 21, 2009 FIESTA: A Sample-Balanced Multi-Program Workload Methodology Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

Dynamic Scheduling Why go out of style?

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Multiscalar Processors

Physical Register Inlining (PRI)

Lecture: Out-of-order Processors

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania.

Out-of-Order Commit Processors

Exploring Value Prediction with the EVES predictor

Milad Hashemi, Onur Mutlu, Yale N. Patt

Cost-Effective Physical Register Sharing

Tolerating Long Latency Instructions

Address-Value Delta (AVD) Prediction

Out-of-Order Commit Processor

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture: Out-of-order Processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Out-of-Order Commit Processors

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Patrick Akl and Andreas Moshovos AENAO Research Group

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Conceptual execution on a processor which exploits ILP

Project Guidelines Prof. Eric Rotenberg.

Sizing Structures Fixed relations Empirical (simulation-based)

Presentation transcript:

PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania

[ 2 ] CPROB in a Nutshell (Sorry, O’Reilly) Physical register file constrains out-of-order window Area and power intensive, latency complicates the scheduler CPR (Checkpoint Processing and Recovery) [Akkary+03] + Aggressive, execution-driven register reclamation – Checkpoint overhead: recovery only to pre-created checkpoints CPROB: hybrid register reclamation scheme CPR + opportunistic checkpoint overhead elimination Opportunistic = dynamically adapts to register demands + Outperforms both CPR and conventional reclamation + Simple low-overhead implementation

[ 3 ] Outline Introduction CPR review The “checkpoint overhead” problem CPROB Evaluation Related Work Conclusion

[ 4 ] Conventional Register Reclamation Running example 7 instructions (A–G), 2 branches (C & E), 3 arch regs (r1–r3) Conventional register reclamation (i.e., ROB) Commit-driven reclamation: over-written register freed at commit Needs 8 physical registers for this “window” RenameMap + OverWritten ld [r3] => r3A:ld [p3] => p4p1p2p3 r1r2r3Raw InsnPCRenamed Insn p3 OW sub r1, 4 => r2 brz r3, Q ld [r2] => r2 brz r2, T add r1, 8 => r3 ld [r3] => r2 B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 p1p8p7 p3 p2 - - p5 p4 p6

[ 5 ] CPR Register Reclamation CPR (Checkpoint Processing & Recovery) Execution-driven reclamation: sources + dest “freed” at execute Needs only 7 physical registers for this window Sources + dests of un-executed insns RenameMap Pre-created checkpoints ld [r3] => r3A:ld [p3] => p4p1p2p3 r1r2r3Raw InsnPCRenamed Insn p3 OW sub r1, 4 => r2 brz r3, Q ld [r2] => r2 brz r2, T add r1, 8 => r3 ld [r3] => r2 B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [r3] => r3ld [p3] => p4 brz r3, Qbrz p4, Q p1p8p7p3 p4 p5 is free p1p6p4 Chk1 p1p2p3 Chk0

[ 6 ] CPR Checkpoint Overhead What if branch C mis-predicts? Can’t recover to D … p5 (appears in D’s RenameMap) already freed! – Must recover to A (checkpoint) and re-execute A–C This penalty is called checkpoint overhead Squash & re-execute insns older than un-checkpointed mis-spec No such penalty in ROB which performs minimal recovery ld [r3] => r3A:ld [p3] => p4p1p2p3 r1r2r3Raw InsnPCRenamed Insn p3 OW sub r1, 4 => r2 brz r3, Q ld [r2] => r2 brz r2, T add r1, 8 => r3 ld [r3] => r2 B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [r3] => r3ld [p3] => p4 brz r3, Qbrz p4, Q p1p8p7p3 p4 p5 is free p5 p1p2p3 Chk0 p1p6p4 Chk1 A: B: C:

[ 7 ] The Two Faces of CPR SpecFP: high bpred accuracy + need large window Reclamation trumps overhead  average speedups Some pathologies, e.g., galgel SpecINT: low bpred accuracy Overhead dominates  average slowdown

[ 8 ] Answer != More Checkpoints More checkpoints reduce overhead … but only a little – Sometimes hurt performance (tie up more registers) – Also, checkpoints are not cheap

[ 9 ] But CPR is Great for SMT … Right? + SMT needs more registers … + And reduces branch mis-prediction penalty … – But actually makes checkpoint overhead worse! Distance from mis-predicted branch to older checkpoint has nothing to do with speculation depth Threads share checkpoints (more un-checkpointed branches)

[ 10 ] Outline Introduction CPR CPROB Basic idea (very simple) Some policies Implementation Evaluation Related Work Conclusion

[ 11 ] CPROB: The Key Idea CPR + hold recovery (OW) registers opportunistically Recovery registers (p5) available  no checkpoint overhead Recover to younger checkpoint, then walk backwards serially Recovery registers (p5) not available  overhead, but still correct Recover to older checkpoint a la CPR Opportunistically = can release recovery registers at any time! ld [r3] => r3A:ld [p3] => p4p1p2p3 r1r2r3Raw InsnPCRenamed Insn p3 OW sub r1, 4 => r2 brz r3, Q ld [r2] => r2 brz r2, T add r1, 8 => r3 ld [r3] => r2 B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [r3] => r3ld [p3] => p4 brz r3, Qbrz p4, Q p1p8p7p3 p4 p5 p1p2p3 Chk0 p1p6p4 Chk1

[ 12 ] Good Time Part I When is a good time to release recovery registers? Don’t grab in first place: no branches since older checkpoint “Tail” checkpoint doesn’t grab p4 & p6, Chk1 didn’t grab p3 & p2 Spontaneously: all branches in a checkpoint have executed Chk1: branch C executes  release p5 ld [r3] => r3A:ld [p3] => p4p1p2p3 r1r2r3Raw InsnPCRenamed Insn p3 OW sub r1, 4 => r2 brz r3, Q ld [r2] => r2 brz r2, T add r1, 8 => r3 ld [r3] => r2 B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [r3] => r3ld [p3] => p4 brz r3, Qbrz p4, Q p1p8p7p3 p4 p5 p1p2p3 Chk0 p1p6p4 Chk1

[ 13 ] Good Time Part II Also victimize when rename needs registers to continue Chances are good un-executed branches are right Otherwise they would have been assigned checkpoints CPROB reclamation policy adapts dynamically Branch mis-predictions tend to cluster [Heil+98] Recent mis-prediction  window empty, no need to victimize Hold recovery registers to “protect” upcoming branches No recent mis-prediction  window full, need to victimize Probably in a region of high-confidence branches Most mis-predicted branches resolve quickly after dispatch Chance of victimization in this “window” is small

[ 14 ] Does CPROB Need a Giant ROB? CPROB tries to support a large window Needs a large ROB to hold all insns, right? No CPROB uses ROB for opportunistic recovery, not commit Only insns whose recovery registers are held need ROB entries Can victimize ROB space & recovery registers together Policy “victimize oldest checkpoint” meshes well with this

[ 15 ] Implementation How is CPROB register reclamation implemented? When/how are instructions added to the free list? First: how is CPR register reclamation implemented? Not using a circular queue free list enqueued at commit … Using register reference counting [Roth08]

[ 16 ] A:ld [p3] => p4p1p2p3 r1r2r3PCRenamed Insn p3 OW B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [p3] => p4 brz p4, Q CPR Register Reference Counting Reference counts implemented as bit-matrix One column per physical register One row per entity that can hold physical register Issue queue entry, checkpoint, RenameMap Columns OR’ed together to form bitvector-style free list Registers allocated using encoders IQ IQ IQ Chk Chk RMap Free p3 p p1p2p3 111 p1p8p7 111 p1p6p Chk1 Chk0

[ 17 ] CPROB Extensions Add recovery-register matrix rows One for each checkpoint One for RenameMap (“tail” checkpoint) CPROB rows can be cleared at any time CPR rows cleared according to strict CPR rules (for correctness) Rec Rec RRec A:ld [p3] => p4p1p2p3 r1r2r3PCRenamed Insn p3 OW B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [p3] => p4 brz p4, Q IQ IQ IQ Chk Chk RMap Free p3 p p1p2p3 111 p1p8p7 111p1p6p Chk0 Chk1 p5 1 1

[ 18 ] Outline Introduction CPR CPROB Evaluation CPROB CPROB-SMT Related Work Conclusion

[ 19 ] Methodology Benchmarks SPEC2000 compiled using -O4 For SMT Characterized as ILP, Branch-, Latency-, or bandWidth-bound 2-thread workloads using FIESTA methodology [Hilton+09] Cycle-level simulation 4-way superscalar out-of-order, 17-stage pipeline, 1 or 2 threads 256 ROB, 32/32 INT/FP issue queue, 128/128 INT/FP phys-regs 8 checkpoints for CPR 48 Kbyte 3-table PPM branch predictor, 16K confidence pred 32 Kbyte I$/D$, 2 Mbyte 20-cycle L2, 400-cycle memory

[ 20 ] CPROB vs. CPR vs. ROB Reduces checkpoint overhead significantly (4%  1%) Remaining: miss-dependent mis-predicted branches Fixes CPR’s performance pathologies relative to ROB Outperforms both CPR and ROB in (almost) every case

[ 21 ] CPROB is Energy Efficient Rough argument (see paper for details) but here goes … Energy efficient = relative-to-ROB ED 2 < 1 [Martin+01] Dynamic energy consumption ~ dynamic instruction execution count CPR: FP: / = 0.95, INT: / = 1.04 CPROB: FP: / = 0.90, INT: / = 0.98

[ 22 ] Register Usage: Spec Average Physical registers are expensive: vary from 256 to 2K ROB: steady benefits to more registers CPR: roughly constant performance + Better than ROB at low registers (reclamation dominates) – Worse with more registers (checkpoint overhead dominates) CPROB: few registers  does CPR, many registers  does ROB Adaptive  better than CPR and ROB at all points

[ 23 ] Register Usage: SpecINT Gap Same behavior in individual benchmarks Some phases need many registers Some phases need minimal recovery

[ 24 ] Checkpoint Usage: Spec Average Checkpoints are also expensive: vary from 2 to 16 CPR: quite sensitive (needs 4 to break even with ROB) CPROB: removes CPR’s sensitivity to checkpoint count Makes CPR viable with 2 checkpoints

[ 25 ] CPROB-SMT + CPROB fixes Bx pairings in SMT Branch-bound program paired with something else Remaining pathologies (LW & WW) due to D$ thrashing + Also relieves checkpoint pressure See paper for other results Sensitivity, energy model details, area analysis, etc.

[ 26 ] Related Works Other aggressive register schemes Early register release [Ergin+04], Cherry [Martinez+02] ROB based large window [Cristal+04, Pericas+06] CPROB not relevant here Control Independence [Cher+01, Chou+99, Gandhi+04, Rotenberg+99] Orthogonal, CPROB potentially Synergistic with TCI [AlZawawi+07] TurboROB [Akl+08] Accelerates serial recovery Compatible (maybe synergistic) with CPROB

[ 27 ] Also Related: FIESTA FIESTA: workloads for multi-program experiments Fixed Instruction with Equal STAndalone runtimes Pre-select application samples for equal standalone runtimes Run same samples consistently in every experiment + Fixed workloads  direct comparison with no result skew Plain, unambiguous speedup metrics + Minimal load imbalance by construction Remaining load imbalance is “un-fairness” Hilton et al. “FIESTA”, MoBS workshop, Consider using it in your multi-program experiments

[ 28 ] Conclusions Physical register file: critical out-of-order core resource Limits window size (especially for SMT) CPR: execution-driven reclamation scheme + Much better scalability (good for SMT) – Checkpoint overhead (surprise, even worse in SMT) Some pathologies relative to ROB commit-driven reclamation CPROB: opportunistic hybrid register reclamation Holds recovery registers to eliminate checkpoint overhead Adaptively victimizes them when rename needs more + Eliminates CPR’s pathologies, outperforms both CPR and ROB

[ 29 ]

[ 30 ] CPROB: ROB size? Vary ROB size from 32 to 256 entries CPROB only needs entries for full performance Degrades gracefully to 32 ROB needs at least 128 entries

[ 31 ] CFPROB CPR base for CFP (Continual Flow Pipelines) [Srinivasan+04] Unblocks issue queue & registers under LLC misses CFPROB: CFP on top of CPROB CPROB baseline fixes performance pathologies Small ROB = minimal recovery for miss-independent branches LLC-miss-dependent branch mis-predictions are rare