PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania
[ 2 ] CPROB in a Nutshell (Sorry, O’Reilly) Physical register file constrains out-of-order window Area and power intensive, latency complicates the scheduler CPR (Checkpoint Processing and Recovery) [Akkary+03] + Aggressive, execution-driven register reclamation – Checkpoint overhead: recovery only to pre-created checkpoints CPROB: hybrid register reclamation scheme CPR + opportunistic checkpoint overhead elimination Opportunistic = dynamically adapts to register demands + Outperforms both CPR and conventional reclamation + Simple low-overhead implementation
[ 3 ] Outline Introduction CPR review The “checkpoint overhead” problem CPROB Evaluation Related Work Conclusion
[ 4 ] Conventional Register Reclamation Running example 7 instructions (A–G), 2 branches (C & E), 3 arch regs (r1–r3) Conventional register reclamation (i.e., ROB) Commit-driven reclamation: over-written register freed at commit Needs 8 physical registers for this “window” RenameMap + OverWritten ld [r3] => r3A:ld [p3] => p4p1p2p3 r1r2r3Raw InsnPCRenamed Insn p3 OW sub r1, 4 => r2 brz r3, Q ld [r2] => r2 brz r2, T add r1, 8 => r3 ld [r3] => r2 B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 p1p8p7 p3 p2 - - p5 p4 p6
[ 5 ] CPR Register Reclamation CPR (Checkpoint Processing & Recovery) Execution-driven reclamation: sources + dest “freed” at execute Needs only 7 physical registers for this window Sources + dests of un-executed insns RenameMap Pre-created checkpoints ld [r3] => r3A:ld [p3] => p4p1p2p3 r1r2r3Raw InsnPCRenamed Insn p3 OW sub r1, 4 => r2 brz r3, Q ld [r2] => r2 brz r2, T add r1, 8 => r3 ld [r3] => r2 B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [r3] => r3ld [p3] => p4 brz r3, Qbrz p4, Q p1p8p7p3 p4 p5 is free p1p6p4 Chk1 p1p2p3 Chk0
[ 6 ] CPR Checkpoint Overhead What if branch C mis-predicts? Can’t recover to D … p5 (appears in D’s RenameMap) already freed! – Must recover to A (checkpoint) and re-execute A–C This penalty is called checkpoint overhead Squash & re-execute insns older than un-checkpointed mis-spec No such penalty in ROB which performs minimal recovery ld [r3] => r3A:ld [p3] => p4p1p2p3 r1r2r3Raw InsnPCRenamed Insn p3 OW sub r1, 4 => r2 brz r3, Q ld [r2] => r2 brz r2, T add r1, 8 => r3 ld [r3] => r2 B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [r3] => r3ld [p3] => p4 brz r3, Qbrz p4, Q p1p8p7p3 p4 p5 is free p5 p1p2p3 Chk0 p1p6p4 Chk1 A: B: C:
[ 7 ] The Two Faces of CPR SpecFP: high bpred accuracy + need large window Reclamation trumps overhead average speedups Some pathologies, e.g., galgel SpecINT: low bpred accuracy Overhead dominates average slowdown
[ 8 ] Answer != More Checkpoints More checkpoints reduce overhead … but only a little – Sometimes hurt performance (tie up more registers) – Also, checkpoints are not cheap
[ 9 ] But CPR is Great for SMT … Right? + SMT needs more registers … + And reduces branch mis-prediction penalty … – But actually makes checkpoint overhead worse! Distance from mis-predicted branch to older checkpoint has nothing to do with speculation depth Threads share checkpoints (more un-checkpointed branches)
[ 10 ] Outline Introduction CPR CPROB Basic idea (very simple) Some policies Implementation Evaluation Related Work Conclusion
[ 11 ] CPROB: The Key Idea CPR + hold recovery (OW) registers opportunistically Recovery registers (p5) available no checkpoint overhead Recover to younger checkpoint, then walk backwards serially Recovery registers (p5) not available overhead, but still correct Recover to older checkpoint a la CPR Opportunistically = can release recovery registers at any time! ld [r3] => r3A:ld [p3] => p4p1p2p3 r1r2r3Raw InsnPCRenamed Insn p3 OW sub r1, 4 => r2 brz r3, Q ld [r2] => r2 brz r2, T add r1, 8 => r3 ld [r3] => r2 B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [r3] => r3ld [p3] => p4 brz r3, Qbrz p4, Q p1p8p7p3 p4 p5 p1p2p3 Chk0 p1p6p4 Chk1
[ 12 ] Good Time Part I When is a good time to release recovery registers? Don’t grab in first place: no branches since older checkpoint “Tail” checkpoint doesn’t grab p4 & p6, Chk1 didn’t grab p3 & p2 Spontaneously: all branches in a checkpoint have executed Chk1: branch C executes release p5 ld [r3] => r3A:ld [p3] => p4p1p2p3 r1r2r3Raw InsnPCRenamed Insn p3 OW sub r1, 4 => r2 brz r3, Q ld [r2] => r2 brz r2, T add r1, 8 => r3 ld [r3] => r2 B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [r3] => r3ld [p3] => p4 brz r3, Qbrz p4, Q p1p8p7p3 p4 p5 p1p2p3 Chk0 p1p6p4 Chk1
[ 13 ] Good Time Part II Also victimize when rename needs registers to continue Chances are good un-executed branches are right Otherwise they would have been assigned checkpoints CPROB reclamation policy adapts dynamically Branch mis-predictions tend to cluster [Heil+98] Recent mis-prediction window empty, no need to victimize Hold recovery registers to “protect” upcoming branches No recent mis-prediction window full, need to victimize Probably in a region of high-confidence branches Most mis-predicted branches resolve quickly after dispatch Chance of victimization in this “window” is small
[ 14 ] Does CPROB Need a Giant ROB? CPROB tries to support a large window Needs a large ROB to hold all insns, right? No CPROB uses ROB for opportunistic recovery, not commit Only insns whose recovery registers are held need ROB entries Can victimize ROB space & recovery registers together Policy “victimize oldest checkpoint” meshes well with this
[ 15 ] Implementation How is CPROB register reclamation implemented? When/how are instructions added to the free list? First: how is CPR register reclamation implemented? Not using a circular queue free list enqueued at commit … Using register reference counting [Roth08]
[ 16 ] A:ld [p3] => p4p1p2p3 r1r2r3PCRenamed Insn p3 OW B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [p3] => p4 brz p4, Q CPR Register Reference Counting Reference counts implemented as bit-matrix One column per physical register One row per entity that can hold physical register Issue queue entry, checkpoint, RenameMap Columns OR’ed together to form bitvector-style free list Registers allocated using encoders IQ IQ IQ Chk Chk RMap Free p3 p p1p2p3 111 p1p8p7 111 p1p6p Chk1 Chk0
[ 17 ] CPROB Extensions Add recovery-register matrix rows One for each checkpoint One for RenameMap (“tail” checkpoint) CPROB rows can be cleared at any time CPR rows cleared according to strict CPR rules (for correctness) Rec Rec RRec A:ld [p3] => p4p1p2p3 r1r2r3PCRenamed Insn p3 OW B: C: D: E: F: G: sub p1, 4 => p5 brz p4, Q ld [p5] => p6 brz p6, T add p1, 8 => p7 ld [p7] => p8 p1p2p4 p1p5p4 p1p6p4 p1p5p4 p1p6p4 p1p6p7 p1p8p7 RenameMap p2 - - p5 p4 p6 ld [p3] => p4 brz p4, Q IQ IQ IQ Chk Chk RMap Free p3 p p1p2p3 111 p1p8p7 111p1p6p Chk0 Chk1 p5 1 1
[ 18 ] Outline Introduction CPR CPROB Evaluation CPROB CPROB-SMT Related Work Conclusion
[ 19 ] Methodology Benchmarks SPEC2000 compiled using -O4 For SMT Characterized as ILP, Branch-, Latency-, or bandWidth-bound 2-thread workloads using FIESTA methodology [Hilton+09] Cycle-level simulation 4-way superscalar out-of-order, 17-stage pipeline, 1 or 2 threads 256 ROB, 32/32 INT/FP issue queue, 128/128 INT/FP phys-regs 8 checkpoints for CPR 48 Kbyte 3-table PPM branch predictor, 16K confidence pred 32 Kbyte I$/D$, 2 Mbyte 20-cycle L2, 400-cycle memory
[ 20 ] CPROB vs. CPR vs. ROB Reduces checkpoint overhead significantly (4% 1%) Remaining: miss-dependent mis-predicted branches Fixes CPR’s performance pathologies relative to ROB Outperforms both CPR and ROB in (almost) every case
[ 21 ] CPROB is Energy Efficient Rough argument (see paper for details) but here goes … Energy efficient = relative-to-ROB ED 2 < 1 [Martin+01] Dynamic energy consumption ~ dynamic instruction execution count CPR: FP: / = 0.95, INT: / = 1.04 CPROB: FP: / = 0.90, INT: / = 0.98
[ 22 ] Register Usage: Spec Average Physical registers are expensive: vary from 256 to 2K ROB: steady benefits to more registers CPR: roughly constant performance + Better than ROB at low registers (reclamation dominates) – Worse with more registers (checkpoint overhead dominates) CPROB: few registers does CPR, many registers does ROB Adaptive better than CPR and ROB at all points
[ 23 ] Register Usage: SpecINT Gap Same behavior in individual benchmarks Some phases need many registers Some phases need minimal recovery
[ 24 ] Checkpoint Usage: Spec Average Checkpoints are also expensive: vary from 2 to 16 CPR: quite sensitive (needs 4 to break even with ROB) CPROB: removes CPR’s sensitivity to checkpoint count Makes CPR viable with 2 checkpoints
[ 25 ] CPROB-SMT + CPROB fixes Bx pairings in SMT Branch-bound program paired with something else Remaining pathologies (LW & WW) due to D$ thrashing + Also relieves checkpoint pressure See paper for other results Sensitivity, energy model details, area analysis, etc.
[ 26 ] Related Works Other aggressive register schemes Early register release [Ergin+04], Cherry [Martinez+02] ROB based large window [Cristal+04, Pericas+06] CPROB not relevant here Control Independence [Cher+01, Chou+99, Gandhi+04, Rotenberg+99] Orthogonal, CPROB potentially Synergistic with TCI [AlZawawi+07] TurboROB [Akl+08] Accelerates serial recovery Compatible (maybe synergistic) with CPROB
[ 27 ] Also Related: FIESTA FIESTA: workloads for multi-program experiments Fixed Instruction with Equal STAndalone runtimes Pre-select application samples for equal standalone runtimes Run same samples consistently in every experiment + Fixed workloads direct comparison with no result skew Plain, unambiguous speedup metrics + Minimal load imbalance by construction Remaining load imbalance is “un-fairness” Hilton et al. “FIESTA”, MoBS workshop, Consider using it in your multi-program experiments
[ 28 ] Conclusions Physical register file: critical out-of-order core resource Limits window size (especially for SMT) CPR: execution-driven reclamation scheme + Much better scalability (good for SMT) – Checkpoint overhead (surprise, even worse in SMT) Some pathologies relative to ROB commit-driven reclamation CPROB: opportunistic hybrid register reclamation Holds recovery registers to eliminate checkpoint overhead Adaptively victimizes them when rename needs more + Eliminates CPR’s pathologies, outperforms both CPR and ROB
[ 29 ]
[ 30 ] CPROB: ROB size? Vary ROB size from 32 to 256 entries CPROB only needs entries for full performance Degrades gracefully to 32 ROB needs at least 128 entries
[ 31 ] CFPROB CPR base for CFP (Continual Flow Pipelines) [Srinivasan+04] Unblocks issue queue & registers under LLC misses CFPROB: CFP on top of CPROB CPROB baseline fixes performance pathologies Small ROB = minimal recovery for miss-independent branches LLC-miss-dependent branch mis-predictions are rare