ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania {adhilton,
[ 2 ] Brief Overview Dynamically scheduled superscalar processors Scalable load & store queues SVW/SQIP [Roth05, Sha05] Latency-tolerant processors CPR/CFP [Akkary03, Srinivasan04] DKIP, FMC [Pericas06, Pericas07] Scalable load & store queues for latency-tolerant processors SA-LQ/HSQ [Akkary03] SRL [Gandhi05] ELSQ [Pericas08] Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP) Decoupled Store Completion & Silent Deterministic Replay
[ 3 ] Outline Background CPR/CFP SVW/SQIP The granularity mismatch problem DSC/SDR Evaluation
[ 4 ] CPR/CFP Latency-tolerant: scale key window structures under LL$ miss Issue queue, regfile, load & store queues CFP (Continual Flow Pipeline) [Srinivasan04] Scale issue queue & regfile by “slicing out” miss-dependent insns CPR (Checkpoint Processing & Recovery) [Akkary03] Scale regfile by limiting recovery to pre-created checkpoints + Aggressive reclamation of non-checkpoint registers – Unintended consequence? checkpoint-granularity “bulk commit” SA-LQ (Set-Associative Load Queue) [Akkary03] HSQ (Hierarchical Store Queue) [Akkary03]
[ 5 ] Baseline Performance (& Area) ASSOC (baseline): 64/48 entry fully-associative load/store queues 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue – Load queue: area is fine, poor performance (set conflicts) – Store queue: performance is fine, area inefficient (large CAM)
[ 6 ] SQIP SQIP (Store Queue Index Prediction) [Sha05] Scales store queue/buffer by eliminating associative load predicts store queue position of forwarding load indexes store queue at this position dispatch commit A:StB:LdP:StQ:StR:Ld…S:+T:Br [?] [x20][x10] instruction streamolderyounger addresses Preliminaries: SSNs (Store Sequence Numbers) [Roth05] Stores named by monotonically increasing sequence numbers Low-order bits are store queue/buffer positions Global SSNs track dispatch, commit, (store) completion SSNs P:St
[ 7 ] SVW Store Vulnerability Window (SVW) [Roth05] Scales load queue by eliminating associative search Load verification by in-order re-execution prior to commit Highly filtered: <1% of loads actually re-execute x18 x20 SSBF (SSN Bloom Filter) x?8 x?0 Address-indexed SSBF tracks [addr, SSN] of commited loads check SSBF, re-execute if possibly incorrect verify/ x18 x?8 complete [x20] [x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br [x18] commit [x18]
[ 8 ] SVW–NAIVE SVW: 512-entry indexed load queue, 256-entry store queue – Slowdowns over 8SA-LQ (mesa, wupwise) – Some slowdowns even over ASSOC too (bzip2, vortex) Why? Not forwarding mis-predictions … store-load serialization Load Y can’t verify until older store X completes to D$
[ 9 ] Store-Load Serialization: ROB SVW/SQIP example: SSBF verification “hole” Load R forwards from store vulnerable to stores – No SSBF entry for address [x10] must replay Can’t search store buffer wait until stores – in D$ In a ROB processor … (P) will complete (and usually quickly) In a CPR processor … complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br verify/commit x18 x20 x?8 x?0 x20 x?0 complete verify/commit x18 x20 x?8 x?0
[ 10 ] Store-Load Serialization: CPR P will complete … unless it’s in same checkpoint as R Deadlock: load R can’t verify store P can’t complete Resolve: squash (ouch), on re-execute, create checkpoint before R P and R will be in separate checkpoints Better: learn and create checkpoints before future instances of R This is SVW–TRAIN verify complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br x18 x20 x?8 x?0 commit
[ 11 ] SVW–TRAIN + Better than SVW–NAÏVE – But worse in some cases (art, mcf, vpr) Over-checkpointing holds too many registers Checkpoint may not be available for branches
[ 12 ] What About Set-Associative SSBFs? + Higher associativity helps (reduces hole frequency) but … – We’re replacing store queue associativity with SSBF associativity Trying to avoid things like this Want a better solution…
[ 13 ] DSC (Decoupled Store Completion) No fundamental reason we cannot complete stores – All older instructions have completed What’s stopping us? definition of commit & architected state CPR: commit = oldest register checkpoint (checkpoint granularity) ROB: commit = SVW-verify (instruction granularity) Restore ROB definition Allow stores to complete past oldest checkpoint This is DSC (Decoupled Store Completion) verify complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br commit complete commit verify/commit
[ 14 ] DSC: What About Mis-Speculations? DSC: Architected state younger than oldest checkpoint What about mis-speculation (e.g., branch T mis-predicted)? Can only recover to checkpoint Squash committed instructions? Squash stores visible to other processors? etc. How do we recover architected state? verify/commit [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br complete T:Br ?
[ 15 ] Silent Deterministic Recovery (SDR) Reconstruct architected state on demand Squash to oldest checkpoint and replay … Deterministically: re-produce committed values Silently: without generating coherence events How? discard committed stores at rename (already in SB or D$) How? read load values from load queue Avoid WAR hazards with younger stores Same thread (e.g., B Q) or different thread (coherence) verify/commit complete [x10][x20] [x18][x10] A:StB:LdP:StQ:StR:Ld…S:+T:Br
[ 16 ] Outline Background DSC/SDR (yes, that was it) Evaluation Performance Performance-area trade-offs
[ 17 ] Performance Methodology Workloads SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling Cycle-level simulator configuration 4-way superscalar out-of-order CPR/CFP processor 8 checkpoints, 32/32 INT/FP issue queue entries 32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers 400 cycle memory, 4Byte/cycle memory bus
[ 18 ] SVW+DSC/SDR + Outperforms SVW–Naïve and SVW–Train + Outperforms 8SA-LQ on average (by a lot) – Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ These are due to forwarding mis-speculation
[ 19 ] Smaller, Less-Associative SSBFs Does DSC/SDR make set-associative SSBFs unnecessary? You can bet your associativity on it
[ 20 ] Fewer Checkpoints DSC/SDR reduce need for large numbers of checkpoints Don’t need checkpoints to serialize store/load pairs Efficient use of D$ bandwidth even with widely spaced checkpoints Good: checkpoints are expensive
[ 21 ] … And Less Area Area methodology CACTI-4 [Tarjan04], 45nm Sum areas for load/store queues (SSBF & predictor too if needed) E.g., 512-entry 8SA-LQ / 256-entry HSQ 6.6% speedup, 0.91mm 2 High-performance/low-area
[ 22 ] How Performance/Area Was Won + SVW load queue: big performance gain (no conflicts) & small area loss + SQIP store queue: small performance loss & big area gain (no CAM) Big SVW performance gain offsets small SQIP performance loss Big SQIP area gain offsets small SVW area loss + DSC/SDR: big performance gain & small area gain
[ 23 ] DSC/SDR Performance/Area DSC/SDR improve SVW/SQIP IPC and reduce its area No new structures, just new ways of using existing structures + No SSBF checkpoints + No checkpoint-creation predictor + More tolerant to reduction in checkpoints, SSBF size
[ 24 ] Pareto Analysis SVW/SQIP+DSC/SDR dominates all other designs SVW/SQIP are low area (no CAMs) DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)
[ 25 ] Related Work SRL (Store Redo Log) [Gandhi05] Large associative store queue FIFO buffer + forwarding cache Expands store queue only under LL$ misses under-performs HSQ Unordered late-binding load/store queues [Sethumadhavan08] Entries only for executed loads and stores Poor match for centralized latency tolerant processors Cherry [Martinez02] “Post retirement” checkpoints No large load/store queues, but may benefit from DSC/SDR Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]
[ 26 ] Conclusions Checkpoint granularity … + … register management: good – … store commit: somewhat painful DSC/SDR: the good parts of the checkpoint world Checkpoint granularity registers + instruction granularity stores Key 1: disassociate commit from oldest register checkpoint Key 2: reconstruct architected state silently on demand Committed load values available in load queue + Allow checkpoint processor to use SVW/SQIP load/store queues Performance and area advantages + Simplify multi-processor operation for checkpoint processors
[ 27 ]