Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu ISCA-36 :: June 23, 2009
Decoupled Store Completion & Silent Deterministic Replay Brief Overview Dynamically scheduled superscalar processors Latency-tolerant processors CPR/CFP [Akkary03, Srinivasan04] Scalable load & store queues SVW/SQIP [Roth05, Sha05] DKIP, FMC [Pericas06, Pericas07] Scalable load & store queues for latency-tolerant processors SA-LQ/HSQ [Akkary03] SRL [Gandhi05] ELSQ [Pericas08] Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP) Decoupled Store Completion & Silent Deterministic Replay
Outline Background DSC/SDR Evaluation CPR/CFP SVW/SQIP The granularity mismatch problem DSC/SDR Evaluation
CPR/CFP Latency-tolerant: scale key window structures under LL$ miss Issue queue, regfile, load & store queues CFP (Continual Flow Pipeline) [Srinivasan04] Scale issue queue & regfile by “slicing out” miss-dependent insns CPR (Checkpoint Processing & Recovery) [Akkary03] Scale regfile by limiting recovery to pre-created checkpoints Aggressive reclamation of non-checkpoint registers Unintended consequence? checkpoint-granularity “bulk commit” SA-LQ (Set-Associative Load Queue) [Akkary03] HSQ (Hierarchical Store Queue) [Akkary03]
Baseline Performance (& Area) ASSOC (baseline): 64/48 entry fully-associative load/store queues 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue Load queue: area is fine, poor performance (set conflicts) Store queue: performance is fine, area inefficient (large CAM)
SQIP Preliminaries: SSNs (Store Sequence Numbers) [Roth05] instruction stream older younger [?] [x20] [x10] addresses A:St B:Ld P:St Q:St R:Ld … S:+ T:Br P:St SSNs dispatch <ssn=9> commit <ssn=4> <8> <4> <8> <9> Preliminaries: SSNs (Store Sequence Numbers) [Roth05] Stores named by monotonically increasing sequence numbers Low-order bits are store queue/buffer positions Global SSNs track dispatch, commit, (store) completion SQIP (Store Queue Index Prediction) [Sha05] Scales store queue/buffer by eliminating associative search @dispatch: load predicts store queue position of forwarding store @execute: load indexes store queue at this position
SVW Store Vulnerability Window (SVW) [Roth05] <4> <9> complete <3> [x20] [x10] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br <8> [x18] commit <9> <8> [x18] x18 <8> x20 <9> SSBF (SSN Bloom Filter) x?8 x?0 verify/ x18 <8> x?8 Store Vulnerability Window (SVW) [Roth05] Scales load queue by eliminating associative search Load verification by in-order re-execution prior to commit Highly filtered: <1% of loads actually re-execute Address-indexed SSBF tracks [addr, SSN] of commited stores @commit: loads check SSBF, re-execute if possibly incorrect
SVW–NAIVE SVW: 512-entry indexed load queue, 256-entry store queue Slowdowns over 8SA-LQ (mesa, wupwise) Some slowdowns even over ASSOC too (bzip2, vortex) Why? Not forwarding mis-predictions … store-load serialization Load Y can’t verify until older store X completes to D$
Store-Load Serialization: ROB <4> <8> <9> [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br verify/commit <9> x18 <8> x20 <9> x?8 x?0 verify/commit <9> x18 <8> x20 <9> x?8 x?0 complete <8> complete <3> x20 <9> x?0 SVW/SQIP example: SSBF verification “hole” Load R forwards from store <4> vulnerable to stores <5>–<9> No SSBF entry for address [x10] must replay Can’t search store buffer wait until stores <5>–<8> in D$ In a ROB processor … <8> (P) will complete (and usually quickly) In a CPR processor …
Store-Load Serialization: CPR <4> <8> <9> verify <9> complete <3> [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br x18 x20 x?8 x?0 commit P will complete … unless it’s in same checkpoint as R Deadlock: load R can’t verify store P can’t complete Resolve: squash (ouch), on re-execute, create checkpoint before R P and R will be in separate checkpoints Better: learn and create checkpoints before future instances of R This is SVW–TRAIN
SVW–TRAIN Better than SVW–NAÏVE But worse in some cases (art, mcf, vpr) Over-checkpointing holds too many registers Checkpoint may not be available for branches
What About Set-Associative SSBFs? Higher associativity helps (reduces hole frequency) but … We’re replacing store queue associativity with SSBF associativity Trying to avoid things like this Want a better solution…
DSC (Decoupled Store Completion) commit commit <4> <8> <9> [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br verify/commit <9> verify <6> complete <8> complete <3> No fundamental reason we cannot complete stores <4> – <9> All older instructions have completed What’s stopping us? definition of commit & architected state CPR: commit = oldest register checkpoint (checkpoint granularity) ROB: commit = SVW-verify (instruction granularity) Restore ROB definition Allow stores to complete past oldest checkpoint This is DSC (Decoupled Store Completion)
DSC: What About Mis-Speculations? [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br T:Br <4> <8> <9> complete <8> verify/commit <9> DSC: Architected state younger than oldest checkpoint What about mis-speculation (e.g., branch T mis-predicted)? Can only recover to checkpoint Squash committed instructions? Squash stores visible to other processors? etc. How do we recover architected state?
Silent Deterministic Recovery (SDR) [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> complete <8> <4> verify/commit <9> Reconstruct architected state on demand Squash to oldest checkpoint and replay … Deterministically: re-produce committed values Silently: without generating coherence events How? discard committed stores at rename (already in SB or D$) How? read load values from load queue Avoid WAR hazards with younger stores Same thread (e.g., BQ) or different thread (coherence)
Outline Background DSC/SDR (yes, that was it) Evaluation Performance Performance-area trade-offs
Performance Methodology Workloads SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling Cycle-level simulator configuration 4-way superscalar out-of-order CPR/CFP processor 8 checkpoints, 32/32 INT/FP issue queue entries 32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers 400 cycle memory, 4Byte/cycle memory bus
SVW+DSC/SDR Outperforms SVW–Naïve and SVW–Train Outperforms 8SA-LQ on average (by a lot) Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ These are due to forwarding mis-speculation
Smaller, Less-Associative SSBFs Does DSC/SDR make set-associative SSBFs unnecessary? You can bet your associativity on it
Fewer Checkpoints DSC/SDR reduce need for large numbers of checkpoints Don’t need checkpoints to serialize store/load pairs Efficient use of D$ bandwidth even with widely spaced checkpoints Good: checkpoints are expensive
… And Less Area Area methodology High-performance/low-area 6.6% speedup, 0.91mm2 Area methodology CACTI-4 [Tarjan04], 45nm Sum areas for load/store queues (SSBF & predictor too if needed) E.g., 512-entry 8SA-LQ / 256-entry HSQ
How Performance/Area Was Won SVW load queue: big performance gain (no conflicts) & small area loss SQIP store queue: small performance loss & big area gain (no CAM) Big SVW performance gain offsets small SQIP performance loss Big SQIP area gain offsets small SVW area loss DSC/SDR: big performance gain & small area gain
DSC/SDR Performance/Area DSC/SDR improve SVW/SQIP IPC and reduce its area No new structures, just new ways of using existing structures No SSBF checkpoints No checkpoint-creation predictor More tolerant to reduction in checkpoints, SSBF size
Pareto Analysis SVW/SQIP+DSC/SDR dominates all other designs SVW/SQIP are low area (no CAMs) DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)
Related Work SRL (Store Redo Log) [Gandhi05] Large associative store queue FIFO buffer + forwarding cache Expands store queue only under LL$ misses under-performs HSQ Unordered late-binding load/store queues [Sethumadhavan08] Entries only for executed loads and stores Poor match for centralized latency tolerant processors Cherry [Martinez02] “Post retirement” checkpoints No large load/store queues, but may benefit from DSC/SDR Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]
Conclusions Checkpoint granularity … … register management: good … store commit: somewhat painful DSC/SDR: the good parts of the checkpoint world Checkpoint granularity registers + instruction granularity stores Key 1: disassociate commit from oldest register checkpoint Key 2: reconstruct architected state silently on demand Committed load values available in load queue Allow checkpoint processor to use SVW/SQIP load/store queues Performance and area advantages Simplify multi-processor operation for checkpoint processors
[ 27 ]