Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania.

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-36 :: June 23, 2009

Decoupled Store Completion & Silent Deterministic Replay
Brief Overview Dynamically scheduled superscalar processors Latency-tolerant processors CPR/CFP [Akkary03, Srinivasan04] Scalable load & store queues SVW/SQIP [Roth05, Sha05] DKIP, FMC [Pericas06, Pericas07] Scalable load & store queues for latency-tolerant processors SA-LQ/HSQ [Akkary03] SRL [Gandhi05] ELSQ [Pericas08] Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP) Decoupled Store Completion & Silent Deterministic Replay

Outline Background DSC/SDR Evaluation CPR/CFP SVW/SQIP
The granularity mismatch problem DSC/SDR Evaluation

CPR/CFP Latency-tolerant: scale key window structures under LL$ miss
Issue queue, regfile, load & store queues CFP (Continual Flow Pipeline) [Srinivasan04] Scale issue queue & regfile by “slicing out” miss-dependent insns CPR (Checkpoint Processing & Recovery) [Akkary03] Scale regfile by limiting recovery to pre-created checkpoints Aggressive reclamation of non-checkpoint registers Unintended consequence? checkpoint-granularity “bulk commit” SA-LQ (Set-Associative Load Queue) [Akkary03] HSQ (Hierarchical Store Queue) [Akkary03]

Baseline Performance (& Area)
ASSOC (baseline): 64/48 entry fully-associative load/store queues 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue Load queue: area is fine, poor performance (set conflicts) Store queue: performance is fine, area inefficient (large CAM)

SQIP Preliminaries: SSNs (Store Sequence Numbers) [Roth05]
instruction stream older younger [?] [x20] [x10] addresses A:St B:Ld P:St Q:St R:Ld … S:+ T:Br P:St SSNs dispatch <ssn=9> commit <ssn=4> <8> <4> <8> <9> Preliminaries: SSNs (Store Sequence Numbers) [Roth05] Stores named by monotonically increasing sequence numbers Low-order bits are store queue/buffer positions Global SSNs track dispatch, commit, (store) completion SQIP (Store Queue Index Prediction) [Sha05] Scales store queue/buffer by eliminating associative search @dispatch: load predicts store queue position of forwarding store @execute: load indexes store queue at this position

SVW Store Vulnerability Window (SVW) [Roth05] <4> <9>
complete <3> [x20] [x10] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br <8> [x18] commit <9> <8> [x18] x18 <8> x20 <9> SSBF (SSN Bloom Filter) x?8 x?0 verify/ x18 <8> x?8 Store Vulnerability Window (SVW) [Roth05] Scales load queue by eliminating associative search Load verification by in-order re-execution prior to commit Highly filtered: <1% of loads actually re-execute Address-indexed SSBF tracks [addr, SSN] of commited stores @commit: loads check SSBF, re-execute if possibly incorrect

SVW–NAIVE SVW: 512-entry indexed load queue, 256-entry store queue
Slowdowns over 8SA-LQ (mesa, wupwise) Some slowdowns even over ASSOC too (bzip2, vortex) Why? Not forwarding mis-predictions … store-load serialization Load Y can’t verify until older store X completes to D$

Store-Load Serialization: ROB
<4> <8> <9> [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br verify/commit <9> x18 <8> x20 <9> x?8 x?0 verify/commit <9> x18 <8> x20 <9> x?8 x?0 complete <8> complete <3> x20 <9> x?0 SVW/SQIP example: SSBF verification “hole” Load R forwards from store <4>  vulnerable to stores <5>–<9> No SSBF entry for address [x10]  must replay Can’t search store buffer  wait until stores <5>–<8> in D$ In a ROB processor … <8> (P) will complete (and usually quickly) In a CPR processor …

Store-Load Serialization: CPR
<4> <8> <9> verify <9> complete <3> [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br x18 x20 x?8 x?0 commit P will complete … unless it’s in same checkpoint as R Deadlock: load R can’t verify  store P can’t complete Resolve: squash (ouch), on re-execute, create checkpoint before R P and R will be in separate checkpoints Better: learn and create checkpoints before future instances of R This is SVW–TRAIN

SVW–TRAIN Better than SVW–NAÏVE
But worse in some cases (art, mcf, vpr) Over-checkpointing holds too many registers Checkpoint may not be available for branches

What About Set-Associative SSBFs?
Higher associativity helps (reduces hole frequency) but … We’re replacing store queue associativity with SSBF associativity Trying to avoid things like this Want a better solution…

DSC (Decoupled Store Completion)
commit commit <4> <8> <9> [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br verify/commit <9> verify <6> complete <8> complete <3> No fundamental reason we cannot complete stores <4> – <9> All older instructions have completed What’s stopping us? definition of commit & architected state CPR: commit = oldest register checkpoint (checkpoint granularity) ROB: commit = SVW-verify (instruction granularity) Restore ROB definition Allow stores to complete past oldest checkpoint This is DSC (Decoupled Store Completion)

DSC: What About Mis-Speculations?
[x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br T:Br <4> <8> <9> complete <8> verify/commit <9> DSC: Architected state younger than oldest checkpoint What about mis-speculation (e.g., branch T mis-predicted)? Can only recover to checkpoint Squash committed instructions? Squash stores visible to other processors? etc. How do we recover architected state?

Silent Deterministic Recovery (SDR)
[x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> complete <8> <4> verify/commit <9> Reconstruct architected state on demand Squash to oldest checkpoint and replay … Deterministically: re-produce committed values Silently: without generating coherence events How? discard committed stores at rename (already in SB or D$) How? read load values from load queue Avoid WAR hazards with younger stores Same thread (e.g., BQ) or different thread (coherence)

Outline Background DSC/SDR (yes, that was it) Evaluation Performance
Performance-area trade-offs

Performance Methodology
Workloads SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling Cycle-level simulator configuration 4-way superscalar out-of-order CPR/CFP processor 8 checkpoints, 32/32 INT/FP issue queue entries 32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers 400 cycle memory, 4Byte/cycle memory bus

SVW+DSC/SDR Outperforms SVW–Naïve and SVW–Train
Outperforms 8SA-LQ on average (by a lot) Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ These are due to forwarding mis-speculation

Smaller, Less-Associative SSBFs
Does DSC/SDR make set-associative SSBFs unnecessary? You can bet your associativity on it

Fewer Checkpoints DSC/SDR reduce need for large numbers of checkpoints
Don’t need checkpoints to serialize store/load pairs Efficient use of D$ bandwidth even with widely spaced checkpoints Good: checkpoints are expensive

… And Less Area Area methodology High-performance/low-area
6.6% speedup, 0.91mm2 Area methodology CACTI-4 [Tarjan04], 45nm Sum areas for load/store queues (SSBF & predictor too if needed) E.g., 512-entry 8SA-LQ / 256-entry HSQ

How Performance/Area Was Won
SVW load queue: big performance gain (no conflicts) & small area loss SQIP store queue: small performance loss & big area gain (no CAM) Big SVW performance gain offsets small SQIP performance loss Big SQIP area gain offsets small SVW area loss DSC/SDR: big performance gain & small area gain

DSC/SDR Performance/Area
DSC/SDR improve SVW/SQIP IPC and reduce its area No new structures, just new ways of using existing structures No SSBF checkpoints No checkpoint-creation predictor More tolerant to reduction in checkpoints, SSBF size

Pareto Analysis SVW/SQIP+DSC/SDR dominates all other designs
SVW/SQIP are low area (no CAMs) DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)

Related Work SRL (Store Redo Log) [Gandhi05]
Large associative store queue  FIFO buffer + forwarding cache Expands store queue only under LL$ misses  under-performs HSQ Unordered late-binding load/store queues [Sethumadhavan08] Entries only for executed loads and stores Poor match for centralized latency tolerant processors Cherry [Martinez02] “Post retirement” checkpoints No large load/store queues, but may benefit from DSC/SDR Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]

Conclusions Checkpoint granularity …
… register management: good … store commit: somewhat painful DSC/SDR: the good parts of the checkpoint world Checkpoint granularity registers + instruction granularity stores Key 1: disassociate commit from oldest register checkpoint Key 2: reconstruct architected state silently on demand Committed load values available in load queue Allow checkpoint processor to use SVW/SQIP load/store queues Performance and area advantages Simplify multi-processor operation for checkpoint processors

[ 27 ]

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania.

Similar presentations

Presentation on theme: "Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania.

Similar presentations

Presentation on theme: "Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania."— Presentation transcript:

Similar presentations

About project

Feedback