Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

HPCA-15 :: Feb 18, 2009 iCFP: Tolerating All Level Cache Misses in In-Order Processors Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Ginger: Control Independence Using Tag Rewriting Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, ISCA-34 :: June, 2007.

MoBS-5 :: June 21, 2009 FIESTA: A Sample-Balanced Multi-Program Workload Methodology Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

CS 352H: Computer Systems Architecture

CSE 502: Computer Architecture

Speculative Lock Elision

Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Multiscalar Processors

Multilevel Memories (Improving performance using alittle “cash”)

Lecture: Cache Hierarchies

CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions

5.2 Eleven Advanced Optimizations of Cache Performance

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture: Cache Hierarchies

Commit out of order Phd student: Adrián Cristal.

ECE 445 – Computer Organization

CS 152 Computer Architecture & Engineering

Milad Hashemi, Onur Mutlu, Yale N. Patt

Module 3: Branch Prediction

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Address-Value Delta (AVD) Prediction

Lecture 11: Memory Data Flow Techniques

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Lecture: Cache Innovations, Virtual Memory

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 10: Branch Prediction and Instruction Delivery

15-740/ Computer Architecture Lecture 10: Runahead and MLP

15-740/ Computer Architecture Lecture 14: Runahead Execution

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

* From AMD 1996 Publication #18522 Revision E

Out-of-Order Execution Structures Optimizations

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

15-740/ Computer Architecture Lecture 14: Prefetching

Additional ILP Topics Prof. Eric Rotenberg

Lecture: Cache Hierarchies

CSC3050 – Computer Architecture

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Project Guidelines Prof. Eric Rotenberg.

Handling Stores and Loads

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania {adhilton, amir}@cis.upenn.edu ISCA-36 :: June 23, 2009

Decoupled Store Completion & Silent Deterministic Replay Brief Overview Dynamically scheduled superscalar processors Latency-tolerant processors CPR/CFP [Akkary03, Srinivasan04] Scalable load & store queues SVW/SQIP [Roth05, Sha05] DKIP, FMC [Pericas06, Pericas07] Scalable load & store queues for latency-tolerant processors SA-LQ/HSQ [Akkary03] SRL [Gandhi05] ELSQ [Pericas08] Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP) Decoupled Store Completion & Silent Deterministic Replay

Outline Background DSC/SDR Evaluation CPR/CFP SVW/SQIP The granularity mismatch problem DSC/SDR Evaluation

CPR/CFP Latency-tolerant: scale key window structures under LL$ miss Issue queue, regfile, load & store queues CFP (Continual Flow Pipeline) [Srinivasan04] Scale issue queue & regfile by “slicing out” miss-dependent insns CPR (Checkpoint Processing & Recovery) [Akkary03] Scale regfile by limiting recovery to pre-created checkpoints Aggressive reclamation of non-checkpoint registers Unintended consequence? checkpoint-granularity “bulk commit” SA-LQ (Set-Associative Load Queue) [Akkary03] HSQ (Hierarchical Store Queue) [Akkary03]

Baseline Performance (& Area) ASSOC (baseline): 64/48 entry fully-associative load/store queues 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue Load queue: area is fine, poor performance (set conflicts) Store queue: performance is fine, area inefficient (large CAM)

SQIP Preliminaries: SSNs (Store Sequence Numbers) [Roth05] instruction stream older younger [?] [x20] [x10] addresses A:St B:Ld P:St Q:St R:Ld … S:+ T:Br P:St SSNs dispatch <ssn=9> commit <ssn=4> <8> <4> <8> <9> Preliminaries: SSNs (Store Sequence Numbers) [Roth05] Stores named by monotonically increasing sequence numbers Low-order bits are store queue/buffer positions Global SSNs track dispatch, commit, (store) completion SQIP (Store Queue Index Prediction) [Sha05] Scales store queue/buffer by eliminating associative search @dispatch: load predicts store queue position of forwarding store @execute: load indexes store queue at this position

SVW Store Vulnerability Window (SVW) [Roth05] <4> <9> complete <3> [x20] [x10] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br <8> [x18] commit <9> <8> [x18] x18 <8> x20 <9> SSBF (SSN Bloom Filter) x?8 x?0 verify/ x18 <8> x?8 Store Vulnerability Window (SVW) [Roth05] Scales load queue by eliminating associative search Load verification by in-order re-execution prior to commit Highly filtered: <1% of loads actually re-execute Address-indexed SSBF tracks [addr, SSN] of commited stores @commit: loads check SSBF, re-execute if possibly incorrect

SVW–NAIVE SVW: 512-entry indexed load queue, 256-entry store queue Slowdowns over 8SA-LQ (mesa, wupwise) Some slowdowns even over ASSOC too (bzip2, vortex) Why? Not forwarding mis-predictions … store-load serialization Load Y can’t verify until older store X completes to D$

Store-Load Serialization: ROB <4> <8> <9> [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br verify/commit <9> x18 <8> x20 <9> x?8 x?0 verify/commit <9> x18 <8> x20 <9> x?8 x?0 complete <8> complete <3> x20 <9> x?0 SVW/SQIP example: SSBF verification “hole” Load R forwards from store <4>  vulnerable to stores <5>–<9> No SSBF entry for address [x10]  must replay Can’t search store buffer  wait until stores <5>–<8> in D$ In a ROB processor … <8> (P) will complete (and usually quickly) In a CPR processor …

Store-Load Serialization: CPR <4> <8> <9> verify <9> complete <3> [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br x18 x20 x?8 x?0 commit P will complete … unless it’s in same checkpoint as R Deadlock: load R can’t verify  store P can’t complete Resolve: squash (ouch), on re-execute, create checkpoint before R P and R will be in separate checkpoints Better: learn and create checkpoints before future instances of R This is SVW–TRAIN

SVW–TRAIN Better than SVW–NAÏVE But worse in some cases (art, mcf, vpr) Over-checkpointing holds too many registers Checkpoint may not be available for branches

What About Set-Associative SSBFs? Higher associativity helps (reduces hole frequency) but … We’re replacing store queue associativity with SSBF associativity Trying to avoid things like this Want a better solution…

DSC (Decoupled Store Completion) commit commit <4> <8> <9> [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br verify/commit <9> verify <6> complete <8> complete <3> No fundamental reason we cannot complete stores <4> – <9> All older instructions have completed What’s stopping us? definition of commit & architected state CPR: commit = oldest register checkpoint (checkpoint granularity) ROB: commit = SVW-verify (instruction granularity) Restore ROB definition Allow stores to complete past oldest checkpoint This is DSC (Decoupled Store Completion)

DSC: What About Mis-Speculations? [x10] [x20] [x18] A:St B:Ld P:St Q:St R:Ld … S:+ T:Br T:Br <4> <8> <9> complete <8> verify/commit <9> DSC: Architected state younger than oldest checkpoint What about mis-speculation (e.g., branch T mis-predicted)? Can only recover to checkpoint Squash committed instructions? Squash stores visible to other processors? etc. How do we recover architected state?

Silent Deterministic Recovery (SDR) [x10] [x20] [x18] [x20] [x10] T:Br S:+ R:Ld Q:St P:St … B:Ld A:St <4> <9> <8> complete <8> <4> verify/commit <9> Reconstruct architected state on demand Squash to oldest checkpoint and replay … Deterministically: re-produce committed values Silently: without generating coherence events How? discard committed stores at rename (already in SB or D$) How? read load values from load queue Avoid WAR hazards with younger stores Same thread (e.g., BQ) or different thread (coherence)

Outline Background DSC/SDR (yes, that was it) Evaluation Performance Performance-area trade-offs

Performance Methodology Workloads SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling Cycle-level simulator configuration 4-way superscalar out-of-order CPR/CFP processor 8 checkpoints, 32/32 INT/FP issue queue entries 32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers 400 cycle memory, 4Byte/cycle memory bus

SVW+DSC/SDR Outperforms SVW–Naïve and SVW–Train Outperforms 8SA-LQ on average (by a lot) Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ These are due to forwarding mis-speculation

Smaller, Less-Associative SSBFs Does DSC/SDR make set-associative SSBFs unnecessary? You can bet your associativity on it

Fewer Checkpoints DSC/SDR reduce need for large numbers of checkpoints Don’t need checkpoints to serialize store/load pairs Efficient use of D$ bandwidth even with widely spaced checkpoints Good: checkpoints are expensive

… And Less Area Area methodology High-performance/low-area 6.6% speedup, 0.91mm2 Area methodology CACTI-4 [Tarjan04], 45nm Sum areas for load/store queues (SSBF & predictor too if needed) E.g., 512-entry 8SA-LQ / 256-entry HSQ

How Performance/Area Was Won SVW load queue: big performance gain (no conflicts) & small area loss SQIP store queue: small performance loss & big area gain (no CAM) Big SVW performance gain offsets small SQIP performance loss Big SQIP area gain offsets small SVW area loss DSC/SDR: big performance gain & small area gain

DSC/SDR Performance/Area DSC/SDR improve SVW/SQIP IPC and reduce its area No new structures, just new ways of using existing structures No SSBF checkpoints No checkpoint-creation predictor More tolerant to reduction in checkpoints, SSBF size

Pareto Analysis SVW/SQIP+DSC/SDR dominates all other designs SVW/SQIP are low area (no CAMs) DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)

Related Work SRL (Store Redo Log) [Gandhi05] Large associative store queue  FIFO buffer + forwarding cache Expands store queue only under LL$ misses  under-performs HSQ Unordered late-binding load/store queues [Sethumadhavan08] Entries only for executed loads and stores Poor match for centralized latency tolerant processors Cherry [Martinez02] “Post retirement” checkpoints No large load/store queues, but may benefit from DSC/SDR Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]

Conclusions Checkpoint granularity … … register management: good … store commit: somewhat painful DSC/SDR: the good parts of the checkpoint world Checkpoint granularity registers + instruction granularity stores Key 1: disassociate commit from oldest register checkpoint Key 2: reconstruct architected state silently on demand Committed load values available in load queue Allow checkpoint processor to use SVW/SQIP load/store queues Performance and area advantages Simplify multi-processor operation for checkpoint processors

[ 27 ]