CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
CS 7810 Lecture 11 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.
Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.
CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
CS Lecture 2 Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’91.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Lecture: Out-of-order Processors
Data Prefetching Smruti R. Sarangi.
Milad Hashemi, Onur Mutlu, and Yale N. Patt
Lecture: Out-of-order Processors
Out-of-Order Commit Processors
Lecture: SMT, Cache Hierarchies
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Lecture: SMT, Cache Hierarchies
Address-Value Delta (AVD) Prediction
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Out-of-Order Commit Processors
15-740/ Computer Architecture Lecture 10: Runahead and MLP
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
15-740/ Computer Architecture Lecture 14: Runahead Execution
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Data Prefetching Smruti R. Sarangi.
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Presentation transcript:

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt Proceedings of HPCA-9 February 2003

In-Flight Windows #1 p33 #2 p34 c #3 p35 c #95 p127 c #96 p128 c #4 p p p 1-32 Load instruction – cache miss 300 cycles Physical Register File Reorder Buffer

In-Flight Windows #1 p33 #2 p34 c #3 p35 c #95 p127 c #96 p128 c #4 p p p 1-32 Load instruction – cache miss 300 cycles #97 Load instruction – cache miss 300 cycles Physical Register File Reorder Buffer

Memory Bottlenecks 128-entry window, real L2  0.77 IPC 128-entry window, perfect L2  entry window, real L2  entry window, perfect L2  entry window, real L2, runahead  0.94

Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache When the oldest instruction is a cache miss, behave like it causes a context-switch: checkpoint the committed registers, rename table, return address stack, and branch history register assume a bogus value and start a new thread this thread cannot modify program state, but can prefetch

Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache When the cache miss returns, copy the registers and the mapping and start executing from that ld/st instruction cost of copying back and forth is not trivial many instructions get executed twice

Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache Note that some values are missing: Do not bother to execute instrs that have invalid inputs Accelerates the thread and generates accurate prefetches Unknown store addresses are ignored

Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache Runahead instrs write to registers (as before), but runahead stores write to the runahead cache: Runahead cache and L1D are accessed in parallel If a block gets evicted out of runahead cache, data is lost

Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache The branch predictor gets accessed/updated twice Cannot resolve branch mispredicts if the branch has an invalid input

Another Form of Runahead Primary Thread Runahead Thread Occasional State Copy and Re-start

Methodology 80 benchmarks – 147 code sequences (that are memory-bound) – each 30M instructions – SPEC, Web, Media, Server, workstation, productivity Pentium 4 hardware prefetcher – eight stream buffers that stay 256 bytes ahead Also evaluate a “future baseline” with twice as many resources Perfect memory disam, 500-cycle memory access

Methodology

Results Runahead improves performance by 22% Synergistic interaction between prefetch & runahead – is the stream buffer not keeping up?

Other Results Runahead with a 128-entry window does as well as a 384-entry window A better front-end improves benefits from runahead On average, 431 useful instructions per runahead and 280 after a mispredict Without the runahead cache, only half the improvement is observed

Unanswered Questions How many re-execs? How many invalid instrs? How much wasted power? – re-execs, double writes to checkpoints How many accesses to hash tables, pointers, and branch-dependent data?

Alternative Approaches Does runahead lead to excessive power and verification complexity? Better stride prefetchers or stream buffers? Is this the best way to support a large in-flight window (register file, issueq, ROB)?

Next Week’s Paper “Delaying Physical Register Allocation Through Virtual-Physical Registers”, T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals, Proceedings of MICRO-32, November 1999

Title Bullet