CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt Proceedings of HPCA-9 February 2003
In-Flight Windows #1 p33 #2 p34 c #3 p35 c #95 p127 c #96 p128 c #4 p p p 1-32 Load instruction – cache miss 300 cycles Physical Register File Reorder Buffer
In-Flight Windows #1 p33 #2 p34 c #3 p35 c #95 p127 c #96 p128 c #4 p p p 1-32 Load instruction – cache miss 300 cycles #97 Load instruction – cache miss 300 cycles Physical Register File Reorder Buffer
Memory Bottlenecks 128-entry window, real L2 0.77 IPC 128-entry window, perfect L2 entry window, real L2 entry window, perfect L2 entry window, real L2, runahead 0.94
Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache When the oldest instruction is a cache miss, behave like it causes a context-switch: checkpoint the committed registers, rename table, return address stack, and branch history register assume a bogus value and start a new thread this thread cannot modify program state, but can prefetch
Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache When the cache miss returns, copy the registers and the mapping and start executing from that ld/st instruction cost of copying back and forth is not trivial many instructions get executed twice
Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache Note that some values are missing: Do not bother to execute instrs that have invalid inputs Accelerates the thread and generates accurate prefetches Unknown store addresses are ignored
Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache Runahead instrs write to registers (as before), but runahead stores write to the runahead cache: Runahead cache and L1D are accessed in parallel If a block gets evicted out of runahead cache, data is lost
Runahead Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache The branch predictor gets accessed/updated twice Cannot resolve branch mispredicts if the branch has an invalid input
Another Form of Runahead Primary Thread Runahead Thread Occasional State Copy and Re-start
Methodology 80 benchmarks – 147 code sequences (that are memory-bound) – each 30M instructions – SPEC, Web, Media, Server, workstation, productivity Pentium 4 hardware prefetcher – eight stream buffers that stay 256 bytes ahead Also evaluate a “future baseline” with twice as many resources Perfect memory disam, 500-cycle memory access
Methodology
Results Runahead improves performance by 22% Synergistic interaction between prefetch & runahead – is the stream buffer not keeping up?
Other Results Runahead with a 128-entry window does as well as a 384-entry window A better front-end improves benefits from runahead On average, 431 useful instructions per runahead and 280 after a mispredict Without the runahead cache, only half the improvement is observed
Unanswered Questions How many re-execs? How many invalid instrs? How much wasted power? – re-execs, double writes to checkpoints How many accesses to hash tables, pointers, and branch-dependent data?
Alternative Approaches Does runahead lead to excessive power and verification complexity? Better stride prefetchers or stream buffers? Is this the best way to support a large in-flight window (register file, issueq, ROB)?
Next Week’s Paper “Delaying Physical Register Allocation Through Virtual-Physical Registers”, T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals, Proceedings of MICRO-32, November 1999
Title Bullet