Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19th, 2016
Continuous Runahead Outline 2 Continuous Runahead Outline Overview of Runahead Runahead Limitations Continuous Runahead Dependence Chains Continuous Runahead Engine Continuous Runahead Evaluation Conclusions
Continuous Runahead Outline 3 Continuous Runahead Outline Overview of Runahead Runahead Limitations Continuous Runahead Dependence Chains Continuous Runahead Engine Continuous Runahead Evaluation Conclusions
Runahead Execution Overview 4 Runahead Execution Overview Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] The core checkpoints architectural state The result of the memory operation that caused the stall is marked as poisoned in the physical register file The core continues to fetch and execute instructions Operations are discarded instead of retired The goal is to generate new independent cache misses
Traditional Runahead Accuracy 5 Traditional Runahead Accuracy
Traditional Runahead Accuracy 6 Traditional Runahead Accuracy Runahead is 95% Accurate
Traditional Runahead Prefetch Coverage 7 Traditional Runahead Prefetch Coverage
Traditional Runahead Prefetch Coverage 8 Traditional Runahead Prefetch Coverage Runahead has only 13% Prefetch Coverage
Traditional Runahead Performance Gain 9 Traditional Runahead Performance Gain
Traditional Runahead Performance Gain 10 Traditional Runahead Performance Gain Runahead has a 12% Performance Gain Runahead Oracle has an 85% Performance Gain
Traditional Runahead Interval Length 11 Traditional Runahead Interval Length
Traditional Runahead Interval Length 12 Traditional Runahead Interval Length Runahead Intervals are Short Low Performance Gain
Continuous Runahead Challenges 13 Continuous Runahead Challenges Which instructions to use during Continuous Runahead? Dynamically target the dependence chains that lead to critical cache misses What hardware to use for Continuous Runahead? How long should chains pre-execute for?
Dependence Chains 14 LD [R3] -> R5 ADD R4, R5 -> R9 Cache Miss LD [R6] -> R8
Dependence Chain Selection Policies 15 Dependence Chain Selection Policies Experiment with 3 policies to determine the best policy to use for Continuous Runahead: PC-Based Policy Use the dependence chain that has caused the most misses for the PC that is blocking retirement Maximum Misses Policy Use a dependence chain from the PC that has generated the most misses for the application Stall Policy Use a dependence chain from the PC that has caused the most full- window stalls for the application
Dependence Chain Selection Policies 16 Dependence Chain Selection Policies
Why does Stall Policy Work? 17 Why does Stall Policy Work?
19 PCs cover 90% of all Stalls 18 Why does Stall Policy Work? 19 PCs cover 90% of all Stalls
Constrained Dependence Chain Storage 19 Constrained Dependence Chain Storage
Storing 1 Chain provides 95% 20 Constrained Dependence Chain Storage Storing 1 Chain provides 95% of the Performance
Continuous Runahead Chain Generation 21 Continuous Runahead Chain Generation Maintain two structures: 32-entry cache of PCs to track the operations that cause the pipeline to frequently stall The last dependence chain for the PC that has caused the most full-window stalls At every full window stall: Increment the counter of the PC that caused the stall Generate a dependence chain for the PC that has caused the most stalls
Runahead for Longer Intervals 22 Runahead for Longer Intervals
CRE Microarchitecture 23 CRE Microarchitecture No Front-End No Register Renaming Hardware 32 Physical Registers 2-Wide No Floating Point or Vector Pipeline 4kB Data Cache
Dependence Chain Generation 24 Dependence Chain Generation Cycle: 1 2 4 3 5 ADD P7 + 1 -> P1 ADD E5 + 1 -> E3 Search List: P7 P1 P9, P1 P3 P2 SHIFT P1 -> P9 SHIFT E3 -> E4 Register Remapping Table: ADD P9 + P1 -> P3 ADD E4 + E3 -> E2 EAX EBX ECX Core Physical Register P7 P1 P8 P2 P3 P9 SHIFT P3 -> P2 SHIFT E2 -> E1 CRE Physical Register E5 E3 E0 E2 E1 E4 LD [P2] -> P8 LD [E1] -> E0 MAP E3 -> E5 First CRE Physical Register E3 E0 E1
Dependence Chain Generation 25 Dependence Chain Generation
26 Interval Length
System Configuration Single-Core/Quad-Core Caches 27 System Configuration Single-Core/Quad-Core 4-wide Issue 256 Entry Reorder Buffer 92 Entry Reservation Station Caches 32 KB 8-Way Set Associative L1 I/D-Cache 1MB 8-Way Set Associative Shared Last Level Cache per Core Non-Uniform Memory Access Latency DDR3 System 256-Entry Memory Queue Batch Scheduling Prefetchers Stream, Global History Buffer Feedback Directed Prefetching: Dynamic Degree 1-32 CRE Compute 2-wide issue 1 Continuous Runahead issue context with a 32-entry buffer and 32-entry physical register file 4 kB Data Cache
Single-Core Performance 28 Single-Core Performance
21% Single Core Performance Increase over prior State of the Art 29 Single-Core Performance 21% Single Core Performance Increase over prior State of the Art
Single-Core Performance + Prefetching 30 Single-Core Performance + Prefetching
Single-Core Performance + Prefetching 31 Single-Core Performance + Prefetching Increases Performance over and In-Conjunction with Prefetching
Independent Miss Coverage 32 Independent Miss Coverage
Independent Miss Coverage 33 Independent Miss Coverage 70% Prefetch Coverage
34 Bandwidth Overhead
Low Bandwidth Overhead 35 Bandwidth Overhead Low Bandwidth Overhead
Multi-Core Performance 36 Multi-Core Performance
43% Weighted Speedup Increase 37 Multi-Core Performance 43% Weighted Speedup Increase
Multi-Core Performance + Prefetching 38 Multi-Core Performance + Prefetching
13% Weighted Speedup Gain over 39 Multi-Core Performance + Prefetching 13% Weighted Speedup Gain over GHB Prefetching
Multi-Core Energy Evaluation 40 Multi-Core Energy Evaluation
Multi-Core Energy Evaluation 41 Multi-Core Energy Evaluation 22% Energy Reduction
42 Conclusions Runahead prefetch coverage is limited by the duration of each runahead interval To remove this constraint, we introduce the notion of Continuous Runahead We can dynamically identify the most critical LLC misses to target with Continuous Runahead by tracking the operations that cause the pipeline to frequently stall We migrate these dependence chains to the CRE where they are executed continuously in a loop
Conclusions Continuous Runahead greatly increases prefetch coverage 43 Conclusions Continuous Runahead greatly increases prefetch coverage Increases single-core performance by 34.4% Increases multi-core performance by 43.3% Synergistic with various types of prefetching
44 Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19th, 2016