Download presentation
Presentation is loading. Please wait.
1
Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads
Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19th, 2016
2
Continuous Runahead Outline
2 Continuous Runahead Outline Overview of Runahead Runahead Limitations Continuous Runahead Dependence Chains Continuous Runahead Engine Continuous Runahead Evaluation Conclusions
3
Continuous Runahead Outline
3 Continuous Runahead Outline Overview of Runahead Runahead Limitations Continuous Runahead Dependence Chains Continuous Runahead Engine Continuous Runahead Evaluation Conclusions
4
Runahead Execution Overview
4 Runahead Execution Overview Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] The core checkpoints architectural state The result of the memory operation that caused the stall is marked as poisoned in the physical register file The core continues to fetch and execute instructions Operations are discarded instead of retired The goal is to generate new independent cache misses
5
Traditional Runahead Accuracy
5 Traditional Runahead Accuracy
6
Traditional Runahead Accuracy
6 Traditional Runahead Accuracy Runahead is 95% Accurate
7
Traditional Runahead Prefetch Coverage
7 Traditional Runahead Prefetch Coverage
8
Traditional Runahead Prefetch Coverage
8 Traditional Runahead Prefetch Coverage Runahead has only 13% Prefetch Coverage
9
Traditional Runahead Performance Gain
9 Traditional Runahead Performance Gain
10
Traditional Runahead Performance Gain
10 Traditional Runahead Performance Gain Runahead has a 12% Performance Gain Runahead Oracle has an 85% Performance Gain
11
Traditional Runahead Interval Length
11 Traditional Runahead Interval Length
12
Traditional Runahead Interval Length
12 Traditional Runahead Interval Length Runahead Intervals are Short Low Performance Gain
13
Continuous Runahead Challenges
13 Continuous Runahead Challenges Which instructions to use during Continuous Runahead? Dynamically target the dependence chains that lead to critical cache misses What hardware to use for Continuous Runahead? How long should chains pre-execute for?
14
Dependence Chains 14 LD [R3] -> R5 ADD R4, R5 -> R9
Cache Miss LD [R6] -> R8
15
Dependence Chain Selection Policies
15 Dependence Chain Selection Policies Experiment with 3 policies to determine the best policy to use for Continuous Runahead: PC-Based Policy Use the dependence chain that has caused the most misses for the PC that is blocking retirement Maximum Misses Policy Use a dependence chain from the PC that has generated the most misses for the application Stall Policy Use a dependence chain from the PC that has caused the most full- window stalls for the application
16
Dependence Chain Selection Policies
16 Dependence Chain Selection Policies
17
Why does Stall Policy Work?
17 Why does Stall Policy Work?
18
19 PCs cover 90% of all Stalls
18 Why does Stall Policy Work? 19 PCs cover 90% of all Stalls
19
Constrained Dependence Chain Storage
19 Constrained Dependence Chain Storage
20
Storing 1 Chain provides 95%
20 Constrained Dependence Chain Storage Storing 1 Chain provides 95% of the Performance
21
Continuous Runahead Chain Generation
21 Continuous Runahead Chain Generation Maintain two structures: 32-entry cache of PCs to track the operations that cause the pipeline to frequently stall The last dependence chain for the PC that has caused the most full-window stalls At every full window stall: Increment the counter of the PC that caused the stall Generate a dependence chain for the PC that has caused the most stalls
22
Runahead for Longer Intervals
22 Runahead for Longer Intervals
23
CRE Microarchitecture
23 CRE Microarchitecture No Front-End No Register Renaming Hardware 32 Physical Registers 2-Wide No Floating Point or Vector Pipeline 4kB Data Cache
24
Dependence Chain Generation
24 Dependence Chain Generation Cycle: 1 2 4 3 5 ADD P > P1 ADD E > E3 Search List: P7 P1 P9, P1 P3 P2 SHIFT P1 -> P9 SHIFT E3 -> E4 Register Remapping Table: ADD P9 + P1 -> P3 ADD E4 + E3 -> E2 EAX EBX ECX Core Physical Register P7 P1 P8 P2 P3 P9 SHIFT P3 -> P2 SHIFT E2 -> E1 CRE Physical Register E5 E3 E0 E2 E1 E4 LD [P2] -> P8 LD [E1] -> E0 MAP E3 -> E5 First CRE Physical Register E3 E0 E1
25
Dependence Chain Generation
25 Dependence Chain Generation
26
26 Interval Length
27
System Configuration Single-Core/Quad-Core Caches
27 System Configuration Single-Core/Quad-Core 4-wide Issue 256 Entry Reorder Buffer 92 Entry Reservation Station Caches 32 KB 8-Way Set Associative L1 I/D-Cache 1MB 8-Way Set Associative Shared Last Level Cache per Core Non-Uniform Memory Access Latency DDR3 System 256-Entry Memory Queue Batch Scheduling Prefetchers Stream, Global History Buffer Feedback Directed Prefetching: Dynamic Degree 1-32 CRE Compute 2-wide issue 1 Continuous Runahead issue context with a 32-entry buffer and 32-entry physical register file 4 kB Data Cache
28
Single-Core Performance
28 Single-Core Performance
29
21% Single Core Performance Increase over prior State of the Art
29 Single-Core Performance 21% Single Core Performance Increase over prior State of the Art
30
Single-Core Performance + Prefetching
30 Single-Core Performance + Prefetching
31
Single-Core Performance + Prefetching
31 Single-Core Performance + Prefetching Increases Performance over and In-Conjunction with Prefetching
32
Independent Miss Coverage
32 Independent Miss Coverage
33
Independent Miss Coverage
33 Independent Miss Coverage 70% Prefetch Coverage
34
34 Bandwidth Overhead
35
Low Bandwidth Overhead
35 Bandwidth Overhead Low Bandwidth Overhead
36
Multi-Core Performance
36 Multi-Core Performance
37
43% Weighted Speedup Increase
37 Multi-Core Performance 43% Weighted Speedup Increase
38
Multi-Core Performance + Prefetching
38 Multi-Core Performance + Prefetching
39
13% Weighted Speedup Gain over
39 Multi-Core Performance + Prefetching 13% Weighted Speedup Gain over GHB Prefetching
40
Multi-Core Energy Evaluation
40 Multi-Core Energy Evaluation
41
Multi-Core Energy Evaluation
41 Multi-Core Energy Evaluation 22% Energy Reduction
42
42 Conclusions Runahead prefetch coverage is limited by the duration of each runahead interval To remove this constraint, we introduce the notion of Continuous Runahead We can dynamically identify the most critical LLC misses to target with Continuous Runahead by tracking the operations that cause the pipeline to frequently stall We migrate these dependence chains to the CRE where they are executed continuously in a loop
43
Conclusions Continuous Runahead greatly increases prefetch coverage
43 Conclusions Continuous Runahead greatly increases prefetch coverage Increases single-core performance by 34.4% Increases multi-core performance by 43.3% Synergistic with various types of prefetching
44
44 Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19th, 2016
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.