Milad Hashemi, Onur Mutlu, Yale N. Patt

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads
Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19th, 2016

Continuous Runahead Outline
2 Continuous Runahead Outline Overview of Runahead Runahead Limitations Continuous Runahead Dependence Chains Continuous Runahead Engine Continuous Runahead Evaluation Conclusions

Continuous Runahead Outline
3 Continuous Runahead Outline Overview of Runahead Runahead Limitations Continuous Runahead Dependence Chains Continuous Runahead Engine Continuous Runahead Evaluation Conclusions

Runahead Execution Overview
4 Runahead Execution Overview Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] The core checkpoints architectural state The result of the memory operation that caused the stall is marked as poisoned in the physical register file The core continues to fetch and execute instructions Operations are discarded instead of retired The goal is to generate new independent cache misses

Traditional Runahead Accuracy
5 Traditional Runahead Accuracy

Traditional Runahead Accuracy
6 Traditional Runahead Accuracy Runahead is 95% Accurate

Traditional Runahead Prefetch Coverage
7 Traditional Runahead Prefetch Coverage

Traditional Runahead Prefetch Coverage
8 Traditional Runahead Prefetch Coverage Runahead has only 13% Prefetch Coverage

Traditional Runahead Performance Gain
9 Traditional Runahead Performance Gain

Traditional Runahead Performance Gain
10 Traditional Runahead Performance Gain Runahead has a 12% Performance Gain Runahead Oracle has an 85% Performance Gain

Traditional Runahead Interval Length
11 Traditional Runahead Interval Length

Traditional Runahead Interval Length
12 Traditional Runahead Interval Length Runahead Intervals are Short Low Performance Gain

Continuous Runahead Challenges
13 Continuous Runahead Challenges Which instructions to use during Continuous Runahead? Dynamically target the dependence chains that lead to critical cache misses What hardware to use for Continuous Runahead? How long should chains pre-execute for?

Dependence Chains 14 LD [R3] -> R5 ADD R4, R5 -> R9
Cache Miss LD [R6] -> R8

Dependence Chain Selection Policies
15 Dependence Chain Selection Policies Experiment with 3 policies to determine the best policy to use for Continuous Runahead: PC-Based Policy Use the dependence chain that has caused the most misses for the PC that is blocking retirement Maximum Misses Policy Use a dependence chain from the PC that has generated the most misses for the application Stall Policy Use a dependence chain from the PC that has caused the most full- window stalls for the application

Dependence Chain Selection Policies
16 Dependence Chain Selection Policies

Why does Stall Policy Work?
17 Why does Stall Policy Work?

19 PCs cover 90% of all Stalls
18 Why does Stall Policy Work? 19 PCs cover 90% of all Stalls

Constrained Dependence Chain Storage
19 Constrained Dependence Chain Storage

Storing 1 Chain provides 95%
20 Constrained Dependence Chain Storage Storing 1 Chain provides 95% of the Performance

Continuous Runahead Chain Generation
21 Continuous Runahead Chain Generation Maintain two structures: 32-entry cache of PCs to track the operations that cause the pipeline to frequently stall The last dependence chain for the PC that has caused the most full-window stalls At every full window stall: Increment the counter of the PC that caused the stall Generate a dependence chain for the PC that has caused the most stalls

Runahead for Longer Intervals
22 Runahead for Longer Intervals

CRE Microarchitecture
23 CRE Microarchitecture No Front-End No Register Renaming Hardware 32 Physical Registers 2-Wide No Floating Point or Vector Pipeline 4kB Data Cache

Dependence Chain Generation
24 Dependence Chain Generation Cycle: 1 2 4 3 5 ADD P > P1 ADD E > E3 Search List: P7 P1 P9, P1 P3 P2 SHIFT P1 -> P9 SHIFT E3 -> E4 Register Remapping Table: ADD P9 + P1 -> P3 ADD E4 + E3 -> E2 EAX EBX ECX Core Physical Register P7 P1 P8 P2 P3 P9 SHIFT P3 -> P2 SHIFT E2 -> E1 CRE Physical Register E5 E3 E0 E2 E1 E4 LD [P2] -> P8 LD [E1] -> E0 MAP E3 -> E5 First CRE Physical Register E3 E0 E1

Dependence Chain Generation
25 Dependence Chain Generation

26 Interval Length

System Configuration Single-Core/Quad-Core Caches
27 System Configuration Single-Core/Quad-Core 4-wide Issue 256 Entry Reorder Buffer 92 Entry Reservation Station Caches 32 KB 8-Way Set Associative L1 I/D-Cache 1MB 8-Way Set Associative Shared Last Level Cache per Core Non-Uniform Memory Access Latency DDR3 System 256-Entry Memory Queue Batch Scheduling Prefetchers Stream, Global History Buffer Feedback Directed Prefetching: Dynamic Degree 1-32 CRE Compute 2-wide issue 1 Continuous Runahead issue context with a 32-entry buffer and 32-entry physical register file 4 kB Data Cache

Single-Core Performance
28 Single-Core Performance

21% Single Core Performance Increase over prior State of the Art
29 Single-Core Performance 21% Single Core Performance Increase over prior State of the Art

Single-Core Performance + Prefetching
30 Single-Core Performance + Prefetching

Single-Core Performance + Prefetching
31 Single-Core Performance + Prefetching Increases Performance over and In-Conjunction with Prefetching

Independent Miss Coverage
32 Independent Miss Coverage

Independent Miss Coverage
33 Independent Miss Coverage 70% Prefetch Coverage

34 Bandwidth Overhead

Low Bandwidth Overhead
35 Bandwidth Overhead Low Bandwidth Overhead

Multi-Core Performance
36 Multi-Core Performance

43% Weighted Speedup Increase
37 Multi-Core Performance 43% Weighted Speedup Increase

Multi-Core Performance + Prefetching
38 Multi-Core Performance + Prefetching

13% Weighted Speedup Gain over
39 Multi-Core Performance + Prefetching 13% Weighted Speedup Gain over GHB Prefetching

Multi-Core Energy Evaluation
40 Multi-Core Energy Evaluation

Multi-Core Energy Evaluation
41 Multi-Core Energy Evaluation 22% Energy Reduction

42 Conclusions Runahead prefetch coverage is limited by the duration of each runahead interval To remove this constraint, we introduce the notion of Continuous Runahead We can dynamically identify the most critical LLC misses to target with Continuous Runahead by tracking the operations that cause the pipeline to frequently stall We migrate these dependence chains to the CRE where they are executed continuously in a loop

Conclusions Continuous Runahead greatly increases prefetch coverage
43 Conclusions Continuous Runahead greatly increases prefetch coverage Increases single-core performance by 34.4% Increases multi-core performance by 43.3% Synergistic with various types of prefetching

44 Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19th, 2016

Milad Hashemi, Onur Mutlu, Yale N. Patt

Similar presentations

Presentation on theme: "Milad Hashemi, Onur Mutlu, Yale N. Patt"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Milad Hashemi, Onur Mutlu, Yale N. Patt

Similar presentations

Presentation on theme: "Milad Hashemi, Onur Mutlu, Yale N. Patt"— Presentation transcript:

Similar presentations

About project

Feedback