Milad Hashemi, Onur Mutlu, Yale N. Patt

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

ARM Cortex-A9 MPCore ™ processor Presented by- Chris Cai (xiaocai2) Rehana Tabassum (tabassu2) Sam Mussmann (mussmnn2)

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Fetch Directed Prefetching - a Study

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Data Prefetching Smruti R. Sarangi.

Lecture: Large Caches, Virtual Memory

Precise Exceptions and Out-of-Order Execution

Design of Digital Circuits Lecture 18: Out-of-Order Execution

Adaptive Cache Partitioning on a Composite Core

The University of Adelaide, School of Computer Science

PowerPC 604 Superscalar Microprocessor

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/14/2011

Milad Hashemi, Onur Mutlu, and Yale N. Patt

Michigan Technological University, Houghton MI

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Lecture 12 Reorder Buffers

Out-of-Order Commit Processors

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

The Microarchitecture of the Pentium 4 processor

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Address-Value Delta (AVD) Prediction

Lecture: Cache Innovations, Virtual Memory

Alpha Microarchitecture

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Adaptive Single-Chip Multiprocessing

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Out-of-Order Commit Processors

15-740/ Computer Architecture Lecture 10: Runahead and MLP

Lecture 20: OOO, Memory Hierarchy

15-740/ Computer Architecture Lecture 14: Runahead Execution

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Data Prefetching Smruti R. Sarangi.

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

José A. Joao* Onur Mutlu‡ Yale N. Patt*

15-740/ Computer Architecture Lecture 14: Prefetching

The Vector-Thread Architecture

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Lecture: Cache Hierarchies

Dynamic Hardware Prediction

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Stream-based Memory Specialization for General Purpose Processors

DSPatch: Dual Spatial pattern prefetcher

Presentation transcript:

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19th, 2016

Continuous Runahead Outline 2 Continuous Runahead Outline Overview of Runahead Runahead Limitations Continuous Runahead Dependence Chains Continuous Runahead Engine Continuous Runahead Evaluation Conclusions

Continuous Runahead Outline 3 Continuous Runahead Outline Overview of Runahead Runahead Limitations Continuous Runahead Dependence Chains Continuous Runahead Engine Continuous Runahead Evaluation Conclusions

Runahead Execution Overview 4 Runahead Execution Overview Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] The core checkpoints architectural state The result of the memory operation that caused the stall is marked as poisoned in the physical register file The core continues to fetch and execute instructions Operations are discarded instead of retired The goal is to generate new independent cache misses

Traditional Runahead Accuracy 5 Traditional Runahead Accuracy

Traditional Runahead Accuracy 6 Traditional Runahead Accuracy Runahead is 95% Accurate

Traditional Runahead Prefetch Coverage 7 Traditional Runahead Prefetch Coverage

Traditional Runahead Prefetch Coverage 8 Traditional Runahead Prefetch Coverage Runahead has only 13% Prefetch Coverage

Traditional Runahead Performance Gain 9 Traditional Runahead Performance Gain

Traditional Runahead Performance Gain 10 Traditional Runahead Performance Gain Runahead has a 12% Performance Gain Runahead Oracle has an 85% Performance Gain

Traditional Runahead Interval Length 11 Traditional Runahead Interval Length

Traditional Runahead Interval Length 12 Traditional Runahead Interval Length Runahead Intervals are Short Low Performance Gain

Continuous Runahead Challenges 13 Continuous Runahead Challenges Which instructions to use during Continuous Runahead? Dynamically target the dependence chains that lead to critical cache misses What hardware to use for Continuous Runahead? How long should chains pre-execute for?

Dependence Chains 14 LD [R3] -> R5 ADD R4, R5 -> R9 Cache Miss LD [R6] -> R8

Dependence Chain Selection Policies 15 Dependence Chain Selection Policies Experiment with 3 policies to determine the best policy to use for Continuous Runahead: PC-Based Policy Use the dependence chain that has caused the most misses for the PC that is blocking retirement Maximum Misses Policy Use a dependence chain from the PC that has generated the most misses for the application Stall Policy Use a dependence chain from the PC that has caused the most full- window stalls for the application

Dependence Chain Selection Policies 16 Dependence Chain Selection Policies

Why does Stall Policy Work? 17 Why does Stall Policy Work?

19 PCs cover 90% of all Stalls 18 Why does Stall Policy Work? 19 PCs cover 90% of all Stalls

Constrained Dependence Chain Storage 19 Constrained Dependence Chain Storage

Storing 1 Chain provides 95% 20 Constrained Dependence Chain Storage Storing 1 Chain provides 95% of the Performance

Continuous Runahead Chain Generation 21 Continuous Runahead Chain Generation Maintain two structures: 32-entry cache of PCs to track the operations that cause the pipeline to frequently stall The last dependence chain for the PC that has caused the most full-window stalls At every full window stall: Increment the counter of the PC that caused the stall Generate a dependence chain for the PC that has caused the most stalls

Runahead for Longer Intervals 22 Runahead for Longer Intervals

CRE Microarchitecture 23 CRE Microarchitecture No Front-End No Register Renaming Hardware 32 Physical Registers 2-Wide No Floating Point or Vector Pipeline 4kB Data Cache

Dependence Chain Generation 24 Dependence Chain Generation Cycle: 1 2 4 3 5 ADD P7 + 1 -> P1 ADD E5 + 1 -> E3 Search List: P7 P1 P9, P1 P3 P2 SHIFT P1 -> P9 SHIFT E3 -> E4 Register Remapping Table: ADD P9 + P1 -> P3 ADD E4 + E3 -> E2 EAX EBX ECX Core Physical Register P7 P1 P8 P2 P3 P9 SHIFT P3 -> P2 SHIFT E2 -> E1 CRE Physical Register E5 E3 E0 E2 E1 E4 LD [P2] -> P8 LD [E1] -> E0 MAP E3 -> E5 First CRE Physical Register E3 E0 E1

Dependence Chain Generation 25 Dependence Chain Generation

26 Interval Length

System Configuration Single-Core/Quad-Core Caches 27 System Configuration Single-Core/Quad-Core 4-wide Issue 256 Entry Reorder Buffer 92 Entry Reservation Station Caches 32 KB 8-Way Set Associative L1 I/D-Cache 1MB 8-Way Set Associative Shared Last Level Cache per Core Non-Uniform Memory Access Latency DDR3 System 256-Entry Memory Queue Batch Scheduling Prefetchers Stream, Global History Buffer Feedback Directed Prefetching: Dynamic Degree 1-32 CRE Compute 2-wide issue 1 Continuous Runahead issue context with a 32-entry buffer and 32-entry physical register file 4 kB Data Cache

Single-Core Performance 28 Single-Core Performance

21% Single Core Performance Increase over prior State of the Art 29 Single-Core Performance 21% Single Core Performance Increase over prior State of the Art

Single-Core Performance + Prefetching 30 Single-Core Performance + Prefetching

Single-Core Performance + Prefetching 31 Single-Core Performance + Prefetching Increases Performance over and In-Conjunction with Prefetching

Independent Miss Coverage 32 Independent Miss Coverage

Independent Miss Coverage 33 Independent Miss Coverage 70% Prefetch Coverage

34 Bandwidth Overhead

Low Bandwidth Overhead 35 Bandwidth Overhead Low Bandwidth Overhead

Multi-Core Performance 36 Multi-Core Performance

43% Weighted Speedup Increase 37 Multi-Core Performance 43% Weighted Speedup Increase

Multi-Core Performance + Prefetching 38 Multi-Core Performance + Prefetching

13% Weighted Speedup Gain over 39 Multi-Core Performance + Prefetching 13% Weighted Speedup Gain over GHB Prefetching

Multi-Core Energy Evaluation 40 Multi-Core Energy Evaluation

Multi-Core Energy Evaluation 41 Multi-Core Energy Evaluation 22% Energy Reduction

42 Conclusions Runahead prefetch coverage is limited by the duration of each runahead interval To remove this constraint, we introduce the notion of Continuous Runahead We can dynamically identify the most critical LLC misses to target with Continuous Runahead by tracking the operations that cause the pipeline to frequently stall We migrate these dependence chains to the CRE where they are executed continuously in a loop

Conclusions Continuous Runahead greatly increases prefetch coverage 43 Conclusions Continuous Runahead greatly increases prefetch coverage Increases single-core performance by 34.4% Increases multi-core performance by 43.3% Synergistic with various types of prefetching

44 Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19th, 2016