Eshan Bhatia1, Gino Chacon1, Elvira Teran2, Paul V. Gratz1, Daniel A

Slides:

Advertisements

Similar presentations

Dead Block Replacement and Bypass with a Sampling Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.

Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

CS4432: Database Systems II Buffer Manager 1. 2 Covered in week 1.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Memory Management 2010.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.

ECE/CSC Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Optimized Hybrid Scaled Neural Analog Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.

1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

1 A New Case for the TAGE Predictor André Seznec INRIA/IRISA.

1 Revisiting the perceptron predictor André Seznec IRISA/ INRIA.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.

Instruction Prefetching Smruti R. Sarangi. Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return.

Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.

Value Prediction Kyaw Kyaw, Min Pan Final Project.

Multiperspective Perceptron Predictor Daniel A. Jiménez Department of Computer Science & Engineering Texas A&M University.

MICRO-48, 2015 Computer System Lab, Kim Jeong Won.

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Improving Cache Performance using Victim Tag Stores

CSE 502: Computer Architecture

CRC-2, ISCA 2017 Toronto, Canada June 25, 2017

Data Prefetching Smruti R. Sarangi.

Summary of “Efficient Deep Learning for Stereo Matching”

Dynamic Branch Prediction

Multiperspective Perceptron Predictor with TAGE

Dynamic Hashing (Chapter 12)

18742 Parallel Computer Architecture Caching in Multi-core Systems

Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.

So far we have covered … Basic visualization algorithms

5.2 Eleven Advanced Optimizations of Cache Performance

CSC 322 Operating Systems Concepts Lecture - 16: by

Prefetch-Aware Cache Management for High Performance Caching

FA-TAGE Frequency Aware TAgged GEometric History Length Branch Predictor Boyu Zhang, Christopher Bodden, Dillon Skeehan ECE/CS 752 Advanced Computer Architecture.

Lecture 13: Large Cache Design I

Energy-Efficient Address Translation

Exploring Value Prediction with the EVES predictor

Milad Hashemi, Onur Mutlu, Yale N. Patt

Using Dead Blocks as a Virtual Victim Cache

Lecture: Static ILP, Branch Prediction

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Lecture: Branch Prediction

Jianbo Dong, Lei Zhang, Yinhe Han, Ying Wang, and Xiaowei Li

Advanced Computer Architecture

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Data Prefetching Smruti R. Sarangi.

TAGE-SC-L Again MTAGE-SC

15-740/ Computer Architecture Lecture 14: Prefetching

The Vector-Thread Architecture

Dynamic Hardware Prediction

rePLay: A Hardware Framework for Dynamic Optimization

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Haonan Wang, Adwait Jog College of William & Mary

Multi-Lookahead Offset Prefetching

Spring 2019 Prof. Eric Rotenberg

Chapter 2: Building a System

Building a “System” Moving from writing a program to building a system. What’s the difference?! Complexity, size, complexity, size complexity Breadth.

DSPatch: Dual Spatial Pattern Prefetcher

DSPatch: Dual Spatial pattern prefetcher

Presentation transcript:

Enhancing Signature Path Prefetching with Perceptron Prefetch Filtering Eshan Bhatia1, Gino Chacon1, Elvira Teran2, Paul V. Gratz1, Daniel A. Jiménez3 1Texas A&M University 2Texas A&M International University 3Texas A&M University / Barcelona Supercomputing Center

Introduction Design Space: Standalone L1D, L2C and LLC Prefetchers Distribution of hardware budget across three prefetchers Interaction among the prefetchers Control over placing the incoming prefetch line (L1D vs L2C vs LLC)

Key Ideas Aggressive L2C Prefetching Optimizing Prefetch Queue Sharing Signature Path Prefetcher (SPP)[Kim, MICRO ‘16] Perceptron-based Prefetch Filtering (PPF)[Bhatia, ISCA ‘19] Optimizing Prefetch Queue Sharing Page based resource sharing Minimal LLC Prefetching Lack of information LLC is a shared resource among cores Coordination between levels Minimizing impact of noisy prefetches on lower level prefetchers

Page Based Resource Sharing Prefetch Queue (PQ) limited in number Valuable resource for L1D / L2C Aggressive (but still accurate) prefetching Takes the current page deep into the speculation path Blocks PQ resources for other pages Timing disparity between multiple pages with interleaved accesses Efficient Resource Utilization Track number of distinct pages in last few memory accesses Divide PQ resource over those pages

L1D Prefetcher: Next-N-Lines Fetches N consecutive lines wrt current demand address N determined through PQ resource availability Page level throttling Tracks per page access pattern for the last two accesses Scores page as +1 delta friendly or averse Throttles prefetching for averse pages

L2C Underlying Prefetcher: SPP Lookahead Prefetcher Uses previous prefetch suggestion to trigger new speculation Recursively iterate and keep compounding the confidence Stop when the confidence falls below a certain threshold Threshold (hyperparameter) is an indication of aggressiveness Less threshold -> more aggressive -> more coverage -> less accuracy Pre-defined trade-off between coverage and accuracy

Enhanced SPP Decoupled coverage and accuracy concerns SPP enhanced to its most aggressive extreme Helps capture complex memory access patterns Increases coverage Perceptron Filtering (PPF) takes care of accuracy

Hashed Perceptron Model Use feature values to index into distinct tables Example: PC, memory address etc Prediction: Lookup, summation, threshold Use xi value to index into table of corresponding Wi Learning occurs when ground truth known Positive Outcome: Increment each feature’s partial prediction weight Negative Outcome: Decrement each feature’s partial prediction weight No multiplication, no division, no complex back-propagation

PPF Architecture Baseline prefetcher: SPP Modified for high coverage Perceptron Weights Tables Tables of 5-bit up-down saturating counters 1 table per feature Variable depth, independent indexing Prefetch and Reject Tables Record prefetches for future training

PPF Design Prefetch suggestions tested using PPF Outcome and indexing metadata recorded in Prefetch / Reject Table Subsequent feedback of a prior prefetch Same perceptron weights re-indexed and updated by +1 / -1

Putting Pieces Together Single Core Configuration L1D: Enhanced Next-N-line L2C: PPF with SPP Triggered on all accesses to L2C Can place prefetches in L2C or LLC LLC: Next Line prefetcher Triggered on demand accesses and only last prefetch from L1D reaching LLC Uses the metadata communication path between the prefetchers Overhead: 49.94 KBs Multi Core Configuration L1D: No Prefetching L2C: PPF with SPP Triggered on all accesses to L2C Can place prefetches in L2C or LLC LLC: SPP (without PPF) Separate tables for each core Modified to be less aggressive than the original SPP (LLC is a shared resource) Overhead: 62.83 KBs

Results Improvement reported over no prefetching Single Core: 40.4% Multi Core: 20.3%

Future Works Better baseline prefetchers for PPF Interaction between the prefetchers Metadata communication path between the levels

Thank you!

Backup Slides

L2C Underlying Prefetcher: SPP