1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Slides:

Advertisements

Similar presentations

Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Yue Hu David M. Koppelman Lu Peng A Penalty-Sensitive Branch Predictor Department of Electrical and Computer Engineering Louisiana State University.

Access Map Pattern Matching Prefetch: Optimization Friendly Method

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

Prefetching On-time and When it Works Sequential Prefetcher With Adaptive Distance (SPAD) Ibrahim Burak Karsli Mustafa Cavus

ECE/CSC Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

TOWARDS BANDWIDTH- EFFICIENT PREFETCHING WITH SLIM AMPM June 13 th 2015 DPC-2 Workshop, ISCA-42 Portland, OR Vinson Young Ajit Krisshna.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Revisiting Load Value Speculation:

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Analysis of Branch Predictors

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Mini-Project Presentation: Prefetching TDT4260 Computer Architecture

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Lecture 20 Last lecture: Today’s lecture: Types of memory

Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

MICRO-48, 2015 Computer System Lab, Kim Jeong Won.

Lecture: Large Caches, Virtual Memory

ASR: Adaptive Selective Replication for CMP Caches

Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,

Dynamically Sizing the TAGE Branch Predictor

Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.

Prefetch-Aware Cache Management for High Performance Caching

ECE 445 – Computer Organization

Energy-Efficient Address Translation

Exploring Value Prediction with the EVES predictor

Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt

Part V Memory System Design

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 10: Branch Prediction and Instruction Delivery

15-740/ Computer Architecture Lecture 14: Prefetching

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Multi-Lookahead Offset Prefetching

Presentation transcript:

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University

2 Motivation  Can’t always expect high prefetch accuracy & timeliness  Potential can be lost when these are low  Adaptive schemes adjust aggressiveness based on effectiveness  Adaption and selectiveness as important as address prediction

3 Our Scheme – Hybrid Adaptive Prefetcher (HAP)  Start with good address prediction – Stride / Sequential hybrid Sequential prefetching scheme requires no warmup Stride prefetcher is more robust  Issue prefetches selectively  Incorporate a published adaptive prefetch method Feedback Directed Prefetching (Srinath et. al, HPCA 2007)  Improve with bandwidth adaption

4 Related Work – Feedback Directed Prefetching (HPCA 2007)  Prefetcher aggressiveness defined by prefetch distance and degree  Aggressiveness adjusted dynamically based on three feedback metrics Percentage of useful prefetches Percentage of late prefetches Percentage of prefetches which cause demand misses (cache pollution)

5 Differences between FDP and our scheme  Use both L1 and L2 prefetching Scheme is modified to support L1/L2  Use a hybrid stride / sequential prefetching scheme  A bandwidth based feedback metric is proposed  No cache pollution metric

6 Stride/Sequential Prefetching Scheme – Training Stride Prefetcher  Use a PC-indexed stride prediction scheme Stride Prediction Table Entry 1. Compute new stride using this field and current address value 2. Store computed stride 3. Increment count for unchanged stride Reset otherwise  Entry is trained if Count is above a threshold value

7 Stride/Sequential Prefetching Scheme – Issuing Prefetches  Check stride table on demand miss / hit to prefetched line Issue stride prefetches based on degree and distance  Sequential prefetches If no valid / trained stride entry If previous line present in cache Issue sequential prefetches based on degree

8 Adjusting Aggressiveness with Feedback Metrics  Prefetch Accuracy – Percentage of prefetches used by a demand request  Prefetch Lateness – Percentage of accurate prefetches which are late  Bandwidth Contention – Percentage of clock cycles during which cache bandwidth is above a threshold  Evaluate separately for L1 and L2  Evaluate periodically after fixed number of cycles. Adjust aggressiveness if justified.

9 Storage efficient Miss Status Hit Registers (MSHRs)  Used to track all inflight / inqueue memory requests at both cache levels MSHR Entry 1.Entry allocated for each outstanding L1 and / or L2 request. Valid bit set. 2. Two bit cache level field indicates L1 only, L2 only or combined L1 / L2 3. Two prefetch bits indicate prefetch requests 4. Concurrent L1 and L2 requests to the same line share the same MSHR entry

10 Implementing Feedback Metrics  Prefetch Accuracy Prefetch bit set for prefetched line brought into cache Bit set in MSHR for inflight / inqueue prefetched lines Increment accurate count if demand request finds a set bit Reset bit after increment Accuracy is based on percentage of total prefetches issued

11 Implementing Feedback Metrics  Prefetch Lateness Prefetch bit (s) set in MSHR for a prefetched inflight / inqueue line On demand miss, late prefetch detected  If a valid MSHR entry exists for this miss  If prefetch bit for the correct cache level is set Reset bit after incrementing late count Lateness is based on percentage of useful prefetches

12 Implementing Feedback Metrics  Bandwidth Contention - 1 Use MSHR to monitor total outstanding L1 and L2 requests in every cycle Increment counter for every cycle that total is above threshold The contention rate is based on percentage of total cycles  Bandwidth Contention - 2 Prefetches not issued if outstanding requests are above threshold

13 Adjusting Aggressiveness  Evaluate metrics at fixed intervals  Determine if high or low based on a threshold  May adjust aggressiveness based on following criteria Aggressiveness Policy

14 Prefetcher Aggressiveness Levels  Aggressiveness adjusted in increments of one Prefetcher Aggressiveness Levels Middle Aggressiveness Very Conservative Very Aggressive

15 Experimental Evaluation - Setup  Evaluate 15 SPEC CPU 2006 Benchmarks using CMPSim Simulator  Evaluate for three competition configurations Config 1 – 2048 KB L2 Cache, unlimited bandwidth Config 2 – 2048 KB L2 Cache, limited bandwidth Config 3 – 512 KB L2 Cache, limited bandwidth  Limited bandwidth configs allow one L1 issue per cycle and one L2 per 10 cycles

16 Experimental Evaluation - Setup  Compare our scheme, Hybrid Adaptive Predictor (HAP) to four configurations No prefetching Middle Aggressive Stride Very Aggressive Stride Modified Feedback Directed Prefetcher  Uses both L1 / L2 prefetching  Does not use a cache pollution metric

17 Results - Expectations  Very aggressive stride will do better on some, worse on other benchmarks  Adaptive schemes will perform at least as well as non-adaptive  Unlimited bandwidth and large cache configurations benefit aggressive schemes

18 Results – Bandwidth Unlimited, 2 MB L2 Config HAP outperforms other prefetchers for all benchmarks except lbm Performance benefit compared to mid-aggressive stride is 11% on average and 46% versus no prefetching.

19 Results – Bandwidth Limited, 2 MB L2 Config HAP is best on average. Aggressive stride performs best in three benchmarks (mcf, lbm and soplex) Performance benefit compared to mid-aggressive stride is 9% on average and 45% versus no prefetching.

20 Results – Bandwidth Limited, 512 KB L2 Config Results are similar to Config 2 Performance benefit compared to mid-aggressive stride is 8% on average and 44% versus no prefetching.

21 Results (All benchmarks) – Bandwidth Limited, 2 MB L2 Config Additional benchmarks are mostly unaffected by prefetching Performance benefit compared to mid-aggressive stride is 6% on average and 29% versus no prefetching for all benchmarks.

22 Conclusions  A well designed and adaptive prefetching scheme is very effective  Very aggressive stride works best for some benchmarks  A cache pollution metric may improve results

23 THANK YOU QUESTIONS?