TOWARDS BANDWIDTH- EFFICIENT PREFETCHING WITH SLIM AMPM June 13 th 2015 DPC-2 Workshop, ISCA-42 Portland, OR Vinson Young Ajit Krisshna.

Slides:

Advertisements

Similar presentations

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Advertisements

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Access Map Pattern Matching Prefetch: Optimization Friendly Method

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

ECE/CSC Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Concentration Zone/ Delta correlation based data prefetcher aided by stream buffer Kowshick Boddu 04/09/2015.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings A. gupta, Y. Kim, B. Urgaonkar, Penn State ASPLOS.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Mini-Project Presentation: Prefetching TDT4260 Computer Architecture

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.

Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Lecture: Cache Hierarchies

2nd Data Prefetching Championship Results and Awards

ECE 445 – Computer Organization

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Lecture: Cache Innovations, Virtual Memory

CARP: Compression-Aware Replacement Policies

15-740/ Computer Architecture Lecture 27: Prefetching II

15-740/ Computer Architecture Lecture 14: Prefetching

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

10/18: Lecture Topics Using spatial locality

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Multi-Lookahead Offset Prefetching

DSPatch: Dual Spatial pattern prefetcher

Presentation transcript:

TOWARDS BANDWIDTH- EFFICIENT PREFETCHING WITH SLIM AMPM June 13 th 2015 DPC-2 Workshop, ISCA-42 Portland, OR Vinson Young Ajit Krisshna

Method: Co-design best prefetchers with BW in mind Result: Slim AMPM nearly 4% speedup over AMPM, for multiple configurations Slim AMPM: case study in reducing BW and pollution 3 Techniques for efficient bandwidth consumption 1 Technique for reducing cache pollution Prefetching less can be beneficial –Prefetching consumes additional bandwidth –Prefetching causes cache pollution EXECUTIVE SUMMARY 2

OUTLINE Introduction to bandwidth and cache pollution 3 Bandwidth Optimizations 1 Cache Pollution Optimization Slim AMPM Results Summary 3

PREFETCH USES ON-CHIP BANDWIDTH Prefetching causes MSHR contention with reads Prefetches can fill MSHR, stalling pipeline 4 Prefetching should take bandwidth into account PPPPPDDP DD L2 MSHR Demand D Cannot continue if MSHR full

OPTIMIZE FOR BANDWIDTH 5 Frugal prefetching reduces bandwidth overhead 1. Co-design of regular and irregular prefetching –Optimize AMPM + DCPT, for bandwidth 2. Bandwidth throttling / smoothing –Optimize number of prefetches sent 3. Dynamic L2 or L3 prefetch –Optimize prefetch into L2 or L3

HYBRID PREFETCHER FOR COVERAGE Idea: –Query more prefetchers for higher coverage Method: Combine regular + irregular prefetcher –Regular pattern prefetcher (AMPM) –Irregular pattern prefetcher (DCPT) Benefit: –High coverage across wide set of workloads 6 Hybrid prefetcher for coverage

PREFETCHING REGULAR PATTERNS Regular access patterns –Stream, Stride, Address Map Pattern Matching Address Map Pattern Matching (AMPM) –Per-page “Access map” –Finds streams based previously accessed lines –Tracks hot pages 7 Use AMPM to prefetch regular accesses xxAP ACurxx Page 0xx Lines previously accessed Line just accessed Line prefetched

PREFETCHING IRREGULAR PATTERNS PC-based prefetching –Irregular Stream Buffer (Prefetch temporal patterns) –PC / Delta-correlating (Prefetch pc-based pattern) –Delta-Correlating Prediction Tables (DCPT) PC | delta | delta | delta | … | last access DCPT Example –Address: –Deltas: –Prefetch: 3140 …. 8 Use DCPT to prefetch irregular access patterns

AMPM + DCPT WASTES BANDWIDTH AMPM for regular patterns, DCPT for irregular patterns But, using both simultaneously wastes bandwidth Must combine AMPM+DCPT in bw-efficient manner

1. CO-DESIGN AMPM + DCPT 10 start Can DCPT Prefetch? Update prefetch parameters yes DCPT Issue prefetch no AMPM Issue prefetch DCPT then AMPM to reduce AMPM over-prefetching Switch between AMPM or DCPT

HYBRID AMPM + DCPT PERFORMANCE 11 Hybrid AMPM+DCPT improves performance by.5%

OPTIMIZE FOR BANDWIDTH Co-design of regular and irregular prefetching –Optimize AMPM + DCPT, for bandwidth 2. Bandwidth throttling / smoothing –Optimize number of prefetches sent 3. Dynamic L2 or L3 prefetch –Optimize prefetch into L2 or L3

WHY BANDWIDTH THROTTLE / SMOOTH Idea: Limit # prefetches to reduce BW consumption Reducing prefetches reduces L2 MSHR stall 13 Reduce bandwidth overhead by limiting prefetches PDPPPDRP DD L2 MSHR Demand No stall! D

2. BANDWIDTH THROTTLE / SMOOTH AMPM: –Reduce candidate strides to 1, 2, 3, 4 –Reduce max AMPM prefetches to 2 Benefits: –Reduces bandwidth consumption –Smooths bursty bandwidth 14 Slim down AMPM to reduce bandwidth consumption

BANDWIDTH SMOOTHING RESULTS 15 Smoothing bandwidth gives 1.1% speedup

OPTIMIZE FOR BANDWIDTH Co-design of regular and irregular prefetching –Optimize AMPM + DCPT, for bandwidth 2. Bandwidth throttling / smoothing –Optimize number of prefetches sent 3. Dynamic L2 or L3 prefetch –Optimize prefetch into L2 or L3

Idea: Reduce load on L2 by prefetching more to L3 Offload prefetch to L3 when L2 in use PDPPPDDP DD L2 MSHR Demand DD PREFETCH TO L3 CAN REDUCE L2 LOAD 17 Prefetch to L3 to reduce L2 MSHR stalls No stall! PRPPPDRP L3 MSHR PPP D

3. L3 PREFETCH IMPLEMENTATION AMPM: –Prefetch into L3 by default –Prefetch to L2 only when L2 not in use i.e. when L2 has high MPKI, low hit rate Benefits: –Reduced L2 load –Opportunistically use L2 when free 18 Prefetch into L3, and into L2 when L2 not in use

L3 PREFETCH RESULTS 19 L2/L3 prefetching improves performance by 0.5%

OUTLINE Introduction to bandwidth and cache pollution 3 Bandwidth Optimizations 1 Cache Pollution Optimization Slim AMPM Results Summary 20

PREFETCH CAUSES CACHE POLLUTION Prefetching into caches evicts previous entry Reduce pollution with: less wasted prefetches more accurate access maps 21 Prefetchers should take cache pollution into account Demand 0x00 Demand 0x01 Prefetch 0x10 Demand 0x00 Prefetch can evict useful entries

POLLUTION VIA STALE ACCESS MAP 22 Prefetcher less effective with stale maps xx~AP Curxx Over- Prefetch xx Line just accessed Previous accesses xxAP ACurxx Under- Prefetch ~A Line just accessed Stale accesses Wasted prefetch Not prefetched Bad prefetch sent Good prefetch not sent No longer in cache

STALE MAPS REDUCE PERFORMANCE More access maps don’t always improve prefetching Tracking too many pages can cause map to go stale 23 AMPM “access maps” should be adjusted

REFRESH “ACCESS MAPS”  LOWER POLLUTION Idea: Refresh access maps periodically and dynamically Benefits: –“Access Map” up-to-date for informed prefetches –“Access Map” refresh dynamically to fit workloads 24 Dynamically refresh to reduce stale “access maps”

REFRESH IMPLEMENTATION AMPM: –“Access map” use random eviction 1% of time –Dynamically reduce number of “access maps” for workloads that miss “access maps” often Benefits –Periodic refresh –Fewer “access maps”  quickly adapt to workload 25 Dynamically refresh “access maps”

“ACCESS MAP” REFRESH PERFORMANCE 26 “Access map” refresh improves performance by.9%

OUTLINE Introduction to bandwidth and cache pollution 3 Bandwidth Optimizations 1 Cache Pollution Optimization Slim AMPM Results Summary 27

SLIM AMPM 28 BW-efficient Slim AMPM for improved performance Bandwidth Optimizations –Hybrid AMPM+DCPT –Bandwidth throttle / smoothing –Dynamic L2 or L3 prefetch Cache pollution Optimizations –“Access Map” refresh

PARAMETERS 29 Slim AMPM parameters tuned for BW Slim AMPMParameterConfiguration AMPM# pages512, dynamically decrease ReplacementLRU 99%, Random 1% Candidate Prefetch-4 to 4 Max prefetches2, or 1 for low bw Prefetch to L2/LLCDynamic, based on L2 hit % DCPTDCPT Entries200 Number deltas9 Prefetch to L2/LLCDynamic, based on L2 Max Prefetches4, or 3 for low bw Read the paper!

OUTLINE Introduction to bandwidth and cache pollution 3 Bandwidth Optimizations 1 Cache Pollution Optimization Slim AMPM Results Summary 30

TESTING FRAMEWORK DPC-2 setup: L1(16KB) / L2(128KB) / L3(1MB) / Main mem Benchmarks: 8 High MPKI workloads from SPEC Verified on multiple configurations and benchmarks ConfigurationL3 sizeMemory BW Base1MB12.8 GB/s Small LLC256KB12.8 GB/s Low BW1MB3.2 GB/s Random1MB12.8 GB/s

% SPEEDUP OVER AMPM 32 Slim AMPM has speedup of 3.3% over AMPM

% SPEEDUP OVER NO PREFETCH 33 Slim AMPM has speedup of 19.3% over no prefetch AMPM Slim AMPM

OUTLINE Introduction to bandwidth and cache pollution 3 Bandwidth Optimizations 1 Cache Pollution Optimization Slim AMPM Results Summary 34

Method: Co-design best prefetchers with BW in mind Result: Slim AMPM nearly 4% speedup over AMPM, for multiple configurations Slim AMPM: case study in reducing BW and pollution 3 Techniques for efficient bandwidth consumption 1 Technique for reducing cache pollution Prefetching less can be beneficial –Prefetching consumes additional bandwidth –Prefetching causes cache pollution EXECUTIVE SUMMARY 35

THANK YOU 36