MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Slides:

Advertisements

Similar presentations

Dead Block Replacement and Bypass with a Sampling Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.

Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Performance of Cache Memory

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University.

1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.

1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010.

Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.

Improving Cache Performance by Exploiting Read-Write Disparity

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Memory System Characterization of Big Data Workloads

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

TOWARDS BANDWIDTH- EFFICIENT PREFETCHING WITH SLIM AMPM June 13 th 2015 DPC-2 Workshop, ISCA-42 Portland, OR Vinson Young Ajit Krisshna.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Ragavendra Natarajan 1, Mainak Chaudhuri 2.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Sampling Dead Block Prediction for Last-Level Caches

Exploiting Compressed Block Size as an Indicator of Future Reuse

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.

The Evicted-Address Filter

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Cache Replacement Championship

Improving Cache Performance using Victim Tag Stores

CRC-2, ISCA 2017 Toronto, Canada June 25, 2017

Adaptive Cache Partitioning on a Composite Core

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,

18742 Parallel Computer Architecture Caching in Multi-core Systems

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Prefetch-Aware Cache Management for High Performance Caching

Lecture 13: Large Cache Design I

Energy-Efficient Address Translation

CARP: Compression Aware Replacement Policies

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Using Dead Blocks as a Virtual Victim Cache

Phase Capture and Prediction with Applications

Lecture 15: Large Cache Design III

CARP: Compression-Aware Replacement Policies

Lecture 14: Large Cache Design II

Lecture 10: Branch Prediction and Instruction Delivery

Hardware Counter Driven On-the-Fly Request Signatures

rePLay: A Hardware Framework for Dynamic Optimization

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Phase based adaptive Branch predictor: Seeing the forest for the trees

Presentation transcript:

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20, 2010

Problem: Changing hardware and workloads encourage investigation of cache replacement/insertion policy designs Proposal: MadCache uses PC history to choose cache insertion policy –Last level cache granularity –Individual PC granularity Performance improvements over LRU –2.5% improvement IPC (single thread) –4.5% speedup and 6% speedup improvement (multithreaded) 2 Executive Summary

Importance of investigating cache insertion policies –Direct affect on performance –LRU dominated hardware designs for many years –Changing workloads, levels of caches Shared last-level cache –Cache behavior now depends on multiple running applications –One streaming thread can ruin the cache for everyone 3 Motivation

Dynamic insertion policies –DIP – Qureshi et. al – ISCA ’07 Dueling sets select best of multiple policies Bimodal Insertion Policy (BIP) offers thrash protection –TADIP – Jaleel et. al – PACT ’08 Awareness of other threads’ workloads Utilizing Program Counter information –Exhibit a useful amount of predictable behavior –Dead-block prediction and prefetching – ISCA ’01 –PC-based load miss prediction – MICRO ’95 4 Previous Work

Problem: With changing hardware and workloads, caches are subject to suboptimal insertion policies Solution: Use PC information to create a better policy –Adaptive default cache insertion policy –Track PCs to determine the policy on a finer grain than DIP –Filter out streaming PCs Introducing MadCache! 5 MadCache Proposal

Tracker Sets –Sample behavior of the cache –Enter the PCs into PC-Predictor Table –Determines default policy of cache Uses set dueling - Qureshi et. al – ISCA ’07 LRU and Bypassing Bimodal Insertion Policy (BBIP) Follower Sets –Majority of the last level cache –Typically follow the default policy –Can override default cache policy (PC-Predictor Table) 6 MadCache Design

7 Tracker and Follower Sets BBIP Tracker Sets LRU Trackers Sets Follower Sets Reuse Bit Index to PC- Predictor Tracker Sets overhead –1-bit to indicate if line was accessed again –10/11 bits to index PC-Predictor table Last Level Cache

PC-Predictor Table –Store PCs that have accessed Tracker Sets –Track behavior history using counter Decrement if an address is used many times in the LLC Increment if line is evicted and was never reused –Per-PC default policy override LRU (default) plus BBIP override BBIP (default) plus LRU override 8 MadCache Design

9 PC-Predictor Table Policy + PC(MSB) Counter# Entries ( bits) (6 bits)(9 bits) PC (miss)(MSB) Counter Hit? 0 1 Parallel to cache miss, PC + current policy index PC-Predictor If hit in table, follow the PC’s override policy If miss in table, follow global default policy Default Policy PC-Predictor Table

Thread aware MadCache –Similar structures as single-threaded MadCache –Track based on current policy of other threads Multithreaded MadCache extensions –Separate tracker sets for each thread Each thread still tracks LRU and BBIP –PC-Predictor table Extended number of entries Indexed by thread-ID, policy, and PC –Set dueling PER THREAD 10 Multi-Threaded MadCache

11 Multi-threaded MadCache TID + + PC(MSB) Counter# Entries ( bits) (6 bits)(9 bits) (MSB) Counter Hit? 0 1 Default PolicyPC-Predictor Table (10 bits) TID-0 TID-1 TID-2 TID-3 TID-0 BBIP Tracker Sets TID-0 LRU Tracker Sets Other Tracker Sets Follower Sets Last Level Cache

Deep Packet Inspection 1 –Large match tables (1MB+) commonly used for DFA/XFA regular expression matching –Incoming byte stream from packets causes different table traversals Table exhibits reuse between packets Packets mostly streaming (backtracking implementation dependent) 12 MadCache – Example Application 1 Evaluating GPUs for Network Packet Signature Matching – ISPASS ‘09

MadCache – Example Application 13 –Packets mostly streaming –Frequently accessed Match Table contents held in L1/L2 Less frequently accessed elements in LLC/memory Match Table Current Processing Element Packet Current Processing Element Packet

MadCache – Example Application DIP –Would favor BIP policy due to packet data streaming –LLC mixture of Match Table and useless packet data MadCache –Would identify PCs associated with Match Table as useful –LLC populated almst entirely by Match Table 14 DIP LLCMadCache LLC Packet Data Table Data

15 Experimentation Processor8-Stage, 4-wide pipeline Instruction window size128 entries Branch PredictorPerfect L1 inst. cache32KB, 64B linesize, 4-way SA, LRU, 1 cycle hit L1 data cache32KB, 64B linesize, 8-way SA, LRU, 1 cycle hit L2 cache32KB, 64B linesize, 8-way SA, LRU, 10 cycle hit L3 cache (1 thread)1MB, 64B linesize, 30 cycle hit L3 cache (4 threads)4MB, 64B linesize, 30 cycle hit Main memory200 cycles –15 benchmarks from SPEC CPU2006 –15 workload mixes for multithreaded experiments –200 million cycle simulations

IPC normalized to LRU –2.5% improvement across benchmarks tested –Slight improvement over DIP 16 Results – Single-threaded

17 Results – Multithreaded Throughput normalized to LRU –6% improvement across mixes tested –DIP performs similarly to LRU

18 Results Weighted speedup normalized to LRU –4.5% improvement across benchmaks tested –DIP performs similarly to LRU

19 Future Work MadderCache? –Optimize size of structures PC-Predictor Table size Replace CAM with Hashed PC & Tag –Detailed analysis of benchmarks with MadCache –Extend PC Predictions Don’t take into account sharers

20 Conclusions Cache behavior still evolving –Changing cache levels, sharing, workloads MadCache insertion policy uses PC information –PCs exhibit useful amount of predictable behavior MadCache performance –2.5% improvement IPC for single-threaded –4.5% speedup, 6% throughput improvement for 4-threads –Sized to competition bit budget Preliminary investigations show little impact with reduction in structures

21 Questions?