CARP: Compression-Aware Replacement Policies

Slides:



Advertisements
Similar presentations
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Advertisements

Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010.
Prefetch-Aware Cache Management for High Performance Caching
Improving Cache Performance by Exploiting Read-Write Disparity
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.
Defining Anomalous Behavior for Phase Change Memory
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Exploiting Compressed Block Size as an Indicator of Future Reuse
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.
The Evicted-Address Filter
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
Cache Replacement Championship
Cache Replacement Policy Based on Expected Hit Count
Improving Cache Performance using Victim Tag Stores
Reducing Memory Interference in Multicore Systems
CRC-2, ISCA 2017 Toronto, Canada June 25, 2017
Cache Performance Samira Khan March 28, 2017.
18-447: Computer Architecture Lecture 23: Caches
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
ASR: Adaptive Selective Replication for CMP Caches
Xiaodong Wang, Shuang Chen, Jeff Setter,
Multilevel Memories (Improving performance using alittle “cash”)
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,
Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University
18742 Parallel Computer Architecture Caching in Multi-core Systems
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Prefetch-Aware Cache Management for High Performance Caching
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
Application Slowdown Model
CARP: Compression Aware Replacement Policies
ECE-752 Zheng Zheng, Anuj Gadiyar
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adapted from slides by Sally McKee Cornell University
Qingbo Zhu, Asim Shankar and Yuanyuan Zhou
Lecture 14: Large Cache Design II
Reducing DRAM Latency via
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
José A. Joao* Onur Mutlu‡ Yale N. Patt*
15-740/ Computer Architecture Lecture 14: Prefetching
15-740/ Computer Architecture Lecture 18: Caching Basics
Lecture 9: Caching and Demand-Paged Virtual Memory
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

CARP: Compression-Aware Replacement Policies Tyler Huberty, Rui Cai, Gennady Pekhimenko

Executive Summary Traditional cache replacement and insertion policies mainly focus on block reuse In a compressed cache, size is an additional dimension Problem: How can we maximize cache performance utilizing both block reuse and size? Observation: The block most likely to be reused soon may no longer be the best block to keep in the cache Key Idea: Use compressed block size in making cache replacement and insertion decisions Solution: We propose three mechanisms: Min-LRU, Min-Eviction, Global Min-Eviction Tyler

Background Promising cache compression mechanisms proposed recently include Base-Delta-Immediate Compression [Pekhimenko+, PACT’12] and Frequent Pattern Compression [Alameldeen+, ISCA’04] To our understanding, no prior work considered special replacement policies that take compressibility into account Rui

Motivation BDI FPC Different workloads have varying distributions of compressed block sizes Use this variability in size to improve replacement decision

Application Groupings Not compressible Homogeneous compressible Little variation in compressible size Heterogeneous compressible Variation in compressible sizes Tyler

Mechanisms Min-LRU Min-Eviction Global Min-Eviction Rui

Min-LRU Observation: LRU evicts more blocks than necessary Key Idea: Evict only the minimum number of LRU blocks Rui

Example 4x 4x 4x x x x 4x x x x Insert LRU Naïve LRU Set 0 LRU Min-LRU Rui LRU 4x x x x Min-LRU Set 0

Min-Eviction Observation: Keeping multiple compressible blocks with less reuse may be more valuable than a single uncompressible block of higher reuse Key Idea: Assign a value based on reuse and compressibility to all blocks and on replacement, evict the set of blocks with the least value Tyler

Example 2x {70%,2x} {50%,x} {50%,x} 35 50 50 Compute Values Evict Sort Insert {70%,2x} {50%,x} {50%,x} Set 0 Compute Values 35 50 50 Evict Sort Set 0 Tyler Value Function = Probability of reuse Compressed size

Min-Eviction Value Function f(block reuse, block size) Monotonically increasing with respect to block reuse (r) Monotonically decreasing with respect to block size (s) Probability of reuse predictor: RRIP [Jaleel+, ISCA’10] derivative Examples Tyler r/s Plane

Min-Eviction Weaknesses Value function potentially difficult to implement in hardware Probability of near reuse is hard to estimate (use RRIP-like mechanism [Jaleel+, ISCA’10]) Value inversion

Decoupled Cache “The V-way cache: Demand based associativity via global replacement”, Qureshi, Thompson, and Patt, ISCA 2005

Global Replacement Policies Global Min-Eviction: Consider next 64 valid entries, starting from PTR Run min-eviction

Methodology Simulator Workloads System Parameters x86 event-driven simulator based on Simics [Magnusson+, Computer’02] Workloads SPEC2006 Benchmarks 1-2 core simulations for 1 billion representative instructions System Parameters L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08] 4GHz, x86 in-order core, 2MB L2, simple memory model (300-cycle latency for row-misses) Rui

Single-Core Results Min-Eviction outperforms RRIP and LRU

Homogeneous-Homogeneous GeoMean In a shared cache, some Homogeneous-Homogeneous workloads act like a Heterogeneous workload

Heterogeneous-Homogeneous Min-Eviction shows varied improvement over LRU in Heterogeneous-Homogeneous workloads, 3.5% overall

Heterogeneous-Heterogeneous Rui Min-Eviction outperforms LRU on average 6.4% in Heterogeneous-Heterogeneous workloads

Next Ideas (CARP ver.next) Insertion policy LCP integration Consider compressibility in prefetch decision Replace value function? New curve New inputs Dynamic weights on inputs Heterogeneous value function Bucketed or staged decision making (i.e. TCM-like)

Further Investigations Classifying applications Standard deviation, sensitivity Entire data store block size distribution Study of reuse predictors (D-RRIP, EAF, etc) Ideal?

Backup Slides

Spatial Footprint Predictor* Predicts neighboring words that are useful to prefetch Idea: “Variable Cache Line Size Cache” Spatial locality varies across applications Cache line size choice affects performance Use Spatial Footprint Predictor to predict at fine-grain optimal line size Compressed cache structure supports variable line sizes Variable cache line size  size-aware replacement policy Also consider: Spatial Memory Streaming [Somogyi et. al. ISCA’06] * [Wilkerson & Kumar, ISCA’98]

Multi-Core Results: Overall 4.6% 1.9% 0.8% The best speedup is exhibited in workloads with the largest variability in size, 1.9% overall with 2-cores

Next Steps Analyze Dynamic History-Based Partitioning results for further insight Continuing simulating and improving Min-Value Eviction Improve value function: try to eliminate division Explore fairness Idea: add application-awareness to value function (i.e. f(r,s,a)) Idea: enforce application partitions in Dynamic History-Based Partitioning Rui

References Belady. A study of replacement algorithms for a virtual-storage computer. In IBM Systems Journal 1966. Hallnor et al. A Fully Software-Managed Cache Design. In ISCA 2000. Jaleel et al. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In ISCA 2010. Pekhimenko et al. Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches. In PACT 2012. Qureshi et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Micro 2006. Seshadri et al. The Evicted-Address Filter: A unified mechanism to address both cache pollution and thrashing. In PACT 2012.