Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Slides:



Advertisements
Similar presentations
Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
Advertisements

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer
A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.
CRUISE: Cache Replacement and Utility-Aware Scheduling
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Bypass and Insertion Algorithms for Exclusive Last-level Caches
To Include or Not to Include? Natalie Enright Dana Vantrease.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
High Performing Cache Hierarchies for Server Workloads
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Achieving Non-Inclusive Cache Performance with Inclusive Caches
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
Prefetch-Aware Cache Management for High Performance Caching
Improving Cache Performance by Exploiting Read-Write Disparity
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Skewed Compressed Cache
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.
Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Ragavendra Natarajan 1, Mainak Chaudhuri 2.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.
Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Sampling Dead Block Prediction for Last-Level Caches
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
International Symposium on Computer Architecture ( ISCA – 2010 )
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
The Evicted-Address Filter
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.
Sunpyo Hong, Hyesoon Kim
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Cache Replacement Policy Based on Expected Hit Count
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Improving Cache Performance using Victim Tag Stores
CRC-2, ISCA 2017 Toronto, Canada June 25, 2017
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Multilevel Memories (Improving performance using alittle “cash”)
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
Cache Memory Presentation I
RIC: Relaxed Inclusion Caches for Mitigating LLC Side-Channel Attacks
Prefetch-Aware Cache Management for High Performance Caching
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
Lecture 21: Memory Hierarchy
Using Dead Blocks as a Virtual Victim Cache
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Presentation transcript:

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer In International Symposium on Microarchitecture (MICRO), December 2010 Presented by: Yingying Tian

High Performing Cache Hierarchy in CMPs Cache Hierarchy Multiple interacting caches on chip Tradeoff between cache latency and hit rate Chip-Multi Processors (CMPs) widen the gap between processor and memory speeds Goal : efficient and high performing cache hierarchy

Key issue: Inclusion or Not? *Some materials are taken from original presentation slides Size of the cache hierarchy v.s. Simplicity of the cache coherence

Simplify cache coherence Waste cache capacity (= size of the LLC) Inclusion property causes invalidation of blocks that keep high temporal locality in core caches – back invalidate problem  hundreds of cycles memory access penalty Inclusive Caches

Back-Invalidate Problem Inclusion property: all the higher-level caches be a subset of the last-level cache (LLC). Back-invalidation: When a block is evicted from the LLC, inclusion is enforced by invalidating that block from all the caches in the hierarchy. -- Inclusion Victim Small caches filter temporal locality  inclusion victims keep temporal locality -- Hot Inclusion Victim

Consider following access pattern in a 2-level inclusive cache hierarchy: … a, b, a, c, a, d, a, e, a, f… Back-Invalidate Problem (Cont.) 6 a a ba ba ba ab cba ca cba ac dcba da dcba ad MRULRU Reference ‘e’ misses and evicts ‘a’ from hierarchy Next Reference to ‘a’ misses. While ‘a’ keeps high temporal locality in L1. L1: L2: edcb ed

Back-Invalidate Problem (Cont.) Intel Core i7– 1:8 cache ratio, inclusive LLCs. AMD Phenom Ⅱ -- 1:4 cache ratio, non-inclusive LLCs.

Goal: to implement efficient and high performing cache hierarchy by eliminating hot inclusion victims to improve inclusive cache performance Temporal Locality Aware Cache Management Polices

Outline Background and motivation Problem description Temporal Locality Aware (TLA) Cache Management Policy Suite Evaluation Conclusion

3 Temporal Locality Aware (TLA) Cache Management Policies: Temporal Locality Hints (TLH) Early Core Invalidation (ECI) Query Based Selection (QBS)

Temporal Locality Hints (TLH) conveys the temporal locality of hot blocks in core caches by sending hints to the LLC on each hit of core caches to update the replacement state of that block in LLC. Significantly reduce the number of inclusion victims The number of requests to the LLC is extremely large and does not scale well with increasing number of cores (even with filter optimizations) Limit study

Early Core Invalidation (ECI) derives the temporal locality of a block before its becomes LRU in the LLC. The LLC chooses the block located at [LRU-1] position and invalidates it in the core caches while keeping it in the LLC by observing the core’s subsequent request, the LLC derives the temporal locality occurs on each LLC miss

Early Core Invalidation (ECI) cont. Early-invalidated block – ECI block ECI block is hot in certain core cache  re-requested by that core cache  L1 miss but LLC hit, move back to MRU in LLC to keep the temporal locality ECI block is not hot (not re-requested or re-requested after a long time)  evicted from the LLC on next LLC miss in the corresponding set Lower traffic solution (# of LLC misses is much smaller) low-accurate prediction (predict the ECI block is hot in core caches) what if the ECI block is hot, but not that hot?

Query Based Selection (QBS) infers the temporal locality of a block in the LLC by query the core caches on each LLC miss The LLC selects a replacement candidate and queries all core caches if this block is present in certain core caches. Only replace the block that is not present in any core caches. If the QBS block is present in certain core cache. The LLC updates the corresponding replacement state to MRU and re-select, re- query another replacement candidate.

The QBS victim selection process is hidden by memory latency. The cache controller can limit the number of queries issued on an LLC miss. Based on the experiments, sending 2 queries is sufficient to achieve performance benefits. Performs similar to a non-inclusive cache hierarchy. The on-chip communication overhead is extremely large. [not mentioned in the paper] Query Based Selection (QBS) Cont.

An example (... a, b, a, c, a, d, a, e, a, f, a,.... )

Experimental Methodology CMP$im: x86 simulator Baseline: 2-core CMP, 3 level inclusive cache hierarchy L1 I/D: 4-way, 32KB, 64B block size, 1 cycle access latency L2: 8-way, 256KB, 64B block size, 10 cycles access latency, non-inclusive L3 (LLC): Shared, 16-way, 2MB, 24 cycles access latency, enforce inclusion Main memory: 150 cycles access latency Benchmarks: 15 benchmarks selected from SPEC CPU 2006 benchmark suite based on program behaviors (core cache fitting, LLC fitting, LLC thrashing, 5 benchmarks of each) Total workloads: core workloads. (15 choose 2)

Performance 5.2% 6.1% 3.4% 6.1% 6.6% 6.1%

Performance (Cont.) QBS performs similar to non-inclusive caches for all cache ratios

Performance (Cont.) Scalability of QBS in 2-core, 4-core and 8-core CMPs (1:4 cache size ratio)

Conclusion Temporal Locality Aware Cache Management Retains benefit of inclusion while minimizing back-invalidate problem TLA managed inclusive cache = performance of non-inclusive cache Thanks! Questions?