Achieving Non-Inclusive Cache Performance with Inclusive Caches

Slides:



Advertisements
Similar presentations
COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems Chapter 9 Memory Hierarchy ©Copyright 2008 Umakishore Ramachandran and William.
Advertisements

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer
CRUISE: Cache Replacement and Utility-Aware Scheduling
Bypass and Insertion Algorithms for Exclusive Last-level Caches
Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
To Include or Not to Include? Natalie Enright Dana Vantrease.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy Dongyuan Zhan CS252 S05.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
High Performing Cache Hierarchies for Server Workloads
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Hierarchy-aware Replacement and Bypass Algorithms for Last-level Caches Mainak Chaudhuri Indian Institute of Technology, Kanpur & Jayesh Gaur 1, Nithiyanandan.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Prefetch-Aware Cache Management for High Performance Caching
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
INSTRUCTIONS 1.Set up the presentation to suit the reading ability of your class (see screenshot). A setting of 0.4 seconds between each word appearing.
Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Ragavendra Natarajan 1, Mainak Chaudhuri 2.
Comp-TIA Standards.  AMD- (Advanced Micro Devices) An American multinational semiconductor company that develops computer processors and related technologies.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Different CPUs CLICK THE SPINNING COMPUTER TO MOVE ON.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
CoLT: Coalesced Large-Reach TLBs December 2012 Binh Pham §, Viswanathan Vaidyanathan §, Aamer Jaleel ǂ, Abhishek Bhattacharjee § § Rutgers University ǂ.
Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
International Symposium on Computer Architecture ( ISCA – 2010 )
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.
By Islam Atta Supervised by Dr. Ihab Talkhan
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Understanding Parallel Computers Parallel Processing EE 613.
Chap 4: Processors Mainly manufactured by Intel and AMD Important features of Processors: Processor Speed (900MHz, 3.2 GHz) Multiprocessing Capabilities.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
A High-Resolution Side-Channel Attack on Last-Level Cache Mehmet Kayaalp, IBM Research Nael Abu-Ghazaleh, University of California Riverside Dmitry Ponomarev,
Presented by: Nick Kirchem Feb 13, 2004
Multiprocessing.
See if you can guess the keywords from the pictures below
תרגול מס' 5: MESI Protocol
Section 9: Virtual Memory (VM)
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
Morgan Kaufmann Publishers
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Intel’s Core i7 Processor
Lecture: Large Caches, Virtual Memory
RIC: Relaxed Inclusion Caches for Mitigating LLC Side-Channel Attacks
CS 286: Memory Paging and Virtual Memory
Prefetch-Aware Cache Management for High Performance Caching
PIII Data Stream Power Saving Modes Buses Memory Order Buffer
Lecture 12: Cache Innovations
Directory-based Protocol
International Symposium on Computer Architecture ( ISCA – 2010 )
CMPT 886: Computer Architecture Primer
Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches
Interconnect with Cache Coherency Manager
Managing GPU Concurrency in Heterogeneous Architectures
Lecture: Cache Innovations, Virtual Memory
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Word Assembly from Narrow Chips
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
Cross-Core Prime+Probe Attacks on Non-inclusive Caches
Lecture: Cache Hierarchies
Learning Objectives To be able to describe the purpose of the CPU
CSE 486/586 Distributed Systems Cache Coherence
University of Illinois at Urbana-Champaign
Presentation transcript:

Achieving Non-Inclusive Cache Performance with Inclusive Caches Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD Aamer.Jaleel@intel.com IEEE/ACM International Symposium on Microarchitecture (MICRO’2010)

Motivation Factors making caching important Goal: CPU speed >> Memory speed Chip Multi-Processors (CMPs) Goal: High performing LLC iL1 dL1 iL1 dL1 L2 L2 A mature field such as caching still has significant importance today! This is because the memory speeds continue to lag behind processor speeds. Additionally, the multi-core era poses significant challenges on better cache management. To deal with long latency to memory, many of us try to design a high performing LLC. A significant portion of prior art has looked at this problem by better LLC management policies. However, as I will show you, that is NOT enough. We must take a wholistic approach and start designing a high performing cache hierarchy. Last Level Cache (LLC)

Focus Of This Talk Is to Design a High Performing Cache Hierarchy Motivation Factors making caching important CPU speed >> Memory speed Chip Multi-Processors (CMPs) Goal: High performing LLC High performing cache hierarchy iL1 dL1 L2 Last Level Cache (LLC) The focus of this talk is to design a high performing cache hierarchy. Focus Of This Talk Is to Design a High Performing Cache Hierarchy

Cache Hierarchy 101: Kinds of Cache Hierarchies Core request evict L1 fill BackInval LLC fill victim memory So that we are all using the same terminology, what I would like to do now is provide a quick overview on the different kinds of cache hierarchies. For illustration purposes, I will use a two level hierarchy. At one end, we have the inclusive hierarchy where the contents of the L1 cache is required to be duplicated in the LLC. Misses to memory fill both levels of the hierarchy. Evicts from the LLC evict THAT line from the L1 (if present) Inclusive Hierarchy L1 subset of LLC

Cache Hierarchy 101: Kinds of Cache Hierarchies Core request Core request evict victim L1 L1 fill fill BackInval LLC LLC fill fill victim victim memory memory At the other end is the exclusive heirarchy where it is guaranteed that no duplication exists. The contents of L1 are NOT duplicated in the LLC. In this case, the LLC acts as a victim cache where fills first go into the L1 and evictions from the L1 are filled into the LLC. Inclusive Hierarchy L1 subset of LLC Exclusive Hierarchy L1 is NOT in LLC

Cache Hierarchy 101: Kinds of Cache Hierarchies Core request Core request Core request evict victim L1 L1 L1 fill fill fill BackInval LLC LLC LLC fill fill fill victim victim victim memory memory memory As a tradeoff between the two, we have the non-inclusive hierarchy where there are no requirements about duplication. Non-inclusive heirarchies are built by simply not sending a back-invalidate when evicting from the LLC Inclusive Hierarchy L1 subset of LLC Non-Inclusive Hierarchy L1 not subset of LLC Exclusive Hierarchy L1 is NOT in LLC

Cache Hierarchy 101: Kinds of Cache Hierarchies Core request Core request Core request Inclusive Caches (+) simplify cache coherence (−) waste cache capacity (−) back-invalidates limits performance Non-Inclusive Caches (+) do not waste cache capacity (−) complicate cache coherence (−) extra hardware for snoop filtering IN A NUTSHELL evict victim L1 L1 L1 fill fill fill BackInval LLC LLC LLC fill fill fill victim victim victim memory memory memory Each of these hierarchies have different tradeoffs Inclusive Hierarchy L1 subset of LLC Non-Inclusive Hierarchy L1 not subset of LLC Exclusive Hierarchy L1 is NOT in LLC Total Capacity: LLC >= LLC and <= (L1+LLC) L1 + LLC Back-Invalidate: YES NO NO Coherence: LLC Acts As LLC miss snoops ALL L1$ LLC miss snoops ALL L1$ Directory (or use Snoop Filter) (or use Snoop Filter)

Performance of Non-Inclusive and Exclusive LLCs AMD INTEL The choice of a cache hierarchy is a function of the size of core caches. To illustrate this, let us take a look at different ratios of core to LLC sizes. We assume a 2-core CMP with a 3-level hierarchy with a 32KB L1, 256KB L2. The x-axis looks at four different ratios of MLC size to LLC size. The y-axis looks at the performance relative to the respective baseline inclusive cache. For example, here we have the LLC to be 2X the MLC size. Here we have the LLC to be 8X the MLC size. Intel is at 1:8 ratio and builds inclusive caches. AMD is at 1:4 and builds non-inclusive caches. At 1:4 there is more benefits from non-inclusion than at 1:8 Baseline Inclusion (2-core CMP with 32KB L1, 256KB L2, LLC based on ratio) Enforcing inclusion is bad when LLC is not significantly larger than MLC Why Non-inclusive (NI) and Exclusive LLCs perform better? Make use of extra cache capacity by avoiding duplication Avoid problems dealing with harmful back-invalidates Which Of the Above Two Reasons Limits Performance of Inclusion?

Back-Invalidate Problem with Inclusive Caches Inclusion Victims: Lines evicted from core caches due to LLC eviction Small caches filter temporal locality Small cache hits do not update LLC LRU “Hot” small cache lines  LRU in LLC Example Reference Pattern: … a, b, a, c, a, d, a, e, a, f… L1: L2: d c b a e d c b c b a d c b a a b a b a c b a MRU LRU Now, some of you might be thinking that this problem been studied in the past, and it is not significant. I agree. However, such studies have been done in the context of single core processors. As, I will show you know, this problem actually exacerbates on CMPs. Reference ‘e’ misses and evicts ‘a’ from hierarchy Next Reference to ‘a’ misses  Filtered Temporal Locality  Lines Become LRU in LLC  Hierarchy Eviction 

Inclusion Problem Exacerbated on CMPs! iL1 dL1 iL1 dL1 LLCT LLCF L2 CCF L2 LLC CMPs allow applications with varying demands for memory to co-execute. Types of Applications: Core Cache Fitting (CCF) Apps: working set fits in the core caches LLC Fitting (LLCF) Apps: working set fits in the LLC LLC Thrashing (LLCT) Apps: working set is larger than LLC

Inclusion Problem Exacerbated on CMPs! CCF LLCF iL1 dL1 iL1 dL1 L2 L2 LLC When a CCF and LLCF application co-execute,

Inclusion Problem Exacerbated on CMPs! CCF LLCF iL1 dL1 iL1 dL1 L2 L2 LLC When a CCF and LLCF application co-execute, CCF apps serviced from L2 cache and rarely from the LLC Replacement state of CCF apps becomes LRU at LLC

Inclusion Problem Exacerbated on CMPs! CCF LLCF iL1 dL1 iL1 dL1 L2 L2 LLC WORK ON THIS CCF apps serviced from L2 cache and rarely from the LLC Replacement state of CCF apps becomes LRU at LLC LLCF app replaces CCF working set from LLC Inclusion mandates removing CCF working set from entire hierarchy  Performance of CCF apps significantly degrades in presence of LLCF/LLCT apps

Eliminate “Inclusion Victims” Using Temporal Hints Baseline policies only update replacement state at level of hit Proposal: convey temporal locality in small caches to LLC Temporal Locality Hints: Non-data requests sent to update LLC replacement state Core request (L1 hit) Update LRU L1 (TLH) Update LRU L2 LLC

Conveying ALL Temporal Locality to the LLC Baseline System This is an important finding. If we can ensure that the LLC is made aware of the temporal locality of lines, we can significantly bridge the performance gap between inclusive and non-inclusive caches. Baseline Inclusion *Our studies do not model TLH BW Bulk of non-inclusive cache performance is from avoiding back-invalidates NOT Capacity! Inclusive LLC management must be Temporal Locality Aware

Performance of L1 Temporal Locality Hints L1 hints bridge 85% of gap between inclusion & non-inclusion Limitations of L1 Hints: Very high BW num messages = num L1 hits 2T Workloads on a 1:4 Hierarchy 5.2% 6.1% Make sure you point out the s-curve for non-inclusive and ends… Baseline Inclusion *Our studies do not model TLH BW Need Low Bandwidth Alternative to L1 Temporal Locality Hints

Improving Inclusive Cache Performance Eliminate back-invalidates (i.e. build non-inclusive caches) Increases coherence complexity  Goal: Retain benefits of inclusion yet avoid inclusion victims Solution: Temporal Locality Aware (TLA) Cache Management Ensure LLC DOES NOT evict “hot” lines from core caches Must identify LLC lines that have high temporal locality in core caches

Early Core Invalidate (ECI) Main Idea: Derive temporal locality by removing line early from core caches Early Core Invalidate (ECI): Send early invalidate for the next victim in same set If line is “hot”, it will be “rescued” from LLC  “rescue” updates LLC replacement state as a side effect L1 Next Victim Early Core Invalidate L2 Back Invalidate Miss Flow L3 d c b a e MRU LRU Memory

Performance of Early Core Invalidate (ECI) ECI bridges 55% of gap between inclusion & non-inclusion Pros: No HW overhead, Low BW num messages = num LLC misses Limitations: Short time to rescue. Rescue must occur BEFORE next miss to set 2T Workloads on a 1:4 Hierarchy 3.4% 6.1% Baseline Inclusion We can Still Do Better Than ECI…

Query Based Selection (QBS) Back-Invalidate Request REJECT Main Idea: Replace lines that are NOT resident in core caches Query Based Selection (QBS): LLC sends back-inval request Core rejects back-inval if line is resident in core caches L1 L2 Miss Flow L3 e d c b a MRU LRU Memory

Query Based Selection (QBS) Back-Invalidate Request ACCEPT Main Idea: Replace lines that are NOT resident in core caches Query Based Selection (QBS): LLC sends back-inval request Core rejects back-inval if line is resident in core caches If core rejects, update to MRU in LLC LLC repeats back-inval process till core accepts back-inval request (or timeout) L1 L2 Miss Flow L3 a e d c b MRU LRU Memory

Performance of Query Based Selection (QBS) QBS outperforms non-inclusion Pros: No HW overhead, Low BW num messages = num LLC misses Studies show maximum of two tries sufficient for victim selection 2T Workloads on a 1:4 Hierarchy 6.6% 6.1% Baseline Inclusion

Summary of TLA Cache Management (2-core CMP) Make sure you say that we have gotten rid of most problems associted with back-invals. Now all that is left is capacity, and there isn’t significant benefit from capacity for the complexity tradeoff Baseline Inclusion QBS performs similar to non-inclusive caches for all cache ratios

QBS Scalability (2-core, 4-core, 8-core CMPs) 2T, 4T and 8T Workloads on a 1:4 Hierarchy Baseline Inclusion QBS scales with number of cores and performs similar to non-inclusive caches

Summary Problem: Inclusive cache problem becomes WORSE on CMPs E.g. Core Cache fitting + LLC Fitting/Thrashing Conventional Wisdom: Primary benefit of non-inclusive cache is because of higher capacity We show: primary benefit NOT capacity but avoiding back-invalidates Proposal: Temporal Locality Aware Cache Management Retains benefit of inclusion while minimizing back-invalidate problem TLA managed inclusive cache = performance of non-inclusive cache

Q&A

Cache Hierarchy 101: Kinds of Cache Hierarchies Core request Core request evict L1 L1 fill fill BackInval LLC LLC fill fill victim victim memory memory Next is a non-inclusive hierarchy where the L1 is not necessarily a subset of LLC. Non-inclusive heirarchies are built by simply not sending a back-invalidate when evicting from the LLC Inclusive Hierarchy L1 subset of LLC Non-Inclusive Hierarchy L1 not subset of LLC

Eliminate “Inclusion Victims” in Inclusive Caches Using Temporal Locality Hints Baseline policies only update replacement state at level of hit Proposal: convey temporal locality in small caches to large caches Temporal Locality Hints: Non-data requests sent to update LLC replacement state Core request Core request (L1 hit) Update LRU L1 (TLH) Update LRU L2 (L2 Hit) Update LRU (TLH) Update LRU LLC

Performance of L2 Temporal Locality Hints L2 hints bridge <50% of gap between inclusion & non-inclusion Limitations of L2 Hints: Not as good as L1 hints High BW num messages = num L2 hits 2T Workloads on a 1:4 Hierarchy 2.8% 6.1% Baseline Inclusion *Our studies do not model TLH BW Need Low Bandwidth Alternative to L1 Temporal Locality Hints

QBS Scalability (4-core and 8-core CMPs) 4T and 8T Workloads on a 1:4 Hierarchy Baseline Inclusion QBS scales with number of cores and performs similar to non-inclusive caches