Achieving Non-Inclusive Cache Performance with Inclusive Caches Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD IEEE/ACM International Symposium on Microarchitecture (MICRO’2010)
Motivation Factors making caching important Goal: CPU speed >> Memory speed Chip Multi-Processors (CMPs) Goal: High performing LLC iL1 dL1 iL1 dL1 L2 L2 A mature field such as caching still has significant importance today! This is because the memory speeds continue to lag behind processor speeds. Additionally, the multi-core era poses significant challenges on better cache management. To deal with long latency to memory, many of us try to design a high performing LLC. A significant portion of prior art has looked at this problem by better LLC management policies. However, as I will show you, that is NOT enough. We must take a wholistic approach and start designing a high performing cache hierarchy. Last Level Cache (LLC)
Focus Of This Talk Is to Design a High Performing Cache Hierarchy Motivation Factors making caching important CPU speed >> Memory speed Chip Multi-Processors (CMPs) Goal: High performing LLC High performing cache hierarchy iL1 dL1 L2 Last Level Cache (LLC) The focus of this talk is to design a high performing cache hierarchy. Focus Of This Talk Is to Design a High Performing Cache Hierarchy
Cache Hierarchy 101: Kinds of Cache Hierarchies Core request evict L1 fill BackInval LLC fill victim memory So that we are all using the same terminology, what I would like to do now is provide a quick overview on the different kinds of cache hierarchies. For illustration purposes, I will use a two level hierarchy. At one end, we have the inclusive hierarchy where the contents of the L1 cache is required to be duplicated in the LLC. Misses to memory fill both levels of the hierarchy. Evicts from the LLC evict THAT line from the L1 (if present) Inclusive Hierarchy L1 subset of LLC
Cache Hierarchy 101: Kinds of Cache Hierarchies Core request Core request evict victim L1 L1 fill fill BackInval LLC LLC fill fill victim victim memory memory At the other end is the exclusive heirarchy where it is guaranteed that no duplication exists. The contents of L1 are NOT duplicated in the LLC. In this case, the LLC acts as a victim cache where fills first go into the L1 and evictions from the L1 are filled into the LLC. Inclusive Hierarchy L1 subset of LLC Exclusive Hierarchy L1 is NOT in LLC
Cache Hierarchy 101: Kinds of Cache Hierarchies Core request Core request Core request evict victim L1 L1 L1 fill fill fill BackInval LLC LLC LLC fill fill fill victim victim victim memory memory memory As a tradeoff between the two, we have the non-inclusive hierarchy where there are no requirements about duplication. Non-inclusive heirarchies are built by simply not sending a back-invalidate when evicting from the LLC Inclusive Hierarchy L1 subset of LLC Non-Inclusive Hierarchy L1 not subset of LLC Exclusive Hierarchy L1 is NOT in LLC
Cache Hierarchy 101: Kinds of Cache Hierarchies Core request Core request Core request Inclusive Caches (+) simplify cache coherence (−) waste cache capacity (−) back-invalidates limits performance Non-Inclusive Caches (+) do not waste cache capacity (−) complicate cache coherence (−) extra hardware for snoop filtering IN A NUTSHELL evict victim L1 L1 L1 fill fill fill BackInval LLC LLC LLC fill fill fill victim victim victim memory memory memory Each of these hierarchies have different tradeoffs Inclusive Hierarchy L1 subset of LLC Non-Inclusive Hierarchy L1 not subset of LLC Exclusive Hierarchy L1 is NOT in LLC Total Capacity: LLC >= LLC and <= (L1+LLC) L1 + LLC Back-Invalidate: YES NO NO Coherence: LLC Acts As LLC miss snoops ALL L1$ LLC miss snoops ALL L1$ Directory (or use Snoop Filter) (or use Snoop Filter)
Performance of Non-Inclusive and Exclusive LLCs AMD INTEL The choice of a cache hierarchy is a function of the size of core caches. To illustrate this, let us take a look at different ratios of core to LLC sizes. We assume a 2-core CMP with a 3-level hierarchy with a 32KB L1, 256KB L2. The x-axis looks at four different ratios of MLC size to LLC size. The y-axis looks at the performance relative to the respective baseline inclusive cache. For example, here we have the LLC to be 2X the MLC size. Here we have the LLC to be 8X the MLC size. Intel is at 1:8 ratio and builds inclusive caches. AMD is at 1:4 and builds non-inclusive caches. At 1:4 there is more benefits from non-inclusion than at 1:8 Baseline Inclusion (2-core CMP with 32KB L1, 256KB L2, LLC based on ratio) Enforcing inclusion is bad when LLC is not significantly larger than MLC Why Non-inclusive (NI) and Exclusive LLCs perform better? Make use of extra cache capacity by avoiding duplication Avoid problems dealing with harmful back-invalidates Which Of the Above Two Reasons Limits Performance of Inclusion?
Back-Invalidate Problem with Inclusive Caches Inclusion Victims: Lines evicted from core caches due to LLC eviction Small caches filter temporal locality Small cache hits do not update LLC LRU “Hot” small cache lines LRU in LLC Example Reference Pattern: … a, b, a, c, a, d, a, e, a, f… L1: L2: d c b a e d c b c b a d c b a a b a b a c b a MRU LRU Now, some of you might be thinking that this problem been studied in the past, and it is not significant. I agree. However, such studies have been done in the context of single core processors. As, I will show you know, this problem actually exacerbates on CMPs. Reference ‘e’ misses and evicts ‘a’ from hierarchy Next Reference to ‘a’ misses Filtered Temporal Locality Lines Become LRU in LLC Hierarchy Eviction
Inclusion Problem Exacerbated on CMPs! iL1 dL1 iL1 dL1 LLCT LLCF L2 CCF L2 LLC CMPs allow applications with varying demands for memory to co-execute. Types of Applications: Core Cache Fitting (CCF) Apps: working set fits in the core caches LLC Fitting (LLCF) Apps: working set fits in the LLC LLC Thrashing (LLCT) Apps: working set is larger than LLC
Inclusion Problem Exacerbated on CMPs! CCF LLCF iL1 dL1 iL1 dL1 L2 L2 LLC When a CCF and LLCF application co-execute,
Inclusion Problem Exacerbated on CMPs! CCF LLCF iL1 dL1 iL1 dL1 L2 L2 LLC When a CCF and LLCF application co-execute, CCF apps serviced from L2 cache and rarely from the LLC Replacement state of CCF apps becomes LRU at LLC
Inclusion Problem Exacerbated on CMPs! CCF LLCF iL1 dL1 iL1 dL1 L2 L2 LLC WORK ON THIS CCF apps serviced from L2 cache and rarely from the LLC Replacement state of CCF apps becomes LRU at LLC LLCF app replaces CCF working set from LLC Inclusion mandates removing CCF working set from entire hierarchy Performance of CCF apps significantly degrades in presence of LLCF/LLCT apps
Eliminate “Inclusion Victims” Using Temporal Hints Baseline policies only update replacement state at level of hit Proposal: convey temporal locality in small caches to LLC Temporal Locality Hints: Non-data requests sent to update LLC replacement state Core request (L1 hit) Update LRU L1 (TLH) Update LRU L2 LLC
Conveying ALL Temporal Locality to the LLC Baseline System This is an important finding. If we can ensure that the LLC is made aware of the temporal locality of lines, we can significantly bridge the performance gap between inclusive and non-inclusive caches. Baseline Inclusion *Our studies do not model TLH BW Bulk of non-inclusive cache performance is from avoiding back-invalidates NOT Capacity! Inclusive LLC management must be Temporal Locality Aware
Performance of L1 Temporal Locality Hints L1 hints bridge 85% of gap between inclusion & non-inclusion Limitations of L1 Hints: Very high BW num messages = num L1 hits 2T Workloads on a 1:4 Hierarchy 5.2% 6.1% Make sure you point out the s-curve for non-inclusive and ends… Baseline Inclusion *Our studies do not model TLH BW Need Low Bandwidth Alternative to L1 Temporal Locality Hints
Improving Inclusive Cache Performance Eliminate back-invalidates (i.e. build non-inclusive caches) Increases coherence complexity Goal: Retain benefits of inclusion yet avoid inclusion victims Solution: Temporal Locality Aware (TLA) Cache Management Ensure LLC DOES NOT evict “hot” lines from core caches Must identify LLC lines that have high temporal locality in core caches
Early Core Invalidate (ECI) Main Idea: Derive temporal locality by removing line early from core caches Early Core Invalidate (ECI): Send early invalidate for the next victim in same set If line is “hot”, it will be “rescued” from LLC “rescue” updates LLC replacement state as a side effect L1 Next Victim Early Core Invalidate L2 Back Invalidate Miss Flow L3 d c b a e MRU LRU Memory
Performance of Early Core Invalidate (ECI) ECI bridges 55% of gap between inclusion & non-inclusion Pros: No HW overhead, Low BW num messages = num LLC misses Limitations: Short time to rescue. Rescue must occur BEFORE next miss to set 2T Workloads on a 1:4 Hierarchy 3.4% 6.1% Baseline Inclusion We can Still Do Better Than ECI…
Query Based Selection (QBS) Back-Invalidate Request REJECT Main Idea: Replace lines that are NOT resident in core caches Query Based Selection (QBS): LLC sends back-inval request Core rejects back-inval if line is resident in core caches L1 L2 Miss Flow L3 e d c b a MRU LRU Memory
Query Based Selection (QBS) Back-Invalidate Request ACCEPT Main Idea: Replace lines that are NOT resident in core caches Query Based Selection (QBS): LLC sends back-inval request Core rejects back-inval if line is resident in core caches If core rejects, update to MRU in LLC LLC repeats back-inval process till core accepts back-inval request (or timeout) L1 L2 Miss Flow L3 a e d c b MRU LRU Memory
Performance of Query Based Selection (QBS) QBS outperforms non-inclusion Pros: No HW overhead, Low BW num messages = num LLC misses Studies show maximum of two tries sufficient for victim selection 2T Workloads on a 1:4 Hierarchy 6.6% 6.1% Baseline Inclusion
Summary of TLA Cache Management (2-core CMP) Make sure you say that we have gotten rid of most problems associted with back-invals. Now all that is left is capacity, and there isn’t significant benefit from capacity for the complexity tradeoff Baseline Inclusion QBS performs similar to non-inclusive caches for all cache ratios
QBS Scalability (2-core, 4-core, 8-core CMPs) 2T, 4T and 8T Workloads on a 1:4 Hierarchy Baseline Inclusion QBS scales with number of cores and performs similar to non-inclusive caches
Summary Problem: Inclusive cache problem becomes WORSE on CMPs E.g. Core Cache fitting + LLC Fitting/Thrashing Conventional Wisdom: Primary benefit of non-inclusive cache is because of higher capacity We show: primary benefit NOT capacity but avoiding back-invalidates Proposal: Temporal Locality Aware Cache Management Retains benefit of inclusion while minimizing back-invalidate problem TLA managed inclusive cache = performance of non-inclusive cache
Cache Hierarchy 101: Kinds of Cache Hierarchies Core request Core request evict L1 L1 fill fill BackInval LLC LLC fill fill victim victim memory memory Next is a non-inclusive hierarchy where the L1 is not necessarily a subset of LLC. Non-inclusive heirarchies are built by simply not sending a back-invalidate when evicting from the LLC Inclusive Hierarchy L1 subset of LLC Non-Inclusive Hierarchy L1 not subset of LLC
Eliminate “Inclusion Victims” in Inclusive Caches Using Temporal Locality Hints Baseline policies only update replacement state at level of hit Proposal: convey temporal locality in small caches to large caches Temporal Locality Hints: Non-data requests sent to update LLC replacement state Core request Core request (L1 hit) Update LRU L1 (TLH) Update LRU L2 (L2 Hit) Update LRU (TLH) Update LRU LLC
Performance of L2 Temporal Locality Hints L2 hints bridge <50% of gap between inclusion & non-inclusion Limitations of L2 Hints: Not as good as L1 hints High BW num messages = num L2 hits 2T Workloads on a 1:4 Hierarchy 2.8% 6.1% Baseline Inclusion *Our studies do not model TLH BW Need Low Bandwidth Alternative to L1 Temporal Locality Hints
QBS Scalability (4-core and 8-core CMPs) 4T and 8T Workloads on a 1:4 Hierarchy Baseline Inclusion QBS scales with number of cores and performs similar to non-inclusive caches