1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N. Patt Presented by Rubao Lee 10/16/2008 International Symposium on Microarchitecture (MICRO) 2006

2 Introduction Core 1 ICDC … Core 1 ICDC Core 1 ICDC Shared Cache Applications compete for the shared cache Partitioning policies critical for high performance Set Partitioning and Way Partitioning

3 Way Partitioning Core 1 ICDC … Set 1, Way 1Set 1, Way 2Set 1, Way Y … Set 2, Way 1Set 2, Way 2Set 2, Way Y … Set X, Way 1Set X, Way 2Set X, Way Y … … Core 1 ICDC Core 1 ICDC Shared Cache

4 Paper Overview GoalDecisionInformation Metrics? Evaluation?How?Who?When? What information? How?Who?When? Accuracy/Overhead? What decision? Accuracy/Overhead?Last Effect? Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

5 Existing Cache Partitioning Policies  Equal (half-and-half)  Performance isolation  No adaptation  LRU  Demand based  Demand is not benefit!  For example, a streaming application can access a large number of unique cache blocks without reuse. Partition the cache based on how much the app is likely to benefit from the cache rather than its demand for the cache.

6 Utility reflects benefit Applications have different memory access patterns. (1) Some do benefit significantly from more cache; (2) Some do not; (3) Some only need a fix amount of cache. An important concept: Application’s Utility of Cache Resource

7 Utility Curves of SEPC benchmarks Utility U a b = Misses with a ways – Misses with b ways Low Utility High Utility Saturating Utility Num ways from 16-way 1MB L2 Misses per 1000 instructions

8 Motivating Example Num ways from 16-way 1MB L2 Misses per 1000 instructions (MPKI) equake vpr LRU UTIL Improve performance by giving more cache to the application that benefits more from cache

9 Outline  Introduction and Motivation  Utility-Based Cache Partitioning  Evaluation  Scalable Partitioning Algorithm  Related Work and Summary

10 Framework for UCP Three components:  Utility Monitors (UMON) per core  Partitioning Algorithm (PA)  Replacement support to enforce partitions I$ D$ Core1 I$ D$ Core2 Shared L2 cache Main Memory UMON1 UMON2 PA

11 How to monitor Utility Monitoring the utility information of an app requires a mechanism that tracks the number of misses for all possible number of ways. For 16-way cache, we can use 16 tag directories, each having the same number of sets as the shared cache, but each having a different number of ways from 1 to 16. (Note: no data lines) … … 16 123 …

12 LRU as a stack algorithm We Just Need One Tag Directory 1: A 4-way set-associative cache shown in (a) 2: Each set has four counters for each recency position from MRU to LRU 3: If a hit, the counter corresponding to the position is incremented. 4: The counters represent the number of misses saved by each recency position.

13 UMON-local vs UMON-global Set D Set C Set B Set A UMON-global Set C Set B Set D Set A UMON-local Tag entry in the Auxiliary Tag Directory (ADT) Hit counter for a recency position MRULRU The capability of cache partitioning on a per-set basis is lost!

14 Utility Monitors (UMON)  For each core, simulate LRU policy using ATD  Hit counters in ATD to count hits per recency position  LRU is a stack algorithm: hit counts  utility E.g. hits(2 ways) = H0+H1 MTD Set B Set E Set G Set A Set C Set D Set F Set H ATD Set B Set E Set G Set A Set C Set D Set F Set H +++ + (MRU)H0 H1 H2…H15(LRU)

15 Dynamic Set Sampling (DSS)  Extra tags incur hardware and power overhead  DSS reduces overhead [Qureshi+ ISCA’06]  32 sets sufficient (analytical bounds)analytical bounds  Storage < 2kB/UMON MTD ATD Set B Set E Set G Set A Set C Set D Set F Set H +++ + (MRU)H0 H1 H2…H15(LRU) Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G UMON (DSS)

16 Partitioning algorithm  Evaluate all possible partitions and select the best  With a ways to core1 and (16-a) ways to core2: Hits core1 = (H 0 + H 1 + … + H a-1 ) ---- from UMON1 Hits core2 = (H 0 + H 1 + … + H 16-a-1 ) ---- from UMON2  Select a that maximizes (Hits core1 + Hits core2 )  Partitioning done once every 5 million cycles

17 Way Partitioning Way partitioning support: [ Suh+ HPCA’02, Iyer ICS’04 ] 1.Each line has core-id bits 2.On a miss, count ways_occupied in set by miss-causing app ways_occupied < ways_given Yes No Victim is the LRU line from other app Victim is the LRU line from miss-causing app

19 Methodology Configuration: Two cores: 8-wide, 128-entry window, private L1s L2: Shared, unified, 1MB, 16-way, LRU-based Memory: 400 cycles, 32 banks Used 20 workloads (four from each type) Benchmarks: Two-threaded workloads divided into 5 categories 1.01.2 1.4 1.6 1.82.0 Weighted speedup for the baseline

20 Metrics Three metrics for performance: 1.Weighted Speedup (default metric)  perf = IPC 1 /SingleIPC 1 + IPC 2 /SingleIPC 2  correlates with reduction in execution time 2.Throughput  perf = IPC 1 + IPC 2  can be unfair to low-IPC application 3.Hmean-fairness  perf = hmean(IPC 1 /SingleIPC 1, IPC 2 /SingleIPC 2 )  balances fairness and performance

21 Results for weighted speedup UCP improves average weighted speedup by 11%

22 Results for throughput UCP improves average throughput by 17%

23 Results for hmean-fairness UCP improves average hmean-fairness by 11%

24 Effect of Number of Sampled Sets Dynamic Set Sampling (DSS) reduces overhead, not benefits 8 sets 16 sets 32 sets All sets

25 Hardware Overhead of UCP L2 Cache: 1MB, 16-way, 64B cache lines

27 Scalability issues  Time complexity of partitioning low for two cores (number of possible partitions ≈ number of ways)  Possible partitions increase exponentially with cores  For a 32-way cache, possible partitions:  4 cores  6545  8 cores  15.4 million  Problem NP hard  need scalable partitioning algorithm

28 Greedy Algorithm [Stone+ ToC ’92]  GA allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated  Optimal partitioning when utility curves are convex  Pathological behavior for non-convex curves Num ways from a 32-way 2MB L2 Misses per 100 instructions

29 Greedy Algorithm [Stone+ ToC ’92]

30 Problem with Greedy Algorithm In each iteration, the utility for 1 block: U(A) = 10 misses U(B) = 0 misses Problem: GA considers benefit only from the immediate block. Hence it fails to exploit huge gains from ahead Blocks assigned Misses All blocks assigned to A, even if B has same miss reduction with fewer blocks

31 Lookahead Algorithm  Marginal Utility (MU) = Utility per cache resource MU a b = U a b /(b-a)  GA considers MU for 1 block. LA considers MU for all possible allocations  Select the app that has the max value for MU. Allocate it as many blocks required to get max MU  Repeat till all blocks assigned

32 Lookahead Algorithm

33 Lookahead Algorithm (example) Time complexity ≈ ways 2 /2 (512 ops for 32-ways) Iteration 1: MU(A) = 10/1 block MU(B) = 80/3 blocks B gets 3 blocks Result: A gets 5 blocks and B gets 3 blocks (Optimal) Next five iterations: MU(A) = 10/1 block MU(B) = 0 A gets 1 block Blocks assigned Misses

34 Results for partitioning algorithms Four cores sharing a 2MB 32-way L2 Mix2 (swm-glg-mesa-prl) Mix3 (mcf-applu-art-vrtx) Mix4 (mcf-art-eqk-wupw) Mix1 (gap-applu-apsi-gzp) LA performs similar to EvalAll, with low time-complexity LRU UCP(Greedy) UCP(Lookahead) UCP(EvalAll)

36 Related work Zhou+ [ASPLOS’04] Perf += 11% Storage += 64kB/core X UCP Perf += 11% Storage += 2kB/core Suh+ [HPCA’02] Perf += 4% Storage += 32B/core Performance Low High Overhead LowHigh UCP is both high-performance and low-overhead

37 Summary  CMP and shared caches are common  Partition shared caches based on utility, not demand  UMON estimates utility at runtime with low overhead  UCP improves performance : oWeighted speedup by 11% oThroughput by 17% oHmean-fairness by 11%  Lookahead algorithm is scalable to many cores sharing a highly associative cache

38 Questions

39 DSS Bounds with Analytical Model Us = Sampled mean (Num ways allocated by DSS) Ug = Global mean (Num ways allocated by Global) P = P(Us within 1 way of Ug) By Cheb. inequality: P ≥ 1 – variance/n n = number of sampled sets In general, variance ≤ 3 back

40 Galgel – concave utility galgel twolf parser

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Similar presentations

Presentation on theme: "1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Similar presentations

Presentation on theme: "1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N."— Presentation transcript:

Similar presentations

About project

Feedback