Download presentation
Presentation is loading. Please wait.
Published byVictoria Doyle Modified over 8 years ago
1
Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur Meyrem Kirman, Cornell Jose F. Martinez, Cornell Jose F. Martinez, Cornell
2
Scavenger (IITK-Cornell) Talk in one slide Observation#1: large number of blocks miss repeatedly in the last-level cache Observation#1: large number of blocks miss repeatedly in the last-level cache Observation#2: number of evictions between an eviction-reuse pair is too large to be captured by a conventional fully associative victim file Observation#2: number of evictions between an eviction-reuse pair is too large to be captured by a conventional fully associative victim file How to exploit this temporal behavior with “large period”? How to exploit this temporal behavior with “large period”? –Our solution prioritizes blocks evicted from the last-level cache by their miss frequencies –Top k frequently missing blocks are scavenged and retained in a fast k-entry victim file
3
Scavenger (IITK-Cornell)Sketch Observations and hypothesis Observations and hypothesis Scavenger overview (Contributions) Scavenger overview (Contributions) Scavenger architecture Scavenger architecture –Frequency estimator –Priority queue –Victim file Simulation environment Simulation environment Simulation results Simulation results Related work Related work Summary Summary
4
Scavenger (IITK-Cornell) Observations and hypothesis 1 2-9 10-99 100-999 >= 1000 ROB stall cycles (%) 0 20 40 60 80 100 gzwuswapvpgcmearmceqcrampebztwaps 512 KB 8-way1 MB 8-way
5
Scavenger (IITK-Cornell) Observations and hypothesis Too smallWish, but too large (FA?)
6
Scavenger (IITK-Cornell) Observations and hypothesis Block addresses repeat in the miss address stream of the last-level cache Block addresses repeat in the miss address stream of the last-level cache Repeating block addresses in the miss stream cause significant ROB stall Repeating block addresses in the miss stream cause significant ROB stall Hypothesis: identifying and retaining the most frequently missing blocks in a victim file should be beneficial, but … Hypothesis: identifying and retaining the most frequently missing blocks in a victim file should be beneficial, but … Number of evictions between an eviction- reuse pair is very large Number of evictions between an eviction- reuse pair is very large –Temporal behavior happens at too large a scale to be captured by any reasonably sized fully associative victim file
7
Scavenger (IITK-Cornell)Sketch Observations and hypothesis Observations and hypothesis Scavenger overview (Contributions) Scavenger architecture Scavenger architecture –Frequency estimator –Priority queue –Victim file Simulation environment Simulation environment Simulation results Simulation results Related work Related work Summary Summary
8
Scavenger (IITK-Cornell) Scavenger overview (Contributions) Functional requirements Functional requirements –Determine the frequency of occurrence of an evicted block address in the miss stream seen so far –Determine (preferably O(1) time) the min. frequency among the top k frequently missing blocks and if the frequency of the current block is bigger than or equal to this min., replace the min., insert this frequency, and compute new minimum quickly –Allocate a new block in the victim file by replacing the minimum frequency block, irrespective of the addresses of these blocks
9
Scavenger (IITK-Cornell) Scavenger overview (L2 eviction) L2 tag & data Bloom filter Victim file Min- heap Evicted block address Freq. Min. Freq. >= Min. Replace min., insert new To MC
10
Scavenger (IITK-Cornell) Scavenger overview (L1 miss) L2 tag & data Bloom filter Victim file Min- heap L1 Miss address Hit De-allocate To MC
11
Scavenger (IITK-Cornell)Sketch Observations and hypothesis Observations and hypothesis Scavenger overview (Contributions) Scavenger overview (Contributions) Scavenger architecture –Frequency estimator –Priority queue –Victim file Simulation environment Simulation environment Simulation results Simulation results Related work Related work Summary Summary
12
Scavenger (IITK-Cornell) Miss frequency estimator BF3BF4 Block address BF0 BF1 BF2 Min Estimate [14:0] [22:15] [25:23] [18:9][24:19]
13
Scavenger (IITK-Cornell) Priority queue (min-heap) 11915 12181113 6101311 610 5 5 6 6 13 11 9 15 12 15 18 11 13 VF tag Priority | VPTR Right child: (i << 1) | 1 Left child: i << 1 T13 T14 T15 T10 T11 T12 T8 T9 T6 T7 T1 T5 T4 T3 T2
14
Scavenger (IITK-Cornell) Pipelined min-heap Both insertion and de-allocation require O(log k) steps for a k-entry heap Both insertion and de-allocation require O(log k) steps for a k-entry heap –Each step involves read, comparison, and write operations; step latency: r+c+w cycles –Latency of (r+c+w)log(k) cycles is too high to cope up with bursty cache misses –Both insertion and de-allocation must be pipelined –We unify insertion and de-allocation into a single pipelined operation called replacement De-allocation is same as a zero insertion De-allocation is same as a zero insertion
15
Scavenger (IITK-Cornell) Pipelined heap replacement 11915 12181113 6101311 610 5 5 6 6 13 11 9 15 12 15 18 11 13 Right child: (i << 1) | 1 Left child: i << 1 W C R W C R W C R W C R 5 20, 10, 0 20 10 20
16
Scavenger (IITK-Cornell) Pipelined heap replacement 11915 12181113 6101311 10 5 5 6 13 11 9 15 12 15 18 11 13 Right child: (i << 1) | 1 Left child: i << 1 W C R W C R W C R W C R 5 20, 10, 0 20 20, 10, 1 20
17
Scavenger (IITK-Cornell) Pipelined heap replacement 11915 12181113 6101311 10 5 5 6 13 11 9 15 12 15 18 11 13 Right child: (i << 1) | 1 Left child: i << 1 W C R W C R W C R W C R 5 20, 10, 0 20, 10, 1 6, 10 20, 100, 1 20 6
18
Scavenger (IITK-Cornell) Pipelined heap replacement 11915 12181113 6 101311 10 5 5 6 13 11 9 15 12 15 18 11 13 Right child: (i << 1) | 1 Left child: i << 1 W C R W C R W C R W C R 5 20, 10, 0 6 20, 10, 1 6, 10 20, 100, 1 20 9 11, 9 6
19
Scavenger (IITK-Cornell) Victim file Functional requirements Functional requirements –Should be able to replace a block with minimum priority by a block of higher or equal priority irrespective of addresses (fully associative functionality) –Should offer fast lookup (conventional fully associative won’t do) –On a hit, should de-allocate the block and move it to main L2 cache (different from conventional victim caches)
20
Scavenger (IITK-Cornell) Victim file organization Tag array Tag array –Direct-mapped hash table with collisions (i.e., conflicts) resolved by chaining –Each tag entry contains an upstream (toward head) and a downstream (toward tail) pointer, and a head (H) and a tail (T) bit –Victim file lookup at address A walks the tag list sequentially starting at direct-mapped index of A Each tag lookup has latency equal to the latency of a direct-mapped cache of same size Each tag lookup has latency equal to the latency of a direct-mapped cache of same size –A replacement delinks the replaced tag from its list and links it up with the list of the new tag
21
Scavenger (IITK-Cornell) Victim file lookup Tail Head Hit VF TagVF Data (A >> BO) & (k-1) k Invalid Insert zero priority in heap node Requires a back pointer to heap
22
Scavenger (IITK-Cornell)Sketch Observations and hypothesis Observations and hypothesis Scavenger overview (Contributions) Scavenger overview (Contributions) Scavenger architecture Scavenger architecture –Frequency estimator –Priority queue –Victim file Simulation environment Simulation results Simulation results Related work Related work Summary Summary
23
Scavenger (IITK-Cornell) Simulation environment Single-stream evaluation in this paper Single-stream evaluation in this paper Configs differ only in L2 cache arch. Configs differ only in L2 cache arch. Common attributes (more in paper) Common attributes (more in paper) –4 GHz, 4-4-6 pipe, 128-entry ROB, 160 i/fpRF –L1 caches: 32 KB/4-way/64B/LRU/0.75 ns –L2 cache miss latency (load-to-use): 121 ns –16-stream stride prefetcher between L2 cache and memory with max. stride 256B –Applications: 1 billion representative dynamic instructions from sixteen SPEC 2000 applications (will discuss results for nine memory-bound applications; rest in paper)
24
Scavenger (IITK-Cornell) Simulation environment L2 cache configurations L2 cache configurations –Baseline: 1 MB/8-way/64B/LRU/2.25 ns/ 15.54 mm 2 –Scavenger: 512 KB/8-way/64B/LRU/2 ns conventional L2 cache + 512 KB VF (8192 entries x 64 B/entry)/0.5 ns, 0.75 ns + auxiliary data structures (8192-entry priority queue, BFs, pointer RAMs)/0.5 ns 16.75 mm 2 16.75 mm 2 –16-way: 1 MB/16-way/64B/LRU/2.75 ns/ 26.4 mm 2 –512KB-FA-VC: 512 KB/8-way/64B/LRU/2 ns conventional L2 cache + 512 KB/FA/64B/ Random/3.5 ns conventional VC
25
Scavenger (IITK-Cornell)Sketch Observations and hypothesis Observations and hypothesis Scavenger overview (Contributions) Scavenger overview (Contributions) Scavenger architecture Scavenger architecture –Frequency estimator –Priority queue –Victim file Simulation environment Simulation environment Simulation results Related work Related work Summary Summary
26
Scavenger (IITK-Cornell) Victim file characteristics Number of tag accesses per L1 cache miss request Number of tag accesses per L1 cache miss request –Mean below 1.5 for 14 applications –Mode (common case) is one for 15 applications (enjoy direct-mapped latency) –More than 90% requests require at most three for 15 applications
27
Scavenger (IITK-Cornell) Performance (Speedup) 1.63 16-way (1.01, 1.00) 512 KB-FA-VC (1.01, 1.01) Scavenger (1.14, 1.08) 0.9 1.0 1.1 1.2 1.3 1.4 wu swapvparmceqam tw Higher is better
28
Scavenger (IITK-Cornell) Performance (L2 cache misses) 16-way (0.98, 0.98) 512 KB-FA-VC (0.94, 0.96) Scavenger (0.85, 0.90) 0.6 0.7 0.8 0.9 1.0 1.1 wu swapvparmceqam tw Lower is better
29
Scavenger (IITK-Cornell)Sketch Observations and hypothesis Observations and hypothesis Scavenger overview (Contributions) Scavenger overview (Contributions) Scavenger architecture Scavenger architecture –Frequency estimator –Priority queue –Victim file Simulation environment Simulation environment Simulation results Simulation results Related work Summary
30
Scavenger (IITK-Cornell) L2 cache misses in recent proposals DIP [ISCA’07] (0.84) V-way [ISCA’05] (0.87) Scavenger (0.84) 0.4 0.55 0.7 0.85 1.00 wu swapvparmceqam tw Lower is better Bottleneck: BFs [Beats Scavenger in art and mcf only] [Beats Scavenger only in ammp] [Improvement across the board]
31
Scavenger (IITK-Cornell) Summary of Scavenger Last-level cache arch. with algorithms to discover global block priority Last-level cache arch. with algorithms to discover global block priority Divides the storage into a conventional set-associative cache and a large fast VF offering the functionality of a FA VF without using any CAM Divides the storage into a conventional set-associative cache and a large fast VF offering the functionality of a FA VF without using any CAM Insertion into VF is controlled by a priority queue backed by a cache block miss frequency estimator Insertion into VF is controlled by a priority queue backed by a cache block miss frequency estimator Offers IPC improvement of up to 63% and on average 8% for a set of sixteen SPEC 2000 applications Offers IPC improvement of up to 63% and on average 8% for a set of sixteen SPEC 2000 applications
32
Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur Meyrem Kirman, Cornell Jose F. Martinez, Cornell Jose F. Martinez, Cornell THANK YOU!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.