Download presentation
Presentation is loading. Please wait.
1
Multi-Lookahead Offset Prefetching
Mehran Shakerinava Mohammad Bakhshalipour Pejman Lotfi-Kamran Hamid Sarbazi-Azad Presenter: Farid Samandi (Stony Brook University) The Third Data Prefetching Championship (DPC3), in conjunction with ISCA 2019
2
Offset Prefetching Maintains a set of prefetching offsets.
Adds offsets to memory access addresses to produce prefetching targets. Periodically updates prefetching offsets based on observed memory access patterns. Offset prefetchers differ in how they implement this part. First I’ll start with a brief introduction to Offset Prefetching in general, and later, I’ll demonstrate how MLOP works in particular. Bullet #3: Offset prefetching assumes that most of the time, access patterns in successive periods (sometimes called a round or epoch) will be similar. Offset prefetchers differ in the mechanism they employ for selecting their set of prefetching offsets. The goal is to identify offsets that offer timely predictions and high miss coverage.
3
Example Prefetching Offsets = {+3, +6}
4
MLOP Offset Selection MLOP scores candidate offsets based on their coverage potential. There is an offset score vector associated with each “lookahead level.” Lookahead level N only counts covered misses if they occur at least N timesteps later. Select highest scoring offset from each lookahead level. (Ignore repeated offsets) Bullet #3: Each cache Miss/Prefetch-Hit can be considered a “timestep” in the context of prefetching. We’ll see an example of how offset scores are calculated soon…
5
Example Prefetching Offsets = (+3, +6) Prefetching Offsets = {+3, +6}
In this example, offsets +1 through +8 have been evaluated and the lookahead levels are 1, 2, 3, 4. Highest scoring offset is selected from each level. Resulting in the prefetching offset set {+3, +6}. Note that even though there are 4 lookahead levels, only 2 offsets were selected. This tends to happen (when lookahead levels are close, like they are now: 1, 2, 3, 4). So the number of lookahead levels is usually quite larger than prefetch degree (unless lookahead levels are far apart, like 1, 9, 17, …). The set of prefetching offsets is actually an ordered set in MLOP. Offsets obtained from smaller lookahead levels will be prefetched first. Prefetching Offsets = (+3, +6) Prefetching Offsets = {+3, +6}
6
Access Map Table MLOP uses an Access Map Table to track the state of memory blocks. Each entry is a vector of states pertaining to a contiguous segment of memory. We’ve added a queue to each entry that stores the M - 1 most recent accesses, where M is the highest lookahead level. Bullet #1: States can be {INIT (not accessed or prefetched), ACCESS, PREFETCH L1/L2/L3}, we’ll assume states are only ACCESS/INIT for demonstration. Bullet #2: The AMT is a set-associative structure. Because MLOP sits in L1 and the L1-cache is small, only a few blocks will have states other that INIT, and thus a small AMT (<1KB) is usually enough to track the state of all blocks, unless access patterns are extremely sparse. Bullet #3: A queue is simply implemented as a shift-register in hardware. You’ll see why we’ve done this (added queues) in the next few slides.
7
Example We’ll focus on one AMT entry in this example.
Blocks 0, 1, 3, 4, 6 are being accessed in order. Unmarking the block indexes of the queue in order will give us a recent history of the Access Map with length M (highest lookahead level).
8
MLOP Score Update We’ll use our access map history to update all score vectors. First map (original access map without any unmarking) will update lookahead lvl 1. Second map will update lookahead lvl 2. And so on… Note that the score vector has been reversed. Assume that block number 7 has just been accessed. We’ll shift the maps so that the recently accessed block is aligned with offset 0 (where it would have been if it was present). In this case, we shift the maps one block to the right. Observe that now, all map blocks are aligned with the offset that could have successfully predicted the recent access. The score for offsets that are aligned with an accessed block is incremented. This update is performed for a fixed number of times. Then, the prefetching offsets are selected and the scores are reset, and the whole process repeats.
9
MLOP’s DPC3 Settings Sits in L1 and prefetches into L1 and L2.
A simple next-line prefetcher sits in L2 and L3. Access Map Table divides memory into 4KB segments (64 blocks/1 page). Evaluates all possible offsets ([-63, +63]). Uses lookahead levels 1 to 16. Bullet #1: We observed that highest performance was achieved in this setting. Bullet #2: Prefetchers at L2, and L3 observe highly irregular access patterns because of the prefetcher in L1. Therefore, we decided to use a simple static prefetcher such as a next-line prefetcher. Bullet #4: Because it’s a competition, we made no attempt to optimize this part. Otherwise, one could only evaluate a subset of these offsets. For example, the “A Best Offset Prefetcher” paper suggests evaluating offsets whose prime factorization consists only of 2, 3, or 5 (primes no greater than 5). Bullet #5: Prefetching degree is much less than 16. On average, less than 5 offsets are selected each round. The rest are repeated offsets or offsets that don’t have a high enough score (next slide explains the score thresholding mechanism).
10
MLOP’s DPC3 Settings Updates scores 500 times in each round.
Selected offsets with score above 200 prefetch into L1. And selected offsets with score above 150 (but not above 200) prefetch into L2. Does not prefetch into L3. Large Access Map Table. Bullet #3: We observed no performance benefit from prefetching into L3. Bullet #4: As I’ve said earlier, a small Access Map Table also achieves high performance, but since this is a competition, we made no attempts to optimize storage. Even a very large (compared with the L1 cache) AMT (~8.5 KB) is much smaller than DPC3’s storage limit (64 KB).
11
Performance Evaluation
We compared MLOP with Best Offset Prefetcher (BOP), Aggregate Stride Prefetcher (ASP), and Signature Path Prefetcher (SPP). All prefetchers sit in L1. MLOP uses no next-line prefetchers in this comparison. On these SPEC2017 workloads, MLOP achieves the highest IPC. After MLOP, SPP performs best. MLOP achieves 39% performance improvement on average in single-core workloads, that is 5% higher than SPP. On multi-core workloads, MLOP achieves an average performance improvement of 21% which is 3% higher than SPP. With next-line prefetchers added at L2 and L3 (not shown here), MLOP achieves 40% performance improvement on single-core workloads and 23% improvement on multi-core workloads.
12
Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.