Multi-Lookahead Offset Prefetching

Slides:



Advertisements
Similar presentations
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Advertisements

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Allocating Memory.
Day 20 Memory Management. Assumptions A process need not be stored as one contiguous block. The entire process must reside in main memory.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Multiprocessing Memory Management
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
SOCSAMS e-learning Dept. of Computer Applications, MES College Marampally MEMORYMANAGEMNT.
Computer Architecture Lecture 28 Fasih ur Rehman.
CMPE 421 Parallel Computer Architecture
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
CS 241 Section Week #9 (11/05/09). Topics MP6 Overview Memory Management Virtual Memory Page Tables.
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 33 Paging Read Ch. 9.4.
MEMORY MANAGEMENT. memory management  In a multiprogramming system, in order to share the processor, a number of processes must be kept in memory. 
Cache Replacement Policy Based on Expected Hit Count
Basic Paging (1) logical address space of a process can be made noncontiguous; process is allocated physical memory whenever the latter is available. Divide.
Memory Management.
Virtual Memory Chapter 7.4.
COSC3330 Computer Architecture
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
Lecture: Large Caches, Virtual Memory
Memory Allocation The main memory must accommodate both:
Outline Paging Swapping and demand paging Virtual memory.
CSC 4250 Computer Architectures
Lecture: Cache Hierarchies
Day 19 Memory Management.
Cache Memory Presentation I
Lecture: Cache Hierarchies
Bojian Zheng CSCD70 Spring 2018
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Evolution in Memory Management Techniques
Milad Hashemi, Onur Mutlu, Yale N. Patt
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Lecture 14: Reducing Cache Misses
Computer Architecture
Lecture: Cache Innovations, Virtual Memory
Main Memory Background Swapping Contiguous Allocation Paging
Introduction to Database Systems
Indexing and Hashing Basic Concepts Ordered Indices
Lecture 29: Virtual Memory-Address Translation
Virtual Memory Hardware
How can we find data in the cache?
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Practical Session 9, Memory
Translation Buffers (TLB’s)
CS-447– Computer Architecture Lecture 20 Cache Memories
CS 3410, Spring 2014 Computer Science Cornell University
CSE451 Virtual Memory Paging Autumn 2002
Translation Buffers (TLB’s)
Cache - Optimization.
Translation Buffers (TLBs)
COMP755 Advanced Operating Systems
Operating Systems: Internals and Design Principles, 6/E
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Review What are the advantages/disadvantages of pages versus segments?
Eshan Bhatia1, Gino Chacon1, Elvira Teran2, Paul V. Gratz1, Daniel A
Virtual Memory.
10/18: Lecture Topics Using spatial locality
Paging Andrew Whitaker CSE451.
Presentation transcript:

Multi-Lookahead Offset Prefetching Mehran Shakerinava Mohammad Bakhshalipour Pejman Lotfi-Kamran Hamid Sarbazi-Azad Presenter: Farid Samandi (Stony Brook University) The Third Data Prefetching Championship (DPC3), in conjunction with ISCA 2019

Offset Prefetching Maintains a set of prefetching offsets. Adds offsets to memory access addresses to produce prefetching targets. Periodically updates prefetching offsets based on observed memory access patterns. Offset prefetchers differ in how they implement this part. First I’ll start with a brief introduction to Offset Prefetching in general, and later, I’ll demonstrate how MLOP works in particular. Bullet #3: Offset prefetching assumes that most of the time, access patterns in successive periods (sometimes called a round or epoch) will be similar. Offset prefetchers differ in the mechanism they employ for selecting their set of prefetching offsets. The goal is to identify offsets that offer timely predictions and high miss coverage.

Example Prefetching Offsets = {+3, +6}

MLOP Offset Selection MLOP scores candidate offsets based on their coverage potential. There is an offset score vector associated with each “lookahead level.” Lookahead level N only counts covered misses if they occur at least N timesteps later. Select highest scoring offset from each lookahead level. (Ignore repeated offsets) Bullet #3: Each cache Miss/Prefetch-Hit can be considered a “timestep” in the context of prefetching. We’ll see an example of how offset scores are calculated soon…

Example Prefetching Offsets = (+3, +6) Prefetching Offsets = {+3, +6} In this example, offsets +1 through +8 have been evaluated and the lookahead levels are 1, 2, 3, 4. Highest scoring offset is selected from each level. Resulting in the prefetching offset set {+3, +6}. Note that even though there are 4 lookahead levels, only 2 offsets were selected. This tends to happen (when lookahead levels are close, like they are now: 1, 2, 3, 4). So the number of lookahead levels is usually quite larger than prefetch degree (unless lookahead levels are far apart, like 1, 9, 17, …). The set of prefetching offsets is actually an ordered set in MLOP. Offsets obtained from smaller lookahead levels will be prefetched first. Prefetching Offsets = (+3, +6) Prefetching Offsets = {+3, +6}

Access Map Table MLOP uses an Access Map Table to track the state of memory blocks. Each entry is a vector of states pertaining to a contiguous segment of memory. We’ve added a queue to each entry that stores the M - 1 most recent accesses, where M is the highest lookahead level. Bullet #1: States can be {INIT (not accessed or prefetched), ACCESS, PREFETCH L1/L2/L3}, we’ll assume states are only ACCESS/INIT for demonstration. Bullet #2: The AMT is a set-associative structure. Because MLOP sits in L1 and the L1-cache is small, only a few blocks will have states other that INIT, and thus a small AMT (<1KB) is usually enough to track the state of all blocks, unless access patterns are extremely sparse. Bullet #3: A queue is simply implemented as a shift-register in hardware. You’ll see why we’ve done this (added queues) in the next few slides.

Example We’ll focus on one AMT entry in this example. Blocks 0, 1, 3, 4, 6 are being accessed in order. Unmarking the block indexes of the queue in order will give us a recent history of the Access Map with length M (highest lookahead level).

MLOP Score Update We’ll use our access map history to update all score vectors. First map (original access map without any unmarking) will update lookahead lvl 1. Second map will update lookahead lvl 2. And so on… Note that the score vector has been reversed. Assume that block number 7 has just been accessed. We’ll shift the maps so that the recently accessed block is aligned with offset 0 (where it would have been if it was present). In this case, we shift the maps one block to the right. Observe that now, all map blocks are aligned with the offset that could have successfully predicted the recent access. The score for offsets that are aligned with an accessed block is incremented. This update is performed for a fixed number of times. Then, the prefetching offsets are selected and the scores are reset, and the whole process repeats.

MLOP’s DPC3 Settings Sits in L1 and prefetches into L1 and L2. A simple next-line prefetcher sits in L2 and L3. Access Map Table divides memory into 4KB segments (64 blocks/1 page). Evaluates all possible offsets ([-63, +63]). Uses lookahead levels 1 to 16. Bullet #1: We observed that highest performance was achieved in this setting. Bullet #2: Prefetchers at L2, and L3 observe highly irregular access patterns because of the prefetcher in L1. Therefore, we decided to use a simple static prefetcher such as a next-line prefetcher. Bullet #4: Because it’s a competition, we made no attempt to optimize this part. Otherwise, one could only evaluate a subset of these offsets. For example, the “A Best Offset Prefetcher” paper suggests evaluating offsets whose prime factorization consists only of 2, 3, or 5 (primes no greater than 5). Bullet #5: Prefetching degree is much less than 16. On average, less than 5 offsets are selected each round. The rest are repeated offsets or offsets that don’t have a high enough score (next slide explains the score thresholding mechanism).

MLOP’s DPC3 Settings Updates scores 500 times in each round. Selected offsets with score above 200 prefetch into L1. And selected offsets with score above 150 (but not above 200) prefetch into L2. Does not prefetch into L3. Large Access Map Table. Bullet #3: We observed no performance benefit from prefetching into L3. Bullet #4: As I’ve said earlier, a small Access Map Table also achieves high performance, but since this is a competition, we made no attempts to optimize storage. Even a very large (compared with the L1 cache) AMT (~8.5 KB) is much smaller than DPC3’s storage limit (64 KB).

Performance Evaluation We compared MLOP with Best Offset Prefetcher (BOP), Aggregate Stride Prefetcher (ASP), and Signature Path Prefetcher (SPP). All prefetchers sit in L1. MLOP uses no next-line prefetchers in this comparison. On these SPEC2017 workloads, MLOP achieves the highest IPC. After MLOP, SPP performs best. MLOP achieves 39% performance improvement on average in single-core workloads, that is 5% higher than SPP. On multi-core workloads, MLOP achieves an average performance improvement of 21% which is 3% higher than SPP. With next-line prefetchers added at L2 and L3 (not shown here), MLOP achieves 40% performance improvement on single-core workloads and 23% improvement on multi-core workloads.

Thank you!