1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.

Slides:



Advertisements
Similar presentations
Dead Block Replacement and Bypass with a Sampling Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.
Advertisements

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.
Improving Cache Performance by Exploiting Read-Write Disparity
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.
Computer Architecture Lecture 26 Fasih ur Rehman.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
The Evicted-Address Filter
Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.
1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
Virtual Memory Questions answered in this lecture: How to run process when not enough physical memory? When should a page be moved from disk to memory?
1 Lecture 8: Virtual Memory Operating System Fall 2006.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Instruction Prefetching Smruti R. Sarangi. Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return.
- 세부 1 - 이종 클라우드 플랫폼 데이터 관리 브로커 연구 및 개발 Network and Computing Lab.
Cache Replacement Policy Based on Expected Hit Count
Improving Cache Performance using Victim Tag Stores
CRC-2, ISCA 2017 Toronto, Canada June 25, 2017
Lecture: Large Caches, Virtual Memory
Associativity in Caches Lecture 25
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,
Lecture: Cache Hierarchies
Lecture 14: Large Cache Design III
CS222/CS122C: Principles of Data Management Lecture #3 Heap Files, Page Formats, Buffer Manager Instructor: Chen Li.
18742 Parallel Computer Architecture Caching in Multi-core Systems
Lecture: Cache Hierarchies
Lecture 13: Large Cache Design I
Lecture 12: Cache Innovations
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Database Applications (15-415) DBMS Internals: Part III Lecture 14, February 27, 2018 Mohammad Hammoud.
Lecture: Cache Innovations, Virtual Memory
Lecture 15: Large Cache Design III
CARP: Compression-Aware Replacement Policies
Lecture 14: Large Cache Design II
15-740/ Computer Architecture Lecture 14: Prefetching
Lecture: Cache Hierarchies
COMP755 Advanced Operating Systems
Sarah Diesburg Operating Systems CS 3430
10/18: Lecture Topics Using spatial locality
Sarah Diesburg Operating Systems COP 4610
Presentation transcript:

1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a 9% higher miss rate than true LRU

2 Overview

3 Set Partitioning Can also partition sets among cores by assigning page colors to each core Needs little hardware support, but must adapt to dynamic arrival/exit of tasks

4 LIN Qureshi et al., ISCA’06 Memory level parallelism (MLP): number of misses that simultaneously access memory; high MLP  miss is less expensive Replacement decision is a linear combination of recency and MLP experienced when fetching that block MLP is estimated by tracking the number of outstanding requests in the MSHR while waiting in the MSHR Can also use set dueling to decide between LRU and LIN

5 Pseudo-LIFO Chaudhuri, MICRO’09 A fill stack is a FIFO that tracks the order in which blocks entered the set Most hits are serviced while a block is near the top of the stack; there is usually a knee-point beyond which blocks stop yielding hits Evict the highest block above the knee that has not yet serviced a hit in its current fill stack position Allows some blocks to probabilistically escape past the knee and get retained for distant reuse

6 Scavenger Basu et al., MICRO’07 Half the cache is used as a victim cache to retain blocks that will likely be used in the distant future Counting bloom filters to track a block’s potential for reuse and make replacement decisions in the victim cache Complex indexing and search in the victim cache Another recent paper (NuCache, HPCA’11) places blocks in a large FIFO victim file if they were fetched by delinquent PCs and the block has a short re-use distance

7 V-Way Cache Qureshi et al., ISCA’05 Meant to reduce load imbalance among sets and compute a better global replacement decision Tag store: every set has twice as many ways Data store: no correspondence with tag store; need forward and reverse pointers In most cases, can replace any block; every block has a 2b saturating counter that is incremented on every access; scan blocks (and decrement) until a zero counter is found; continue scan on next replacement

8 ZCache Sanchez and Kozyrakis, MICRO’10 Skewed associative cache: each way has a different indexing function (in essence, W direct-mapped caches) When block A is brought in, it could replace one of four (say) blocks B, C, D, E; but B could be made to reside in one of three other locations (currently occupied by F, G, H); and F could be moved to one of three other locations We thus get a tree of replacement options and we can pick LRU among these options Every replacement requires multiple tag look-ups and data block copies; worthwhile if you’re reducing off-chip accesses

9 Temporal Memory Streaming Wenisch et al., ISCA’05 When a thread incurs a series of misses to blocks (PQRS, not necessarily contiguous), other threads are likely to incur a similar series of misses Each thread maintains its miss log in a circular buffer in memory; the directory for P also keeps track of a pointer to multiple log entries of P When a thread has a miss on P, it contacts the directory and the directory provides the log pointers; the thread receives multiple streams and starts prefetching Log access and prefetches are off the critical path

10 Spatial Memory Streaming Somogyi et al., ISCA’06 Threads often enter a new region (page) and touch a few arbitrary blocks in that region A predictor is indexed with the PC of the first access to that region and the offset of the first access; the predictor returns a bit vector indicating the blocks accessed within that region Can even prefetch for regions that have not been touched before!

11 Feedback Directed Prefetching Srinath et al., HPCA’07 A stream prefetcher has two parameters: P: prefetch distance: how far ahead of the start do we prefetch N: prefetch degree: how much do we advance the start when there is a hit in the stream Can vary these two parameters based on pref effectiveness Accuracy: a bit tracks if a prefetched block was touched Timeliness: was the block touched while in the MSHR? Pollution: track recent evictions (Bloom filter) and see if they are re-touched; also guides insertion policy

12 Dead Block Prediction Can keep track of the number of accesses to a line during its previous residence; the block is deemed to be dead after that many accesses Kharbutli, Solihin, IEEE TOC’08 To reduce noise, an access can be considered as a block’s move to the MRU position Liu et al., MICRO 2008 Earlier DBPs used a trace of PCs to capture when a block has completed its use DBP is used for energy savings, replacement policies, and cache bypassing

13 Distill Cache Qureshi, HPCA 2007 Half the ways are traditional (LOC); when a block is evicted from the LOC, only the touched words are stored in a word-organized cache that has many narrow ways Incurs a fair bit of complexity (more tags for the WOC, collection of word touches in L1s, blocks with holes, etc.) Does not need a predictor; actions are based on the block’s behavior during current residence Useless word identification is orthogonal to cache compression

14 Title Bullet

15 Title Bullet