Cache Replacement Policies

Slides:

Advertisements

Similar presentations

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer

Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

1 Cache and Caching David Sands CS 147 Spring 08 Dr. Sin-Min Lee.

1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.

Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)

CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.

Improving Cache Performance by Exploiting Read-Write Disparity

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache II Steve Ko Computer Sciences and Engineering University at Buffalo.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.

CSI 400/500 Operating Systems Spring 2009 Lecture #9 – Paging and Segmentation in Virtual Memory Monday, March 2 nd and Wednesday, March 4 th, 2009.

1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.

3-1: Memory Hierarchy Prof. Mikko H. Lipasti University of Wisconsin-Madison Companion Lecture Notes for Modern Processor Design: Fundamentals of Superscalar.

Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.

Lecture 19: Virtual Memory

Chapter Twelve Memory Organization

Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.

Chapter 21 Virtual Memoey: Policies Chien-Chung Shen CIS, UD

CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

ECE/CS 552: Cache Memory Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on set created by Mark Hill Updated.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Memory Hierarchy CS 282 – KAUST – Spring 2010 Slides by: Mikko Lipasti Muhamed Mudawar.

Operating Systems CMPSC 473 Virtual Memory Management (3) November – Lecture 20 Instructor: Bhuvan Urgaonkar.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Demand Paged Virtual Memory Andy Wang Operating Systems COP 4610 / CGS 5765.

Exploiting Compressed Block Size as an Indicator of Future Reuse

International Symposium on Computer Architecture ( ISCA – 2010 )

Clock-Pro: An Effective Replacement in OS Kernel Xiaodong Zhang College of William and Mary.

CS2100 Computer Organisation Cache II (AY2015/6) Semester 1.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

The Evicted-Address Filter

CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.

1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Transforming Policies into Mechanisms with Infokernel Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J.

Lecture 20 Last lecture: Today’s lecture: Types of memory

ECE 752 Review: Modern Processors © Prof. Mikko Lipasti Lecture notes based in part on slides created by John Shen, Mark Hill, David Wood, Guri Sohi, Jim.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 9: Virtual Memory.

Fall 2010 Computer Architecture Lecture 6: Caching Basics Prof. Onur Mutlu Carnegie Mellon University.

Improving Cache Performance using Victim Tag Stores

CRC-2, ISCA 2017 Toronto, Canada June 25, 2017

Computer Architecture Lecture 18: Caches, Caches, Caches

Cache Performance Samira Khan March 28, 2017.

18-447: Computer Architecture Lecture 23: Caches

Multilevel Memories (Improving performance using alittle “cash”)

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

ECE/CS 552: Virtual Memory

18742 Parallel Computer Architecture Caching in Multi-core Systems

Cache Memory Presentation I

Memory Data Flow ECE/CS 752 Fall 2017

EECE.4810/EECE.5730 Operating Systems

ECE/CS 552: Cache Design © Prof. Mikko Lipasti

Cache Replacement Policies

International Symposium on Computer Architecture ( ISCA – 2010 )

Demand Paged Virtual Memory

Lecture 15: Large Cache Design III

CARP: Compression-Aware Replacement Policies

Adapted from slides by Sally McKee Cornell University

Overheads for Computers as Components 2nd ed.

Lecture 14: Large Cache Design II

CSC3050 – Computer Architecture

Operating Systems CMPSC 473

Presentation transcript:

Cache Replacement Policies Prof. Mikko H. Lipasti University of Wisconsin-Madison ECE/CS 752 Spring 2012

Cache Design: Four Key Issues These are: Placement Where can a block of memory go? Identification How do I find a block of memory? Replacement How do I make space for new blocks? Write Policy How do I propagate changes? Consider these for caches Usually SRAM Also apply to main memory, disks © 2005 Mikko Lipasti

Placement Memory Type Placement Comments Registers Anywhere; Int, FP, SPR Compiler/programmer manages Cache (SRAM) Fixed in H/W Direct-mapped, set-associative, fully-associative DRAM Anywhere O/S manages Disk © 2005 Mikko Lipasti

Placement Address Range Map address to finite capacity Direct-mapped Block Size Address SRAM Cache Address Range Exceeds cache capacity Map address to finite capacity Called a hash Usually just masks high-order bits Direct-mapped Block can only exist in one location Hash collisions cause problems Index Hash Offset Data Out 32-bit Address Index Offset © 2005 Mikko Lipasti

Identification Fully-associative Identification Expensive! Tag Address SRAM Cache ?= Hit Fully-associative Block can exist anywhere No more hash collisions Identification How do I know I have the right block? Called a tag check Must store address tags Compare against address Expensive! Tag & comparator per block Hash Tag Check Offset Data Out 32-bit Address Tag Offset © 2005 Mikko Lipasti

Placement Set-associative Identification Block can be in a locations Address SRAM Cache Index Set-associative Block can be in a locations Hash collisions: a still OK Identification Still perform tag check However, only a in parallel Index Hash a Tags a Data Blocks ?= ?= Tag ?= ?= Offset Offset 32-bit Address Tag Index Data Out © 2005 Mikko Lipasti

Replacement Cache has finite size Analogy: desktop full? Same idea: What do we do when it is full? Analogy: desktop full? Move books to bookshelf to make room Bookshelf full? Move least-used to library Etc. Same idea: Move blocks to next level of cache © 2005 Mikko Lipasti

Cache Miss Rates: 3 C’s [Hill] Compulsory miss or Cold miss First-ever reference to a given block of memory Measure: number of misses in an infinite cache model Capacity Working set exceeds cache capacity Useful blocks (with future references) displaced Good replacement policy is crucial! Measure: additional misses in a fully-associative cache Conflict Placement restrictions (not fully-associative) cause useful blocks to be displaced Think of as capacity within set Measure: additional misses in cache of interest

Replacement How do we choose victim? Many policies are possible Verbs: Victimize, evict, replace, cast out Many policies are possible FIFO (first-in-first-out) LRU (least recently used), pseudo-LRU LFU (least frequently used) NMRU (not most recently used) NRU Pseudo-random (yes, really!) Optimal Etc © 2005 Mikko Lipasti

Optimal Replacement Policy? [Belady, IBM Systems Journal, 1966] Evict block with longest reuse distance i.e. next reference to block is farthest in future Requires knowledge of the future! Can’t build it, but can model it with trace Process trace in reverse [Sugumar&Abraham] describe how to do this in one pass over the trace with some lookahead (Cheetah simulator) Useful, since it reveals opportunity (X,A,B,C,D,X): LRU 4-way SA $, 2nd X will miss © 2005 Mikko Lipasti

Least-Recently Used For a=2, LRU is equivalent to NMRU Single bit per set indicates LRU/MRU Set/clear on each access For a>2, LRU is difficult/expensive Timestamps? How many bits? Must find min timestamp on each eviction Sorted list? Re-sort on every access? List overhead: log2(a) bits /block Shift register implementation © Shen, Lipasti

Practical Pseudo-LRU Rather than true LRU, use binary tree J Older F 1 C 1 B X 1 Y Newer A 1 Z Rather than true LRU, use binary tree Each node records which half is older/newer Update nodes on each reference Follow older pointers to find LRU victim

Practical Pseudo-LRU In Action J Y X Z B C F A J F 011: PLRU Block B is here C B X 110: MRU block is here Y A Z Partial Order Encoded in Tree: B C F A J Y X Z Z < A Y < X B < C J < F A > X C < F A > F

Practical Pseudo-LRU Refs: J,Y,X,Z,B,C,F,A Older F 1 011: PLRU Block B is here C 1 B X 1 110: MRU block is here Y Newer A 1 Z Binary tree encodes PLRU partial order At each level point to LRU half of subtree Each access: flip nodes along path to block Eviction: follow LRU path Overhead: (a-1)/a bits per block

True LRU Shortcomings Streaming data/scans: x0, x1, …, xn Effectively no temporal reuse Thrashing: reuse distance > a Temporal reuse exists but LRU fails All blocks march from MRU to LRU Other conflicting blocks are pushed out For n>a no blocks remain after scan/thrash Incur many conflict misses after scan ends Pseudo-LRU sometimes helps a little bit

Segmented or Protected LRU [I/O: Karedla, Love, Wherry, IEEE Computer 27(3), 1994] [Cache: Wilkerson, Wade, US Patent 6393525, 1999] Partition LRU list into filter and reuse lists On insert, block goes into filter list On reuse (hit), block promoted into reuse list Provides scan & some thrash resistance Blocks without reuse get evicted quickly Blocks with reuse are protected from scan/thrash blocks No storage overhead, but LRU update slightly more complicated

Protected LRU: LIP Simplified variant of this idea: LIP Qureshi et al. ISCA 2007 Insert new blocks into LRU position, not MRU position Filter list of size 1, reuse list of size (a-1) Do this adaptively: DIP Use set dueling to decide LIP vs. LRU 1 (or a few) set uses LIP vs. 1 that uses LRU Compare hit rate for sets Set policy for all other sets to match best set

Not Recently Used (NRU) Keep NRU state in 1 bit/block Bit is set to 0 when installed (assume reuse) Bit is set to 0 when referenced (reuse observed) Evictions favor NRU=1 blocks If all blocks are NRU=0 Eviction forces all blocks in set to NRU=1 Picks one as victim (can be pseudo-random, or rotating, or fixed left-to-right) Simple, similar to virtual memory clock algorithm Provides some scan and thrash resistance Relies on “randomizing” evictions rather than strict LRU order Used by Intel Itanium, Sparc T2 © Shen, Lipasti

RRIP [Jaleel et al. ISCA 2010] Re-reference Interval Prediction Extends NRU to multiple bits Start in the middle, promote on hit, demote over time Can predict near-immediate, intermediate, and distant re-reference Low overhead: 2 bits/block Static and dynamic variants (like LIP/DIP) Set dueling © Shen, Lipasti

Least Frequently Used Counter per block, incremented on reference Evictions choose lowest count Logic not trivial (a2 comparison/sort) Storage overhead 1 bit per block: same as NRU How many bits are helpful? © Shen, Lipasti

Pitfall: Cache Filtering Effect Upper level caches (L1, L2) hide reference stream from lower level caches Blocks with “no reuse” @ LLC could be very hot (never evicted from L1/L2) Evicting from LLC often causes L1/L2 eviction (due to inclusion) Could hurt performance even if LLC miss rate improves © 2005 Mikko Lipasti

Cache Replacement Championship Held at ISCA 2010 http://www.jilp.org/jwac-1 Several variants, improvements Simulation infrastructure Implementations for all entries © Shen, Lipasti

Recap Replacement policies affect capacity and conflict misses Policies covered: Belady’s optimal replacement Least-recently used (LRU) Practical pseudo-LRU (tree LRU) Protected LRU LIP/DIP variant Set dueling to dynamically select policy Not-recently-used (NRU) or clock algorithm RRIP (re-reference interval prediction) Least frequently used (LFU) Contest results

References S. Bansal and D. S. Modha. “CAR: Clock with Adaptive Replacement”, In FAST, 2004. A. Basu et al. “Scavenger: A New Last Level Cache Architecture with Global Block Priority”. In Micro-40, 2007. L. A. Belady. A study of replacement algorithms for a virtual-storage computer. In IBM Systems journal, pages 78–101, 1966. M. Chaudhuri. “Pseudo-LIFO: The Foundation of a New Family of Replacement Policies for Last-level Caches”. In Micro, 2009. F. J. Corbat´o, “A paging experiment with the multics system,” In Honor of P. M. Morse, pp. 217–228, MIT Press, 1969. A. Jaleel, et al. “Adaptive Insertion Policies for Managing Shared Caches”. In PACT, 2008. Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer, “High Performance Cache Replacement Using Re-Reference Interval Prediction “, In ISCA, 2010. S. Jiang and X. Zhang, “LIRS: An efficient low inter-reference recency set replacement policy to improve buffer cache performance,” in Proc. ACM SIGMETRICS Conf., 2002. T. Johnson and D. Shasha, “2Q: A low overhead high performance buffer management replacement algorithm,” in VLDB Conf., 1994. S. Kaxiras et al. Cache decay: exploiting generational behavior to reduce cache leakage power. In ISCA-28, 2001. A. Lai, C. Fide, and B. Falsafi. Dead-block prediction & dead-block correlating prefetchers. In ISCA-28, 2001 D. Lee et al. “LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies,” IEEE Trans.Computers, vol. 50, no. 12, pp. 1352–1360, 2001.

References W. Lin et al. “Predicting last-touch references under optimal replacement.” Technical Report CSE-TR-447-02, U. of Michigan, 2002. H. Liu et al. “Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency.” In Micro-41, 2008. G. Loh. “Extending the Effectiveness of 3D-Stacked DRAM Caches with an Adaptive Multi-Queue Policy”. In Micro, 2009. C.-K. Luk et al. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI, pages 190–200, 2005. N. Megiddo and D. S. Modha, “ARC: A self-tuning, low overhead replacement cache,” in FAST, 2003. E. J. O’Neil et al. “The LRU-K page replacement algorithm for database disk buffering,” in Proc. ACM SIGMOD Conf., pp. 297–306, 1993. M. Qureshi, A. Jaleel, Y. Patt, S. Steely, J. Emer. “Adaptive Insertion Policies for High Performance Caching”. In ISCA-34, 2007. K. Rajan and G. Ramaswamy. “Emulating Optimal Replacement with a Shepherd Cache”. In Micro-40, 2007. J. T. Robinson and M. V. Devarakonda, “Data cache management using frequency-based replacement,” in SIGMETRICS Conf, 1990. R. Sugumar and S. Abraham, “Efficient simulation of caches under optimal replacement with applications to miss characterization,” in SIGMETRICS, 1993. Y. Xie, G. Loh. “PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches.” In ISCA-36, 2009 Y. Zhou and J. F. Philbin, “The multi-queue replacement algorithm for second level buffer caches,” in USENIX Annual Tech. Conf, 2001.