Song Jiang1 and Xiaodong Zhang1,2 1College of William and Mary

Slides:

Advertisements

Similar presentations

ARC: A self-tuning, low overhead Replacement Cache

Advertisements

Chapter 4 Memory Management Basic memory management Swapping

LRU-K Page Replacement Algorithm

Page Replacement Algorithms

Seoul National University Archi & Network LAB LRFU (Least Recently/Frequently Used) Block Replacement Policy Sang Lyul Min Dept. of Computer Engineering.

Chapter 11 – Virtual Memory Management

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.

ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.

Paging: Design Issues. Readings r Silbershatz et al: ,

Chapter 8 Virtual Memory

ARC: A SELF-TUNING, LOW OVERHEAD REPLACEMENT CACHE

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Outperforming LRU with an Adaptive Replacement Cache Algorithm Nimrod megiddo Dharmendra S. Modha IBM Almaden Research Center.

Application-Controlled File Caching Policies Pei Cao, Edward W. Felten and Kai Li Presented By: Mazen Daaibes Gurpreet Mavi ECE 7995 Presentation.

Mendel Rosenblum and John K. Ousterhout Presented by Travis Bale 1.

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

Virtual Memory Hardware Support

Cache Memory By JIA HUANG. "Computer Science has only three ideas: cache, hash, trash.“ - Greg Ganger, CMU.

Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.

Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.

Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,

Virtual Memory Chapter 8.

1 Virtual Memory Chapter 8. 2 Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

CSI 400/500 Operating Systems Spring 2009 Lecture #9 – Paging and Segmentation in Virtual Memory Monday, March 2 nd and Wednesday, March 4 th, 2009.

ECE7995 Caching and Prefetching Techniques in Computer Systems Lecture 8: Buffer Cache in Main Memory (IV)

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

New Visual Characterization Graphs for Memory System Analysis and Evaluation Edson T. Midorikawa Hugo Henrique Cassettari.

CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.

Systems I Locality and Caching

Virtual Memory.

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Software and Hardware Support for Locality Aware High Performance Computing Xiaodong Zhang National Science Foundation College of William and Mary This.

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

Chapter 21 Virtual Memoey: Policies Chien-Chung Shen CIS, UD

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

Computer Architecture Lecture 26 Fasih ur Rehman.

1 Virtual Memory Chapter 9. 2 Resident Set Size n Fixed-allocation policy u Allocates a fixed number of frames that remains constant over time F The number.

An Effective Disk Caching Algorithm in Data Grid Why Disk Caching in Data Grids?  It takes a long latency (up to several minutes) to load data from a.

Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.

Clock-Pro: An Effective Replacement in OS Kernel Xiaodong Zhang College of William and Mary.

Parallel and Distributed Simulation Time Parallel Simulation.

Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.

CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

An Overview of Proxy Caching Algorithms Haifeng Wang.

Program-Context Specific Buffer Caching Feng Zhou Rob von Behren Eric Brewer System Lunch, Berkeley CS, 11/1/04.

Energy Efficient Prefetching and Caching Athanasios E. Papathanasiou and Michael L. Scott. University of Rochester Proceedings of 2004 USENIX Annual Technical.

Page Buffering, I. Pages to be replaced are kept in main memory for a while to guard against poorly performing replacement algorithms such as FIFO Two.

Transforming Policies into Mechanisms with Infokernel Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J.

Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,

LIRS: Low Inter-reference Recency Set Replacement for VM and Buffer Caches Xiaodong Zhang College of William and Mary.

PAGE REPLACEMNT ALGORITHMS FUNDAMENTAL OF ALGORITHMS.

Chapter 1 Computer System Overview

LRFU (Least Recently/Frequently Used) Block Replacement Policy

Chapter 21 Virtual Memoey: Policies

Multilevel Memories (Improving performance using alittle “cash”)

18742 Parallel Computer Architecture Caching in Multi-core Systems

Chapter 8 Virtual Memory

Adaptive Cache Replacement Policy

ECE7995 Caching and Prefetching Techniques in Computer Systems

Distributed Systems CS

Clock-Pro: An Effective Replacement in OS Kernel

Chapter 1 Computer System Overview

Distributed Systems CS

ARC (Adaptive Replacement Cache)

Operating Systems: Internals and Design Principles, 6/E

The Design and Implementation of a Log-Structured File System

Presentation transcript:

LIRS : An Efficient Replacement Policy to Improve Buffer Cache Performance Song Jiang1 and Xiaodong Zhang1,2 1College of William and Mary 2National Science Foundation

The Problem of LRU Replacement Inability to cope with weak access locality File scanning: one-time accessed blocks are not replaced timely; Loop-like accesses: blocks to be accessed soonest can be unfortunately replaced; Accesses with distinct frequencies: Frequently accessed blocks can be unfortunately replaced.

Why does LRU Fail Sometimes? A recently used block is not necessarily to be used again soon. Can not deal with working set larger than available cache size

LRU Merits Simplicity: affordable implementation Adaptability: responsive to access pattern changes

Our Objectives Significant efforts have been made to improve LRU, but Case by case; or High runtime overhead Our objectives: Address the limits of LRU fundamentally. Retain the low overhead and adaptability merits of LRU.

Outline Related Work The LIRS Algorithm LIRS Implementation Using LRU Stack Performance Evaluation Sensitivity and Overhead Analysis Conclusions

Related Work Aided by user-level hints Detection and adaptation of access regularities Tracing and utilizing deeper history information

Rely on users’ understanding of data access patterns User-level Hints Application-controlled file caching [Cao et al, USENIX’94] Application-informed prefetching and caching [Patterson et al, SOSP’96] Rely on users’ understanding of data access patterns

Detection and Adaptation of Regularities SEQ: sequential access pattern detection [Glass et al, Sigmetrics’97] EELRU: on-line analysis of aggregate recency distributions of referenced blocks [Smaragdakis et al, Sigmetrics’97] DEAR: detection of multiple block reference patterns [Choi et al, USENIX’99] AFC: Application/File-level Characterization [Choi et al, Sigmetrics’00] UBM: Unified Buffer Management [Kim et al, OSDI’00] Case-by-case oriented approaches

Tracing and Utilizing Access History LRFU: combine LRU and LFU [Lee et al, Sigmetrics’99] LRU-K: replacement decision based on the time of the Kth-to-last reference [ O'Neil et al, Sigmod’93] 2Q: use two queues to quickly remove cold blocks [Johnson et al, VLDB’94] Either high implementation cost, or workload dependent performance

Outline Related Work The LIRS Algorithm LIRS Implementation Using LRU Stack Performance Evaluation Sensitivity and Overhead Analysis Conclusions

Observation of Data Flow in LRU Stack Blocks are ordered by recency in the LRU stack; Blocks enter from stack top, and leave from its bottom; 3 2 5 A block evicted from the bottom of the stack should have been evicted much earlier ! . LRU stack 1 6

Inter-Reference Recency (IRR) IRR of a block: number of other unique blocks accessed between two consecutive references to the block. Recency: number of other unique blocks accessed from last reference to the current time. IRR = 3 R = 2 1 2 3 4 3 1 5 6 5

Principles of Our Replacement If a block’s IRR is high, its next IRR is likely to be high again. We select the blocks with high IRRs for replacement . Once IRR is out of date, we rely on the recency. LIRS: Low Inter-reference Recency Set Replacement Policy We keep the blocks with low IRRs in cache.

Basic LIRS Idea: Keep LIR Blocks in Cache Low IRR (LIR) block and High IRR (HIR) block Block Sets Physical Cache LIR block set (size is Llirs ) Lhirs Llirs Cache size L = Llirs + Lhirs HIR block set

LIR block set = {A, B}, HIR block set = {C, D, E} An Example for LIRS Llirs=2, Lhirs=1 LIR block set = {A, B}, HIR block set = {C, D, E}

Mapping to Cache Llirs=2 Lhirs=1 Block Sets Physical Cache D E HIR block set A B LIR block set Resident blocks Llirs=2 Lhirs=1

The resident HIR block (E) is replaced ! Which Block is replaced ? Replace a HIR Block D is referenced at time 10 The resident HIR block (E) is replaced !

How LIR Set is Updated ? Recency of LIR Block Used

After D is Referenced at Time 10 E is replaced, D enters LIR set

E is replaced, C can not enter LIR set If Reference is to C at Time 10 . . . . . . E is replaced, C can not enter LIR set

The Power of LIRS Replacement Capability to cope with weak access locality File scanning: one-time accessed blocks will be replaced timely; Loop-like accesses: blocks to be accessed soonest will NOT be replaced; Accesses with distinct frequencies: Frequently accessed blocks will NOT be replaced.

Outline Related Work The LIRS Algorithm LIRS Implementation Using LRU Stack Performance Evaluation Sensitivity and Overhead Analysis Conclusions

LIRS Efficiency: O(1) IRR HIR Rmax (New IRR of the HIR block) Rmax (Maximum Recency of LIR blocks) This efficiency is achieved by our LIRS stack. LRU stack + LIR block with Rmax recency in its bottom ==> LIRS stack.

Differences between LRU and LIRS Stacks resident block LIR block HIR block Cache size L = 5 3 2 1 6 5 LRU stack 9 4 8 LIRS stack Llir = 3 Lhir =2 Stack size of LRU decided by cache size, and fixed; Stack size of LIRS decided by LIR block with Rmax recncy, and varied. LRU stack holds only resident blocks; LIRS stack holds any blocks whose recencies are no more than Rmax. LRU stack does not distinguish “hot” and “cold” blocks in it; LIRS stack distinguishes LIR and HIR blocks in it, and dynamically maintains their statues.

How does LIRS Stack Help? Rmax (Maximum Recency of LIR blocks) IRR HIR (New IRR of the HIR block) LIRS Stack Blocks in the LIRS stack ==> IRR < Rmax Other blocks ==> IRR > Rmax

LIRS Operations 5 3 2 Cache size L = 5 1 6 Llir = 3 Lhir =2 9 4 8 LIRS stack S Resident HIR Stack Q resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 Initialization: All the referenced blocks are given an LIR status until LIR block set is full. We place resident HIR blocks in Stack Q Upon accessing a LIR block (a hit) Upon accessing a resident HIR block (a hit) Upon accessing a non-resident HIR block (a miss)

Access a LIR block (a Hit) 5 3 2 1 4 8 S Q 6 9 5 3 2 1 6 9 4 8 S Q 5 3 2 1 6 9 4 8 S Q resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 Access 4 Access 8 S d

Access a HIR Resident block (a Hit) 5 3 2 1 4 8 S Q 3 1 4 8 S 5 Q 2 1 3 4 8 S 5 Q resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 Access 3 Access 5 S d

Access a Non-Resident HIR block (a Miss) 1 3 4 8 S 5 Q 5 3 4 8 S 7 Q resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 Access 7

Access a HIR Non-Resident block (a Miss) (Cont) resident in cache 5 block number LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 3 4 8 S 7 Q 7 3 4 8 S 9 Q 5 4 S Q 8 9 7 5 3 Access 9 Access 5

Outline Related Work The LIRS Algorithm LIRS Implementation Using LRU Stack Performance Evaluation Sensitivity and Overhead Analysis Conclusions

Workload Traces cpp is a GNU C compiler pre-processor trace cs is an interactive C source program examination tool trace. glimpse is a text information retrieval utility trace. postgres is a trace of join queries among four relations in a relational database system sprite is from the Sprite network file system mulit1 is obtained by executing two workloads, cs and cpp, together. multi2 is obtained by executing three workloads, cs, cpp, and postgres, together.

Representative Access patterns Looping references: all blocks are accessed repeatedly with a regular interval; Temporally-clustered references: blocks accessed more recently are the ones more likely to be accessed again soon. Probabilistic references: each block has a stationary reference probability, and all blocks are accessed independently with the associated probabilities.

Cache Partition 1% of the cache size is for HIR blocks 99% of the cache size is for LIR blocks Performance is not sensitive to a partition.

Looping Pattern: cs (Time-space map)

Looping Pattern: cs (Hit Rates)

Looping Pattern: postgres (Time-space map)

Looping Pattern: postgres (Hit Rates)

Looping Pattern: postgres (Hit Rates)

Probabilistic Pattern: cpp (Time-space map)

Probabilistic Pattern: cpp (Hit Rates)

Temporally-Clustered Pattern: sprite (Time-space map)

Temporally-Clustered Pattern: sprite (Hit Rates)

Mixed Pattern: multi1 (Time-space map)

Mixed Pattern: multi1 (Hit Rates)

Mixed Pattern: multi2 (Time-space map)

Mixed Pattern: multi2 (Hit Rates)

Outline Related Work The LIRS Algorithm LIRS Implementation Using LRU Stack Performance Evaluation Sensitivity and Overhead Analysis Conclusions

Sensitivity to the Change of Lhirs

Sensitivity to the Change of Lhirs

LIRS with Limited Stack Sizes

LIRS with Limited Stack Sizes

Conclusions Effectively use deeper access history without explicit regularity detection and high cost operations. Outperform exiting replacement policies. Its implementation as simple as LRU. Applicable to virtual memory and database buffer management.