ECE7995 Caching and Prefetching Techniques in Computer Systems Lecture 8: Buffer Cache in Main Memory (IV)

Slides:



Advertisements
Similar presentations
Dissemination-based Data Delivery Using Broadcast Disks.
Advertisements

ARC: A self-tuning, low overhead Replacement Cache
Song Jiang1 and Xiaodong Zhang1,2 1College of William and Mary
Seoul National University Archi & Network LAB LRFU (Least Recently/Frequently Used) Block Replacement Policy Sang Lyul Min Dept. of Computer Engineering.
The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.
Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.
ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.
ARC: A SELF-TUNING, LOW OVERHEAD REPLACEMENT CACHE
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Outperforming LRU with an Adaptive Replacement Cache Algorithm Nimrod megiddo Dharmendra S. Modha IBM Almaden Research Center.
Application-Controlled File Caching Policies Pei Cao, Edward W. Felten and Kai Li Presented By: Mazen Daaibes Gurpreet Mavi ECE 7995 Presentation.
Mendel Rosenblum and John K. Ousterhout Presented by Travis Bale 1.
Virtual Memory Introduction to Operating Systems: Module 9.
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
CS 342 – Operating Systems Spring 2003 © Ibrahim Korpeoglu Bilkent University1 Memory Management – 4 Page Replacement Algorithms CS 342 – Operating Systems.
Lecture 11: Memory Management
Chapter 10: Virtual Memory. 9.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 22, 2005 Chapter 10: Virtual Memory.
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
CSI 400/500 Operating Systems Spring 2009 Lecture #9 – Paging and Segmentation in Virtual Memory Monday, March 2 nd and Wednesday, March 4 th, 2009.
Computer Organization and Architecture
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
New Visual Characterization Graphs for Memory System Analysis and Evaluation Edson T. Midorikawa Hugo Henrique Cassettari.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Maninder Kaur CACHE MEMORY 24-Nov
Maninder Kaur VIRTUAL MEMORY 24-Nov
Memory Management ◦ Operating Systems ◦ CS550. Paging and Segmentation  Non-contiguous memory allocation  Fragmentation is a serious problem with contiguous.
Virtual Memory.
Calculating Stack Distances Efficiently George Almasi,Calin Cascaval,David Padua
Software and Hardware Support for Locality Aware High Performance Computing Xiaodong Zhang National Science Foundation College of William and Mary This.
Lecture 19: Virtual Memory
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Chapter 21 Virtual Memoey: Policies Chien-Chung Shen CIS, UD
An Effective Disk Caching Algorithm in Data Grid Why Disk Caching in Data Grids?  It takes a long latency (up to several minutes) to load data from a.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Clock-Pro: An Effective Replacement in OS Kernel Xiaodong Zhang College of William and Mary.
Client Cache Management Improving the broadcast for one probability access distribution will hurt the performance of other clients with different access.
Parallel and Distributed Simulation Time Parallel Simulation.
A BRIEF INTRODUCTION TO CACHE LOCALITY YIN WEI DONG 14 SS.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
1 Chapter 10: Virtual Memory Background Demand Paging Process Creation Page Replacement Allocation of Frames Thrashing Operating System Examples (not covered.
Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 9: Virtual Memory.
COT 4600 Operating Systems Spring 2011 Dan C. Marinescu Office: HEC 304 Office hours: Tu-Th 5:00-6:00 PM.
LIRS: Low Inter-reference Recency Set Replacement for VM and Buffer Caches Xiaodong Zhang College of William and Mary.
Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.
Computer Organization
Virtual memory.
Chapter 2 Memory and process management
Chapter 9: Virtual Memory – Part I
LRFU (Least Recently/Frequently Used) Block Replacement Policy
Chapter 21 Virtual Memoey: Policies
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Management
CSI 400/500 Operating Systems Spring 2009
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Chapter 9: Virtual-Memory Management
ECE7995 Caching and Prefetching Techniques in Computer Systems
Distributed Systems CS
\course\cpeg324-05F\Topic7c
Chapter 1 Computer System Overview
Virtual Memory: Working Sets
ARC (Adaptive Replacement Cache)
CGS 3763 Operating Systems Concepts Spring 2013
Chapter Contents 7.1 The Memory Hierarchy 7.2 Random Access Memory
The Design and Implementation of a Log-Structured File System
Presentation transcript:

ECE7995 Caching and Prefetching Techniques in Computer Systems Lecture 8: Buffer Cache in Main Memory (IV)

Quantifying Locality with LRU Stack Blocks are ordered by their recencies; Blocks enter from the stack top, and leave from its bottom; 1 LRU stack Recency = 1 Recency = 2

LRU Stack Blocks are ordered by recency in the LRU stack; Blocks enter from the stack top, and leave from its bottom; LRU stack Recency = 2 IRR = 2 Inter-Reference Recency (IRR) The number of other distinct blocks accessed between two consecutive references to the block. Recency = 0

Locality Strength Cache Size MULTI2 IRR (Re-use Distance in Blocks) Virtual Time (Reference Stream) LRU Good for “absolutely” strong locality Bad for relatively weak locality

LRU ’ s Inability with Weak Locality Memory scanning (one-time access)  Infinite IRR, weak locality;  should not be cached at all;  not replaced timely in LRU (be cached until their recency larger than cache size);

LRU ’ s Inability with Weak Locality Loop-like accesses (repeated accesses with a fixed interval)  IRR is the same as the interval  The interval larger than cache size, no hits  blocks to be accessed soonest can be unfortunately replaced.

LRU ’ s Inability with Weak Locality Accesses with distinct frequencies:  The recencies of frequently accessed blocks become large because of references to infrequently accessed block;  Frequently accessed blocks could be unfortunately replaced.

Looking for Blocks with Strong Locality Locality Strength Cache Size MULTI2 IRR (Re-use Distance in Blocks) Virtual Time (Reference Stream) Cover 1000 Blocks with Strongest Locality

Challenges  Address the limitations of LRU fundamentally.  Retain the low overhead and adaptability merits of LRU. Simplicity: affordable implementation Adaptability: responsive to access pattern changes

Principle of the LIRS Replacement We select the blocks with high IRRs for replacement. LIRS: Low IRR Set Replacement algorithm We keep the set of blocks with low IRRs in cache. If a block’s IRR is high, its next IRR is likely to be high again.

Requirements on Low IRR Block Set (LIRS)  The set size should be the cache size.  The set consists of the blocks with strongest locality strength (with the lowest IRRs)  Dynamically keep the set up to date

Low IRR Block Set Low IRR ( LIR ) block and High IRR ( HIR ) block LIR block set (size is L lirs ) HIR block set Cache size L = L lirs + L hirs L hirs L lirs Physical Cache Block Sets

An Example for LIRS L lirs =2, L hirs =1 LIR block set = {A, B}, HIR block set = {C, D, E}

CDECDE HIR block set ABAB ABAB E LIR block set Resident blocks Mapping to Cache Block Sets L hirs =1 L lirs =2 Physical Cache

D is referenced at time 10 The resident HIR block (E) is replaced ! Which Block is replaced ? Replace HIR Blocks

How LIR Set is Updated ? Recency of LIR Block Used

After D is Referenced at Time 10 … … E is replaced, D enters LIR set B D

If Reference is to C at Time 10 … … E is replaced, C cannot enter LIR set

The LIRS References with Weak Locality Memory scanning (one-time access)  Infinite IRR;  Not included in the LIR block set;  replaced timely.

The LIRS References with Weak Locality Loop-like accesses  The IRRs of all blocks are the same;  Once a block becomes LIR block, it can keep its status;  Any cached block can contribute a hit in one loop of accesses.

The LIRS References with Weak Locality Accesses with distinct frequencies:  The IRRs of frequently accessed blocks have smaller IRR, than infrequently accessed blocks.  Frequently accessed blocks are LIR blocks;  Always cached and get hits.

Making LIRS O(1) Efficient Rmax (Maximum Recency of LIR blocks) IRR HIR (New IRR of the HIR block) This efficiency is achieved by our LIRS stack. LRU stack + LIR block with Rmax recency in its bottom ==> LIRS stack.

Differences between LRU and LIRS Stacks resident block LIR block HIR block Cache size L = LRU stack LIRS stack L lir = 3 L hir =2  Stack size of LRU decided by cache size, and fixed; Stack size of LIRS decided by Rmax, and varied.  LRU stack holds only resident blocks; LIRS stack holds any blocks whose recencies are no more than Rmax.  LRU stack does not distinguish “ hot ” and “ cold ” blocks in it; LIRS stack distinguishes LIR and HIR blocks in it, and dynamically maintains their statues.

Rmax (Maximum Recency of LIR blocks) IRR HIR (New IRR of the HIR block) Blocks in the LIRS stack ==> IRR < Rmax Other blocks ==> IRR > Rmax LIRS Stack How does LIRS Stack Help?

LIRS Operations resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = LIRS stack S 5 3 Resident HIR Stack Q Initialization: All the referenced blocks are given an LIR status until LIR block set is full. We place resident HIR blocks in Stack Q

resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = Access an LIR Block (a Hit) LIRS stack S Resident HIR Stack Q

resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = Access an LIR Block (a Hit) LIRS stack S Resident HIR Stack Q

Access an LIR block (a Hit) resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = SQ

Access a Resident HIR Block (a Hit) resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = SQ

resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = Access a Resident HIR Block (a Hit) SQ

resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = Access a Resident HIR Block (a Hit) SQ

resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = Access a Resident HIR Block (a Hit) SQ

Access a Non-Resident HIR block (a Miss) resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = SQ

resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = Access a Non-Resident HIR block (a Miss) SQ

4 83 resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = Access a Non-Resident HIR block (a Miss) SQ

Workload Traces postgres is a trace of join queries among four relations in a relational database system; sprite is from the Sprite network file system; multi2 is obtained by executing three workloads, cs, cpp, and postgres, together.

Cache Partition 1% of the cache size is for HIR blocks 99% of the cache size is for LIR blocks Performance is not sensitive to a partition.

Looping Pattern: postgres (Access Map) Virtual Time (Reference Stream) Logical Block Number

Looping Pattern: Postgres (IRR Map) IRR (Re-use Distance in Blocks) Virtual Time (Reference Stream) LRU LIRS

Looping Pattern: postgres (Hit Rates)

Temporally-Clustered Pattern: sprite (Access Map) Virtual Time (Reference Stream) Logical Block Number

Temporally-Clustered Pattern: sprite (IRR Map) IRR (Re-use Distance in Blocks) Virtual Time (Reference Stream) LRU LIRS

Temporally-Clustered Pattern: sprite (Hit Ratio)

Mixed Pattern: multi2 (Access Map) Virtual Time (Reference Stream) Logical Block Number

Mixed Pattern: multi2 (IRR Map) IRR (Re-use Distance in Blocks) Virtual Time (Reference Stream) LIRS LRU

Mixed Pattern: multi2 (Hit Ratio)

Summay LIRS uses both IRR (or reuse distance) and recency for its replacement decision. 2Q uses only reuse distance. LIRS adapts to the locality changes when deciding which blocks have small IRRs. 2Q uses a fixed threshold in looking for blocks of small reuse distances. Both LIRS and 2Q are of low time overhead (as low as LRU). Their space overheads are acceptably larger.