07/05/20051 The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms by Ali R. Butt, Chris Gniady, and Y.Charlie Hu, SIGMETRICS05.

Slides:



Advertisements
Similar presentations
Load Balancing in a Cluster-based Active Jiani Guo (Student Member, IEEE) Laxmi Bhuyan (Fellow, IEEE) March 15 th 2005 Seo, Dong Mahn.
Advertisements

A AAAA Model to Support Science Gateways with Community Accounts GGF-14 Science Gateways Workshop June 28, 2005 Von Welch, James Barlow, James Basney,
Lia Toledo Moreira Mota, Alexandre de Assis Mota, Wu, Shin-Ting
SE-292 High Performance Computing
ICDT Optimal Distributed Declustering using Replication Keith Frikken Purdue University Jan 5, 2005.
Chapter 4 Memory Management Basic memory management Swapping
CS 241 Spring 2007 System Programming 1 Memory Replacement Policies Lecture 32 Klara Nahrstedt.
Song Jiang1 and Xiaodong Zhang1,2 1College of William and Mary
Page Replacement Algorithms
Online Algorithm Huaping Wang Apr.21
Cache and Virtual Memory Replacement Algorithms
Chapter 11 – Virtual Memory Management
Yannis Smaragdakis / 11-Jun-14 General Adaptive Replacement Policies Yannis Smaragdakis Georgia Tech.
Module 10: Virtual Memory
Chapter 10: Virtual Memory
The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.
Chapter 11 – Virtual Memory Management
Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.
ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.
Virtual Memory II Chapter 8.
Virtual Memory: Page Replacement
A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:
Bypass and Insertion Algorithms for Exclusive Last-level Caches
1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.
May 9, September 2005, Barcelona, Spain Prioritization of Forestry Themes for the SRA Risto Päivinen.
Chapter 4 Memory Management Page Replacement 补充:什么叫页面抖动?
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Outperforming LRU with an Adaptive Replacement Cache Algorithm Nimrod megiddo Dharmendra S. Modha IBM Almaden Research Center.
Application-Controlled File Caching Policies Pei Cao, Edward W. Felten and Kai Li Presented By: Mazen Daaibes Gurpreet Mavi ECE 7995 Presentation.
Operating Systems ECE344 Ding Yuan Final Review Lecture 13: Final Review.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
03/31/2004CSCI 315 Operating Systems Design1 Allocation of Frames & Thrashing (Virtual Memory)
9.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Virtual Memory OSC: Chapter 9. Demand Paging Copy-on-Write Page Replacement.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 15: Background Information for the VMWare ESX Memory Management.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Virtual Memory CSCI 444/544 Operating Systems Fall 2008.
OS Spring’04 Virtual Memory: Page Replacement Operating Systems Spring 2004.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
03/29/2004CSCI 315 Operating Systems Design1 Page Replacement Algorithms (Virtual Memory)
New Visual Characterization Graphs for Memory System Analysis and Evaluation Edson T. Midorikawa Hugo Henrique Cassettari.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Page 19/17/2015 CSE 30341: Operating Systems Principles Optimal Algorithm  Replace page that will not be used for longest period of time  Used for measuring.
CS 153 Design of Operating Systems Spring 2015 Final Review.
Design Tradeoffs For Software-Managed TLBs Authers; Nagle, Uhlig, Stanly Sechrest, Mudge & Brown.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Chapter 21 Virtual Memoey: Policies Chien-Chung Shen CIS, UD
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Virtual Memory.
Operating Systems CMPSC 473 Virtual Memory Management (3) November – Lecture 20 Instructor: Bhuvan Urgaonkar.
Clock-Pro: An Effective Replacement in OS Kernel Xiaodong Zhang College of William and Mary.
Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.
Operating Systems ECE344 Ding Yuan Final Review Lecture 13: Final Review.
Energy Efficient Prefetching and Caching Athanasios E. Papathanasiou and Michael L. Scott. University of Rochester Proceedings of 2004 USENIX Annual Technical.
Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 9: Virtual Memory.
LIRS: Low Inter-reference Recency Set Replacement for VM and Buffer Caches Xiaodong Zhang College of William and Mary.
Computer Architecture
Informed Prefetching and Caching
Module 9: Virtual Memory
Chapter 9: Virtual-Memory Management
What Happens if There is no Free Frame?
5: Virtual Memory Background Demand Paging
Clock-Pro: An Effective Replacement in OS Kernel
Operating Systems CMPSC 473
Lecture 9: Caching and Demand-Paged Virtual Memory
Implementation of page-replacement algorithms and Belady’s anomaly
Module 9: Virtual Memory
Presentation transcript:

07/05/20051 The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms by Ali R. Butt, Chris Gniady, and Y.Charlie Hu, SIGMETRICS05 Course: CSCI 780 – Advanced Topics on Caching Techniques in Computer and Distributed Systems Presenter: Chuan Yue

07/05/20052 Outline The Buffer Cache Linux Kernel Prefetching Adapted Buffer Cache Replacement Algorithms Simulation Results Conclusions Discussions

07/05/20053 Buffer Cache in Main Memory Two kinds of I/O operations: –Direct access read()/write() use block-based buffer cache –Memory-mapped I/O share page cache with the virtual memory system Naturally that leads to two separate buffers Problems: –Double buffering –Inconsistencies I/O using read/write virtual memory memory- mapped I/O page cache buffer cache disk

07/05/20054 Unification of Buffer Cache and Page Cache A unified buffer cache uses the same page cache to store –Virtual memory pages –Memory-mapped pages –Ordinary file system I/O Issues: –complex interactions between file system and VM I/O using read/write virtual memory memory- mapped I/O disk unified buffer cache

07/05/20055 Buffer Cache Management Designing effective buffer cache replacement algorithms is a fundamental challenge in improving system performance –Traditional file I/O system –Virtual memory system Various buffer cache replacement algorithms –LRU replacement is widely used –LRUs inability to cope with access patterns with weak locality –Other well-known algorithms that utilize recency information: LRU-2, 2Q, LIRS, LRFU, MQ, ARC

07/05/20056 Prefetching Prefetching is another highly effective technique used for improving the I/O performance The main motivation for prefetching is to overlap computation with I/O and thus reduce exposed latency of I/O Various prefetching techniques: –Prefetching using user inserted hints of I/O access patterns Drawback: placing burden on programmer –File system kernel-driven prefetching in modern operating systems Synchronous read-ahead to amortize seek cost Asynchronous prefetching after detecting sequential access patterns

07/05/20057 The impact of kernel prefetching on buffer cache replacement algorithms performance The close interactions between caching and prefetching –Prefetching file blocks into cache can be harmful (P. Cao, et. al., 1995) –Both replacement policy & prefetching buffer cache hit ratio –Hit ratio, prefetching & clustering I/O disk traffic –I/O disk traffic file system performance Almost all proposed buffer cache replacement algorithms didnt take into account the kernel driven prefetching The work in this paper: –Shows the potential performance impact of kernel prefetching on buffer cache replacement algorithms –Presents the simulation results on 8 adapted replacement algorithms

07/05/20058 Kernel components on the path from file system operations to the disk

07/05/20059 Kernel Prefetching in Linux Prefetching is based on the pattern of accesses to the file –Only considers prefetching for read accesses –Beneficial for sequential accesses to a file Read-ahead Group and Read-ahead Window Synchronous Prefetching and Asynchronous Prefetching group window group window new group window

07/05/ Beladys algorithm can be non-optimal given kernel prefetching Access sequence: a c e g i k m o a b c d e f g h i j k l m n o p Without prefetching: Beladys Alg. 16 cache misses; LRU 23 cache misses With prefetching: Beladys Alg. 8 cache misses; LRU 6 cache misses

07/05/ Prefetching has been ignored in algorithm design Caching algorithms have been proposed and studied without considering prefetching –OPT –LRU –LRU-K [SIGMOD 1993] –2Q [VLDB 1994] –LRFU [TC 2001] –MQ [USENIX 2001] –LIRS [SIGMETRICS 2002] –ARC [FAST 2003] Changes to OPT, LRU, 2Q, LIRS will be explained

07/05/ OPT OPT is based on Beladys cache replacement algorithms. –Off-line, has the knowledge of future references In the presence of the Linux kernel prefetching –Prefetched blocks are assumed to be accessed most recently and inserted into the cache according to the original OPT algorithm –But, OPT is added the capability to immediately determine wrong prefetches, i.e., prefetched blocks that will not be accessed on-demand at all, or will be accessed further in future than all other blocks in the cache –Wrong prefetched blocks become immediate candidates for removal

07/05/ LRU LRU is the most widely used replacement policy In the presence of the kernel prefetching, adapted LRU: –Each access, kernel determines the number of blocks that need to be prefetched –Prefetched blocks are inserted in the MRU locations just like regular blocks

07/05/ Q Three buffers and the algorithm: –A1in queue: all missed blocks are initially placed –A1out queue: when blocks are replaced from the A1in queue in the FIFO order, their addresses are temporarily placed –Am queue: When a block is re-referenced and its address is in the A1out queue, it is promoted to Am queue Block 10, 11, 12, 13, 14, 11, 12, 22, Am A1in A1out Address only

07/05/ Q – With Adaptation (In the presence of the kernel prefetching) Prefetched blocks are treated as on-demand blocks: –A prefetched block is placed into the A1in queue initially –On the subsequent on-demand access, the block stays in the A1in queue –If the prefetched block is evicted from the A1in queue before any on- demand access, it is simply discarded, as opposed to being moved into the A1out queue –If a block currently in the A1out queue is prefetched, it is promoted into Am queue as if it is accessed on-demand Demand & Prefetch blocks 10, 11, 12, 11, 13, 14, 11, 22, Am A1in A1out Address only

07/05/ LIRS Dynamically and responsively maintains the LIR block set and HIR block set and keeps LIR block set in the cache In the presence of the kernel prefetching, adapted LIRS: –Prefetched blocks are not inserted into the LIRS stack S, they are only inserted into the HIR stack Q –If a prefetched block did not have an existing entry in LIRS stack S, the first on-demand access to the block will cause it to be inserted onto the top of LIRS stack S as a HIR block –If a prefetched block exists in LIRS stack S, the first on-demand access to the block will be treated as a LIR block access

07/05/ Performance Evaluation Trace collection –Interception of I/O system calls (using modified linux strace utility) –Collect I/O access type, time, file identifier (inode), and I/O size Timing accurate trace simulator –Detailed implementation of kernel prefetching and clustering –Interface with DiskSim simulator to simulator I/O time –Implementation of: OPT, LRU, LRU-2, LRFU, LIRS, MQ, 2Q, ARC Metrics –Hit ratio –Aggregated synchronous and asynchronous disk I/O requests –Actual running time

07/05/ Applications and Trace Statistics (Concurrent applications: Multi1: cscope, gcc; Multi2: cscope, gcc, viewperf; Multi3: glimpse, TPC-H.)

07/05/ Hit ratio results for cscope Kernel prefetching has a significant impact on the hit ratio The improvement for different algorithms differ Prefeching can result in significant changes in the relative performance of replacement algorithms

07/05/ Disk requests results for cscope The clustering of I/O requests in the presence of prefetching results in a significant reduction in the number of disk requests The effect is complex and closely tied to the file access patterns

07/05/ Execution time results for cscope Reduction in the # of disk requests due to kernel prefetching does not necessarily translate into reduction in execution time.

07/05/ Results for other three sequential access applications Glimpse –It also benefits from prefetching –The changes in the relative behavior of different algorithms observed in cscope with prefetching are also observed in glimpse Viewperf –It benefits the most from prefetching –The behavior of different cache replacement algorithms is similar to that observed in cscope Gcc –Many accesses are to small files, little opportunity for prefetching –All three performance metrics are almost identical with and without prefetching

07/05/ Hit ratio results for tpc-h Prefetching provides little improvement on the hit ratio for random access pattern

07/05/ Disk requests results for tpc-h Most of prefetched blocks are not accessed and as a result the number of disk requests is doubled

07/05/ Execution time results for tpc-h The significant increase in the number of I/Os translates into a significant increase in the execution time

07/05/ Results for concurrent applications Multi1: cscope, gcc –Similar as that of cscope Multi2: cscope, gcc, viewperf –Similar as that of Multi1, however, prefetching does not improve the execution time because viewperf is CPU-bound Multi3: glimpse, TPC-H –Similar as that of tpc-h

07/05/ Number and size of synchronous and asynchronous disk I/Os in cscope at 128MB cache size The total number of disk requests with prefetching is as least 30% lower than without prefetching for all schemes except OPT Most reduction in disk requests comes from issuing asynchronous disk requests which can be overlapped with CPU time

07/05/ Conclusions In this research work, the authors –Proposed prefetching implementation for different replacement algorithms –Built a timing simulator to evaluate relative performances The paper shows –Prefetching impacts hit ratio, disk requests, execution time –Comparison of hit ratios is insufficient –Kernel prefetching can narrow the performance gap of different replacement algorithms –Kernel prefetching can also change the relative performance benefits of different replacement algorithms Future buffer caching research should –Take into consideration prefetching and I/O clustering –Simulate execution time

07/05/ Discussions (1) Good points –No new algorithm; but the paper is the first to simulate and compare the impact of kernel prefetching on well-known cache buffer replacement algorithms –Results are not very astonishing, we can guess the general results for sequential and random workloads; but this paper is the first to report the results Bad points –The simulation is only based on I/O traces. It better VM traces based results are also presented. –Concurrent applications simulation results are not analyzed in detail (in this paper itself). –It better the unification of buffer cache and page cache in many OSes be considered. It better the competition between process page access and file cache page access be simulated and analyzed.

07/05/ Discussions (2) Some questions: –Regarding Beladys anomaly: In LIRS paper: Belady's anomaly appears in 2Q and ARC for glimpse workload In this paper: Without prefetching, their simulation results didn't show Belady's anomaly. With prefetching, Belady's anomaly appears in ARC for glimpse workload Why the differences? LRU has no Belady's anomaly. How about other algorithms? –Regarding simulations: Is there any relationship between cache size selection (in simulation) with the real environment where the trace is collected? Is the performance under thrashing condition still worth simulating?

07/05/ References A Study of Integrated Prefetching and Caching Strategies, P.Cao, et., al., ACM SIGMETRICS, 1995 Making LRU Friendly to Weak Locality Workloads: A Novel Replacement Algorithm to Improve Buffer Cache Performance, S. Jiang and X. Zhang, IEEE Transactions on Computers, VOL.54, NO.9, SEPTEMBER 2005 CLOCK-Pro: An Effective Improvement of the CLOCK Replacement, S. Jiang, F. Chen, and X. Zhang, Proceedings of 2005 USENIX Annual Technical Conference (USENIX'05) "Page Replacement in Linux 2.4 Memory Management," Rik van Riel, Proc. of 2001 USENIX Technical Conference, FREENIX track Towards and O(1) VM: Making Linux virtual memory management scale towards large amounts of physical memory, Rik van Riel, Proceedings of the Linux Symposium, July 2003 Journal File Systems in Linux, June 21th, 2005 ( The Buffer Cache, June 21th, 2005 ( The Performance Impact of Kernel Prefetching on Buffer Cache Replacement, Chris Gniady, et., al., (Purdue University), ACM SIGMETRICS 2005 presentation slides More on File System (lecture notes, June 22th, 2005) (

07/05/ Thank you!