Informed Prefetching and Caching R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, Jim Zelenka.

Slides:

Advertisements

Similar presentations

Storing Data: Disk Organization and I/O

Advertisements

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.

Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

1 Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems Brian Forney Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Wisconsin Network Disks University.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:

Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.

Memory Management 2010.

A Status Report on Research in Transparent Informed Prefetching (TIP) Presented by Hsu Hao Chen.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

PRASHANTHI NARAYAN NETTEM.

OS and Hardware Tuning. Tuning Considerations Hardware  Storage subsystem Configuring the disk array Using the controller cache  Components upgrades.

Chapter 15.7 Buffer Management ID: 219 Name: Qun Yu Class: CS Spring 2009 Instructor: Dr. T.Y.Lin.

CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.

Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

1 Storage Refinement. Outline Disk failures To attack Intermittent failures To attack Media Decay and Write failure –Checksum To attack Disk crash –RAID.

Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.

Chapter 3 Memory Management: Virtual Memory

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.

Parity Logging O vercoming the Small Write Problem in Redundant Disk Arrays Daniel Stodolsky Garth Gibson Mark Holland.

TRACK-ALIGNED EXTENTS: MATCHING ACCESS PATTERNS TO DISK DRIVE CHARACTERISTICS J. Schindler J.-L.Griffin C. R. Lumb G. R. Ganger Carnegie Mellon University.

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

1 Virtual Memory Chapter 8. 2 Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Energy Efficient Prefetching and Caching Athanasios E. Papathanasiou and Michael L. Scott. University of Rochester Proceedings of 2004 USENIX Annual Technical.

1 Lecture 8: Virtual Memory Operating System Fall 2006.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Load Rebalancing for Distributed File Systems in Clouds.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

CS422 Principles of Database Systems Buffer Management Chengyu Sun California State University, Los Angeles.

1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.

1 Query Processing Exercise Session 1. 2 The system (OS or DBMS) manages the buffer Disk B1B2B3 Bn … … Program’s private memory An application program.

Virtual Memory Chapter 8.

CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.

Lecture: Large Caches, Virtual Memory

Parallel Programming By J. H. Wang May 2, 2017.

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Cache Memory Presentation I

Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.

Informed Prefetching and Caching

Improving cache performance of MPEG video codec

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

ECE 445 – Computer Organization

Lecture: Cache Innovations, Virtual Memory

Persistence: hard disk drive

Contents Memory types & memory hierarchy Virtual memory (VM)

Virtual Memory: Working Sets

Performance-Robust Parallel I/O

CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.

CSC Multiprocessor Programming, Spring, 2011

Overview Problem Solution CPU vs Memory performance imbalance

Virtual Memory 1 1.

Presentation transcript:

Informed Prefetching and Caching R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, Jim Zelenka

Contribution One of basic functions of file system: Management of disk accesses Management of main-memory file buffers Approach: Use hints from I/O-intensive applications to prefetch aggressively enough to eliminate I/O stall time while maximizing buffer availability for caching How to allocate cache buffers dynamically among competing hinting and non-hinting applications for the greatest performance benefit Balance caching against prefetching Distribute cache buffers among competing applications

Motivation Storage parallelism CPU   I/O performance dependence  Cache   cache-hit ratios  I/O intensive applications: Amount of data processed >> file cache size Locality is poor or limited Frequently non-sequential accesses Large I/O stall time/total execution time Access patterns are largely predictable How can I/O workloads be improved to take full advantage of the hardware that already exists?

ASAP: the four virtues of I/O workloads Avoidance: not a scalable solution to the I/O bottleneck Sequentiality: scale for writes but not for reads Asynchrony: scalable through write buffering, scaling for reads depends on prefetching aggressiveness Parallelism: scalable for explicitly parallel I/O requests; but for serial workloads, scalable parallelisms achieved by scaling no. of asyn requests Asynchrony eliminates write latency, and parallelism provides throughput. No existing techniques scalably relieve the I/O bottleneck for reads. Aggressive prefetching

Prefetching Aggressive prefetching for reads  writing buffers for writes

Hints Historical Information: LRU cache replacement algorithm Sequential readahead: prefetching up to 64 blocks ahead when it detects long sequential runs Disclosure: hints based on advance knowledge A mechanism for portable I/O optimizations Providing evidence for a policy decision Conforms to software engineering principles of modularity

Informed Prefetching System: TIP-1 implemented in OSF/1, which has 2 I/O optimizations Application: 5 I/O-intensive benchmarks single threaded, data fetched from FS Hardware: DEC3000/500 workstation, MHz processor, 128 MB RAM, 5 KZTSA fast SCSI-2 adapters, each hosting 3 HP2247 1GB disks, 12MB (1536 x 8KB) cache Stripe unit: 64 KB Cluster prefetch: 5 prefetches. Disk scheduler: striper SCAN 512 buffers (1/3 cache) Unread hinted prefetch LRU count_unread_buffers--

Agrep Agrep woodworking 224_newsgroup_msg: 358 disk blocks Read from the beginning to the end

Agrep (cont’d) Elapsed time for the sum of 4 searches is reduced by up to 84%

Postgres Join of two relations Outer relation: 20,000 unindexed tuples (3.2 MB) Inner relation: 200,000 tuples (32 MB) and indexed (5 MB) Output about 4,000 tuples written sequentially

Postgres (cont’d)

Elapsed time reduced by up to 55%

MCHF Davidson algorithm MCHF: A suite of computational-chemistry programs used for atomic-structures calculations Davidson algorithm: an element of MCHF that computes, by successive refinement, the extreme eigenvalue-eigenvector pairs of a large, sparse, real, symmetric matrix stored on disk Matrix size: 17 MB The algorithm repeatedly accesses the same large file sequentially.

MCHF Davidson algorithm (cont’d)

Hints disclose only sequential access in one large file. OSF/1’s aggressive readahead performs better than TIP-1. Neither OSF/1 nor informed prefetching alone uses the 12 MB of cache buffers well. LRU replacement algorithm flushes all of the blocks before any of them are reused.

Informed caching Goal: allocate cache buffers to minimize application elapsed time Approach: estimate the impact on execution time of alternative buffer allocations and then choose the best allocation 3 broad uses for each buffer: Caching recently used data in the traditional LRU queue Prefetching data according to hints Caching data that a predictor indicates will be reused in the future

Three uses of cache buffers Difficult to estimate the performance of allocations at a global level

Cost-benefit analysis System model: from which the various cost and benefit estimates are derived Derivations: for each component Comparison: how to compare the estimates at a global level to find the globally least valuable buffer and the globally most beneficial consumer

System assumptions Assumptions: Modern OS with a file buffer cache running on a uniprocessor with sufficient memory to make available number of cache buffers Workload emphasized on read-intensive applications All application I/O accesses request a single file block that can be read in a single disk access and that the requests are not too bursty. System parameters are constant. Enough parallelism, no congestion

System model Elapsed time # I/O req. Avg time to service an I/O req. Avg app CPU time between requests Overhead: allocating of a buffer, queuing the request at the drive, and servicing the interrupt when the I/O completes

Cost of deallocating LRU buffer

The benefit of prefetching Prefetching a block can mask some of the latency of a disk read, is the upper bound of the benefit of fetching a block. If the prefetch can be delayed and still complete before it is needed, we consider there to be no benefit from starting the prefetch now. prefetchconsume X requests There is no benefit from prefetching further than P

The prefetch horizon

Comparison of LRU cost to prefetching benefit Shared resources: cache buffers Common currency: T/access = T/buffer Rate of hinted accesses Rate of unhinted demand accesses A buffer should be reallocated from the LRU cache for prefetching

The cost of flushing a hinted block When should we flush a hinted block? flushHint access y accesses prefetch back Py-P Cost:

Putting it all together: global min-max 3 estimates: Which block should be replaced when a buffer is needed for prefetching or to service a demand request? The globally least valuable block in the cache. Should a cache buffer be used to prefetch data now? Prefetch if the expected benefit is greater than the expected cost of flushing or stealing the least valuable block. Separate estimators for LRU cache and for each independent Stream of hints

Value estimators LRU cache: i- th position if the LRU queue Hint estimators: Global value=max(value_LRU,value_hint) Globally least valuable block = min(global value) A global min-max valuation of blocks

Informed caching example: MRU The informed cache manager discovers MRU caching without being specifically coded to implement this policy.

Implementation of informed caching and prefetch

Implementation of informed caching and prefetch(cont’d)

Performance improvement by informed caching

Balance contention

Future work Richer hint languages to disclosure future accesses Strategies for dealing with imprecise but still useful hints Cost-benefit model adapted to non- uniform bandwidths Extensibility, e.g.: VM estimator to track VM pages