Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

Cache and Virtual Memory Replacement Algorithms

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

1 Cache and Caching David Sands CS 147 Spring 08 Dr. Sin-Min Lee.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Performance of Cache Memory

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 CSIT431 Introduction to Operating Systems Welcome to CSIT431 Introduction to Operating Systems In this course we learn about the design and structure.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Computer Architecture Lecture 26 Fasih ur Rehman.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.

1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.

Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Chapter 9 Memory Organization By Nguyen Chau Topics Hierarchical memory systems Cache memory Associative memory Cache memory with associative mapping.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Lecture 20 Last lecture: Today’s lecture: Types of memory

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

CS203 – Advanced Computer Architecture Virtual Memory.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

CS161 – Design and Architecture of Computer

COSC3330 Computer Architecture

CS161 – Design and Architecture of Computer

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

Cache Memory Presentation I

CS61C : Machine Structures Lecture 6. 2

Chapter 9: Virtual-Memory Management

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Lecture 23: Cache, Memory, Virtual Memory

Morgan Kaufmann Publishers

CS-447– Computer Architecture Lecture 20 Cache Memories

CSE451 Virtual Memory Paging Autumn 2002

Cache - Optimization.

Principle of Locality: Memory Hierarchies

Chapter Contents 7.1 The Memory Hierarchy 7.2 Random Access Memory

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: April 2006 On seminar book:

2/21 Abstract  Cache replacement algorithms originally developed in the context of uniprocessors executing one instruction at a time implicitly assume that all cache misses have the same cost. However, in modern systems, some cache misses are more expensive than others. The cost may be latency, penalty, power consumption, bandwidth consumption, or any other ad hoc numeric property attached to a miss. We call the class of replacement algorithms designed to minimize a nonuniform miss cost function “cost- sensitive replacement algorithms.” In this paper, we first introduce and analyze an optimum cost-sensitive replacement algorithm (CSOPT) in the context of multiple nonuniform miss costs. CSOPT can significantly improve the cost function over OPT (the replacement algorithm minimizing miss count) in large regions of the design space. Although CSOPT is an offline and unrealizable replacement policy, it serves as a lower bound on the achievable cost by realistic cost-sensitive replacement algorithms. Using the practical example of latency cost in CC-NUMA multiprocessors, we demonstrate that there is a lot of room left to improve current replacement algorithms in many situations beyond the promise of OPT. Next, we introduce three practical extensions of LRU inspired by CSOPT and we compare their performance to LRU, OPT and CSOPT. Finally, as a practical application, we evaluate these realizable cost- sensitive replacement algorithms in the context of the second-level caches of a CC-NUMA multiprocessor with superscalar processors, using the miss latency as the cost function. By applying simple replacement policies sensitive to the latency of misses, we can improve the execution time of some parallel applications by up to 18 percent.

3/21 What’s the Problem  Cache replacement algorithms widely used in modern systems aim to reduce the aggregate miss count and assume that miss cost are uniform  However, the uniform cost assumption has lost its validity in multiprocessor system The cost of a miss mapping to a remote memory is higher than the cost of a miss mapping to a local memory  Motivating the exploration of replacement policy to minimize the miss cost of multiple nonuniform miss costs instead of miss count

4/21 Introduction  Cache replacement algorithm reaching a minimum aggregate miss cost of multiple miss costs  Cost-Sensitive OPTimal replacement algorithm (CSOPT) CSOPT is an extension of OPT, the classical replacement algorithm minimizing miss count CSOPT and OPT are identical under the uniform miss cost  However, CSOPT and OPT requires knowledge of future memory accesses and are unrealizable  Thus, we also introduce 3 realistic online cost-sensitive replacement algorithms by extending the LRU algorithm  With multiple miss costs, CSOPT doesn’t always replace the block selected by OPT  CSOPT considers the option of keeping the block victimized by OPT in cache until next reference to it

5/21 Replacement Algorithm Optimizing the Miss Count  OPTimal Replacement Algorithm (OPT)  Consider uniform miss cost and minimize miss count  The victim block select by OPT is The block whose next reference is farthest away in the future among all blocks in cache  OPT can be implemented with a priority list  Consider a trace of block addresses X= x 1, x 2, x 3, …, x L  The forward distance to a block a at time t : w t (a) Define as the position t’ in the trace x t+1,…, x t’, where x t’ is the first reference to block a after time t. (i.e. next reference time of block a) If block a is never referenced again after time t, it is set to L+1  The priority list at time t : p t Order by their forward distance right before the reference x t is performed  Initially, P 1 contains null blocks whose forward distances are set to L+1, P is updated before each reference When a replacement is required, the victim block is => Block whose forward distance is largest in P, at bottom of P

6/21 Optimum Replacement Policy With Multiple Miss Costs  Cost-Sensitive OPTimal replacement algorithm (CSOPT)  Let c(x t ) be the miss cost of the memory reference with block address x t at time t, for a trace of memory references X= x 1, x 2, x 3, …, x L  The problem is to find a replacement algorithm such that the aggregate cost of the trace, C(X) = ∑ c(x t ) is minimized  Basic Implementation of CSOPT  Expands all possible replacement sequences in a search tree Pick the sequence with the least cost at the end Add one level of depth in the search tree on every reference S possible blocks to replace, where s is the set size in block The procedure is extremely complex and unfeasible

7/21 Exploiting OPT to Cut the Branch Factor  The basic idea  If all cache blocks have the same miss cost for their next reference, the victim can be selected by invoking OPT  Improvement of CSOPT  Consider the next miss cost at time t : f t (i) Miss cost to bring p t (i) into cache at its next reference, if p t (i) is replaced at time t  In the case that f t (s) ≤ f t (i) for every i, where i < s Invoke OPT to select p t (s) as the victim  if f t (i) < f t (s) for every i, where i < s Reservation can be made for more than one block at a time with up to s-1 reservation options This is the only situation the search branch with 2 replacement options (1) Pursuing OPT : still select p t (s) as the victim (2) Reserving p t (s) until its next reference by replacing one of the lower cost blocks

8/21 Illustration of CSOPT  Consider 2 static miss costs and assume c(d) = r, where r > 1 and the miss cost of all other blocks is 1 The block at the bottom of p 4 has the highest cost Consider the option to reserve for d by replacing b From t=4 to t=11, we reserved for block d and applied OPT to the remaining 2 blocks RV releases the hold on block d at t=11 (where it is first accessed after t=4)

9/21 The Final Replacement option Made by CSOPT  Follow the above illustration  At t = 14, the cache states under OPT and RV are identical, thus we compare the costs of OPT and RV C OPT (x 5,…,x 14 ) = r and C RV (x 5,…,x 14 ) = 4 If r > 4, RV yields lower cost and block b is replaced at t=4 If r = 4, both options lead to the optimal cost If r < 4, OPT yields lower cost and block d is replaced at t=4

10/21 Comparison Between OPT and CSOPT  Experimental methodology  Used simulation trace of one processor in multiprocessor system  Used the simplified cost model Miss mapping to local memory is assigned a cost of 1 Miss mapping to remote memory is assigned a cost of r  Data placement policy in main memory Random replacement of blocks  Place blocks locally or remotely in a random fashion First touch policy (Practical situation)  If a processor is the first one to access the block, the block is mapped locally; otherwise it is mapped remotely  The setting of cache Caches are 4-way 16KB with 64-byte blocks Define the cost ratio of accessing remote and local memory, denoted r

11/21 Evaluation for Relative Cost Saving  The relative cost saving of CSOPT over OPT is calculated as  M loc R denotes the number of local misses using replacement algorithm R (OPT or CSOPT)  M rem R denotes the number of remote misses using replacement algorithm R (OPT or CSOPT)  r denotes the cost ratio (under the two static costs)

12/21 Relative Cost Saving with Random Cost Mapping  The relative cost savings increase with r  The curve for r = inf shows the upper bound of cost savings  As HAF (high cost access fraction) varies from 0 to 1  The relative cost saving increases, showing a peak between HAF=0.1 and HAF=0.5 It is easier to benefit from CSOPT when HAF < 0.5 The relative cost savings by CSOPT over OPT is significant and consistent across all benchmarks

13/21 Relative Cost Saving with First-Touch Cost Mapping  Relative cost savings of CSOPT over OPT with first-touch and random cost mapping with same HAF  The cost savings achieved under first-touch cost mapping are consistently less than under random cost mapping Especially for LU, the cost savings under first-touch policy is very poor

14/21 Online Cost-Sensitive Replacement Policies  The rationale of LRU cost-sensitive replacement algorithm  Let c[i] be the predicted miss cost of the block which occupies the ith position from the top of the LRU stack in a set of size s c[1] is the predicted miss cost of the MRU block c[s] is the predicted miss cost of the LRU block  In the case that c t (s) ≤ c t (i) for every i, where i < s Replace the LRU block LRU(s) as the victim  If c t (i) < c t (s) for every i, where i < s Reserve the LRU block LRU(s) Terminate reservation by depreciating the predicted miss cost of reserved LRU block  Explore 3 strategies to depreciate the predicted miss cost  Basic Cost-Sensitive LRU Algorithm (BCL)  Dynamic Cost-Sensitive LRU Algorithm (DCL)  Adaptive Cost-Sensitive LRU Algorithm (ACL) when a reservation is active Victimize the first block in LRU stack whose predicted miss cost is lower than the predicted miss cost of the reserved block

15/21 Basic Cost-Sensitive LRU Algorithm (BCL)  The Basic Cost-Sensitive LRU Algorithm (BCL) depreciate the cost of a reserved LRU block whenever  A non-LRU block is victimized in its place  BCL algorithm in an s-way set associative cache  To select a victim, BCL searches for blocks in the LRU stack Such that c[i] < Acost and i is closest to the LRU position  If BCL find one, reserve the LRU block by replacing the ith block  Otherwise, the LRU block is replaced When Acost reaches zero, the reserved LRU block becomes the prime candidate for replacement Whenever a block takes LRU position Acost is loaded with c[s] (Predicted miss cost of new LRU block)

16/21 Dynamic Cost-Sensitive LRU Algorithm (DCL)  BCL’s weakness is that it assumes that LRU provides a perfect estimate of forward distances  To correct for this weakness The Dynamic Cost-Sensitive LRU Algorithm (DCL)  Depreciate the Acost of a reserved LRU block whenever  A non-LRU block is victimized in its place is actually accessed before the reserved LRU block  To do this, DCL keeps a record of every replaced non-LRU blocks in the Extended Tag Directory (ETD)  ETD entries are attached to each set, initially all entries are invalid When a non-LRU block is replaced, an ETD entry is allocated and its valid bit is set If miss in cache but hit in ETD, then depreciate the cost and invalidate the matching ETD entry When an access hits on the LRU block, all ETD entries are invalidated

17/21 Adaptive Cost-Sensitive LRU Algorithm (ACL)  The Adaptive Cost-Sensitive LRU Algorithm (ACL)  The rationale behind ACL The reservation of LRU block successes and failures are clustered in time  Thus, ACL implements an adaptive reservation activation scheme  ACL automaton in each set Associate a counter to enable and disable reservation When reservations are disabled  An LRU block enters the ETD on replacement and other blocks in the set has lower cost If miss in cache but hit in ETD, then all ETD entries are invalidated and reservations are enabled Initially, the counter is set to zero Disabling all reservations The counter increases or decreases when a reservation success or fail

18/21 Evaluation Approach and Setup  To estimate the impact of the replacement policy on performance  We need to simulate the architecture in detail Costs associated with misses are multiple and dynamic Data is placed in main memory using the first touch policy Implement the 3 online cost-sensitive replacement algorithms in the second-level cache  Target system configuration

19/21 Reduction of Execution Time Over LRU  The improvement on the execution time by DCL is significant  Although the performance of ACL is often slightly lower than DCL  However, ACL gives more reliable performance across applications Compare to BCL, DCL yields reliable and significant improvement of execution time in every situation The performance of ACL is often slightly lower than DCL  In LU, the performance improvement of ACL is over DCL  This is because ACL effectively filters unnecessary reservations in LU

20/21 Conclusions  Proposed a new optimum cache replacement policy, which minimizes the miss cost of multiple miss costs rather than miss count, called CSOPT  To facilitate the search involves in a huge search tree Exploit the rationale behind OPT and the block reservation  Although CSOPT is unrealizable in real systems  However, CSOPT gives useful hints and guideline to improve existing cache replacement algorithms We have demonstrated the significant performance benefits by developing 3 practical algorithms, called BCL, DCL and ACL  The application domain of our algorithm is very board  They are applicable to the management of various kinds of storage which involve multiple nonuniform cost functions

21/21 Appendix – CSOPT Algorithm When cache miss The prime candidate for replacement is the block whose forward distance is the largest and not reserved Initially, P has only one active node which contains null blocks whose forward distances are set to L+1, and cost is zero (1) Pursuing OPT : still select p t (s) as the victim (2) Reserving p t (s) until its next reference by replacing one of the lower cost blocks