Download presentation
Presentation is loading. Please wait.
1
Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: 353-365 April 2006 On seminar book: 152-164
2
2/21 Abstract Cache replacement algorithms originally developed in the context of uniprocessors executing one instruction at a time implicitly assume that all cache misses have the same cost. However, in modern systems, some cache misses are more expensive than others. The cost may be latency, penalty, power consumption, bandwidth consumption, or any other ad hoc numeric property attached to a miss. We call the class of replacement algorithms designed to minimize a nonuniform miss cost function “cost- sensitive replacement algorithms.” In this paper, we first introduce and analyze an optimum cost-sensitive replacement algorithm (CSOPT) in the context of multiple nonuniform miss costs. CSOPT can significantly improve the cost function over OPT (the replacement algorithm minimizing miss count) in large regions of the design space. Although CSOPT is an offline and unrealizable replacement policy, it serves as a lower bound on the achievable cost by realistic cost-sensitive replacement algorithms. Using the practical example of latency cost in CC-NUMA multiprocessors, we demonstrate that there is a lot of room left to improve current replacement algorithms in many situations beyond the promise of OPT. Next, we introduce three practical extensions of LRU inspired by CSOPT and we compare their performance to LRU, OPT and CSOPT. Finally, as a practical application, we evaluate these realizable cost- sensitive replacement algorithms in the context of the second-level caches of a CC-NUMA multiprocessor with superscalar processors, using the miss latency as the cost function. By applying simple replacement policies sensitive to the latency of misses, we can improve the execution time of some parallel applications by up to 18 percent.
3
3/21 What’s the Problem Cache replacement algorithms widely used in modern systems aim to reduce the aggregate miss count and assume that miss cost are uniform However, the uniform cost assumption has lost its validity in multiprocessor system The cost of a miss mapping to a remote memory is higher than the cost of a miss mapping to a local memory Motivating the exploration of replacement policy to minimize the miss cost of multiple nonuniform miss costs instead of miss count
4
4/21 Introduction Cache replacement algorithm reaching a minimum aggregate miss cost of multiple miss costs Cost-Sensitive OPTimal replacement algorithm (CSOPT) CSOPT is an extension of OPT, the classical replacement algorithm minimizing miss count CSOPT and OPT are identical under the uniform miss cost However, CSOPT and OPT requires knowledge of future memory accesses and are unrealizable Thus, we also introduce 3 realistic online cost-sensitive replacement algorithms by extending the LRU algorithm With multiple miss costs, CSOPT doesn’t always replace the block selected by OPT CSOPT considers the option of keeping the block victimized by OPT in cache until next reference to it
5
5/21 Replacement Algorithm Optimizing the Miss Count OPTimal Replacement Algorithm (OPT) Consider uniform miss cost and minimize miss count The victim block select by OPT is The block whose next reference is farthest away in the future among all blocks in cache OPT can be implemented with a priority list Consider a trace of block addresses X= x 1, x 2, x 3, …, x L The forward distance to a block a at time t : w t (a) Define as the position t’ in the trace x t+1,…, x t’, where x t’ is the first reference to block a after time t. (i.e. next reference time of block a) If block a is never referenced again after time t, it is set to L+1 The priority list at time t : p t Order by their forward distance right before the reference x t is performed Initially, P 1 contains null blocks whose forward distances are set to L+1, P is updated before each reference When a replacement is required, the victim block is => Block whose forward distance is largest in P, at bottom of P
6
6/21 Optimum Replacement Policy With Multiple Miss Costs Cost-Sensitive OPTimal replacement algorithm (CSOPT) Let c(x t ) be the miss cost of the memory reference with block address x t at time t, for a trace of memory references X= x 1, x 2, x 3, …, x L The problem is to find a replacement algorithm such that the aggregate cost of the trace, C(X) = ∑ c(x t ) is minimized Basic Implementation of CSOPT Expands all possible replacement sequences in a search tree Pick the sequence with the least cost at the end Add one level of depth in the search tree on every reference S possible blocks to replace, where s is the set size in block The procedure is extremely complex and unfeasible
7
7/21 Exploiting OPT to Cut the Branch Factor The basic idea If all cache blocks have the same miss cost for their next reference, the victim can be selected by invoking OPT Improvement of CSOPT Consider the next miss cost at time t : f t (i) Miss cost to bring p t (i) into cache at its next reference, if p t (i) is replaced at time t In the case that f t (s) ≤ f t (i) for every i, where i < s Invoke OPT to select p t (s) as the victim if f t (i) < f t (s) for every i, where i < s Reservation can be made for more than one block at a time with up to s-1 reservation options This is the only situation the search branch with 2 replacement options (1) Pursuing OPT : still select p t (s) as the victim (2) Reserving p t (s) until its next reference by replacing one of the lower cost blocks
8
8/21 Illustration of CSOPT Consider 2 static miss costs and assume c(d) = r, where r > 1 and the miss cost of all other blocks is 1 The block at the bottom of p 4 has the highest cost Consider the option to reserve for d by replacing b From t=4 to t=11, we reserved for block d and applied OPT to the remaining 2 blocks RV releases the hold on block d at t=11 (where it is first accessed after t=4)
9
9/21 The Final Replacement option Made by CSOPT Follow the above illustration At t = 14, the cache states under OPT and RV are identical, thus we compare the costs of OPT and RV C OPT (x 5,…,x 14 ) = r and C RV (x 5,…,x 14 ) = 4 If r > 4, RV yields lower cost and block b is replaced at t=4 If r = 4, both options lead to the optimal cost If r < 4, OPT yields lower cost and block d is replaced at t=4
10
10/21 Comparison Between OPT and CSOPT Experimental methodology Used simulation trace of one processor in multiprocessor system Used the simplified cost model Miss mapping to local memory is assigned a cost of 1 Miss mapping to remote memory is assigned a cost of r Data placement policy in main memory Random replacement of blocks Place blocks locally or remotely in a random fashion First touch policy (Practical situation) If a processor is the first one to access the block, the block is mapped locally; otherwise it is mapped remotely The setting of cache Caches are 4-way 16KB with 64-byte blocks Define the cost ratio of accessing remote and local memory, denoted r
11
11/21 Evaluation for Relative Cost Saving The relative cost saving of CSOPT over OPT is calculated as M loc R denotes the number of local misses using replacement algorithm R (OPT or CSOPT) M rem R denotes the number of remote misses using replacement algorithm R (OPT or CSOPT) r denotes the cost ratio (under the two static costs)
12
12/21 Relative Cost Saving with Random Cost Mapping The relative cost savings increase with r The curve for r = inf shows the upper bound of cost savings As HAF (high cost access fraction) varies from 0 to 1 The relative cost saving increases, showing a peak between HAF=0.1 and HAF=0.5 It is easier to benefit from CSOPT when HAF < 0.5 The relative cost savings by CSOPT over OPT is significant and consistent across all benchmarks
13
13/21 Relative Cost Saving with First-Touch Cost Mapping Relative cost savings of CSOPT over OPT with first-touch and random cost mapping with same HAF The cost savings achieved under first-touch cost mapping are consistently less than under random cost mapping Especially for LU, the cost savings under first-touch policy is very poor
14
14/21 Online Cost-Sensitive Replacement Policies The rationale of LRU cost-sensitive replacement algorithm Let c[i] be the predicted miss cost of the block which occupies the ith position from the top of the LRU stack in a set of size s c[1] is the predicted miss cost of the MRU block c[s] is the predicted miss cost of the LRU block In the case that c t (s) ≤ c t (i) for every i, where i < s Replace the LRU block LRU(s) as the victim If c t (i) < c t (s) for every i, where i < s Reserve the LRU block LRU(s) Terminate reservation by depreciating the predicted miss cost of reserved LRU block Explore 3 strategies to depreciate the predicted miss cost Basic Cost-Sensitive LRU Algorithm (BCL) Dynamic Cost-Sensitive LRU Algorithm (DCL) Adaptive Cost-Sensitive LRU Algorithm (ACL) when a reservation is active Victimize the first block in LRU stack whose predicted miss cost is lower than the predicted miss cost of the reserved block
15
15/21 Basic Cost-Sensitive LRU Algorithm (BCL) The Basic Cost-Sensitive LRU Algorithm (BCL) depreciate the cost of a reserved LRU block whenever A non-LRU block is victimized in its place BCL algorithm in an s-way set associative cache To select a victim, BCL searches for blocks in the LRU stack Such that c[i] < Acost and i is closest to the LRU position If BCL find one, reserve the LRU block by replacing the ith block Otherwise, the LRU block is replaced When Acost reaches zero, the reserved LRU block becomes the prime candidate for replacement Whenever a block takes LRU position Acost is loaded with c[s] (Predicted miss cost of new LRU block)
16
16/21 Dynamic Cost-Sensitive LRU Algorithm (DCL) BCL’s weakness is that it assumes that LRU provides a perfect estimate of forward distances To correct for this weakness The Dynamic Cost-Sensitive LRU Algorithm (DCL) Depreciate the Acost of a reserved LRU block whenever A non-LRU block is victimized in its place is actually accessed before the reserved LRU block To do this, DCL keeps a record of every replaced non-LRU blocks in the Extended Tag Directory (ETD) ETD entries are attached to each set, initially all entries are invalid When a non-LRU block is replaced, an ETD entry is allocated and its valid bit is set If miss in cache but hit in ETD, then depreciate the cost and invalidate the matching ETD entry When an access hits on the LRU block, all ETD entries are invalidated
17
17/21 Adaptive Cost-Sensitive LRU Algorithm (ACL) The Adaptive Cost-Sensitive LRU Algorithm (ACL) The rationale behind ACL The reservation of LRU block successes and failures are clustered in time Thus, ACL implements an adaptive reservation activation scheme ACL automaton in each set Associate a counter to enable and disable reservation When reservations are disabled An LRU block enters the ETD on replacement and other blocks in the set has lower cost If miss in cache but hit in ETD, then all ETD entries are invalidated and reservations are enabled Initially, the counter is set to zero Disabling all reservations The counter increases or decreases when a reservation success or fail
18
18/21 Evaluation Approach and Setup To estimate the impact of the replacement policy on performance We need to simulate the architecture in detail Costs associated with misses are multiple and dynamic Data is placed in main memory using the first touch policy Implement the 3 online cost-sensitive replacement algorithms in the second-level cache Target system configuration
19
19/21 Reduction of Execution Time Over LRU The improvement on the execution time by DCL is significant Although the performance of ACL is often slightly lower than DCL However, ACL gives more reliable performance across applications Compare to BCL, DCL yields reliable and significant improvement of execution time in every situation The performance of ACL is often slightly lower than DCL In LU, the performance improvement of ACL is over DCL This is because ACL effectively filters unnecessary reservations in LU
20
20/21 Conclusions Proposed a new optimum cache replacement policy, which minimizes the miss cost of multiple miss costs rather than miss count, called CSOPT To facilitate the search involves in a huge search tree Exploit the rationale behind OPT and the block reservation Although CSOPT is unrealizable in real systems However, CSOPT gives useful hints and guideline to improve existing cache replacement algorithms We have demonstrated the significant performance benefits by developing 3 practical algorithms, called BCL, DCL and ACL The application domain of our algorithm is very board They are applicable to the management of various kinds of storage which involve multiple nonuniform cost functions
21
21/21 Appendix – CSOPT Algorithm When cache miss The prime candidate for replacement is the block whose forward distance is the largest and not reserved Initially, P has only one active node which contains null blocks whose forward distances are set to L+1, and cost is zero (1) Pursuing OPT : still select p t (s) as the victim (2) Reserving p t (s) until its next reference by replacing one of the lower cost blocks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.