Download presentation
Presentation is loading. Please wait.
1
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec. 2009 2015/7/14
2
Efficient memory hierarchy design is critical due to the increasing gap between the speed of the processors and the memory. One of the source of inefficiency in current caches is the non-uniform distribution of the memory accesses on the cache sets. Its consequence is that while some cache sets may have working sets that are far from fitting in them, other sets may be underutilized because their working set has fewer lines than the set. In this paper we present a technique that aims to balance the pressure on the cache sets by detecting when it may be beneficial to associate sets, displacing lines from stressed sets to underutilized ones. This new technique, called Set Balancing Cache or SBC, achieved an average reduction of 13% in the miss rate of ten benchmarks from the SPEC CPU2006 suite, resulting in an average IPC improvement of 5%. Abstract - 2 -
3
Non-uniform distribution of the memory accesses on the cache sets Some cache sets whose working set is larger than associativity of cache 。 However, other sets may be underutilized An intuitive answer is to increase the associativity Impact access latency and power consumption 。 More tags have to be read and compared What’s the Problem - 3 - Block 0 Block n Way 0 Block 1 Way 1 Conflict miss, (large local miss) Replace with Block 2 Good local miss, but underutilized
4
Related Works - 4 - Efficient cache architecture for performance improvement Balance non-uniform distribution of accesses across $ sets Selective victim caching Novel replacement policy Place blocks in a second associated line [1, 3,19] Place blocks in a second associated line [1, 3,19] Place blocks in a second associated line Retain the lines to be replaced in underutilized $ frames of direct-mapped cache Allow different $ sets have different # of lines according to program demand [14] Allow different $ sets have different # of lines according to program demand Increase possible places where a memory block can be placed Attempt to identify underutilized $ frames Improve miss rate reduction Only the most frequently missed blocks in conventional part of cache in recent past are retained in victim cache [2] Retain the lines to be replaced in underutilized $ frames of direct-mapped cache [12] A highly missed block in the past, has higher probability of imposing more misses in future Dynamically change the insertion policy between marking the most recently inserted lines in a set as MRU or LRU [13] More than half the lines installed in cache are never reused before getting evicted Displace lines from sets with high local miss rates to sets with underutilized line This Paper: Provide better performance and less HW cost
5
Shift lines from sets with high local miss rates to sets with underutilized lines Goal: Balance saturation level among sets and avoid misses Move lines from highly saturated sets to little saturated ones Set Balancing Cache - 5 - Block 0 Block n Way 0 Block 1 Way 1 Conflict Miss (large local miss) Replace with Block 2 Replace with Block 2 Perform displacement from highly saturated set Good local miss, but underutilized
6
The mechanism to measure the saturation level of each cache set Saturation counter in each cache set 。 Increment when cache miss, and decrement when cache hit The association algorithm Decide which set(s) are to be associated in the displacement The displacement algorithm Decide when to displace lines to an associated set The search algorithm How to find the displaced line The Involved Techniques - 6 -
7
The association algorithm Each cache set is statically associated to another specific set 。 Complement the most significant bit of the index The displacement algorithm (require 2 conditions) Perform displacement when a line is evicted from a set whose saturation counter is maximum value 2K-1 (K: associativity) Saturation counter of the receiver is below “displacement limit” K The search algorithm Every cache line has an additional displaced bit (d bit) 。 d= 0: the line is native to the set; d= 1: displace from its associated set Every set has an additional second search bit (sc bit) 。 sc= 1: the associated set hold displaced lines Static Set Balancing Cache - 7 - “00” “01” “10” “11” Index 1 st search: tag equality and d=0 2 nd search in the associated set: tag equality and d=1 1 st search: tag equality and d=0 2 nd search in the associated set: tag equality and d=1 (not found and sc=1)
8
Presume a 2-way cache with 4 sets The upper tag in each set is the most recently used The saturation counters operate in the range 0 to 3 Example: Static SBC Operation - 8 - 1 st ref: miss and sc=0 -> no second search but sat counter reach max value “3” 1 st ref: miss and sc=0 -> no second search but sat counter reach max value “3” 2 st ref: miss but sc=1 -> perform second search in set 2 Displace the evicted line (tag:10010) from set 0 to set 2 Displace the evicted line (tag:10010) from set 0 to set 2 00000 Tag is found and d=1 -> second hit
9
Goal: associate a set whose saturation counter reaches max value with a set has smallest saturation level The association algorithm Identify the smallest one using Destination Set Selector (DSS) 。 Keep the data related to the less saturated sets When the saturation counter of a free set is updated 。 The index of this set is compared with the “index” of all the valid DSS entries If hit, the corresponding entry is updated with the new saturation level If miss, and its saturation level is smaller than max.level The set index and its saturation level are stored in DSS entry pointed by max.entry Dynamic Set Balancing Cache - 1 - 9 - “0000” “1101” “0110” “1001” 2 2 3 3 10 9 9 1 1 1 1 1 1 1 1 V Index of Set Saturation LVL 2 2 0 0 DSS “0000” Saturation LVL 0 1 2 3 DSS Entry Index of Set Min. 10 2 2 Saturation LVL DSS Entry Max. Min. saturation level in DSS (provide index of best set for association) Max. saturation level in DSS
10
The association algorithm (Cont.) Invalidation of the DSS entry 。 Scenario 1: when the saturation level reaches the “displacement limit” 。 Scenario 2: when the entry pointed by “min” is used for association Association Table (AT) 。 One entry per set Record the relation of association 。 AT(i). Index Index of the set associated with set i 。 AT(i). s/d 1: being associated 0: not associated yet Its entry stores its own index Dynamic Set Balancing Cache - 2 - 10 - “0000” “0001” “0110” “0011” 0 0 0 0 1 1 0 0 AT. index AT. s/d Association Table “0000” “0001” “0010” “0011” …. Set i
11
The displacement algorithm Just as SSBC, displacement when a line is evicted from a set whose saturation counter is max value 2K-1 (K: associativity) The search algorithm Just as SSBC, every cache line has an additional displaced bit (d bit) Dynamic Set Balancing Cache - 3 - 11 - 1 st search: tag equality and d=0 AT(i). s/d= 0 2 nd search in set AT(i). index: tag equality and d=1 Move LRU line from source set to destination set to replace it (the LRU line of destination set will be evicted) 1 st search: tag equality and d=0 AT(i). s/d= 0 2 nd search in set AT(i). index: tag equality and d=1 Move LRU line from source set to destination set to replace it (the LRU line of destination set will be evicted) (not found) (result in a miss) AT(i). s/d= 1 (not found)
12
Presume a 2-way cache with 4 sets The upper tag in each set is the most recently used The saturation counters operate in the range 0 to 3 Example: Dynamic SBC Operation - 12 - 1 st ref: miss and AT(i). s/d= 0 -> not associated yet but sat counter reach max value “3” Displace the evicted line (tag:10010) from set 0 to set 3 (assume DSS provides set 3 as candidate) Displace the evicted line (tag:10010) from set 0 to set 3 (assume DSS provides set 3 as candidate) 2 st ref: miss but AT(i). s/d= 1 -> perform second search in set AT(0).index= 3 Tag is found and d=1 -> second hit
13
Use SESC simulator [15] with two baseline configuration Two-level on-chip cache 。 L2 (unified) cache: 2MB/8-way/64B/LRU 。 Both SSBC/DSBC are applied in the L2 cache Three-level on-chip cache 。 L2 (unified) cache: 256KB/8-way/64B/LRU 。 L3 (unified) cache: 2MB/16-way/64B/LRU 。 Both SSBC/DSBC are applied in the two lower level cache (i.e. L2 and L3) Destination Set Selector used in dynamic SBC Four entries Benchmarks 10 representative benchmarks of the SPEC CPU 2006 suite Experimental Setup - 13 -
14
Both SBCs keep same ratio of first access hits as standard $ But they turn misses into secondary hits DSBC achieves better results than SSBC as expected Performance Comparison- Miss, Hit, and Secondary Hit rates - 14 - (First Hit) L2 cache in the two-level configuration L2 cache in the three-level configuration (c) L3 cache in the three-level configuration Smaller is better
15
The average improvement of access time Two-level configuration 。 4% and 8% reduction in L2 for SSBC and DSBC respectively Three-level configuration 。 3% and 6% reduction in L2 for SSBC and DSBC respectively 。 10% and 12% reduction in L3 for SSBC and DSBC respectively Performance Comparison- Average Access Time Improvement (Reduction) - 15 - L2 cache in the two-level configuration L2 cache in the three-level configuration L3 cache in the three-level configuration DSBC achieves better average access time than SSBC Small 1% slowdown, since the baseline miss is small
16
In two-level configuration SBCs have positive effect on performance However, two kinds of benchmarks get no benefit from SBCs 。 1. The baseline miss rate is small (like 444.namd or 445.gobmk) 。 2. few accesses to L2 (like 458.sjeng) In three-level configuration The improvement is larger and applied to all benchmarks Performance Comparison- IPC Improvement - 16 - Two-level configurationThree-level configuration (Double L2 associativity) (Double L2 & L3 associativity) SBCs achieve better results than doubling associativity
17
Storage requirement of SBC Based on a 2MB/8-way/64B/LRU cache, and address of 42 bits Energy overhead Less than 1% for SBC 。 But 79% overhead for the baseline with double associativity Cost of SBC – Storage Requirement and Energy - 17 - d bit: identify displaced lines Saturation counter per set In SSBC, require sc bit per set In DSBC, instead use one Association Table entry per set In DSBC, require a Destination Set Selector (4 entry) to choose the best set of association 7KB (0.31% overhead) 13KB (0.58% overhead)
18
The area overhead of SBC Less than 1% for both SSBC and DSBC 。 But more than 3% overhead for the baseline with double associativity Cost of SBC – Area - 18 - SBC offers more performance but requires less energy and area than doubling the associativity
19
This paper proposed a Set Balancing Cache (SBC) Balance saturation level among sets and avoid misses 。 Displace lines from cache sets with high saturation level to underutilized sets Two designs have been presented 。 Static Set Balancing Cache Displacement between pre-established pairs of sets 。 Dynamic Set Balancing Cache Try to associate each highly saturated set with the less saturated set available Experimental results show that SBC Achieve an average miss reduction of 9.2% and 12.8% for SSBC and DSBC respectively Conclusions - 19 -
20
The paper extensively reviews the emerging techniques on cache architecture for performance improvement But the HW implementation for assigning less saturated cache sets into Destination Set Selector (DSS) is not clear Only explain the table used in DSS 。 But how to choose the suitable candidate for the limited table (4 entry) Comment for This Paper - 20 -
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.