ECE7995 Caching and Prefetching Techniques in Computer Systems

ECE7995 Caching and Prefetching Techniques in Computer Systems
Lecture 8: Buffer Cache in Main Memory (IV)

Quantifying Locality with LRU Stack
Blocks are ordered by their recencies; Blocks enter from the stack top, and leave from its bottom; . . . 3 4 4 5 3 4 3 2 5 9 8 Recency = 2 Recency = 1 First Let’s see how LRU quantifies locality using LRU stack LRU stack is a data structure for its implementation, in which Blocks are ordered by their recency; Recency of a block is a period of interval from its last reference to the current time. (2) Blocks enter from the stack top, and leave from its bottom;  LRU uses recency to quantify locality. In the example, let’s observe the recency of block 3. For this access stream the recency of block 3 changes from 1 to 2 then to 0. The recency keeps changing from time to time. So an ephemeral recency value is hard to quantify the locality of a block. For example, if a block is regularly accessed every 4 other distinct blocks, we know its locality strength is stable. However, its recency constantly changes from 0 to 4. If we use recency to quantify locality, we will be confused and wonder which of the 5 recency values should be used to describe the locality strength? We argue that not all the recency values, but the recency at which the block is accessed should be used to quantify locality. We define this recency as IRR. If you look at the access stream, it can also be defined as the number of other distinct blocks accessed between two consecutive references to the block. [2] LRU stack 1

. . . LRU Stack Blocks are ordered by recency in the LRU stack;
Blocks enter from the stack top, and leave from its bottom; IRR = 2 . . . 3 3 4 5 3 4 5 Recency = 0 Recency = 2 3 Inter-Reference Recency (IRR) The number of other distinct blocks accessed between two consecutive references to the block. 2 9 LRU stack 8

IRR (Re-use Distance in Blocks) Virtual Time (Reference Stream)
Locality Strength Locality Strength MULTI2 LRU Good for “absolutely” strong locality Bad for relatively weak locality IRR (Re-use Distance in Blocks) Here is the IRR map of the access stream for the previously shown workload. X axis is still for virtual time, y axis is for IRR, the recency value at which a reference happens or re-use distance in the access stream. Each point represents the IRR of a reference. When we use IRR to quantify locality, the lower a point is, the stronger its locality is. The locality strength increases in this direction. Using IRR map, the performance of LRU can be easily interpreted. Say the cache size is 1000 blocks, in LRU if and only if the references below this line are hits. For a workload with weak locality, where most references are in the upper area, LRU will have a low hit ratio. In the extreme case, where all the references are above the line, the LRU hit ratio would be ZERO. [1.25] Cache Size Virtual Time (Reference Stream)

LRU’s Inability with Weak Locality
Memory scanning (one-time access) Infinite IRR, weak locality; should not be cached at all; not replaced timely in LRU (be cached until their recency larger than cache size); Now let me show some commonly observed access patterns. Because of their weak access locality, LRU can’t deal with them properly. In memory scanning, the blocks are accessed only once. They have Infinite IRR, showing very weak locality. So they should not be cached at all. But in LRU they are cached until their recency are larger than cache size. It is actually a waste of buffers In the loop-like access, the blocks are repeatedly accessed with a fixed interval The IRRs of all the blocks are the same, and the same as the interval If the interval is larger than cache size, there are no hits al all in LRU. And blocks to be accessed soonest may be unfortunately replaced; For the accesses of blocks with distinct frequencies: The recencies of frequently accessed blocks become large because of their interleaving with the references to the infrequently accessed blocks; Frequently accessed blocks could be unfortunately replaced [1.25]

Loop-like accesses (repeated accesses with a fixed interval) IRR is the same as the interval The interval larger than cache size, no hits blocks to be accessed soonest can be unfortunately replaced. Now let me show some commonly observed access patterns. Because of their weak access locality, LRU can’t deal with them properly. In memory scanning, the blocks are accessed only once. They have Infinite IRR, showing very weak locality. So they should not be cached at all. But in LRU they are cached until their recency are larger than cache size. It is actually a waste of buffers In the loop-like access, the blocks are repeatedly accessed with a fixed interval The IRRs of all the blocks are the same, and the same as the interval If the interval is larger than cache size, there are no hits al all in LRU. And blocks to be accessed soonest may be unfortunately replaced; For the accesses of blocks with distinct frequencies: The recencies of frequently accessed blocks become large because of their interleaving with the references to the infrequently accessed blocks; Frequently accessed blocks could be unfortunately replaced [1.25]

Accesses with distinct frequencies: The recencies of frequently accessed blocks become large because of references to infrequently accessed block; Frequently accessed blocks could be unfortunately replaced. Now let me show some commonly observed access patterns. Because of their weak access locality, LRU can’t deal with them properly. In memory scanning, the blocks are accessed only once. They have Infinite IRR, showing very weak locality. So they should not be cached at all. But in LRU they are cached until their recency are larger than cache size. It is actually a waste of buffers In the loop-like access, the blocks are repeatedly accessed with a fixed interval The IRRs of all the blocks are the same, and the same as the interval If the interval is larger than cache size, there are no hits al all in LRU. And blocks to be accessed soonest may be unfortunately replaced; For the accesses of blocks with distinct frequencies: The recencies of frequently accessed blocks become large because of their interleaving with the references to the infrequently accessed blocks; Frequently accessed blocks could be unfortunately replaced [1.25]

Looking for Blocks with Strong Locality
MULTI2 Locality Strength Cover 1000 Blocks with Strongest Locality IRR (Re-use Distance in Blocks) Because LRU doesn’t quantify locality strength correctly, it could cache many actually weak locality blocks and make cache seriously under-utilized.  I use IRR to quantify locality strength, in which small IRR values means strong locality.  Then the next challenge is that: can we dynamically maintain a set of blocks with strong locality and and the size of the set equals to the cache size, so that we can store these blocks in cache?  Suppose we have a curve to cover the references with the strongest locality. And the number of the blocks involving in the references below the curve equals to the cache size. Then these blocks are cached. The curve can adapt to the current access pattern. When the locality becomes weak, more reference points are in the upper area, the curve adaptively climbs up. When the locality becomes strong, the curve slips down. So at any time the 1000 strongest locality blocks are identified for caching and make the cache fully utilized.  Because all references covered by the LRU line are also covered by the new curve, so theoretically, the replacement algorithm caching the blocks covered the curve will produce a higher hit ratio than LRU, and could be much higher than LRU with weak locality access patterns.  The LIRS algorithm I designed is the one to realize the idea. [2] Cache Size Virtual Time (Reference Stream)

Challenges Retain the low overhead and adaptability merits of LRU.
Address the limitations of LRU fundamentally. Retain the low overhead and adaptability merits of LRU. Simplicity: affordable implementation Adaptability: responsive to access pattern changes I am not the first to observe the LRU limitations and to address them. But I do want to address the limitations fundamentally; not just target at some specific access patterns in a case by case fashion. Also I don’t want to sacrifice its merits when addressing its problems, we want it the same simple and responsive. [0.5]

Principle of the LIRS Replacement
If a block’s IRR is high, its next IRR is likely to be high again. We select the blocks with high IRRs for replacement . The principle of our replacement is: If a block’s IRR is high, its next IRR is likely to be high again. So we select the blocks with high IRRs for replacement . because these blocks are highly possible to be replaced later by LRU before being referenced again. Please note that these replaced blocks may also have been recently accessed, or with a small recency. We name our replacement as LIRS (…), because …. [1.25] LIRS: Low IRR Set Replacement algorithm We keep the set of blocks with low IRRs in cache.

Requirements on Low IRR Block Set (LIRS)
The set size should be the cache size. The set consists of the blocks with strongest locality strength (with the lowest IRRs) Dynamically keep the set up to date We have several requirements on the Low IRR Block Set: The set size should be the cache size. So the blocks can be cached. The set consists of the blocks with strongest locality strength (with the lowest IRR), so the cache can be fully utilized. We must Dynamically keep the set up to date, that is to say, the blocks cached are the ones with currently strong locality. This also means that when we make a comparison among IRRs of various blocks, only current IRR values are used; [0.75]

Low IRR Block Set Llirs Lhirs [1]
Low IRR ( LIR ) block and High IRR (HIR) block Block Sets Physical Cache LIR block set (size is Llirs ) Lhirs Llirs Cache size L = Llirs + Lhirs Then I describe how Low IRR Block set is generated and maintained Comparatively, some blocks have low IRR, called LIR blocks, some are with high IRR, called HIR blocks. We put them into two separate sets: LIR block set and HIR block set. We also divide the cache, whose size in blocks is L, into a major partition and a minor partition in terms of their sizes. Their sizes are Llirs and Lhirs, respectively. The size of LIR block set is also Llirs, the size of major partition, which is used to store LIR blocks. So all the LIR blocks are cached and all the references to LIR blocks are hits. The minor partition is used to store HIR blocks, so many HIR blocks may not reside in the cache. [1] HIR block set

LIR block set = {A, B}, HIR block set = {C, D, E}
An Example for LIRS Llirs=2, Lhirs=1 In the example, the table tells us that at virtual time 1 through 9, which of block A, B,... E is accessed. symbol ``X'' represents a reference. For example block A is accessed at time units 1, 6, and 8. The last two columns list the recency and IRR for each block at time 10. We assume Llirs is 2, and Lhirs is 1, at time 10 the LIRS algorithm leaves two blocks in LIR set = {A, B}. The rest of the blocks go to HIR set = {C, D, E}. Because E is most recently used, E is resident. [1] LIR block set = {A, B}, HIR block set = {C, D, E}

Mapping to Cache Llirs=2 Lhirs=1 Block Sets Physical Cache
D E HIR block set A B LIR block set Resident blocks Llirs=2 Lhirs=1 LIR blocks are in red color and HIR blocks are in blue color. All LIR blocks, A, B are cached, however, E is the only resident HIR block.

The resident HIR block (E) is replaced !
Which Block is replaced ? Replace HIR Blocks D is referenced at time 10 When a free buffer is needed, which block should be replaced ? Replace HIR Blocks! Block D is accessed at time 10, it is a miss. Because E is only resident HIR block, E is replaced and D can be loaded in. Please note that LRU would replace B due to its largest recency. [0.5] The resident HIR block (E) is replaced !

How LIR Set is Updated ? Recency of LIR Block Used
When block D is accessed, it generates a new IRR, it is 2. Is this new IRR small enough so that it can get into LIR set, the set for low IRR blocks? We let new IRR of the HIR block compared with Recencies of LIR blocks? Why not compared with IRR’s of LIR blocks? There are 2 reasons (1) IRR’s are out of date; we only use current IRRs (2) Recencies of LIR blocks are part of their coming IRR’s, and can be used to predict the current IRR If there is an LIR block whose recency greater than the new IRR, the newly accessed HIR block becomes LIR block, and that LIR block becomes HIR block. Their statuses are exchanged! The recency of LIR block B is 3, greater the new IRR, 2, the statuses are exchanged: . [1]

E is replaced, D enters LIR set
After D is Referenced at Time 10 … … B D E is replaced, D enters LIR set Now D is an LIR block, B is an HIR block.

E is replaced, C cannot enter LIR set
If Reference is to C at Time 10 … … If at virtual time 10, block C with its recency 4 is accessed, the new IRR is 4, all the recencies of LIR blocks are not greater than 4, there will be no status exchanging. Block C remains as HIR block. [0.5] E is replaced, C cannot enter LIR set

The LIRS References with Weak Locality
Memory scanning (one-time access) Infinite IRR; Not included in the LIR block set; replaced timely. Now let’s see how LIRS can deal with the weak access locality cases we have mentioned. In Memory scanning, the references have Infinite IRR; So the blocks are not included in the LIR block set, and will be replaced timely. In Loop-like accesses, the IRRs of all blocks are the same; Once a block becomes LIR block, it can keep its status; Any cached block can contribute a hit in one loop of accesses. In accesses with distinct frequencies: The IRRs of frequently accessed blocks have smaller IRR, than infrequently accessed blocks. Frequently accessed blocks are LIR blocks; and are always cached and get hits. [1]

Loop-like accesses The IRRs of all blocks are the same; Once a block becomes LIR block, it can keep its status; Any cached block can contribute a hit in one loop of accesses. Now let’s see how LIRS can deal with the weak access locality cases we have mentioned. In Memory scanning, the references have Infinite IRR; So the blocks are not included in the LIR block set, and will be replaced timely. In Loop-like accesses, the IRRs of all blocks are the same; Once a block becomes LIR block, it can keep its status; Any cached block can contribute a hit in one loop of accesses. In accesses with distinct frequencies: The IRRs of frequently accessed blocks have smaller IRR, than infrequently accessed blocks. Frequently accessed blocks are LIR blocks; and are always cached and get hits. [1]

Accesses with distinct frequencies: The IRRs of frequently accessed blocks have smaller IRR, than infrequently accessed blocks. Frequently accessed blocks are LIR blocks; Always cached and get hits. Now let’s see how LIRS can deal with the weak access locality cases we have mentioned. In Memory scanning, the references have Infinite IRR; So the blocks are not included in the LIR block set, and will be replaced timely. In Loop-like accesses, the IRRs of all blocks are the same; Once a block becomes LIR block, it can keep its status; Any cached block can contribute a hit in one loop of accesses. In accesses with distinct frequencies: The IRRs of frequently accessed blocks have smaller IRR, than infrequently accessed blocks. Frequently accessed blocks are LIR blocks; and are always cached and get hits. [1]

Making LIRS O(1) Efficient
IRR HIR (New IRR of the HIR block) Rmax (Maximum Recency of LIR blocks) This efficiency is achieved by our LIRS stack. To have an efficient LIRS algorithm with O(1) time complexity, the key is how to efficiently compare the new IRR and Rmax, the max recency of LIR blocks. This efficiency is achieved by our LIRS stack. LIRS stack is such an LRU stack that all LIR blocks are in stack and the bottom block is an LIR block. [0.5] LRU stack + LIR block with Rmax recency in its bottom ==> LIRS stack.

Differences between LRU and LIRS Stacks
resident block LIR block HIR block Cache size L = 5 3 2 1 6 5 LRU stack 9 4 8 LIRS stack Llir = 3 Lhir =2 Stack size of LRU decided by cache size, and fixed; Stack size of LIRS decided by Rmax, and varied. LRU stack holds only resident blocks; LIRS stack holds any blocks whose recencies are no more than Rmax. LRU stack does not distinguish “hot” and “cold” blocks in it; LIRS stack distinguishes LIR and HIR blocks in it, and dynamically maintains their statues. Let’s see the differences of LRU and LIRS stacks, which helps to reveal the advantages of LIRS over LRU. We uses the figure to illustrate the differences. of the symbols, yellow pie is for resident blocks, red circle is for LIR blocks, blue circle is for HIR blocks. …. [1]

How does LIRS Stack Help?
Rmax (Maximum Recency of LIR blocks) IRR HIR (New IRR of the HIR block) LIRS Stack Blocks in the LIRS stack ==> IRR < Rmax Other blocks ==> IRR > Rmax With the help of LIRS stack, the seemingly expensive comparison operations are much simplified. Because we ensure the LIR block with Rmax recency is in the stack bottom, when a HIR block is accessed in the LIRS stack, its new IRR is less than Rmax, it becomes LIR block. Otherwise it remains as a HIR block. [0.5]

LIRS Operations Now let’s see the algorithm.
resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 3 Initialization: All the referenced blocks are given an LIR status until LIR block set is full. We place resident HIR blocks in Stack Q 2 1 6 Now let’s see the algorithm. Initially, All the referenced blocks are given an LIR status until LIR block set is full. We define an operation called ``stack pruning'' on LIRS stack , which removes the HIR blocks in the bottom of the stack until an LIR block sits in the stack bottom. We place resident HIR blocks in a small stack called Q. Let me use an example to show the LIRS operations when accessing various kinds of blocks. [0.5] 9 4 5 8 3 LIRS stack S Resident HIR Stack Q

Access an LIR Block (a Hit)
. . . 5 9 7 5 3 8 4 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 3 2 1 6 9 First let’s see the operation on accessing an LIR block. Block 4 is accessed, move it to the top of stack S, just like LRU does on LRU stack. [3] 4 5 8 3 LIRS stack S Resident HIR Stack Q

Access an LIR Block (a Hit)
. . . 5 9 7 5 3 8 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 3 2 1 6 9 4 Then block 8 is accessed, move it to the top. However, HIR blocks are in the bottom. So do stack pruning to make sure a LIR block is in the bottom. 5 8 3 LIRS stack S Resident HIR Stack Q

Access an LIR block (a Hit)
. . . 5 9 7 5 3 8 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 3 2 1 4 8 6 9 5 3 S Q

Access a Resident HIR Block (a Hit)
. . . 5 9 7 5 3 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 3 8 4 5 3 This is for accessing HIR resident blocks. Block 3 is accessed. Move it to the top, it also becomes LIR block because it was in stack S, and its new IRR is less than Rmax. Accordingly LIR block 1 is demoted as HIR block and enters stack Q. Do stack pruning. 2 5 3 1 S Q

. . . 5 9 7 5 3 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 4 8 3 2 5 1 S Q

. . . 5 9 7 5 3 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 4 8 3 2 5 1 5 1 S Q

. . . 5 9 7 5 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 3 8 1 5 Then 5 is accessed, it remains as HIR block because it was not in stack S. 4 5 S Q

Access a Non-Resident HIR block (a Miss)
. . . 5 9 7 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 7 5 3 7 This is for accessing non-resident HIR blocks. Block 7 is accessed. It is a miss. We need a free buffer. Bottom block of stack Q is replaced. Move it to the top of stack S as HIR block, also move it to the top of stack Q. 8 5 4 1 S Q

. . . 5 9 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 9 7 5 5 3 8 9 Block 9 is accessed. This time bottom block of stack Q, 5 is replaced and leaves stack Q. But it remains in LIRS stack as non-resident HIR block. 7 4 5 5 S Q

. . . 5 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 9 7 7 5 3 Then block 5 is accessed, it becomes LIR block because it was in stack S. We can see that the overhead of the operations on the stacks in the LIRS algorithm is almost as low as that of LRU. 8 9 4 4 7 7 S Q

Workload Traces postgres is a trace of join queries among four relations in a relational database system; sprite is from the Sprite network file system; multi2 is obtained by executing three workloads, cs, cpp, and postgres, together. We conducted trace-driven simulations to test LIRS performance on a large number of traces, either synthetic or real-life, with various access patterns. Here I will show some representative results for these three real-life traces. Among them, postgres is a trace of join queries among four relations in a relational database system; sprite is from the Sprite network file system; multi2 is obtained by executing three workloads, cs, which is a GNU C compiler pre-processor, cpp, which is a C source program examination tool, and postgres, together. [0.75]

Cache Partition 1% of the cache size is for HIR blocks
99% of the cache size is for LIR blocks Performance is not sensitive to a partition. I partition the cache among LIR blocks and HIR blocks in this way: 1% of the cache size is for HIR blocks 99% of the cache size is for LIR blocks Because we want most of the cache spaces serve the blocks with strong locality. Assigning more cache spaces for HIR blocks will make LIRS behave more like LRU. Our parameter sensitivity study also shows that the performance of LIRS is not sensitive to a partition in a large range. We compare the hit ratios of LIRS with other representative replacement algorithms and find that LIRS consistently provides the best performance. Here let’s see the results on the three typical workloads: [1]

Virtual Time (Reference Stream)
Looping Pattern: postgres (Access Map) Logical Block Number This is the access map for looping pattern workload: postgres. It has several different looping intervals. We can find that there is a access locality with its size of about 400 blocks. Virtual Time (Reference Stream)

Looping Pattern: Postgres (IRR Map) LIRS IRR (Re-use Distance in Blocks) LRU This is its IRR map. Your can see the locality scope of about 400 blocks. I also shown the size of LRU stack and the varying size of LIRS stack for 500 block cache. While LRU only covers the references below its stack size line, LIRS adaptively change its stack size to cover current locality-strong blocks. When the locality becomes weak, i.e more references are in the upper IRR area, Its size goes up, otherwise, it goes down. This adaptation attempts to make the 500 blocks with the currently strongest locality be identified and cached. [1] Virtual Time (Reference Stream)

Looping Pattern: postgres (Hit Rates)
This shows the hit ratios of LIRS and other algorithms I mentioned before. Here I also include the OPT replacement , an off-line optimal replacement algorithm, which depends on the knowledge about the future accesses for its decision. In this hit ratio graph, X axis is for cache size, Y axis is for hit ratio. LIRS hit ratio is very close to the optimal, much better than others. You can see that LRU hit ratios are very low before the cache reaches 400 blocks, the size of one of its locality scopes. LRU is not effective until this working set fully reside in the cache. [0.75]

Temporally-Clustered Pattern: sprite (Access Map) Logical Block Number This is the access map for workload sprite with a temporarily-clustered access pattern, in which blocks accessed more recently are the ones more likely to be accessed in the near future. This is a LRU-friendly pattern. thus the behavior of all the other algorithms should be similar to that of LRU, so that their hit rates could be close to that of LRU. [0.5] Virtual Time (Reference Stream)

Temporally-Clustered Pattern: sprite (IRR Map) IRR (Re-use Distance in Blocks) LIRS This is its IRR map. You can see why this is a LRU-friendly workload: almost all the references are in the low IRR area. The stack sizes are for 500 blocks. In this access pattern, there is no need for LIRS to significantly increase its stack size to cover 500 blocks with the strongest locality. [0.5] LRU Virtual Time (Reference Stream)

Temporally-Clustered Pattern: sprite (Hit Ratio)
This is the easy case for all the replacement algorithms. All the curves are close to that of LRU. They all have high hit ratios. [1.5]

Mixed Pattern: multi2 (Access Map) Logical Block Number Virtual Time (Reference Stream) This is the access map for multi2, which I have shown in the beginning of the talk. It have a mixed access patterns of temporally-Clustered, looping and memory scanning.

Mixed Pattern: multi2 (IRR Map)
IRR (Re-use Distance in Blocks) LIRS LRU This is the IRR I have shown. The stack sizes for 1000 blocks. LIRS indeed realizes our initial idea here. Virtual Time (Reference Stream)

Mixed Pattern: multi2 (Hit Ratio)
LIRS has the best performance among the on-line algorithms. The impact of LIRS: The paper on the work published on Sigmetics 2002 conference have 13 citations. The researchers at IBM Almaden asked me for my simulator, Redhat Linux architect expressed their interests in implementation of LIRS in their kernel. MSR people read the paper and contacted me for an internship. During the time, I developed a approx. LIRS to improve Windows VM management.

Summay LIRS uses both IRR (or reuse distance) and recency for its replacement decision. 2Q uses only reuse distance. LIRS adapts to the locality changes when deciding which blocks have small IRRs. 2Q uses a fixed threshold in looking for blocks of small reuse distances. Both LIRS and 2Q are of low time overhead (as low as LRU). Their space overheads are acceptably larger.

ECE7995 Caching and Prefetching Techniques in Computer Systems

Similar presentations

Presentation on theme: "ECE7995 Caching and Prefetching Techniques in Computer Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE7995 Caching and Prefetching Techniques in Computer Systems

Similar presentations

Presentation on theme: "ECE7995 Caching and Prefetching Techniques in Computer Systems"— Presentation transcript:

Similar presentations

About project

Feedback