Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur

Slides:



Advertisements
Similar presentations
Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.
Advertisements

More on File Management
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 10: Virtual Memory Background Demand Paging Process Creation Page Replacement.
Virtual Memory Background Demand Paging Performance of Demand Paging
Virtual Memory Introduction to Operating Systems: Module 9.
Segmentation and Paging Considerations
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Instructor: Umar KalimNUST Institute of Information Technology Operating Systems Virtual Memory.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
Multiprocessing Memory Management
Virtual Memory Chapter 8.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
CSI 400/500 Operating Systems Spring 2009 Lecture #9 – Paging and Segmentation in Virtual Memory Monday, March 2 nd and Wednesday, March 4 th, 2009.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Memory Management ◦ Operating Systems ◦ CS550. Paging and Segmentation  Non-contiguous memory allocation  Fragmentation is a serious problem with contiguous.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Lecture 21 Last lecture Today’s lecture Cache Memory Virtual memory
Page 19/17/2015 CSE 30341: Operating Systems Principles Optimal Algorithm  Replace page that will not be used for longest period of time  Used for measuring.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
Operating Systems CMPSC 473 Virtual Memory Management (3) November – Lecture 20 Instructor: Bhuvan Urgaonkar.
Virtual Memory 1 1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
CE Operating Systems Lecture 17 File systems – interface and implementation.
1 Data Link Layer Lecture 23 Imran Ahmed University of Management & Technology.
Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
1 Lecture 8: Virtual Memory Operating System Fall 2006.
Part III Storage Management
Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.
CS203 – Advanced Computer Architecture Virtual Memory.
1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.
CS161 – Design and Architecture of Computer
Cache Coherence: Directory Protocol
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Cache Coherence: Directory Protocol
Segmentation COMP 755.
CS161 – Design and Architecture of Computer
Dynamic Hashing (Chapter 12)
CSC 4250 Computer Architectures
Hashing Exercises.
CSI 400/500 Operating Systems Spring 2009
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Lecture 14: Large Cache Design II
Hash-Based Indexes Chapter 11
Database Design and Programming
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Contents Memory types & memory hierarchy Virtual memory (VM)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Operating Systems CMPSC 473
COMP755 Advanced Operating Systems
Virtual Memory 1 1.
Presentation transcript:

Cooling the Hot Sets: Improved Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur

Balanced $ (IIT, Kanpur) Talk in one slide Closed-addressed hashing used in traditional cache designs with a fixed collision chain length (known as associativity) Closed-addressed hashing used in traditional cache designs with a fixed collision chain length (known as associativity) Clustering of physical addresses to a few hot sets is a well-known phenomenon Clustering of physical addresses to a few hot sets is a well-known phenomenon Non-uniform set utilization leads to high volume of conflict misses Non-uniform set utilization leads to high volume of conflict misses First proposal on a fully dynamic scheme to re-balance sets by migrating blocks from “hot regions” to “cooler regions” First proposal on a fully dynamic scheme to re-balance sets by migrating blocks from “hot regions” to “cooler regions”

Balanced $ (IIT, Kanpur)Sketch  Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur)Observation#1

Observation#2, 3

Balanced $ (IIT, Kanpur)Sketch Observations Observations  Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Design detail Overview Overview –The basic idea is to migrate evicted blocks to sets with smaller fill count Involves the following sub-problems Involves the following sub-problems –Identify a good receiver set quickly –Locate migrated blocks efficiently –Offer dynamic control of hit/miss critical path Optimizations worth exploring Optimizations worth exploring –Selective migration (not all blocks are important) –Bound migrations from a particular set –Retain migrated blocks (the difficult part)

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Destination of migration Associate a saturating counter C(s) with each set s and a global counter G Associate a saturating counter C(s) with each set s and a global counter G –Increment C(s) on a refill into s –When C(s) reaches a value equal to the associativity, increment G –When G reaches a value equal to the number of sets, reset G and C(s) for all s –Size C(s) so that it can count up to k times the associativity (we set k to 4)

Balanced $ (IIT, Kanpur) Destination of migration Divide the sets into clusters of sets and associate a saturating counter D(u) with each cluster u Divide the sets into clusters of sets and associate a saturating counter D(u) with each cluster u –Increment D(u) whenever C(s) is incremented for some s in u –Reset D(u) when all C(s) are reset –Have a comparator tree to compute the minimum among all D(u) whenever an increment takes place (scalable?) –Have a second comparator tree to compute the minimum among all C(s) within the minimum u found by the first tree; the set t with this minimum is the target of migration provided C(s) > C(t) for source set s

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Locating migrated blocks The migrated tags are duplicated in a migration tag cache (MTC) The migrated tags are duplicated in a migration tag cache (MTC) –MTC is organized as a direct-mapped table –Each entry has a tag, a target set index, a forward pointer to an MTC entry, a backward pointer to an MTC entry, a head bit, and a tail bit –Starting at an index of the MTC, one can follow the forward pointers in a linked list until the tail bit is encountered –One tag list in the MTC corresponds to the migrated tags from a particular parent set in the main cache

Balanced $ (IIT, Kanpur) Locating migrated blocks Tag lookup protocol Tag lookup protocol –With each set s in the main cache, a head pointer H(s) to the MTC is maintained; H(s) points to the index of MTC where the list of migrated tags belonging to set s begins –The main cache is looked up first as usual –On a miss, H(s) is read out and an MTC walk is initiated at index H(s) –Note that on reset, the MTC is organized as a free list; a new migration from set s allocates an MTC entry, links it at the head of the list starting at H(s), and updates H(s)

Balanced $ (IIT, Kanpur) Locating migrated blocks Tag lookup protocol Tag lookup protocol –On an MTC hit, the block is swapped with the LRU block in the parent set to improve future hit latency (behaves like a folded victim cache) –It is necessary to avoid false hits –Now the same set may contain the same tag multiple times –Each tag is extended by log(A) bits where A is the associativity; the target way of a migrated tag is stored along with the tag

Balanced $ (IIT, Kanpur) Locating migrated blocks Replacement of migrated blocks Replacement of migrated blocks –A migrated block may get replaced due to primary or secondary replacements –A primary migrated block replacement is again migrated to a different target set; this case is easy to handle because it requires only MTC entry modification –But to get to the MTC entry, one needs to maintain a direct MTC entry pointer MEP(t) with each migrated tag t in the main cache

Balanced $ (IIT, Kanpur) Locating migrated blocks Replacement of migrated blocks Replacement of migrated blocks –A secondary migrated block replacement evicts the block from the cache –This requires delinking the tag from its list –Efficient delinking is possible only in doubly- linked lists and this is why we need a backward pointer with each MTC entry –Also, this may need updating the H(s) field in the parent set s –To be able to get to the parent set, each MTC entry needs to store the parent set index

Balanced $ (IIT, Kanpur) Locating migrated blocks Summary of structures added till now Summary of structures added till now –Per set s: one saturating counter C(s), one head pointer H(s) and VALID(H(s)) –Per tag t: MTC entry pointer MEP(t) and VALID(MEP(t)), extra way bits W(t) –Per MTC entry m: migrated tag MT(m) including the extra way bits, target set index TS(m), parent set index PS(m), forward pointer FPTR(m), backward pointer BPTR(m), head/tail bits HT(m) –Per set cluster u: saturating counter D(u) –A global saturating counter –Two comparator trees

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Hit/Miss critical path Reducing the MTC walk latency Reducing the MTC walk latency –Proposal#1: Make MTC dual-ported so that a list can be walked from both ends (a win-win situation); halves hit as well as miss paths –Add a tail pointer T(s) to each set (along with H(s)) so that the tail of a list can be accessed directly –Proposal#2: Maintain the summary of migrated tags from a set s in a small filter F(s) attached to s –Query F(s) first before walking MTC; a negative response from F(s) means the tag is definitely not there in MTC; optimizes the miss path only

Balanced $ (IIT, Kanpur) Hit/Miss critical path Reducing the MTC walk latency Reducing the MTC walk latency –We experimented with a simple design of a 60-bit F(s) with great success –Divide the 60 bits into nine segments: each of the lower eight segments is seven bits wide and the upper segment is four bits wide –When a tag t is queried, the lower three bits of t identifies one of the lower eight segments of F(s) –Let the contents of the identified segment be f[6:0] and the contents of the upper segment be g[3:0]

Balanced $ (IIT, Kanpur) Hit/Miss critical path Reducing the MTC walk latency Reducing the MTC walk latency –The filter says “yes” if and only if (f[6:0] AND t[9:3]) == t[9:3] and (g[3:0] AND t[13:10]) == t[13:10] –A newly migrated tag t is hashed into F(s) by ORing t[9:3] into the identified segment and ORing t[13:10] with the upper segment –F(s) is not updated if a migrated tag is removed (not possible to update) –On a false positive from F(s), all the migrated tags for the set s will have to be visited anyway; at this time F(s) is cleared and rebuilt

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Selective migration Not all blocks are important Not all blocks are important –Unnecessary migrations waste energy and may hurt performance by using up MTC space –Ideally, we want to migrate the most frequently missing blocks –Usually, these blocks are associated with the hot sets –The idea, therefore, should be to identify the hot sets and migrate only the blocks evicted from the hot sets

Balanced $ (IIT, Kanpur) Selective migration Identifying hot sets Identifying hot sets –Associate a saturating counter R(s) with each set s to count the number of external refills to the set –Whenever some R(s) reaches its maximum value, all R(s) are reset (leader-decides rule) –Maintain the total refill count across all sets in a register TRC and the maximum refill count across all sets in another register MaxRC; let average refill count be ARC = TRC >> log(|S|) –Definition: A set s is hot if and only if R(s) > ARC + (MaxRC – ARC) >> delta –Delta is dynamically incremented

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Throttling migration If a set becomes very hot, it may start migrating a large number of blocks If a set becomes very hot, it may start migrating a large number of blocks –While this may appear desirable, monotonically increasing expected MTC walk cost outweighs the benefits soon –We impose a limit on the length of the migrated tag list belonging to a particular set –However, a static limit may not work; so the limit is dynamically increased by monitoring the volume of rejected migrations due to too short a length limit –Each set s now maintains a list length register LLR(s)

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Retaining migrated blocks Number of misses between two misses to the same block is often very high Number of misses between two misses to the same block is often very high –Points to the danger of losing the migrated blocks before they get reused –We need to design a replacement policy that gives lower replacement priority to the migrated blocks because these are the blocks we really want to retain –Classify the sets into high-hit and low-hit sets –For high-hit sets continue with baseline policy (LRU in our case) –For low-hit sets, consider the non-migrated blocks before the migrated ones

Balanced $ (IIT, Kanpur) Retaining migrated blocks Associate a hit counter HC(s) with each set s Associate a hit counter HC(s) with each set s –Reset HC(s) when the refill counter is reset –Count a hit on a migrated block as a hit in the parent set –Classify a set as low-hit if and only if HC(s) ≤ hR(s) and R(s) > r for some constant h > 1 and r r for some constant h > 1 and r < associativity –We fix h to 4 and r to 1/8 th of associativity More research is needed on better retention schemes More research is needed on better retention schemes –This is going to play a big role

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks  Scaling to CMPs Simulation results Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Scaling to CMPs Assume that the CMP caches will be banked Assume that the CMP caches will be banked –All the policies can be applied to each bank or a subset of close-by banks independently –No cross-bank (or cross-switch) migration –Use cross-bank migration only for proximity enhancement (more detail in second talk) –The entire design scales seamlessly to larger caches In our simulations, we assume that a pair of banks share a switch on a ring and cross-bank migration is allowed only within a pair In our simulations, we assume that a pair of banks share a switch on a ring and cross-bank migration is allowed only within a pair

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs  Simulation results Summary Summary

Balanced $ (IIT, Kanpur) Simulation results Single-threaded and multi-threaded applications Single-threaded and multi-threaded applications Single-threaded runs are done on 2 MB 16-way L2 caches Single-threaded runs are done on 2 MB 16-way L2 caches Multi-threaded runs are done on 8 cores sharing a 4 MB 16-way L2 cache Multi-threaded runs are done on 8 cores sharing a 4 MB 16-way L2 cache –Each core has private L1 caches The MTC is sized to hold half the tags compared to the main cache The MTC is sized to hold half the tags compared to the main cache Space overhead of about 56 KB per 1 MB bank Space overhead of about 56 KB per 1 MB bank

Balanced $ (IIT, Kanpur) Simulation results

Balanced $ (IIT, Kanpur) Simulation results

Balanced $ (IIT, Kanpur) Simulation results

Balanced $ (IIT, Kanpur) Simulation results

Balanced $ (IIT, Kanpur) Simulation results

Balanced $ (IIT, Kanpur) Simulation results

Balanced $ (IIT, Kanpur) Simulation results

Balanced $ (IIT, Kanpur)Sketch Observations Observations Design detail Design detail –Destination of migration –Locating migrated blocks –Hit/Miss critical path –Selective migration –Throttling migration –Retaining migrated blocks Scaling to CMPs Scaling to CMPs Simulation results Simulation results  Summary

Balanced $ (IIT, Kanpur)Summary Huge potential for improving performance and saving energy with slightly over 5% extra storage Huge potential for improving performance and saving energy with slightly over 5% extra storage Logic simplifications need to be explored further Logic simplifications need to be explored further

Cooling the Hot Sets: Improving Space Utilization in Large Caches via Dynamic Set Balancing Mainak Chaudhuri, IIT Kanpur THANK YOU!