1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
Cache Optimization Summary
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.
1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,
1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
Lecture 17: Virtual Memory, Large Caches
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory
ASR: Adaptive Selective Replication for CMP Caches
Lecture: Large Caches, Virtual Memory
Reactive NUMA A Design for Unifying S-COMA and CC-NUMA
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture: Cache Hierarchies
18742 Parallel Computer Architecture Caching in Multi-core Systems
Cache Memory Presentation I
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
CS61C : Machine Structures Lecture 6. 2
Lecture 13: Large Cache Design I
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Lecture 12: Cache Innovations
Lecture 23: Cache, Memory, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Lecture: Cache Innovations, Virtual Memory
Lecture 17: Transactional Memories I
Lecture 15: Large Cache Design III
Lecture 24: Memory, VM, Multiproc
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
Lecture 14: Large Cache Design II
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 9: Directory Protocol Implementations
Lecture 22: Cache Hierarchies, Memory
Lecture 11: Cache Hierarchies
Lucía G. Menezo Valentín Puente Jose Ángel Gregorio
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Presentation transcript:

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication, Zhang and Asanovic, ISCA’05

2 Page Coloring Example P C P C P C P C P C P C P C P C

3 Cho and Jin, MICRO’06 Page coloring to improve proximity of data and computation Flexible software policies Has the benefits of S-NUCA (each address has a unique location and no search is required) Has the benefits of D-NUCA (page re-mapping can help migrate data, although at a page granularity) Easily extends to multi-core and can easily mimic the behavior of private caches

4 Victim Replication Large shared L2 cache (each core has a local slice) On an L1 eviction, place the victim in local L2 slice (if there are unused lines) The replication does not impact correctness as this core is still in the sharer list and will receive invalidations On an L1 miss, the local L2 slice is checked before fwding the request to the correct slice P C P C P C P C P C P C P C P C

5 Static and Dynamic NUCA Static NUCA (S-NUCA)  The address index bits determine where the block is placed  Page coloring can help here as well to improve locality Dynamic NUCA (D-NUCA)  Blocks are allowed to move between banks  The block can be anywhere: need some search mechanism  Each core can maintain a partial tag structure so they have an idea of where the data might be (complex!)  Every possible bank is looked up and the search propagates (either in series or in parallel) (complex!)

6 Private L2s Arguments for private L2s: Lower latency for L2 hits Fewer ways have to be looked up for L2 hits Performance isolation (little interference from other threads) Can be turned off easily (since L2 does not have directory info) Fewer requests on the on-chip network Primary disadvantage: More off-chip accesses because of higher miss rates

7 Coherence among L2s On an L2 miss, can broadcast request to all L2s and off-chip controller (snooping-based coherence for few cores) On an L2 miss, contact a directory that replicates tags for all L2 caches and handles the request appropriately (directory-based coherence for many cores) P C P C P C P C P C P C P C P C D

8 The Directory Structure For 64-byte blocks, 1 MB L2 caches, overhead ~432 KB Note the complexities in maintaining presence vectors, non-inclusion for L1 and L2 Note that clean evictions must also inform the central directory Need not inform directory about L1-L2 swaps (the directory is imprecise about whether the block will be found in L1 or L2)

9 Co-operation I Cache-to-cache sharing On an L2 miss, the directory is contacted and the request is forwarded to and serviced by another cache If silent evictions were allowed, some of these forwards would fail

10 Co-operation II Every block keeps track of whether it is a singlet or replicate – this requires notifications from the central directory every time a block changes modes While replacing a block, replicates are preferred (with a given probability) When a singlet block is evicted, the directory is contacted and the directory then forwards this block to another randomly selected cache (weighted probabilities to prefer nearby caches or no cache at all) (hopefully, the forwarded block will replace another replicate)

11 Co-operation III An evicted block is given a Recirculation Count of N and pushed to another cache – this block is placed as the LRU block in its new cache – every eviction decrements the RC before forwarding (this paper uses N=1) Essentially, a block has one more chance to linger in the cache – it will stick around if it is reused before the new cache experiences capacity pressure This is an attempt to approximate a global LRU policy among all 32 ways of aggregate L2 cache Overheads per L2 cache block: one bit to indicate “once spilled and not reused” and one bit for “singlet” info

12 Results

13 Results

14 Title Bullet