Lecture 13: Large Cache Design I

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.
Nikos Hardavellas, Northwestern University
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.
1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,
1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )
1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 
1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
Lecture 17: Virtual Memory, Large Caches
1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Lecture: Large Caches, Virtual Memory
ASR: Adaptive Selective Replication for CMP Caches
Lecture: Large Caches, Virtual Memory
The University of Adelaide, School of Computer Science
Lecture: Cache Hierarchies
18742 Parallel Computer Architecture Caching in Multi-core Systems
Multiprocessor Cache Coherency
Cache Memory Presentation I
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Lecture 12: Cache Innovations
Lecture 2: Snooping-Based Coherence
Lecture 23: Cache, Memory, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Lecture: Cache Innovations, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Lecture 6: Reliability, PCM
Lecture 15: Large Cache Design III
Lecture 24: Memory, VM, Multiproc
Lecture 14: Large Cache Design II
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 15: Memory Design
Lecture 9: Directory Protocol Implementations
Lecture 25: Multiprocessors
Lecture 22: Cache Hierarchies, Memory
Lecture 11: Cache Hierarchies
Contents Memory types & memory hierarchy Virtual memory (VM)
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
The University of Adelaide, School of Computer Science
Presentation transcript:

Lecture 13: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers

Multi-Core Cache Organizations P P P P C C C C P P P P C C C C Private L1 caches Shared L2 cache, but physically distributed Scalable network Directory-based coherence between L1s

Multi-Core Cache Organizations P P P P C C C C D P P P P C C C C Private L1 caches Private L2 caches Scalable network Directory-based coherence between L2s (through a separate directory)

Shared Vs. Private SHR: No replication of blocks SHR: Dynamic allocation of space among cores SHR: Low latency for shared data in LLC (no indirection thru directory) SHR: No interconnect traffic or tag replication to maintain directories PVT: More isolation and better quality-of-service PVT: Lower wire traversal when accessing LLC hits, on average PVT: Lower contention when accessing some shared data PVT: No need for software support to maintain data proximity

Innovations for Private Caches: Cooperation Cooperative Caching, Chang and Sohi, ISCA’06 P P P P C C C C D P P P P C C C C Prioritize replicated blocks for eviction with a given probability; directory must track and communicate a block’s “replica” status “Singlet” blocks are sent to sibling caches upon eviction (probabilistic one-chance forwarding); blocks are placed in LRU position of sibling

Dynamic Spill-Receive Dynamic Spill-Receive, Qureshi, HPCA’09 Instead of forcing a block upon a sibling, designate caches as Spillers and Receivers and all cooperation is between Spillers and Receivers Every cache designates a few of its sets as being Spillers and a few of its sets as being Receivers (each cache picks different sets for this profiling) Each private cache independently tracks the global miss rate on its S/R sets (either by watching the bus or at the directory) The sets with the winning policy determine the policy for the rest of that private cache – referred to as set-dueling

Innovations for Shared Caches: NUCA Issues to be addressed for Non-Uniform Cache Access: Mapping Migration Search Replication CPU

Static and Dynamic NUCA Static NUCA (S-NUCA) The address index bits determine where the block is placed; sets are distributed across banks Page coloring can help here to improve locality Dynamic NUCA (D-NUCA) Ways are distributed across banks Blocks are allowed to move between banks: need some search mechanism Each core can maintain a partial tag structure so they have an idea of where the data might be (complex!) Every possible bank is looked up and the search propagates (either in series or in parallel) (complex!)

Beckmann and Wood, MICRO’04 Latency 65 cyc Data must be placed close to the center-of-gravity of requests Latency 13-17cyc

Alternative Layout From Huh et al., ICS’05: Paper also introduces the notion of sharing degree A bank can be shared by any number of cores between N=1 and 16. Will need support for L2 coherence as well

Victim Replication, Zhang & Asanovic, ISCA’05 Large shared L2 cache (each core has a local slice) On an L1 eviction, place the victim in local L2 slice (if there are unused lines) The replication does not impact correctness as this core is still in the sharer list and will receive invalidations On an L1 miss, the local L2 slice is checked before fwding the request to the correct slice P P P P C C C C P P P P C C C C

Page Coloring Bank number with Page-to-Bank Bank number with Set-interleaving CACHE VIEW Tag Set Index Block offset 00000000000 000000000000000 000000 OS VIEW Physical page number Page offset

Cho and Jin, MICRO’06 Page coloring to improve proximity of data and computation Flexible software policies Has the benefits of S-NUCA (each address has a unique location and no search is required) Has the benefits of D-NUCA (page re-mapping can help migrate data, although at a page granularity) Easily extends to multi-core and can easily mimic the behavior of private caches

Page Coloring Example P P P P C C C C P P P P C C C C Awasthi et al., HPCA’09 propose a mechanism for hardware-based re-coloring of pages without requiring copies in DRAM memory They also formalize the cost functions that determine the optimal home for a page

R-NUCA, Hardavellas et al., ISCA’09 A page is categorized as “shared instruction”, “private data”, or “shared data”; the TLB tracks this and prevents access of a different kind Depending on the page type, the indexing function into the shared cache is different “Private data” only looks up the local bank “Shared instruction” looks up a region of 4 banks “Shared data” looks up all the banks

Rotational Interleaving Can allow for arbitrary group sizes and a numbering that distributes load 00 01 10 11 10 11 00 01 00 01 10 11 10 11 00 01

Title Bullet