Lecture 13: Large Cache Design I

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

Nikos Hardavellas, Northwestern University

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 

1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Lecture 17: Virtual Memory, Large Caches

1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.

CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)

1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Lecture: Large Caches, Virtual Memory

ASR: Adaptive Selective Replication for CMP Caches

Lecture: Large Caches, Virtual Memory

The University of Adelaide, School of Computer Science

Lecture: Cache Hierarchies

18742 Parallel Computer Architecture Caching in Multi-core Systems

Multiprocessor Cache Coherency

Cache Memory Presentation I

Lecture: Large Caches, Virtual Memory

Lecture: Cache Hierarchies

Lecture 21: Memory Hierarchy

Lecture 12: Cache Innovations

Lecture 2: Snooping-Based Coherence

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Lecture: Cache Innovations, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Lecture 6: Reliability, PCM

Lecture 15: Large Cache Design III

Lecture 24: Memory, VM, Multiproc

Lecture 14: Large Cache Design II

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Lecture 15: Memory Design

Lecture 9: Directory Protocol Implementations

Lecture 25: Multiprocessors

Lecture 22: Cache Hierarchies, Memory

Lecture 11: Cache Hierarchies

Contents Memory types & memory hierarchy Virtual memory (VM)

Lecture: Cache Hierarchies

Lecture 21: Memory Hierarchy

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

The University of Adelaide, School of Computer Science

Presentation transcript:

Lecture 13: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers

Multi-Core Cache Organizations P P P P C C C C P P P P C C C C Private L1 caches Shared L2 cache, but physically distributed Scalable network Directory-based coherence between L1s

Multi-Core Cache Organizations P P P P C C C C D P P P P C C C C Private L1 caches Private L2 caches Scalable network Directory-based coherence between L2s (through a separate directory)

Shared Vs. Private SHR: No replication of blocks SHR: Dynamic allocation of space among cores SHR: Low latency for shared data in LLC (no indirection thru directory) SHR: No interconnect traffic or tag replication to maintain directories PVT: More isolation and better quality-of-service PVT: Lower wire traversal when accessing LLC hits, on average PVT: Lower contention when accessing some shared data PVT: No need for software support to maintain data proximity

Innovations for Private Caches: Cooperation Cooperative Caching, Chang and Sohi, ISCA’06 P P P P C C C C D P P P P C C C C Prioritize replicated blocks for eviction with a given probability; directory must track and communicate a block’s “replica” status “Singlet” blocks are sent to sibling caches upon eviction (probabilistic one-chance forwarding); blocks are placed in LRU position of sibling

Dynamic Spill-Receive Dynamic Spill-Receive, Qureshi, HPCA’09 Instead of forcing a block upon a sibling, designate caches as Spillers and Receivers and all cooperation is between Spillers and Receivers Every cache designates a few of its sets as being Spillers and a few of its sets as being Receivers (each cache picks different sets for this profiling) Each private cache independently tracks the global miss rate on its S/R sets (either by watching the bus or at the directory) The sets with the winning policy determine the policy for the rest of that private cache – referred to as set-dueling

Innovations for Shared Caches: NUCA Issues to be addressed for Non-Uniform Cache Access: Mapping Migration Search Replication CPU

Static and Dynamic NUCA Static NUCA (S-NUCA) The address index bits determine where the block is placed; sets are distributed across banks Page coloring can help here to improve locality Dynamic NUCA (D-NUCA) Ways are distributed across banks Blocks are allowed to move between banks: need some search mechanism Each core can maintain a partial tag structure so they have an idea of where the data might be (complex!) Every possible bank is looked up and the search propagates (either in series or in parallel) (complex!)

Beckmann and Wood, MICRO’04 Latency 65 cyc Data must be placed close to the center-of-gravity of requests Latency 13-17cyc

Alternative Layout From Huh et al., ICS’05: Paper also introduces the notion of sharing degree A bank can be shared by any number of cores between N=1 and 16. Will need support for L2 coherence as well

Victim Replication, Zhang & Asanovic, ISCA’05 Large shared L2 cache (each core has a local slice) On an L1 eviction, place the victim in local L2 slice (if there are unused lines) The replication does not impact correctness as this core is still in the sharer list and will receive invalidations On an L1 miss, the local L2 slice is checked before fwding the request to the correct slice P P P P C C C C P P P P C C C C

Page Coloring Bank number with Page-to-Bank Bank number with Set-interleaving CACHE VIEW Tag Set Index Block offset 00000000000 000000000000000 000000 OS VIEW Physical page number Page offset

Cho and Jin, MICRO’06 Page coloring to improve proximity of data and computation Flexible software policies Has the benefits of S-NUCA (each address has a unique location and no search is required) Has the benefits of D-NUCA (page re-mapping can help migrate data, although at a page granularity) Easily extends to multi-core and can easily mimic the behavior of private caches

Page Coloring Example P P P P C C C C P P P P C C C C Awasthi et al., HPCA’09 propose a mechanism for hardware-based re-coloring of pages without requiring copies in DRAM memory They also formalize the cost functions that determine the optimal home for a page

R-NUCA, Hardavellas et al., ISCA’09 A page is categorized as “shared instruction”, “private data”, or “shared data”; the TLB tracks this and prevents access of a different kind Depending on the page type, the indexing function into the shared cache is different “Private data” only looks up the local bank “Shared instruction” looks up a region of 4 banks “Shared data” looks up all the banks

Rotational Interleaving Can allow for arbitrary group sizes and a numbering that distributes load 00 01 10 11 10 11 00 01 00 01 10 11 10 11 00 01

Title Bullet