Download presentation
Presentation is loading. Please wait.
1
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders: Assignment 5 posted
2
2 Intel Montecito Cache Two cores, each with a private 12 MB L3 cache and 1 MB L2 Naffziger et al., Journal of Solid-State Circuits, 2006
3
3 Intel 80-Core Prototype – Polaris Prototype chip with an entire die of SRAM cache stacked upon the cores
4
4 Example Intel Studies L3 Cache sizes up to 32 MB C L1 C L2 C L1 C L2 L3 Memory interface C L1 C L2 C L1 C L2 Interconnect IO interface From Zhao et al., CMP-MSI Workshop 2007
5
5 Shared Vs. Private Caches in Multi-Core What are the pros/cons to a shared L2 cache? P4P3P2P1 L1 L2 P4P3P2P1 L1 L2
6
6 Shared Vs. Private Caches in Multi-Core Advantages of a shared cache: Space is dynamically allocated among cores No wastage of space because of replication Potentially faster cache coherence (and easier to locate data on a miss) Advantages of a private cache: small L2 faster access time private bus to L2 less contention
7
7 UCA and NUCA The small-sized caches so far have all been uniform cache access: the latency for any access is a constant, no matter where data is found For a large multi-megabyte cache, it is expensive to limit access time by the worst case delay: hence, non-uniform cache architecture
8
8 Large NUCA CPU Issues to be addressed for Non-Uniform Cache Access: Mapping Migration Search Replication
9
9 Static and Dynamic NUCA Static NUCA (S-NUCA) The address index bits determine where the block is placed Page coloring can help here as well to improve locality Dynamic NUCA (D-NUCA) Blocks are allowed to move between banks The block can be anywhere: need some search mechanism Each core can maintain a partial tag structure so they have an idea of where the data might be (complex!) Every possible bank is looked up and the search propagates (either in series or in parallel) (complex!)
10
10 Where Should Data be Placed? CPU 2 L1DL1I CPU 4 L1D L1I CPU 6 L1DL1I CPU 1 L1D L1I CPU 3 L1DL1I CPU 5 L1D L1I CPU 7 L1DL1I CPU 0 L1D L1I
11
11 Example Organization Latency 13-17cyc Latency 65 cyc Data must be placed close to the center-of-gravity of requests
12
12 Examples: Frequency of Accesses Dark more accesses OLTP (on-line transaction processing) Ocean (scientific code)
13
13 Organizations in 3D Possibly some inter-mingling of cache and cores? CPCP CPCP CPCP CPCP CPCP CPCP PPPP CPPC PPPP CCCC CCCC CCCC
14
14 Title Bullet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.