Download presentation
Presentation is loading. Please wait.
1
CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004
2
Cache Design DECODERDECODER Sense Amp Address DECODERDECODER Comparator Mux+driver Data Tag Array Data Array
3
Capacity Vs. Latency 8KB 1 cycle 32 KB 2 cycles 128 KB 3 cycles
4
Large L2 Caches CPU Issues to be addressed for Non-Uniform Cache Access: Mapping Searching Movement
5
Dynamic NUCA Frequently accessed blocks are moved closer to CPU – reduces average latency Partial (6-bit) tags are maintained close to CPU – tag look-up can identify potential location of block or quickly signal a miss Without partial tags, every possible location would have to be searched serially or in parallel What if you optimize for power?
6
DNUCA – CMP Latency 13-17cyc Latency 65 cyc Allocation: static, based on block’s address Migration: r.l r.i r.c m.c m.i m.l Search: multicast to 6; then multicast to 10 False misses
7
Alternative Layout From Huh et al., ICS’05
8
Block Sharing
9
Hit Distribution
10
Block Migration Results While block migration reduces avg. distance, it complicates search.
11
CMP-TLC Pros: Fast wires enable uniform low-latency access Cons: Low-bandwidth interconnect High implementation cost More latency/complexity at the L2 interface
12
Stride Prefetching Prefetching algorithm: detect at least 4 uniform stride accesses and then allocate an entry in stream buffer Stream buffer has 8 entries and each stream stays 6 (L1) or 25 (L2) accesses ahead
13
Combination of Techniques
14
Title Bullet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.