1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 18: Large Caches, Multiprocessors Today: NUCA caches, multiprocessors (Sections ) Reminder: assignment 5 due Thursday (don’t procrastinate!)
1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )
1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
Lecture 17: Virtual Memory, Large Caches
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Fundamentals of Parallel Computer Architecture - Chapter 61 Chapter 6 Introduction to Memory Hierarchy Organization Yan Solihin Copyright.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory
Lecture: Large Caches, Virtual Memory
Multilevel Memories (Improving performance using alittle “cash”)
Lecture: Cache Hierarchies
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
Lecture 23: Cache, Memory, Security
Lecture 13: Large Cache Design I
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Lecture 12: Cache Innovations
Lecture 23: Cache, Memory, Virtual Memory
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Lecture 22: Cache Hierarchies, Memory
Lecture: Cache Innovations, Virtual Memory
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture 15: Memory Design
Lecture 22: Cache Hierarchies, Memory
Lecture 11: Cache Hierarchies
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Lecture 23: Virtual Memory, Multiprocessors
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Lecture 14: Cache Performance
Presentation transcript:

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

2 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed – the misses for an infinite cache Capacity misses: happens because the program touched many other words before re-touching the same word – the misses for a fully-associative cache Conflict misses: happens because two words map to the same location in the cache – the misses generated while moving from a fully-associative to a direct-mapped cache Sidenote: can a fully-associative cache have more misses than a direct-mapped cache of the same size?

3 Reducing Miss Rate Large block size – reduces compulsory misses, reduces miss penalty in case of spatial locality – increases traffic between different levels, space waste, and conflict misses Large cache – reduces capacity/conflict misses – access time penalty High associativity – reduces conflict misses – rule of thumb: 2-way cache of capacity N/2 has the same miss rate as 1-way cache of capacity N – more energy

4 More Cache Basics L1 caches are split as instruction and data; L2 and L3 are unified The L1/L2 hierarchy can be inclusive, exclusive, or non-inclusive On a write, you can do write-allocate or write-no-allocate On a write, you can do writeback or write-through; write-back reduces traffic, write-through simplifies coherence Reads get higher priority; writes are usually buffered L1 does parallel tag/data access; L2/L3 does serial tag/data

5 Techniques to Reduce Cache Misses Victim caches Better replacement policies – pseudo-LRU, NRU, DRRIP Cache compression

6 Victim Caches A direct-mapped cache suffers from misses because multiple pieces of data map to the same location The processor often tries to access data that it recently discarded – all discards are placed in a small victim cache (4 or 8 entries) – the victim cache is checked before going to L2 Can be viewed as additional associativity for a few sets that tend to have the most conflicts

7 Replacement Policies Pseudo-LRU: maintain a tree and keep track of which side of the tree was touched more recently; simple bit ops NRU: every block in a set has a bit; the bit is made zero when the block is touched; if all are zero, make all one; a block with bit set to 1 is evicted DRRIP: use multiple (say, 3) NRU bits; incoming blocks are set to a high number (say 6), so they are close to being evicted; similar to placing an incoming block near the head of the LRU list instead of near the tail

8 Tolerating Miss Penalty Out of order execution: can do other useful work while waiting for the miss – can have multiple cache misses -- cache controller has to keep track of multiple outstanding misses (non-blocking cache) Hardware and software prefetching into prefetch buffers – aggressive prefetching can increase contention for buses

9 Stream Buffers Simplest form of prefetch: on every miss, bring in multiple cache lines When you read the top of the queue, bring in the next line L1 Stream buffer Sequential lines

10 Stride-Based Prefetching For each load, keep track of the last address accessed by the load and a possibly consistent stride FSM detects consistent stride and issues prefetches init trans steady no-pred incorrect correct incorrect (update stride) correct incorrect (update stride) incorrect (update stride) tagprev_addrstridestate PC

11 Prefetching Hardware prefetching can be employed for any of the cache levels It can introduce cache pollution – prefetched data is often placed in a separate prefetch buffer to avoid pollution – this buffer must be looked up in parallel with the cache access Aggressive prefetching increases “coverage”, but leads to a reduction in “accuracy”  wasted memory bandwidth Prefetches must be timely: they must be issued sufficiently in advance to hide the latency, but not too early (to avoid pollution and eviction before use)

12 Intel Montecito Cache Two cores, each with a private 12 MB L3 cache and 1 MB L2 Naffziger et al., Journal of Solid-State Circuits, 2006

13 Intel 80-Core Prototype – Polaris Prototype chip with an entire die of SRAM cache stacked upon the cores

14 Example Intel Studies L3 Cache sizes up to 32 MB C L1 C L2 C L1 C L2 L3 Memory interface C L1 C L2 C L1 C L2 Interconnect IO interface From Zhao et al., CMP-MSI Workshop 2007

15 Shared Vs. Private Caches in Multi-Core What are the pros/cons to a shared L2 cache? P4P3P2P1 L1 L2 P4P3P2P1 L1 L2

16 Shared Vs. Private Caches in Multi-Core Advantages of a shared cache:  Space is dynamically allocated among cores  No waste of space because of replication  Potentially faster cache coherence (and easier to locate data on a miss) Advantages of a private cache:  small L2  faster access time  private bus to L2  less contention

17 UCA and NUCA The small-sized caches so far have all been uniform cache access: the latency for any access is a constant, no matter where data is found For a large multi-megabyte cache, it is expensive to limit access time by the worst case delay: hence, non-uniform cache architecture

18 Large NUCA CPU Issues to be addressed for Non-Uniform Cache Access: Mapping Migration Search Replication

Core 0 L1 D$ L1 I$ L2 $ Core 1 L1 D$ L1 I$ L2 $ Core 2 L1 D$ L1 I$ L2 $ Core 3 L1 D$ L1 I$ L2 $ Core 4 L1 D$ L1 I$ L2 $ Core 5 L1 D$ L1 I$ L2 $ Core 6 L1 D$ L1 I$ L2 $ Core 7 L1 D$ L1 I$ L2 $ Memory Controller for off-chip access A single tile composed of a core, L1 caches, and a bank (slice) of the shared L2 cache The cache controller forwards address requests to the appropriate L2 bank and handles coherence operations Shared NUCA Cache

20 Title Bullet