18-740 Fall 2010 Computer Architecture Lecture 6: Caching Basics Prof. Onur Mutlu Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Processor - Memory Interface

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

CMPE 421 Parallel Computer Architecture

Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Chapter Twelve Memory Organization

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Cache Memory By Tom Austin. What is cache memory? A cache is a collection of duplicate data, where the original data is expensive to fetch or compute.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Fundamentals of Parallel Computer Architecture - Chapter 61 Chapter 6 Introduction to Memory Hierarchy Organization Yan Solihin Copyright.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.

CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

Lecture 20 Last lecture: Today’s lecture: Types of memory

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.

15-740/ Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011.

18-447: Computer Architecture Lecture 22: Memory Hierarchy and Caches Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 3/27/2013.

CMSC 611: Advanced Computer Architecture

Caches Samira Khan March 23, 2017.

Computer Architecture Lecture 18: Caches, Caches, Caches

Cache Performance Samira Khan March 28, 2017.

18-447: Computer Architecture Lecture 23: Caches

The Goal: illusion of large, fast, cheap memory

Cache Memory Presentation I

Samira Khan University of Virginia Nov 5, 2017

Bojian Zheng CSCD70 Spring 2018

Adapted from slides by Sally McKee Cornell University

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

15-740/ Computer Architecture Lecture 18: Caching Basics

Cache - Optimization.

10/18: Lecture Topics Using spatial locality

Presentation transcript:

Fall 2010 Computer Architecture Lecture 6: Caching Basics Prof. Onur Mutlu Carnegie Mellon University

Readings Required:  Hennessy and Patterson, Appendix C.2 and C.3  Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA  Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA Recommended:  Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers,

Memory Latency We would like to have small CPI (cycles per instruction) But it takes 100s of CPU cycles to access main memory How do we bridge the gap?  Put all data into registers? 3 CPU Main Memory (DRAM) RF

Why Not A Lot of Registers? Have a large number of architectural registers  Increases access time of register file Cannot make memories large and fast  Increases bits used in instruction format for register ID 1024 registers in a 3-address machine  30 bits to specify IDs Multi-level register files  CRAY-1 had this  A small, fast register file connected to a large, slower file  Movement between files/memory explicitly managed by code  Explicit management not simple Not easy to figure out which data is frequently used  Cache: automatic management of data 4

Memory Hierarchy Fundamental tradeoff  Fast memory: small  Large memory: slow Idea: Memory hierarchy Latency, cost, size, bandwidth 5 CPU Main Memory (DRAM) RF Cache Hard Disk

Caching in a Pipelined Design The cache needs to be tightly integrated into the pipeline  Ideally, access in 1-cycle so that dependent operations do not stall High frequency pipeline  Cannot make the cache large  But, we want a large cache AND a pipelined design Idea: Cache hierarchy 6 CPU Main Memory (DRAM) RF Level1 Cache Level 2 Cache

Caching Basics: Temporal Locality Idea: Store recently accessed data in automatically managed fast memory (called cache) Anticipation: the data will be accessed again soon Temporal locality principle  Recently accessed data will be again accessed in the near future  This is what Maurice Wilkes had in mind: Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, “The use is discussed of a fast core memory of, say words as a slave to a slower core memory of, say, one million words in such a way that in practical cases the effective access time is nearer that of the fast memory than that of the slow memory.” 7

Caching Basics: Spatial Locality Idea: Store addresses adjacent to the recently accessed one in automatically managed fast memory  Logically divide memory into equal size blocks  Fetch to cache the accessed block in its entirety Anticipation: nearby data will be accessed soon Spatial locality principle  Nearby data in memory will be accessed in the near future E.g., sequential instruction access, array traversal  This is what IBM 360/85 implemented 16 Kbyte cache with 64 byte blocks Liptay, “Structural aspects of the System/360 Model 85 II: the cache,” IBM Systems Journal,

The Bookshelf Analogy Book in your hand Desk Bookshelf Boxes at home Boxes in storage Recently-used books tend to stay on desk  Comp Arch books, books for classes you are currently taking  Until the desk gets full Adjacent books in the shelf needed around the same time  If I have organized/categorized my books well in the shelf 9

Caching Basics When data referenced  HIT: If in cache, use cached data instead of accessing memory  MISS: If not in cache, bring into cache Maybe have to kick something else out to do it Important decisions  Placement: where and how to place/find a block in cache?  Replacement: what data to remove to make room in cache?  Write policy: what do we do about writes?  Instructions/data Block (line): Unit of storage in the cache  IBM vs. DEC terminology 10

Cache Abstraction and Metrics Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses) Average memory access time (AMAT) = ( hit-rate * hit-latency ) + ( miss-rate * miss-latency ) Aside: Can reducing AMAT reduce performance? 11 Address Tag Store (is the address in the cache?) Data Store Hit/miss?Data

Placement and Access Assume byte-addressable memory: 256 bytes, 8-byte blocks  32 blocks Assume cache: 64 bytes, 8 blocks  Direct-mapped: A block can only go to one location  Addresses with same index contend for the same location Cause conflict misses 12 Tag store Data store Address tag indexbyte in block 3 bits 2b V tag =? MUX byte in block Hit?Data

Set Associativity Addresses 0 and 8 always conflict in direct mapped cache Instead of having one column of 8, have 2 columns of 4 blocks 13 Tag storeData store V tag =? V tag =? Address tag indexbyte in block 3 bits 2 bits 3b Logic MUX byte in block Associative memory within the set -- More complex, slower access, larger tag store + Accommodates conflicts better (fewer conflict misses) SET Hit?

Higher Associativity 4-way -- More tag comparators and wider data mux; larger tags + Likelihood of conflict misses even lower 14 Tag store Data store =? MUX byte in block Logic Hit?

Full Associativity Fully associative cache  A block can be placed in any cache location 15 Tag store Data store =? MUX byte in block Logic Hit?

Associativity How many blocks can map to the same index (or set)? Larger associativity  lower miss rate, less variation among programs  diminishing returns Smaller associativity  lower cost  faster hit time Especially important for L1 caches 16 associativity hit rate

Set-Associative Caches (I) Diminishing returns in hit rate from higher associativity Longer access time with higher associativity Which block in the set to replace on a cache miss?  Any invalid block first  If all are valid, consult the replacement policy Random FIFO Least recently used (how to implement?) Not most recently used Least frequently used? Least costly to re-fetch?  Why would memory accesses have different cost? Hybrid replacement policies Optimal replacement policy? 17

Replacement Policy LRU vs. Random  Set thrashing: When the “program working set” in a set is larger than set associativity  4-way: Cyclic references to A, B, C, D, E 0% hit rate with LRU policy  Random replacement policy is better when thrashing occurs In practice:  Depends on workload  Average hit rate of LRU and Random are similar Hybrid of LRU and Random  How to choose between the two? Set sampling See Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA

Set-Associative Caches (II) Belady’s OPT  Replace the block that is going to be referenced furthest in the future by the program  Belady, “A study of replacement algorithms for a virtual- storage computer,” IBM Systems Journal,  How do we implement this? Simulate? Is this optimal for minimizing miss rate? Is this optimal for minimizing execution time?  No. Cache miss latency/cost varies from block to block!  Two reasons: Remote vs. local caches and miss overlapping  Qureshi et al. “A Case for MLP-Aware Cache Replacement,“ ISCA

Handling Writes (Stores) When do we write the modified data in a cache to the next level? Write through: At the time the write happens Write back: When the block is evicted  Write-back -- Need a bit in the tag store indicating the block is “modified” + Can consolidate multiple writes to the same block before eviction  Potentially saves bandwidth between cache levels + saves energy  Write-through + Simpler + All levels are up to date. Consistency: Simpler cache coherence because no need to check lower-level caches -- More bandwidth intensive 20

Handling Writes (Stores) Do we allocate a cache block on a write miss?  Allocate on write miss: Yes  No-allocate on write miss: No Allocate on write miss + Can consolidate writes instead of writing each of them individually to next level -- Requires (?) transfer of the whole cache block + Simpler because write misses can be treated the same way as read misses No-allocate + Conserves cache space if locality of writes is low 21

Inclusion vs. Exclusion Inclusive caches  Every block existing in the first level also exists in the next level  When fetching a block, place it in all cache levels. Tradeoffs: -- Leads to duplication of data in the hierarchy: less efficient -- Maintaining inclusion takes effort (forced evictions) + But makes cache coherence in multiprocessors easier  Need to track other processors’ accesses only in the highest-level cache Exclusive caches  The blocks contained in cache levels are mutually exclusive  When evicting a block, do you write it back to the next level? + More efficient utilization of cache space + (Potentially) More flexibility in replacement/placement -- More blocks/levels to keep track of to ensure cache coherence; takes effort Non-inclusive caches  No guarantees for inclusion or exclusion: simpler design  Most Intel processors 22

Maintaining Inclusion and Exclusion When does maintaining inclusion take effort?  L1 block size < L2 block size  L1 associativity > L2 associativity  Prefetching into L2  When a block is evicted from L2, need to evict all corresponding subblocks from L1  keep 1 bit per subblock in L2  When a block is inserted, make sure all higher levels also have it When does maintaining exclusion take effort?  L1 block size != L2 block size  Prefetching into any cache level  When a block is inserted into any level, ensure it is not in any other 23