Adapted from slides by Sally McKee Cornell University

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Main Memory Cache Architectures
CSE 351 Section 9 3/1/12.
Associativity in Caches Lecture 25
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture: Cache Hierarchies
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Caches 2 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science
Lecture 21: Memory Hierarchy
Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012
Lecture 23: Cache, Memory, Virtual Memory
Lecture 08: Memory Hierarchy Cache Performance
Lecture 22: Cache Hierarchies, Memory
Lecture: Cache Innovations, Virtual Memory
Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science
CPE 631 Lecture 05: Cache Design
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Performance metrics for caches
CDA 5155 Caches.
Han Wang CS 3410, Spring 2012 Computer Science Cornell University
Performance metrics for caches
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 22: Cache Hierarchies, Memory
Lecture 11: Cache Hierarchies
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 21: Memory Hierarchy
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Principle of Locality: Memory Hierarchies
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Memory & Cache.
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Adapted from slides by Sally McKee Cornell University Memory Hierarchies Adapted from slides by Sally McKee Cornell University Copyright Gary S. Tyson 2003 Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 SRAM vs. DRAM SRAM (static random access memory) Faster than DRAM Each storage cell is larger, so smaller capacity for same area 2-10ns access time DRAM (dynamic random access memory) Each storage cell tiny (capacitance on wire) Can get 2Gb chips today 50-70ns access time Leaky–need to periodically refresh data What happens on a read? CPU clock rates ~0.2ns-2ns (5GHz-500MHz) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Terminology Temporal locality: If memory location X is accessed, then it is more likely to be re-accessed in the near future than some random location Y Caches exploit temporal locality by placing a memory element that has been referenced into the cache Spatial locality: If memory location X is accessed, then locations near X are more likely to be accessed in the near future than some random location Y Caches exploit spatial locality by allocating a cache line of data (including data near the referenced location) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Cache Design 101 Memory pyramid Reg 100s bytes part of pipeline L1 Cache (several KB) 1-3 cycle access L3 becoming more common (sometimes VERY LARGE) L2 Cache (½-32MB) 6-15 cycle access Memory (128MB – few GB) 50-300 cycle access Millions cycle access! Disk (Many GB) These are rough numbers: mileage may vary for latest/greatest Caches USUALLY made of SRAM Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Cache design issues Block placement: where can block be placed in higher memory level? Fully-associative: anywhere Direct-mapped: exactly one place Set-associative: some small number of places Block identification: how does processor find the block if it is there at higher memory level? Block replacement: which block should be replaced from higher level to make room for a new block Write strategy: are lower levels updated when block in higher level is written? Write-through: yes Write-back: no, update lower level only when block is evicted from higher level Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

A Simple Fully Associative Cache Processor Cache Memory 2 cache lines 3 bit tag field 2 byte block 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 V 150 160 V 170 180 190 How many address bits? 200 R0 R1 R2 R3 210 220 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 1st Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 150 160 170 180 190 200 R0 R1 R2 R3 210 220 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 1st Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 1 100 150 110 Addr: 0001 160 lru 170 180 block offset 190 200 R0 R1 R2 R3 210 110 220 Misses: 1 Hits: 0 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 2nd Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 1 100 150 110 160 lru 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 1 Hits: 0 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 2nd Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 lru 1 100 150 110 160 1 2 140 170 150 180 block offset 190 Addr: 0101 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 0 150 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 3rd Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 lru 1 100 150 110 160 1 2 140 170 150 180 block offset 190 Addr: 0001 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 0 150 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 3rd Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 1 100 150 110 160 lru 1 2 140 170 150 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 1 150 230 110 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 4th Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 1 100 150 110 160 lru 1 2 140 170 150 180 block offset 190 Addr: 0100 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 1 150 230 110 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 4th Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 lru 1 100 150 110 160 1 2 140 170 150 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 2 150 230 140 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 5th Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 lru 1 100 150 110 160 1 2 140 170 150 180 block offset 190 Addr: 0000 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 2 150 230 140 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 5th Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 1 100 150 110 160 lru 1 2 140 170 150 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 3 140 100 230 140 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Block size Decide on the block size How? Simulate lots of different block sizes and see which one gives the best performance Most systems use a block size between 32 bytes and 128 bytes Longer sizes reduce the overhead by Reducing the number of tags Reducing the size of each tag But beyond some block size, you bring in too much data that you do not use: cache pollution Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write strategy Where should you write the result of a store? If that memory location is in the cache? Send it to the cache Should we also send it to memory right away? (write-through policy) Wait until we kick the block out (write-back policy) If it is not in the cache? Allocate the line (put it in the cache)? (write allocate policy) Write it directly to memory without allocation? (no write allocate policy) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Handling Stores (Write-Through) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Assume write-allocate policy 78 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 150 162 173 18 21 33 R0 R1 R2 R3 28 19 Misses: 0 Hits: 0 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Through (REF 1) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 150 162 173 18 21 33 R0 R1 R2 R3 28 19 Misses: 0 Hits: 0 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Through (REF 1) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 1 78 150 29 162 lru 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 1 Hits: 0 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Through (REF 2) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 1 78 150 29 162 lru 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 1 Hits: 0 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Through (REF 2) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 lru 1 78 150 29 162 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 0 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Through (REF 3) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 lru 1 78 150 29 162 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 0 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Through (REF 3) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 173 1 150 29 162 lru 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 1 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Through (REF 4) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 1 173 150 29 162 lru 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 1 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Through (REF 4) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 lru 1 173 29 150 29 162 1 2 71 173 29 150 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 3 Hits: 1 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Through (REF 5) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 lru 1 173 29 29 162 1 2 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 3 Hits: 1 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Through (REF 5) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 V tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 1 5 33 29 28 162 lru 1 2 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 4 Hits: 1 33 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

How Many Memory References? Each miss reads a block (only two bytes in this cache) Each store writes a byte Total reads: eight bytes Total writes: two bytes but caches generally miss < 20% usually much lower miss rates . . . but depends on both cache and application! Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Through vs. Write-Back Can we also design the cache NOT to write all stores immediately to memory? Keep the most current copy in cache, and update memory when that data is evicted (write-back policy) Do we need to write-back all evicted lines? No, only blocks that have been stored into (written) Keep a “dirty bit”, reset when the line is allocated, set when the block is written If a block is “dirty” when evicted, write its data back into memory Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Handling Stores (Write-Back) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 150 162 173 18 21 33 R0 R1 R2 R3 28 19 Misses: 0 Hits: 0 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 1) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 150 162 173 18 21 33 R0 R1 R2 R3 28 19 Misses: 0 Hits: 0 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 1) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 1 78 150 29 162 lru 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 1 Hits: 0 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 2) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 1 78 150 29 162 lru 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 1 Hits: 0 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 2) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 lru 1 78 150 29 162 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 0 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 3) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 lru 1 78 150 29 162 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 0 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 3) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 1 1 173 150 29 162 lru 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 1 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 4) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 1 1 173 150 29 162 lru 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 1 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 4) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 lru 1 1 173 150 29 162 1 1 3 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 3 Hits: 1 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 5) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 lru 1 1 173 150 29 162 1 1 3 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 3 Hits: 1 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 5) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 lru 1 1 173 150 29 162 1 1 3 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 4 Hits: 1 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Write-Back (REF 5) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ 1 ] Ld R2  M[ 7 ] St R2  M[ 0 ] St R1  M[ 5 ] Ld R2  M[ 10 ] 71 1 5 33 150 28 162 lru 1 1 3 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 4 Hits: 1 33 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

How many memory references? Each miss reads a block Two bytes in this cache Each evicted dirty cache line writes a block Total reads: eight bytes Total writes: four bytes (after final eviction) Choose write-back or write-through? Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Direct-Mapped Cache Memory Address 01011 Cache 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 V d tag data 78 23 29 218 120 10 123 44 71 16 150 141 162 28 173 214 Block Offset (1-bit) 18 33 21 98 Line Index (2-bit) 33 181 28 129 Tag (2-bit) 19 119 Compulsory Miss: First reference to memory block Capacity Miss: Working set doesn’t fit in cache Conflict Miss: Working set maps to same cache line 200 42 210 66 225 74 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Two-Way Set Associative Cache Memory Address 01101 Cache 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 V d tag data 78 23 29 218 120 10 123 44 71 16 150 141 162 28 173 214 Block Offset (unchanged) 18 33 21 98 1-bit Set Index 33 181 28 129 Larger (3-bit) Tag 19 119 Rule of thumb: Increasing associativity decreases conflict misses. A 2-way associative cache has about the same hit rate as a direct mapped cache twice the size. 200 42 210 66 225 74 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Sources of cache misses Cold misses: the first time processor accesses a line, there will be a cache miss also known as compulsory misses Capacity misses: if number of distinct cache lines accessed between two references to the same line is greater than the capacity of the cache and the second reference is a miss, it is called a capacity miss Conflict misses: misses causes by evictions of line because of associativity conflicts cannot occur in fully associative caches Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Programming for caches How do we reduce the number of cache misses? How do we reduce cold misses? How do we reduce capacity misses? How do we reduce conflict misses? How do we reduce the impact of cache misses on overall performance? Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Effects of Varying Cache Parameters Total cache size: block size  # sets  associativity Positives: Should decrease miss rate Negatives: May increase hit time Probably increase area requirements (how are these related?) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Effects of Varying Cache Parameters Bigger block size Positives: Exploits spatial locality ; reduce compulsory misses Reduces tag overhead (bits) Reduces transfer overhead (address, burst data mode) Negatives: Fewer blocks for given size; increase conflict misses Increases miss transfer time (multi-cycle transfers) Wastes bandwidth for non-spatial data Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Effects of Varying Cache Parameters Increasing associativity Positives: Reduces conflict misses Low-associative caches can have pathological behavior (very high miss rates) Negatives: Increased hit time More hardware requirements (comparators, muxes, bigger tags) Decreases improvements past 4- or 8- way Belady’s anomaly (eventually more associativity = lower performance!) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Effects of Varying Cache Parameters Replacement strategy: (for associative caches) How is the evicted line chosen? LRU: intuitive; difficult to implement with high associativity; worst case performance can occur (N+1 element array) Random: Pseudo-random easy to implement; performance close to LRU for high associativity; usually avoids pathological behavior Optimal: replace block that has its next reference farthest in the future; Belady replacement; hard to implement  Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Other Cache Design Decisions Write Policy: how to deal with write misses? Write-through / no-allocate Total traffic? Read misses  block size + writes Common for L1 caches back by L2 (especially on-chip) Write-back / write-allocate Needs a dirty bit to determine whether cache data differs Total traffic? (read misses + write misses)  block size + dirty-block-evictions  block size Common for L2 caches (memory bandwidth limited) Variation: Write validate Write-allocate without fetch-on-write Needs sub-block cache with valid bits for each word/byte Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Other Cache Design Decisions Write Buffering Delay writes until bandwidth available Put them in FIFO buffer Only stall on write if buffer is full Use bandwidth for reads first (since they have latency problems) Important for write-through caches→ write traffic frequent Write-Back buffer Holds evicted (dirty) lines for Write-Back caches Gives reads priority on the L2 or memory bus Usually only needs a small buffer Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005 Prefetching Already done – loading entire line assumes spatial locality Extend this… Next Line Prefetch Bring in next block in memory as well on a miss Very good for Icache (why?) Software prefetch Loads to R0 have no data dependency Aggressive/speculative prefetch useful for L2 Speculative prefetch problematic for L1 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Calculating the Effects of Latency Does a cache miss reduce performance? depends if critical instructions waiting for the result Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Calculating the Effects of Latency Depends on whether critical resources are held up Blocking: When a miss occurs, all later reference to the cache must wait. This is a resource conflict. Non-blocking: Allows later references to access cache while miss is being processed. Generally there is some limit to how many outstanding misses can be bypassed. Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005