Adapted from slides by Sally McKee Cornell University

Adapted from slides by Sally McKee Cornell University
Memory Hierarchies Adapted from slides by Sally McKee Cornell University Copyright Gary S. Tyson 2003 Copyright Sally A. McKee 2005

Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005
SRAM vs. DRAM SRAM (static random access memory) Faster than DRAM Each storage cell is larger, so smaller capacity for same area 2-10ns access time DRAM (dynamic random access memory) Each storage cell tiny (capacitance on wire) Can get 2Gb chips today 50-70ns access time Leaky–need to periodically refresh data What happens on a read? CPU clock rates ~0.2ns-2ns (5GHz-500MHz) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Terminology Temporal locality: If memory location X is accessed, then it is more likely to be re-accessed in the near future than some random location Y Caches exploit temporal locality by placing a memory element that has been referenced into the cache Spatial locality: If memory location X is accessed, then locations near X are more likely to be accessed in the near future than some random location Y Caches exploit spatial locality by allocating a cache line of data (including data near the referenced location) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Cache Design 101 Memory pyramid Reg 100s bytes part of pipeline L1 Cache (several KB) 1-3 cycle access L3 becoming more common (sometimes VERY LARGE) L2 Cache (½-32MB) 6-15 cycle access Memory (128MB – few GB) cycle access Millions cycle access! Disk (Many GB) These are rough numbers: mileage may vary for latest/greatest Caches USUALLY made of SRAM Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Cache design issues Block placement: where can block be placed in higher memory level? Fully-associative: anywhere Direct-mapped: exactly one place Set-associative: some small number of places Block identification: how does processor find the block if it is there at higher memory level? Block replacement: which block should be replaced from higher level to make room for a new block Write strategy: are lower levels updated when block in higher level is written? Write-through: yes Write-back: no, update lower level only when block is evicted from higher level Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

A Simple Fully Associative Cache
Processor Cache Memory 2 cache lines 3 bit tag field 2 byte block 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ ] 130 tag data 140 V 150 160 V 170 180 190 How many address bits? 200 R0 R1 R2 R3 210 220 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

1st Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ ] 130 tag data 140 150 160 170 180 190 200 R0 R1 R2 R3 210 220 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

1st Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ ] 130 tag data 140 1 100 150 110 Addr: 0001 160 lru 170 180 block offset 190 200 R0 R1 R2 R3 210 110 220 Misses: 1 Hits: 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

2nd Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ ] 130 tag data 140 1 100 150 110 160 lru 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 1 Hits: 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

2nd Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ ] 130 tag data 140 lru 1 100 150 110 160 1 2 140 170 150 180 block offset 190 Addr: 0101 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 150 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

3rd Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ ] 130 tag data 140 lru 1 100 150 110 160 1 2 140 170 150 180 block offset 190 Addr: 0001 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 150 230 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

3rd Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ 0 ] 130 tag data 140 1 100 150 110 160 lru 1 2 140 170 150 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 150 230 110 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

4th Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ ] 130 tag data 140 1 100 150 110 160 lru 1 2 140 170 150 180 block offset 190 Addr: 0100 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 150 230 110 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

4th Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ ] 130 tag data 140 lru 1 100 150 110 160 1 2 140 170 150 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 150 230 140 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

5th Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ ] 130 tag data 140 lru 1 100 150 110 160 1 2 140 170 150 180 block offset 190 Addr: 0000 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 150 230 140 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

5th Access Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ ] Ld R2  M[ ] Ld R3  M[ ] Ld R3  M[ ] Ld R2  M[ ] 130 tag data 140 1 100 150 110 160 lru 1 2 140 170 150 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 140 100 230 140 240 250 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Block size Decide on the block size How? Simulate lots of different block sizes and see which one gives the best performance Most systems use a block size between 32 bytes and 128 bytes Longer sizes reduce the overhead by Reducing the number of tags Reducing the size of each tag But beyond some block size, you bring in too much data that you do not use: cache pollution Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write strategy Where should you write the result of a store? If that memory location is in the cache? Send it to the cache Should we also send it to memory right away? (write-through policy) Wait until we kick the block out (write-back policy) If it is not in the cache? Allocate the line (put it in the cache)? (write allocate policy) Write it directly to memory without allocation? (no write allocate policy) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Handling Stores (Write-Through)
Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Assume write-allocate policy 78 29 120 123 V tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 150 162 173 18 21 33 R0 R1 R2 R3 28 19 Misses: 0 Hits: 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Through (REF 1) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 150 162 173 18 21 33 R0 R1 R2 R3 28 19 Misses: 0 Hits: 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Through (REF 1) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 1 78 150 29 162 lru 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 1 Hits: 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Through (REF 2) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 1 78 150 29 162 lru 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 1 Hits: 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Through (REF 2) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 lru 1 78 150 29 162 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Through (REF 3) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 V tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 173 1 150 29 162 lru 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Through (REF 4) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 V tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 1 173 150 29 162 lru 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Through (REF 4) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 V tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 lru 1 173 29 150 29 162 1 2 71 173 29 150 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 3 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Through (REF 5) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 29 120 123 V tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 1 5 33 29 28 162 lru 1 2 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 4 Hits: 33 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

How Many Memory References?
Each miss reads a block (only two bytes in this cache) Each store writes a byte Total reads: eight bytes Total writes: two bytes but caches generally miss < 20% usually much lower miss rates but depends on both cache and application! Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Through vs. Write-Back
Can we also design the cache NOT to write all stores immediately to memory? Keep the most current copy in cache, and update memory when that data is evicted (write-back policy) Do we need to write-back all evicted lines? No, only blocks that have been stored into (written) Keep a “dirty bit”, reset when the line is allocated, set when the block is written If a block is “dirty” when evicted, write its data back into memory Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Handling Stores (Write-Back)
Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 150 162 173 18 21 33 R0 R1 R2 R3 28 19 Misses: 0 Hits: 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 1) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 150 162 173 18 21 33 R0 R1 R2 R3 28 19 Misses: 0 Hits: 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 1) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 1 78 150 29 162 lru 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 1 Hits: 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 2) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 1 78 150 29 162 lru 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 1 Hits: 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 2) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 lru 1 78 150 29 162 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 3) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 lru 1 78 150 29 162 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 3) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 1 1 173 150 29 162 lru 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 4) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 1 1 173 150 29 162 lru 1 3 162 173 173 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 2 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 4) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 lru 1 1 173 150 29 162 1 1 3 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 3 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 5) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 lru 1 1 173 150 29 162 1 1 3 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 3 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 5) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 173 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 lru 1 1 173 150 29 162 1 1 3 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 4 Hits: 173 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Write-Back (REF 5) Processor Cache Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 78 29 120 123 V d tag data Ld R1  M[ ] Ld R2  M[ ] St R2  M[ ] St R1  M[ ] Ld R2  M[ 10 ] 71 1 5 33 150 28 162 lru 1 1 3 71 173 29 18 21 33 R0 R1 R2 R3 28 29 19 Misses: 4 Hits: 33 200 210 225 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

How many memory references?
Each miss reads a block Two bytes in this cache Each evicted dirty cache line writes a block Total reads: eight bytes Total writes: four bytes (after final eviction) Choose write-back or write-through? Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Direct-Mapped Cache Memory Address 01011 Cache 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 V d tag data 78 23 29 218 120 10 123 44 71 16 150 141 162 28 173 214 Block Offset (1-bit) 18 33 21 98 Line Index (2-bit) 33 181 28 129 Tag (2-bit) 19 119 Compulsory Miss: First reference to memory block Capacity Miss: Working set doesn’t fit in cache Conflict Miss: Working set maps to same cache line 200 42 210 66 225 74 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Two-Way Set Associative Cache
Memory Address 01101 Cache 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 V d tag data 78 23 29 218 120 10 123 44 71 16 150 141 162 28 173 214 Block Offset (unchanged) 18 33 21 98 1-bit Set Index 33 181 28 129 Larger (3-bit) Tag 19 119 Rule of thumb: Increasing associativity decreases conflict misses. A 2-way associative cache has about the same hit rate as a direct mapped cache twice the size. 200 42 210 66 225 74 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Sources of cache misses
Cold misses: the first time processor accesses a line, there will be a cache miss also known as compulsory misses Capacity misses: if number of distinct cache lines accessed between two references to the same line is greater than the capacity of the cache and the second reference is a miss, it is called a capacity miss Conflict misses: misses causes by evictions of line because of associativity conflicts cannot occur in fully associative caches Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Programming for caches
How do we reduce the number of cache misses? How do we reduce cold misses? How do we reduce capacity misses? How do we reduce conflict misses? How do we reduce the impact of cache misses on overall performance? Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Effects of Varying Cache Parameters
Total cache size: block size  # sets  associativity Positives: Should decrease miss rate Negatives: May increase hit time Probably increase area requirements (how are these related?) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Bigger block size Positives: Exploits spatial locality ; reduce compulsory misses Reduces tag overhead (bits) Reduces transfer overhead (address, burst data mode) Negatives: Fewer blocks for given size; increase conflict misses Increases miss transfer time (multi-cycle transfers) Wastes bandwidth for non-spatial data Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Increasing associativity Positives: Reduces conflict misses Low-associative caches can have pathological behavior (very high miss rates) Negatives: Increased hit time More hardware requirements (comparators, muxes, bigger tags) Decreases improvements past 4- or 8- way Belady’s anomaly (eventually more associativity = lower performance!) Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Replacement strategy: (for associative caches) How is the evicted line chosen? LRU: intuitive; difficult to implement with high associativity; worst case performance can occur (N+1 element array) Random: Pseudo-random easy to implement; performance close to LRU for high associativity; usually avoids pathological behavior Optimal: replace block that has its next reference farthest in the future; Belady replacement; hard to implement  Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Other Cache Design Decisions
Write Policy: how to deal with write misses? Write-through / no-allocate Total traffic? Read misses  block size + writes Common for L1 caches back by L2 (especially on-chip) Write-back / write-allocate Needs a dirty bit to determine whether cache data differs Total traffic? (read misses + write misses)  block size + dirty-block-evictions  block size Common for L2 caches (memory bandwidth limited) Variation: Write validate Write-allocate without fetch-on-write Needs sub-block cache with valid bits for each word/byte Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Other Cache Design Decisions
Write Buffering Delay writes until bandwidth available Put them in FIFO buffer Only stall on write if buffer is full Use bandwidth for reads first (since they have latency problems) Important for write-through caches→ write traffic frequent Write-Back buffer Holds evicted (dirty) lines for Write-Back caches Gives reads priority on the L2 or memory bus Usually only needs a small buffer Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Prefetching Already done – loading entire line assumes spatial locality Extend this… Next Line Prefetch Bring in next block in memory as well on a miss Very good for Icache (why?) Software prefetch Loads to R0 have no data dependency Aggressive/speculative prefetch useful for L2 Speculative prefetch problematic for L1 Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Calculating the Effects of Latency
Does a cache miss reduce performance? depends if critical instructions waiting for the result Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Calculating the Effects of Latency
Depends on whether critical resources are held up Blocking: When a miss occurs, all later reference to the cache must wait. This is a resource conflict. Non-blocking: Allows later references to access cache while miss is being processed. Generally there is some limit to how many outstanding misses can be bypassed. Copyright Gary S. Tyson 2003, Copyright Sally A. McKee 2005

Adapted from slides by Sally McKee Cornell University

Similar presentations

Presentation on theme: "Adapted from slides by Sally McKee Cornell University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adapted from slides by Sally McKee Cornell University

Similar presentations

Presentation on theme: "Adapted from slides by Sally McKee Cornell University"— Presentation transcript:

Similar presentations

About project

Feedback