Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches
DataTagValid Reference Stream:Hit/Miss 0b b b b Direct-mapped Cache Blocksize=4words, wordsize= 4bytes Tag Index Byte Offset Block Offset
DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes
DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes
DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79]
DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ]
DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ] M[32-47]
DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ] M[32-47] Not Valid
DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ] M[32-47]
DataTagValid Reference Stream:Hit/Miss 0b H 0b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ] M[32-47]
DataTagValid Reference Stream:Hit/Miss 0b H 0b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[ ] M[32-47]
DataTagValid Reference Stream:Hit/Miss 0b H 0b M 0b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47]
DataTagValid Reference Stream:Hit/Miss 0b H 0b M 0b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47]
DataTagValid Reference Stream:Hit/Miss 0b H 0b M 0b M 0b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47] M[48-63]
DataTagValid Reference Stream:Hit/Miss 0b H 0b M 0b M 0b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47] M[48-63]
DataTagValid Reference Stream:Hit/Miss 0b H 0b M 0b M 0b H Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=4words, wordsize= 4bytes M[64-79] M[16-31] M[32-47] M[48-63]
Cache Writes There are multiple copies of the data lying around –L1 cache, L2 cache, DRAM Do we write to all of them? Do we wait for the write to complete before the processor can proceed?
Do we write to all of them? Write-through Write-back –creates data - different values for same item in cache and DRAM. –This data is referred to as
Do we write to all of them? Write-through - write to all levels of hierarchy Write-back –creates data - different values for same item in cache and DRAM. –This data is referred to as
Do we write to all of them? Write-through - write to all levels of hierarchy Write-back - write to lower level only when cache line gets evicted from cache –creates inconsistent data - different values for same item in cache and DRAM – stale data. –Inconsistent data in highest level in cache is referred to as dirty –If they all match, they are clean –The old data is stale.
Write-Through CPU L1 L2 Cache DRAM Sw $3, 0($5)
Write-Back CPU L1 L2 Cache DRAM Sw $3, 0($5)
Write-through vs Write-back Which performs the write faster? Which has faster evictions from a cache? Which causes more bus traffic?
Write-through vs Write-back Which performs the write faster? –Write-back - it only writes the L1 cache Which has faster evictions from a cache? Which causes more bus traffic?
Write-through vs Write-back Which performs the write faster? –Write-back - it only writes the L1 cache Which has faster evictions from a cache? –Write-through - no write involved, just overwrite tag Which causes more bus traffic?
Write-through vs Write-back Which performs the write faster? –Write-back - it only writes the L1 cache Which has faster evictions from a cache? –Write-through - no write involved, just overwrite tag Which causes more bus traffic? –Write-through. DRAM is written every store. Write-back only writes on eviction.
Does processor wait for write? Write buffer –Any loads must check write buffer in parallel with cache access. –Buffer values are more recent than cache values.
Does processor wait for write? Write buffer - intermediate queue for pending writes –Any loads must check write buffer in parallel with cache access. –Buffer values are more recent than cache values.
Outline Cache writes DRAM configurations Performance Associative caches
Challenge DRAM is designed for density, not speed DRAM is ______ than the bus We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. Widening anything increases the cost by quite a bit.
Challenge DRAM is designed for density, not speed DRAM is slower than the bus We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. Widening anything increases the cost by quite a bit.
Narrow Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss?
Narrow Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1cycle + 15 cycles/word * 8 words + 1 cycle/word * 8 words = 129 cycles
Wide Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / 2 words DRAM latency –1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss?
Wide Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / 2 words DRAM latency –1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1cycle + 15 cycles/2 words * 8 words + 1 cycle/2words*8words = 65 cycles
Interleaved Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? DRAM
Interleaved Configuration CPU Cache DRAM Bus Given: –1 clock cycle request –15 cycles / word DRAM latency –1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles DRAM
Recent DRAM trends Fewer, Bigger DRAMs New bus protocols (RAMBUS) small DRAM caches (page mode) SDRAM (synchronous DRAM) –one request & length nets several continuous responses.
Outline Cache writes DRAM configurations Performance Associative caches
Performance Execute Time = (Cpu cycles + Memory- stall cycles) * clock cycle time Memory-stall cycles = –accesses * misses * cycles = –program access miss –memory access * Miss rate * Miss penalty –program –instructions * misses * cycles = – program inst miss –instructions * misses * miss penalty –program inst
Example 1 instruction cache miss rate: 2% data cache miss rate: 3% miss penalty: 50 cycles ld/st instructions are 25% of instructions CPI with perfect cache is 2.3 How much faster is the computer with a perfect cache?
Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr
Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 * *.03 = =.0275
Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 * *.03 = =.0275 Memory cycles = I *.0275 * 50 = I* 1.375
Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 * *.03 = =.0275 Memory cycles = I *.0275 * 50 = I* ExecT = (Cpu CPI * I + MemCycles)*Clk
Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 * *.03 = =.0275 Memory cycles = I *.0275 * 50 = I* ExecT = (Cpu CPI * I + MemCycles)*Clk = (2.3 * I * I) * clk = 3.675IC
Example 1 misses = Iacc * Imr + Dacc * Dmr instr instr instr = 1 * *.03 = =.0275 Memory cycles = I *.0275 * 50 = I* ExecT = (Cpu CPI * I + MemCycles)*Clk = (2.3 * I * I) * clk = 3.675IC speedup = IC / 2.3IC = 1.6
Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now?
Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles =
Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles = I *.0275 * 100 = I * 2.75
Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles = I *.0275 * 100 = I * 2.75 Exec = (2.3*I *I)*clk = 5.05I(C/2)
Example 2 Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? How long is the miss penalty now? 100 cycles Memory cycles = I *.0275 * 100 = I * 2.75 Exec = (2.3*I *I)*clk = 5.05I(C/2) speedup = old = 3.675IC = = 1.5 new = 5.05IC/
Outline Cache writes DRAM configurations Performance Associative caches
DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] Not Valid
DataTagValid Reference Stream:Hit/Miss 0b M 0b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[56-63]
DataTagValid Reference Stream:Hit/Miss 0b M 0b b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[56-63]
DataTagValid Reference Stream:Hit/Miss 0b M 0b M 0b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[24-31]
DataTagValid Reference Stream:Hit/Miss 0b M 0b M 0b b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[24-31]
DataTagValid Reference Stream:Hit/Miss 0b M 0b M 0b M 0b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[56-63]
DataTagValid Reference Stream:Hit/Miss 0b M 0b M 0b M 0b Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[56-63]
DataTagValid Reference Stream:Hit/Miss 0b M 0b M 0b M 0b M Tag Index Byte Offset Block Offset Direct-mapped Cache Blocksize=2words, wordsize= 4bytes M[ ] M[72-79] M[16-23] M[56-63]
Problem Conflicting addresses cause high miss rates
Solution Relax the direct-mapping Allow each address to be mapped into 2 or 4 locations (a set)
Cache Configurations DataTagValid 0 1 DataTagValidDataTagValid Direct-Mapped 2-way Associative - each set has two blocks DataTagValidDataTagValid Fully Associative - all addresses map to the same set
Cache Configurations DataTagValid 0 1 DataTagValidDataTagValid Direct-Mapped 2-way Associative - each set has two blocks DataTagValidDataTagValid Fully Associative - all addresses map to the same set Block
Cache Configurations DataTagValid 0 1 DataTagValidDataTagValid Direct-Mapped 2-way Associative - each set has two blocks DataTagValidDataTagValid Fully Associative - all addresses map to the same set Block Set
DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex Set Block
DataTagValid Reference Stream:Hit/Miss 0b b b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex
DataTagValid Reference Stream:Hit/Miss 0b M 0b b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex
DataTagValid Reference Stream:Hit/Miss 0b M 0b b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex
DataTagValid Reference Stream:Hit/Miss 0b M 0b H 0b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex
DataTagValid Reference Stream:Hit/Miss 0b M 0b H 0b b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex
DataTagValid Reference Stream:Hit/Miss 0b M 0b H 0b H 0b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex
DataTagValid Reference Stream:Hit/Miss 0b M 0b H 0b H 0b Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex
DataTagValid Reference Stream:Hit/Miss 0b M 0b H 0b H 0b H Tag Index Byte Offset Block Offset 2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes DataTagValidIndex
Implementation 0 1 DataTagValid Byte Address 0x Tag Index Byte Offset = Hit? MUX Block offset Data TagValid MUX=
Performance Implications Increasing associativity increases/decreases hit rate Increasing associativity increases/decreases access time Increasing associativity increases/decreases miss penalty
Performance Implications Increasing associativity increases hit rate Increasing associativity increases/decreases access time Increasing associativity increases/decreases miss penalty
Performance Implications Increasing associativity increases hit rate Increasing associativity increases access time Increasing associativity increases/decreases miss penalty
Performance Implications Increasing associativity increases hit rate Increasing associativity increases access time Increasing associativity has no effect on miss penalty
0 1 Direct-Mapped Cache DataTagValid Miss Rate: Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream:Hit/Miss 0b M 0b b b
0 1 Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b
Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b
Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b
Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b
Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b
Direct-Mapped Cache DataTagValid Tag Index Byte Offset Block Offset Example 2-way associative Reference Stream: 0b b b b
Which block to replace? 0b b
Which block to replace? 0b It entered the cache first –FIFO - First In First Out 0b
Which block to replace? 0b It entered the cache first –FIFO - First In First Out 0b Longer since it has been used –LRU - Least Recently Used Random
Replacement Algorithms LRU & FIFO simple conceptually, but implementation difficult for high assoc. LRU & FIFO must be approximated with high associativity Random sometimes better than approximated LRU/FIFO Tradeoff between accuracy, implementation cost
L1 L2 Cache DRAM Memory Me L1 cache’s perspective L1’s miss penalty contains the access of L2, and possibly the access of DRAM!!!
Multi-level Caches Base CPI 1.0, 500MHz clock main memory-100 cycles, L cycles L1 miss rate per instruction - 5% w/L2 - 2% of instructions go to DRAM What is the speedup with the L2 cache? There is a typo in the book for this example!
Multi-level Caches CPI = 1 + memory stalls / instructions
Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr
Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty
Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty = 1 + 5% * % * 100 = 3.5
Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty = 1 + 5% * % * 100 = 3.5 = 1 + (5-2)%*10 + 2%*(10+100) = 3.5
Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr CPI new = 1 + L2%*L2penalty + Mem%*MemPenalty = 1 + 5% * % * 100 = 3.5 = 1 + (5-2)%*10 + 2%*(10+100) = 3.5 Speedup = 6/3.5 = 1.7
DO GROUPWORK NOW
Summary Direct-mapped –simple –_____ access time –_______ hit rate Variable block size –still simple –_______ access time
Summary Direct-mapped –simple –fast access time –marginal hit rate Variable block size –still simple –_____ access time –_____ hit rate by exploiting __________
Summary Direct-mapped –simple –fast access time –marginal hit rate Variable block size –still simple –fast access time –higher hit rate by exploiting spatial locality
Summary Associative caches –________ the access time –________ the hit rate –associativity above ___ has little to no gain Multi-level caches –__________ worst-case miss penalty –__________ average miss penalty
Summary Associative caches –increase the access time –increase the hit rate –associativity above 8 has little to no gain Multi-level caches –__________ worst-case miss penalty –__________ average miss penalty
Summary Associative caches –increase the access time –increase the hit rate –associativity above 8 has little to no gain Multi-level caches –increases worst-case miss penalty (because you waste time accessing another cache) –Reduces average miss penalty (because so many are caught and handled quickly)