Peng Liu liupeng@zju.edu.cn Lecture 11 Cache Peng Liu liupeng@zju.edu.cn
Associative Cache Example The location of a memory block whose address is 12 in a cache with eight blocks varies for Direct-mapped, set-associative, and fully associative placement. 12 modulo 8 = 4 In a two-way set-associative cache, there would be four sets, and memory block 12 must be in set (12 mod 4) = 0; the memory block could be in either element of the set.
Associative Cache Example An eight-block cache configured as direct mapped, two-way set associative, four-way set associative, and fully associative. The total size of the cache in blocks is equal to the number of sets times the associativity. Thus, for a fixed cache size, increasing the associativity decreases the number of sets while increasing the number of elements per set. With eight blocks, an eight-way set-associative cache is the same as a fully associative cache.
Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative Block access sequence: 0, 8,0,6,8 (0 modulo 4) = 0 (6 modulo 4) = 2 (8 modulo 4) = 0 Direct mapped Block address Cache index Hit/miss Cache content after access 1 2 3 miss Mem[0] 8 Mem[8] 6 Mem[6] Assume there are small caches, each consisting of four one-word blocks.
Associativity Example 2-way set associative Full associative Block address Cache index Hit/miss Cache content after access Set 0 Set 1 miss Mem[0] 8 Mem[8] hit 6 Mem[6] Block address Hit/miss Cache content after access Set 0 miss Mem[0] 8 Mem[8] hit 6 Mem[6]
Set Associative Cache Organization The implementation of a four-way set-associative cache requires four comparators and 4-to-1 multiplexor.
Tag & Index with Set-Associative Caches Assume a 2n-byte cache with 2m-byte blocks that is 2a set-associative Which bits of the address are the tag or the index? m least significant bits are byte select within the block Basic idea The cache contains 2n/2m=2n-m blocks Each cache way contains 2n-m/2a=2n-m-a blocks Cache index: (n-m-a) bits after the byte select Same index used with all cache ways … Observation For fixed size, length of tags increases with the associativity Associative caches incur more overhead for tags
Placement Policy Memory Cache Fully (2-way) Set Direct 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 3 3 0 1 Block Number 0 1 2 3 4 5 6 7 8 9 Memory Set Number 0 1 2 3 0 1 2 3 4 5 6 7 Cache Simplest scheme is to extract bits from ‘block number’ to determine ‘set’. More sophisticated schemes will hash the block number ---- why could that be good/bad? Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into set 0 block 4 (12 mod 4) (12 mod 8) block 12 can be placed
Direct-Mapped Cache Tag Index t k b V Tag Data Block 2k lines t = HIT Offset t k b V Tag Data Block 2k lines t = HIT Data Word or Byte
2-Way Set-Associative Cache Tag Index Block Offset b t k V Tag Data Block V Tag Data Block t Compare latency to direct mapped case? Data Word or Byte = = HIT
Fully Associative Cache Tag Data Block t = Tag t = HIT Block Offset Data Word or Byte = b
Replacement Methods Which line do you replace on a miss? Direct Mapped Easy, you have only one choice Replace the line at the index you need N-way Set Associative Need to choose which way to replace Random (choose one at random) Least Recently Used (LRU) (the one used least recently) Often difficult to calculate, so people use approximations. Often they are really not recently used
Replacement only happens on misses Replacement Policy In an associative cache, which block from a set should be evicted when the set becomes full? Random Least Recently Used (LRU) LRU cache state must be updated on every access true implementation only feasible for small sets (2-way) pseudo-LRU binary tree often used for 4-8 way First In, First Out (FIFO) a.k.a. Round-Robin used in highly associative caches Not Least Recently Used (NLRU) FIFO with exception for most recently used block or blocks This is a second-order effect. Why? NLRU used in Alpha TLBs. Replacement only happens on misses
Block Size and Spatial Locality Block is unit of transfer between the cache and memory 4 word block, b=4 Tag Word0 Word1 Word2 Word3 Split CPU address block address offsetb 32-b bits b bits 2b = block size a.k.a line size (in bytes) Larger block size has distinct hardware advantages less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? Larger block size will reduce compulsory misses (first miss to a block). Larger blocks may increase conflict misses since the number of blocks is smaller. Fewer blocks => more conflicts. Can waste bandwidth.
CPU-Cache Interaction (5-stage pipeline) 0x4 E Add M Decode, Register Fetch A ALU we Y addr nop IR B Primary Data Cache rdata R addr PC inst D hit? hit? wdata wdata PCen Primary Instruction Cache MD1 MD2 Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy
Improving Cache Performance Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the hit time reduce the miss rate reduce the miss penalty What is the simplest design strategy? Design the largest primary cache without slowing down the clock Or adding pipeline stages. Biggest cache that doesn’t increase hit time past 1-2 cycles (approx 8-32KB in modern technology) [ design issues more complex with out-of-order superscalar processors ]
Serial-versus-Parallel Cache and Memory Access a is HIT RATIO: Fraction of references in cache 1 - a is MISS RATIO: Remaining references CACHE Processor Main Memory Addr Data Average access time for serial search: tcache + (1 - a) tmem CACHE Processor Main Memory Addr Data Average access time for parallel search: a tcache + (1 - a) tmem Savings are usually small, tmem >> tcache, hit ratio a high High bandwidth required for memory path Complexity of handling parallel paths can slow tcache
Causes for Cache Misses Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect replacement policy Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity
Effect of Cache Parameters on Performance Larger cache size reduces capacity and conflict misses hit time will increase Higher associativity reduces conflict misses may increase hit time Larger block size reduces compulsory and capacity (reload) misses increases conflict misses and miss penalty Requested block first…. The following could be in the slide… spatial locality reduces compulsory misses and capacity reload misses fewer blocks may increase conflict miss rate larger blocks may increase miss penalty
Multilevel Caches DRAM CPU L2$ L1$ A memory cannot be large and fast Increasing sizes of cache at each level CPU L1$ L2$ DRAM Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions MPI makes it easier to compute overall performance.
Multilevel Caches Primary (L1) caches attached to CPU Small, but fast Focusing on hit time rather than hit rate Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Unified instruction and data Focusing on hit rate rather than hit time Main memory services L2 cache misses Some high-end systems include L3 cache
A Typical Memory Hierarchy Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) L1 Instruction Cache Unified L2 Cache Memory CPU Memory Memory L1 Data Cache RF Memory Implementation close to the CPU looks like a Harvard machine. Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM)
What About Writes? Where do we put the data we want to write? In the cache? In main memory? In both? Caches have different policies for this question Most systems store the data in the cache (why?) Some also store the data in memory as well (why?) Interesting observation Processor does not need to “wait” until the store completes
Cache Write Policies: Major Options Write-through (write data go to cache and memory) Main memory is updated on each cache write Replacing a cache entry is simple (just overwrite new block) Memory write causes significant delay if pipeline must stall Write-back (write data only goes to the cache) Only the cache entry is updated on each cache write so main memory and cache data are inconsistent Add “dirty” bit to the cache entry to indicate whether the data in the cache entry must be committed to memory Replacing a cache entry requires writing the data back to memory before replacing the entry if it is “dirty”
Write Policy Trade-offs Write-through Misses are simpler and cheaper (no write-back to memory) Easier to implement Requires buffering to be practical Uses a lot of bandwidth to the next level of memory Write-back Writes are fast on a hit Multiple writes within a block require only one “writeback” later Efficient block transfer on write back to memory at eviction Eviction:替换
Write Policy Choices Cache hit: write through: write both cache & memory generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) a dirty bit per block can further reduce the traffic Cache miss: no write allocate: only write to main memory write allocate (aka fetch on write): fetch into cache Common combinations: write through and no write allocate write back with write allocate
Write Buffer to Reduce Read Miss Penalty Unified L2 Cache Data Cache CPU Write buffer RF Evicted dirty lines for writeback cache OR All writes in writethru cache Processor is not stalled on writes, and read misses can go ahead of write to main memory Problem: Write buffer may hold updated value of location needed by a read miss Simple scheme: on a read miss, wait for the write buffer to go empty Faster scheme: Check write buffer addresses against read miss addresses, if no match, allow read miss to go ahead of writes, else, return value in write buffer Deisgners of the MIPS M/1000 estimated that waiting for a four-word buffer to empty increased the read miss penalty by a factor of 1.5.
Write Buffers for Write-Through Caches Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory Q. Why a write buffer ? A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common. Q. Are Read After Write (RAW) hazards an issue for write buffer? A. Yes! Drain buffer before next read, or check write buffers.
Avoiding the Stalls for Write-Through Use write buffer between cache and memory Processor writes data into the cache and the write buffer Memory controller slowly “drains” buffer to memory Write buffer: a first-in-first-out buffer (FIFO) Typically holds a small number of writes Can absorb small bursts as long as the long term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM
Cache Write Policy: Allocation Options What happens on a cache write that misses? It’s actually two subquestions Do you allocate space in the cache for the address? Write-allocate VS no-write allocate Actions: select a cache entry, evict old contents, update tags, Do you fetch the rest of the block contents from memory? Of interest if you do write allocate Remember a store updates up to 1 word from a wider block Fetch-on-miss VS no-fetch-on-miss For no-fecth-on-miss must remember which words are valid Use fine-grain valid bits in each cache line
Typical Choices Write-back caches Write-through caches Write-allocate, fetch-on-miss Write-through caches Write-allocate, no-fetch-on-miss No-write-allocate, write-around Modern HW support multiple polices Select by OS on at some coarse granularity Which program patters match each policy?
Splitting Caches Most processors have separate caches for instructions & data Often noted $I and $D Advantages Extra access port Can customize to specific access patterns Low hit time Disadvantages Capacity utilization Miss rate
Cache Design: Datapath + Control Most design errors come from incorrect specification of state machine behavior! Common bugs: Stalls, Block replacement, Write buffer To Lower Level Memory To CPU Control State Machine Control Control To Lower Level Memory Addr Addr To CPU Blocks Tags Din Din Dout Dout
Cache Controller Example cache characteristics Direct-mapped, write-back, write allocate Block size: 4 words (16 bytes) Cache size: 16KB (1024 blocks) 32-bit byte addresses Valid bit and dirty bit per block Blocking cache CPU waits until assess is complete Address
Signals between the Processor and the Cache
Finite-state Machine Controllers Use and FSM to sequence control steps Set of states, transition on each clock edge State values are binary encoded Current state stored in a register Next state = fn (current state, current inputs) Control output signals = fo (current state) Finite-state machine controllers are typically implemented using a block of combinational logic and a register to hold the current state.
Cache Controller FSM Idle state Waiting for a valid read or write request from the processor Compare Tag state Testing if hit or miss If hit, set Cache Ready after read or write -> Idle state If miss, updates the cache tag If dirty ->Write-Back state, else -> Allocate state Write-Back state Writing the 128-bit block to memory Waiting for ready signal from memory ->Allocate state Allocate state Fetching new blocks is from memory Four states of the simple controllers
Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Example cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer For 4-word block, 1-word-wide DRAM Miss penalty = 1 + 4x15 + 4x1 = 65 bus cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle
Measuring Performance
Measuring Performance Memory system is important for performance Cache access time often determines the overall system clock cycle time since it is often the slowest pipeline stage Memory stalls is a large contributor to CPI Stall due to instructions & data, reading & writing Stalls include both cache miss stalls and write buffer stalls Memory system & performance CPU Time = (CPU Cycles + Memory Stall Cycles) * Cycle Time MemStallCycles = Read Stall Cycles + Write Stalls Cycles CPI = CPIpipe + AvgMemStallCycles CPIpipe = 1 + HazardStallsCycles
Memory Performance Read stalls are fairly easy to understand Read Cycles = Read/prog * ReadMissRate * ReadMissPenalty Write stalls depend upon the write policy Write-through Write Stall = (Writes/Prog * WriteMissRate *WriteMissPenalty)+ Write Buffer Stalls Write-back Write Stall = (Writes/Prog * WriteMissRate * WriteMissPenalty) “Write miss penalty” can be complex: Can be partially hidden if processor can continue executing Can include extra time to write-back a value we are evicting
Worst-Case Simplicity Assume that write and read misses cause the same delay In a single-level cache system MissPenalty = latency of DRAM In a multi-level cache system MissPenalty is the latency of L2 cache etc Calculate by considering MissRateL2, MissPenaltyL2 etc Watch out: global vs local miss rate for L2
Simple Cache Performance Example Consider the following Miss rate for instruction access is 5% Miss rate for data access is 8% Data references per instruction are 0.4 CPI with perfect cache is 2 Read and write miss penalty is 20 cycles Including possible write buffer stalls What is the performance of this machine relative to one without misses? Always start by considering execution times (IC*CPI*CCT) But IC and CCT are the same here, so focus on CPI CCT: Clock Cycle Time IC: Instruction Counter
Performance Solution Find the CPI for the base system without misses CPI no misses = CPIperfect = 2 Find the CPI for system with misses Misses/inst = I Cache Misses + D Cache Misses = 0.05 + (0.08*0.4) = 0.082 Memory Stall Cycles = Misses/Inst * MissPenalty = 0.082*20 = 1.64 cycles/inst CPI with misses = CPIperfect + Memory Stall Cycles = 2 + 1.64 = 3.64 Compare the performance
Another Cache Problem Given the following data Base CPI of 1.5 1 instruction reference per instruction fetch 0.27 loads/instruction 0.13 stores/instruction A 64KB, cache with 4-word block size has a miss rate of 1.7% Memory access time = 4 cycles + #words/block Suppose the cache uses a write through, write-around write strategy without a write buffer. How much faster would the machine be with a perfect write buffer? CPUtime = Instruction Count*(CPIbase + CPImemory) * ClockCycleTime Performance is proportional to CPI = 1.5 + CPImemory
No Write Buffer CPI memory = reads/inst.*miss rate * read miss penalty Cache Lower Level Memory CPU CPI memory = reads/inst.*miss rate * read miss penalty + writes/inst.* write penalty read miss penalty = 4 cycles + 4 words * 1cycle/word = 8 cycles write penalty = 4 cycles + 1word * 1cycle/word = 5 cycles CPI memory = (1 if + 0.27 ld)(1/inst.)*(0.017)*8 cycles + (0.13st)(1/inst.)*5cycles CPI memory = 0.17 cycles/inst. + 0.65 cycles/inst. = 0.82 cycles/inst. CPI overall = 1.5 cycles/inst. + 0.82 cycles/inst. = 2.32 cycles/inst.
Perfect Write Buffer Cache Lower Level Memory CPU Wbuff CPI memory = reads/inst.*miss rate * 8 cycle read miss penalty + writes/inst.* (1- miss rate) * 1 cycle hit penalty A hit penalty is required because on hits we must Access the cache tags during the MEM cycle to determine a hit Stall the processor for a cycle to update a hit cache block CPI memory = 0.17 cycles/inst. + (0.13st)(1/inst.)*( 1-0.017)*1cycle CPI memory = 0.17 cycles/inst. + 0.13 cycles/inst. = 0.30 cycles/inst. CPI overall = 1.5 cycles/inst. + 0.30 cycles/inst. = 1.80 cycles/inst.
Perfect Write Buffer + Cache Write Buffer WBuff Lower Level Memory Cache CPU CWB CPI memory = reads/inst.*miss rate * 8 cycle read miss penalty Avoid a hit penalty on write by: Add a one-entry write buffer to the cache itself Write the last store hit to the data array during next stors’s MEM Hazard: On loads, must check CWB along with cache! CPI memory = 0.17 cycles/inst. . CPI overall = 1.5 cycles/inst. + 0.17 cycles/inst. = 1.67 cycles/inst.
Acknowledgements These slides contain material from courses: UCB CS152 Stanford EE108B