ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for Monday: Section C.4 – C.7
ENGS 116 Lecture 122 Who Cares about the Memory Hierarchy? So far, have discussed only processor –CPU Cost/Performance, ISA, Pipelined Execution, ILP 1980: no cache in microprocessors 1995: 2-level cache, 60% transistors on Alpha : IBM experimenting with Main Memory on die. CPU-DRAM Gap
ENGS 116 Lecture 123 The Motivation for Caches Motivation: –Large memories (DRAM) are slow –Small memories (SRAM) are fast Make the average access time small by servicing most accesses from a small, fast memory Reduce the bandwidth required of the large memory ProcessorCache Main Memory Memory System
ENGS 116 Lecture 124 Principle of Locality of Reference Programs do not access their data or code all at once or with equal probability –Rule of thumb: Program spends 90% of its execution time in only 10% of the code Programs access a small portion of the address space at any one time Programs tend to reuse data and instructions that they have recently used Implication of locality: Can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past
ENGS 116 Lecture 125 Memory System Processor IllusionReality Processor Memory
ENGS 116 Lecture 126 General Principles Locality –Temporal Locality: referenced again soon –Spatial Locality: nearby items referenced soon Locality + smaller HW is faster memory hierarchy –Levels: each smaller, faster, more expensive/byte than level below –Inclusive: data found in top also found in lower levels Definitions –Upper is closer to processor –Block: minimum, address aligned unit that fits in cache –Address = Block frame address + block offset address –Hit time: time to access upper level, including hit determination
ENGS 116 Lecture 127 Cache Measures Hit rate: fraction of accesses found in that level –So high that we usually talk about the miss rate –Miss rate fallacy: miss rate induces miss penalty, determines average memory performance Average memory-access time (AMAT) = Hit time + Miss rate x Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to copy to and restart CPU – access time: time to lower level = ƒ(lower level latency) – transfer time: time to transfer block = ƒ(BW upper & lower, block size)
ENGS 116 Lecture 128 Block Size vs. Cache Measures Increasing block size generally increases the miss penalty Block Size Miss Rate Miss Penalty Avg. Memory Access Time => Miss penalty Transfer time Access time Miss rate Average access time
ENGS 116 Lecture 129 Key Points of Memory Hierarchy Need methods to give illusion of large, fast memory Programs exhibit both temporal locality and spatial locality –Keep more recently accessed data closer to the processor –Keep multiple contiguous words together in memory blocks Use smaller, faster memory close to processor – hits are processed quickly; misses require access to larger, slower memory If hit rate is high, memory hierarchy has access time close to that of highest (fastest) level and size equal to that of lowest (largest) level
ENGS 116 Lecture 1210 Implications for CPU Fast hit check since every memory access needs this check –Hit is the common case Unpredictable memory access time –10s of clock cycles: wait –1000s of clock cycles: (Operating System) » Interrupt & switch & do something else » Lightweight: multithreaded execution
ENGS 116 Lecture 1211 Four Memory Hierarchy Questions Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)
ENGS 116 Lecture 1212 Q1: Where can a block be placed in the cache? Block 12 placed in 8 block cache: –Fully associative, direct mapped, 2-way set associative –S.A. Mapping = Block number modulo number sets Fully associative: block 12 can go anywhere Direct mapped: block 12 can go only into block 4 (12 mod 8) Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block no Cache Memory Block no. Block frame address Set 0 Set 1 Set 2 Set 3
ENGS 116 Lecture 1213 Direct Mapped Cache Each memory location is mapped to exactly one location in the cache Cache location assigned based on address of word in memory Mapping: (address of block) mod (# of blocks in cache)
ENGS 116 Lecture 1214 Associative Caches Fully Associative: block can go anywhere in the cache N-way Set Associative: block can go in one of N locations in the set
ENGS 116 Lecture 1215 Q2: How is a block found if it is in the cache? Tag on each block –No need to check index or block offset Increasing associativity shrinks index, expands tag Fully Associative: No index Direct Mapped: Large index Block Address Tag Index Block Offset
ENGS 116 Lecture 1216 Examples 512-byte cache, 4-way set associative, 16-byte blocks, byte addressable 8-KB cache, 2-way set associative, 32-byte blocks, byte addressable
ENGS 116 Lecture 1217 Q3: Which block should be replaced on a miss? Easy for direct mapped Set associative or fully associative: –Random (large associativities) –LRU (smaller associativities) –FIFO (large associativities) Associativity: 2-way 4-way SizeLRU RandomFIFOLRU RandomFIFO 16 KB KB KB
ENGS 116 Lecture 1218 Q4: What Happens on a Write? Write through: The information is written to both the block in the cache and to the block in the lower-level memory. Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. –Is block clean or dirty? Pros and Cons of each: –WT: read misses cannot result in writes (because of replacements) –WB: no writes of repeated writes WT always combined with write buffers so that we don’t wait for lower level memory WB write buffer, giving a read-miss precedence
ENGS 116 Lecture 1219 Example: Data Cache Index = 8 bits: 256 blocks = 8192/(32 x 1) Direct Mapped 4 CPU address Data Data in out Lower Level Memory Write buffer 2 Valid Tag Data (256 Blocks) =?34:1 MUX Block address 1 Block offset Index Tag
ENGS 116 Lecture way Set Associative, Address to Select Word CPU address Data Data in out Lower Level Memory Write buffer Block address Block offset Index Tag 2:1 mux selects data Two sets of address tags and data RAM Use address bits to select correct RAM 2:1 MU X =? Data Valid Tag
ENGS 116 Lecture 1221 Structural Hazard: Instruction and Data? Size Instruction CacheData CacheUnified Cache 8 KB KB KB KB KB KB Misses per 1000 instructions Mix: instructions 74%, data 26%
ENGS 116 Lecture 1222 Cache Performance CPU time = (CPU execution clock cycles + Memory-stall clock cycles) Clock cycle time Memory-stall clock cycles = Read-stall cycles + Write-stall cycles = includes hit time
ENGS 116 Lecture 1223 Cache Performance CPU time = IC (CPI execution + Mem accesses per instruction Miss rate Miss penalty) Clock cycle time Misses per instruction = Memory accesses per instruction Miss rate CPU time = IC (CPI execution + Misses per instruction Miss penalty) Clock cycle time These formulas are conceptual only. Modern day out-of-order processors hide much latency though parallelism.
ENGS 116 Lecture 1224 Summary of Cache Basics Associativity Block size (cache line size) Write Back/Write Through, write buffers, dirty bits AMAT as a basic performance measure Larger block size decreases miss rate but can increase miss penalty Can increase bandwidth of main memory to transfer cache blocks more efficiently Memory system can have significant impact on program execution time, memory stalls can be over 100 cycles Faster processors => memory stalls more costly