1 Computer Architecture Cache Memory
2 Today is brought to you by cache What do we want? –Fast access to data from memory –Large size of memory –Acceptable memory system cost Where do we get it? –Use a method to interpose a smaller but faster memory between the data-path and main memory which holds recently accessed data
3 Cache Cache = To conceal or store, as in the earth; hide in a secret place; n. A place for hiding or storing provisions, equipment etc, also the things stored or hidden [F. cacher to hide] Cache sounds like cash Programs usually exhibit locality: –temporal locality: If an item is referenced, with high probability it will be referenced again –spatial locality: If an item is referenced, the items near to it have high probability of being referenced
4 Learning Objectives 1.Know principle of cache implementation 2.Know the difference between direct, partial set associative and fully associative cache and how they work 3.Know the terms: cache hit, cache miss, word size, block size, row or set size, cache rows, cache tag, cache index, direct mapped cache, partial set associative and fully associative cache.
5 Consider this. It is like caching information You are in the library gathering books for an assignment 1) The well selected books you have gathered probably contain material that you had not expected but will likely use 2) You do not collect ALL the books from the library to your desk 3) It is quicker to access information from the book on your desk than to go to stack again This is like use of cache principles in computing.
6 Cache Principle The memory fetch and store on a simply configured CPU- memory system with no cache has access time dependent on memory access speed. In general for a given technology, for larger memory size, the access time increases. Cache is a mechanism that can speed up the memory transfers by making use a of proximity principle: machine instructions and memory accesses are often "near" to the previous and following accesses. By caching the recent transactions in fast access memory and having another memory transfer process between the main memory and cache, the effective memory access time can be sped up with consequent performance gains.
7 Why is caching effective in computing Spatial locality arises from –loops –data structures, arrays Temporal locality arises from –loops –sequential access to program instructions Memory cost and speed –SRAM5-25ns$100-$250/MByte –DRAM60-120ns$5-$10/MByte –Magnetic disk10-20 ms$ $0.20/MByte
8 Memory access time and cost
9 Practical usage of memory types Advantageous to build a hierarchy of memories: –fastest and most expensive, small and close to processor –slower and least expensive, large and further from processor
10 Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the user with as much memory as is available in the cheapest technology. –Provide access at the speed offered by the fastest technology. Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Second Level Cache (SRAM) On-Chip Cache 1ns10,000,000ns (10s ms) Speed (ns):10ns100ns 100s Gs Size (bytes): KsMs Tertiary Storage (Disk) 10,000,000,000ns (10s sec) Ts
11 The Art of Memory System Design Processor $ MEM Memory reference stream,,,,... op: i-fetch, read, write Optimize the memory system organization to minimize the average memory access time for typical workloads Workload or Benchmark programs
12 Notation for accessing data and instructions in memory Define a BLOCK as the minimum size unit of information transferred between two adjacent levels of the memory hierarchy When a word of data is required, the whole block that the word is in is transferred. There is a high probability that the next word required is also in the block!, hence the next word is obtained from FAST memory rather than SLOW memory
13 Hits and misses Define a hit as event when data requested by a processor is available in some block of the highest memory hierarchy. A miss is the other case. Hit rate is a measure of success in accessing a cache
14 More notation Hit rate, miss rate, hit time, miss penalty: time to fetch from slow memory memory systems are critical to good performance
15 Basics of caches How do we determine if the data is in the cache? If data is in the cache, how is it found? We only have information on: –address of data –how the cache is organized Direct mapped cache: –the data can only be at a specific place
16 Data Address is used to organize cache storage strategy Word is organized by byte bits Block is organized by bits denoting the word Location in cache is indexed by row Tag is identification of a block in a cache row Tag Index Block Byte Word address bits fields
17 Example 24 bit address with 8 byte block and 2048 blocks in cache of bytes
18 Bit fields for 4 byte word in 32 bit address with 2 b words per block FieldAddress BitsUsage Word field0 : 3address bits within the word being accessed Block field4 : 4+b-1identifies word within the block, field could be empty Set fieldno bits Tag field4+b : 31identifies tag field (unique identifier for block on its row)
19 Example of direct mapped cache Example shows address entries that map to the same location in cache for one byte per word, one word per block, one block per row Tag Index Block Byte Word address bits fields Index 8 cache entries Data mapped by address modulo 8
20 Contents of a direct mapped cache Data == Cached block TAG == Most significant bits of cached block address that identify the block in that cache row from other blocks that map to that same row VALID == Flag bit to indicate the cache content is valid
21 Direct cache Separate address into fields: Byte offset in word Index for row of cache Tag identifier of block Cache of 2^n words, a block being a 4 byte word, has 2^n*(63-n) bits for 32 bit address #rows=2^n #bits/row= n+1=63-n
22 Reading: Hits and Misses Hit requires no special handling. The data is available Instruction fetch cache miss: –Stall the pipeline, apply the PC to memory and fetch the block. Re-fetch the instruction when the miss has been serviced –Same for data fetch
23 Multi-word Blocks Address (showing bit positions) 1612Byte offset V Tag Data HitData K entries 16 bits128 bits Mux Block offsetIndex Tag
24 Miss Rates Vs Block Size
25 Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT: –Larger block size means larger miss penalty: Takes longer time to fill up the block –If block size is too big relative to cache size, miss rate will go up Too few cache blocks In general, Average Access Time: –= Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Miss Penalty Block Size Miss Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Increased Miss Penalty & Miss Rate Block Size
26 Example: 1 KB Direct Mapped Cache with 32 Byte Blocks For a 2 ** N byte cache: –The uppermost (32 - N) bits are always the Cache Tag –The lowest M bits are the Byte Select (Block Size = 2 ** M) Cache Index : Cache Data Byte : Cache TagExample: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 9
27 Extreme Example: single big line Cache Size = 4 bytesBlock Size = 4 bytes –Only ONE entry in the cache If an item is accessed, likely that it will be accessed again soon –But it is unlikely that it will be accessed again immediately!!! –The next access will likely be a miss again Continually loading data into the cache but discard (force out) them before they are used again Worst nightmare of a cache designer: Ping Pong Effect Conflict Misses are misses caused by: –Different memory locations mapped to the same cache index Solution 1: make the cache size bigger Solution 2: Multiple entries for the same Cache Index 0 Cache DataValid Bit Byte 0Byte 1Byte 3 Cache Tag Byte 2
28 Another Extreme Example: Fully Associative Fully Associative Cache, N blocks of 32 bytes each –Forget about the Cache Index –Compare the Cache Tags of all cache entries in parallel –Example: Block Size = 32 Byte blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache : Cache Data Byte : Cache Tag (27 bits long) Valid Bit : Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Cache Tag Byte Select Ex: 0x01 X X X X X
29 A Two-way Set Associative Cache N-way set associative: N entries for each Cache Index –N direct mapped caches operates in parallel Example: Two-way set associative cache –Cache Index selects a “set” from the cache –The two tags in the set are compared in parallel –Data is selected based on the tag result Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit
30 Disadvantage of Set Associative Cache N-way Set Associative Cache versus Direct Mapped Cache: –N comparators vs. 1 –Extra MUX delay for the data –Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: –Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit
31 Three Cs of Caches: 1.Compulsory misses: These are cache misses caused by the first access to the block that has never been in cache (also known as cold-start misses) 2.Capacity misses: These are cache misses caused when the cache cannot contain all the blocks needed during execution of a program. Capacity misses occur because of blocks being replaced and later retrieved when accessed. 3.Conflict misses: These are cache misses that occur in set- associative or direct-mapped caches when multiple blocks compete for the same set. Conflict misses are those misses in a direct- mapped or set-associative cache that are eliminated in a fully associative cache of the same size. These are also called collision misses.
32 A Summary on Sources of Cache Misses Compulsory (cold start or process migration, first reference): first access to a block –“Cold” fact of life: not a whole lot you can do about it –Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant Conflict (collision): –Multiple memory locations mapped to the same cache location –Solution 1: increase cache size –Solution 2: increase associativity Capacity: –Cache cannot contain all blocks access by the program –Solution: increase cache size Invalidation: other process (e.g., I/O) updates memory
33 Summary: The Principle of Locality: –Program likely to access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three Major Categories of Cache Misses: –Compulsory Misses: sad facts of life. Example: cold start misses. –Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! –Capacity Misses: increase cache size Cache Design Space –total size, block size, associativity –replacement policy –write-hit policy (write-through, write-back) –write-miss policy
34 Cache design parameters Design changeeffect on miss ratepossible negative performance effect Increase block decreases miss ratemay increase size due to compulsorymiss-penalty misses Increase sizedecreases capacitymay access time increase misses Increase decreases miss rate may increase access associativity time due to conflict misses