CS61C L31 Caches I (1) Garcia 2005 © UCB Peer Instruction Answer A. Mem hierarchies were invented before (UNIVAC I wasn’t delivered ‘til 1951) B. If you know your computer’s cache size, you can often make your code run faster. C. Memory hierarchies take advantage of spatial locality by keeping the most recent data items closer to the processor. ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT A.“We are…forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less accessible.” – von Neumann, 1946 B.Certainly! That’s call “tuning” C.“Most Recent” items Temporal locality
CS61C L31 Caches I (2) Garcia 2005 © UCB Peer Instruction Answer 1. All caches take advantage of spatial locality. 2. All caches take advantage of temporal locality. 3. On a read, the return value will depend on what is in the cache. T R U E F A L S E 1. Block size = 1, no spatial! 2. That’s the idea of caches; We’ll need it again soon. 3. It better not! If it’s there, use it. Oth, get from mem F A L S E ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT
CS61C L32 Caches II (3) Garcia, 2005 © UCB Review: Direct-Mapped Cache Cache Location 0 can be occupied by data from: Memory location 0, 4, 8,... 4 blocks => any memory location that is multiple of 4 Memory Memory Address A B C D E F 4 Byte Direct Mapped Cache Cache Index
CS61C L32 Caches II (4) Garcia, 2005 © UCB Caching Terminology When we try to read memory, 3 things can happen: 1.cache hit: cache block is valid and contains proper address, so read desired word 2.cache miss: nothing in cache in appropriate block, so fetch from memory 3.cache miss, block replacement: wrong data is in cache at appropriate block, so discard it and fetch desired data from memory (cache always copy)
Cache access (Fig 7.7) 1 word = 4bytes Byte addressing Cache 裡真正用來存資料的部分 Cache block 大小:
CS61C L32 Caches II (6) Garcia, 2005 © UCB Issues with Direct-Mapped Since multiple memory addresses map to same cache index, how do we tell which one is in there? What if we have a block size > 1 byte? Answer: divide memory address into three fields ttttttttttttttttt iiiiiiiiii oooo tagindexbyte to checkto offset if have selectwithin correct blockblockblock WIDTHHEIGHT Tag Index Offset
Use of spatial locality Previous cache design takes advantage of temporal locality Use spatial locality in cache design A cache block that is larger than 1 word in length With a cache miss, we will fetch multiple words that are adjacent 時間上的局部性 空間上的局部性 一次抓多個相鄰的 words
4-word cache (Fig 7.10) 4-word block addr.
Advantage of multiple-word block (spatial locality) Ex. access word with byte address 16,24,20 … … word block cache 1-word block cache 16 - cache miss 24 - cache miss 20 - cache miss 16 – cache miss load 4-word block 24 – cache hit 20 – cache hit memory
Multiple-word cache: write miss addr. 1-word data 01 Reload 4-word block 1-word data miss
CS61C L32 Caches II (11) Garcia, 2005 © UCB Accessing data in a direct mapped cache Ex.: 16KB of data, direct-mapped, 4 word blocks Read 4 addresses 1.0x x C 3.0x x Address (hex) Value of Word Memory C a b c d C e f g h C i j k l...
CS61C L32 Caches II (12) Garcia, 2005 © UCB 16 KB Direct Mapped Cache, 16B blocks Valid bit: determines whether anything is stored in that row (when computer initially turned on, all entries invalid)... Valid Tag 0x0-3 0x4-70x8-b0xc-f Index
CS61C L32 Caches II (13) Garcia, 2005 © UCB 1. Read 0x Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (14) Garcia, 2005 © UCB So we read block 1 ( )... Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (15) Garcia, 2005 © UCB No valid data... Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (16) Garcia, 2005 © UCB So load that data into cache, setting tag, valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (17) Garcia, 2005 © UCB Read from cache at offset, return word b Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (18) Garcia, 2005 © UCB 2. Read 0x C = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (19) Garcia, 2005 © UCB Index is Valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (20) Garcia, 2005 © UCB Index valid, Tag Matches... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (21) Garcia, 2005 © UCB Index Valid, Tag Matches, return d... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (22) Garcia, 2005 © UCB 3. Read 0x = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (23) Garcia, 2005 © UCB So read block 3... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (24) Garcia, 2005 © UCB No valid data... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (25) Garcia, 2005 © UCB Load that cache block, return word f... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (26) Garcia, 2005 © UCB 4. Read 0x = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (27) Garcia, 2005 © UCB So read Cache Block 1, Data is Valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (28) Garcia, 2005 © UCB Cache Block 1 Tag does not match (0 != 2)... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (29) Garcia, 2005 © UCB Miss, so replace block 1 with new data & tag... Valid Tag 0x0-3 0x4-70x8-b0xc-f ijkl efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (30) Garcia, 2005 © UCB And return word j... Valid Tag 0x0-3 0x4-70x8-b0xc-f ijkl efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (31) Garcia, 2005 © UCB Do an example yourself. What happens? Chose from: Cache: Hit, Miss, Miss w. replace Values returned:a,b, c, d, e,..., k, l Read address 0x ? Read address 0x c ? Valid Tag 0x0-3 0x4-70x8-b0xc-f ijkl 1 0efgh Index
Advantage of multiple-word block (spatial locality) Comparison of miss rate Block size in words program Instruction miss rate Data miss rate gcc 1 6.1% 2.1% 4 2.0% 1.7% spice 1 1.2% 1.3% 4 0.3% 0.6% Why improvement on instruction miss is significant? Instruction references have better spatial locality
Miss rate v.s. block size Why? Block 數變少 !
Short conclusion Direct mapped cache Map a memory word to a cache block Valid bit, tag field Cache read Hit, read miss, miss penalty Cache write Write-through Write-back Write miss penalty Multi-word cache (use spatial locality)
Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache Virtual memory Make memory system fast
Cache performance How cache affects system performance? CPU time = ( CPU execution clock cycles ) x clock cycle time + Memory-stall clock cycles cache hit cache miss Memory-stall cycles = Read-stall cycles + Write-stall cycles Read-stall cycles = Program Reads X Read miss rate x read miss penalty Assume read and write miss penalty are the same Memory-stall cycles = Program Mem. access X miss rate x miss penalty
Ex. Calculate cache performance CPI = 2 without any memory stalls For gcc, instruction cache miss rate=2% data cache miss rate=4% miss penalty = 40 cycles Sol: Set instruction count = I Instruction miss cycles = I x 2% x 40 = 0.8 x I Data miss cycles = I x 36% x 4% x 40 = 0.58 x I percentage of lw/sw Memory-stall cycles = 0.8I I = 1.38I CPU time stalls CPU time perfect cache = 2I I 2I =1.69
Why memory is bottleneck for system performance? In previous example, if we make the processor faster, change CPI from 2 to 1 Memory-stall cycles remains the same=1.38I CPU time stalls CPU time perfect cache = I I I =2.38 Percentage of memory stall: =41% =58% CPU 變快 (CPI 降低,或 clock rate 提高 ) Memory 對系統效能的影響百分比越重
Outline Introduction Basics of caches Measuring cache performance Set associative cache (reduce miss rate) Multilevel cache Virtual memory Make memory system fast
How to improve cache performance ? Larger cache Set associative cache Reduce cache miss rate New placement rule other than direct mapping Multi-level cache Reduce cache miss penalty Memory-stall cycles = Program Mem. access X miss rate x miss penalty
Flexible placement of blocks Recall: direct mapped cache One address -> one block in cache ? One address -> more than one block in cache 一個 memory address 可以對應到 cache 中一個以上的 block
Full-associative cache A memory data can be placed in any block in the cache Disadvantage: Search all entries in the cache for a match Using parallel comparators 記憶體資料可放在 cache 任意位置
Set-associative cache Between direct mapped and full-associative A memory data can be placed in a set of blocks in the cache Disadvantage: Search all entries in the set for a match Parallel comparators 可放在 cache 中某一個集合中 (address) modulo (number of sets in cache) Ex. 12 modulo 4 = 0
Example: 4-way set-associative cache Parallel comparators
Take all schemes as a case of set-associativity Ex. 8-block cache
Example: set-associative caches (p. 499) A cache with 4 blocks Load data with block addresses 0,8,0,6,8 one-way set-associative cache (direct mapped) 5 misses
Example: set-associative caches 2-way set-associative cache 4-way set-associative cache 4 misses 3 misses
CS61C L34 Caches IV (48) Garcia © UCB Block Replacement Policy (1/2) Direct-Mapped Cache: index completely specifies position which position a block can go in on a miss N-Way Set Assoc: index specifies a set, but block can occupy any position within the set on a miss Fully Associative: block can be written into any position Question: if we have the choice, where should we write an incoming block?
CS61C L34 Caches IV (49) Garcia © UCB Block Replacement Policy (2/2) If there are any locations with valid bit off (empty), then usually write the new block into the first one. If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss.
CS61C L34 Caches IV (50) Garcia © UCB Block Replacement Policy: LRU LRU (Least Recently Used) Idea: cache out block which has been accessed (read or write) least recently Pro: temporal locality recent past use implies likely future use: in fact, this is a very effective policy Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this
CS61C L34 Caches IV (51) Garcia © UCB Block Replacement Example We have a 2-way set associative cache with a four word total capacity and one word blocks. We perform the following word accesses (ignore bytes for this problem): 0, 2, 0, 1, 4, 0, 2, 3, 5, 4 How many hits and how many misses will there be for the LRU block replacement policy?
CS61C L34 Caches IV (52) Garcia © UCB Block Replacement Example: LRU Addresses 0, 2, 0, 1, 4, 0,... 0 lru 2 1 loc 0loc 1 set 0 set 1 02 lru set 0 set 1 0: miss, bring into set 0 (loc 0) 2: miss, bring into set 0 (loc 1) 0: hit 1: miss, bring into set 1 (loc 0) 4: miss, bring into set 0 (loc 1, replace 2) 0: hit 0 set 0 set 1 lru 02 set 0 set 1 lru set 0 set lru 2 4 set 0 set lru
Short conclusion Higher degree of associativity Lower miss rate More hardware cost to search
Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache (reduce miss penalty) Virtual memory Make memory system fast
Multi-level cache Goal: reduce miss penalty Primary cache (L1) Secondary cache(L2) L1 cache miss L2 cache miss Cache hit Main memory
Example: Performance of multilevel cache CPI = 1 without cache miss, clock rate = 500MHz Primary cache, miss rate=5% Secondary cache, miss rate=2%, access time=20ns Main memory, access time=200 ns Total CPI = Base CPI + memory-stall CPI 1 ?
Example: Performance of multilevel cache (cont.) Total CPI = Base CPI + memory-stall CPI 1 ? access to main memory=200ns x 500M clock/sec=100clock access to L2 cache =20ns x 500M clock/sec =10 clock Total CPI = 1 + L1 miss penalty + L2 miss penalty = 1 + 5% x % x 100 = 3.5 One-level cache Two-level cache Total CPI = 1 + 5% x 100 = 6
Cache Things to Remember Caches are NOT mandatory: Processor performs arithmetic Memory stores data Caches simply make data transfers go faster Caches speed up due to temporal locality: store data used recently Block size > 1 wd spatial locality speedup: Store words next to the ones used recently Cache design choices: size of cache: speed v. capacity N-way set assoc: choice of N (direct-mapped, fully-associative just special cases for N)