Download presentation
Presentation is loading. Please wait.
1
Cache - Optimization
2
Performance CPU time = (CPU execution cycles + Memory-stall cycles) x cycle time Memory-stall cycles = read-stall cycles + write-stall cycles Read-stall cycles = #reads x read miss rate x read miss penalty Write-stall cycles = (#writes x write miss rate x write miss penalty) + write buffer stalls (assume write-through cache) Assume the following, to make a simpler calculation In most write-through cache, read and write miss penalties are the same Write buffer stalls are negligible Memory-stall cycles = #memory accesses x miss rate x miss penalty = #misses x miss penalty
3
Cache Performance Example
SPECInt2000 Miss rates = 2% (I-cache), 4% (D-cache ) CPI = 2 (without memory stall) Miss penalty = 100 cycles Memory accesses = 36% of instructions Speedup of a perfect cache over the above The above (assume I is the #instructions) CPU time (cycles) = I x 2 CPI + memory stall cycles memory stall cycles = I x 2% miss rate x 100 cycles + I x 36% x 4% miss rate x 100 cycles = 3.44 x I Thus, CPU time = 2 x I x I = 5.44 x I Perfect cache CPU time (cycles) = I x 2 CPI + 0 cycles = 2 x I Speedup = 5.44 / 2 = 2.72 Perfect cache would have made an almost three times faster system!
4
Performance on Increased Clock Rate
Doubling the clock rate in the previous example Cycle time decreases to a half But miss penalty increases from 100 to 200 cycles Speedup of the fast clock system CPU time of the previous (slow) system Assume C is the cycle time of the slow clock system CPU time slow clock = 5.44 x I x C (seconds) CPU time of the faster clock system CPU time (cycles) = I x 2 CPI + memory stall cycles memory stall cycles = I x 2% miss rate x 200 cycles + I x 36% x 4% miss rate x 200 cycles = 6.88 x I CPU time fast clock = (2 x I x I) x (C x ½) = 4.44 x I x C (seconds) Speedup = 5.44 / 4.44 = 1.23 Not twice faster than the slow clock system!
5
Improving Cache Performance
Increase the performance of cache Decrease the average access time Average access time = hit time + miss rate x miss penalty Reduce the components of average access time Hit time Miss rate Miss penalty
6
Associativity Flexible block placement data cache tag search: main
Fully associative (any block) Direct mapped (12 mod 8) = 4 2-way set associative (12 mod 4) = 0 cache main memory set# block# memory address 12 search: 12 tag data
7
Set-Associative Cache
Multiple blocks in a set Each memory block maps to a unique set first Within the set, can be placed in any cache block in the set N-way set associative cache: each set consists of N blocks Mapping of a block to cache: Select a set Set index = (Block number) modulo (#Sets in the cache) Parallel tag matching Parallel comparisons of tags in the selected set
8
Four-way Set Associative Cache
More HW increases cost Parallel tag comparison Multiplexor Multiplexor increases Hit time Increased associativity Reduces conflicts Reduces miss rates Requires more complex MUX Increases hit time a set set index tag
9
Multiple-Word Blocks In a Set
Two-way set associative cache with 32 byte long cache block memory block# 32 bytes valid tag cache block set 0: 1 … 31 … valid tag cache block valid tag cache block set 1: 32 33 … 63 valid tag cache block • • • valid tag cache block set 31: valid tag cache block 227-1 22 bits 5 bits 00001 31 tag set index block offset selected set Address Space
10
Accessing a Word Tag matching and word selection
Must compare the tag in each valid block in the selected set (a), (b) Select a word by using a multiplexer (c) selected set (I): (a) The valid bit must be set for the matching block. 1 Y MUX 1 X w0 w1 w2 w3 w4 w5 w6 w7 (c) If (a) and (b), then cache hit, block offset selects a word = ? (b) The tag bits in one of the cache blocks must match the tag bits in the address = ? 22 bits 5 bits 5 bits X I 10000 31 tag set index block offset
11
Block Replacement Easy for direct mapped Single candidate to replace
Set associative or fully associative Random is good enough with large caches LRU is better with small size and low associativity caches Associativity 2-way 4-way 8-way Size LRU Random 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.12%
12
Types of Cache Misses Compulsory (or cold start, first reference) misses First access to an address Misses even with an infinite sized cache Capacity misses Misses because cache is not large enough to fit working set e.g. Block replaced from cache and later accessed again Misses in fully associative cache of a certain size Conflict (or collision, interference) misses Misses due to not enough associativity e.g. Two addresses map to the same block in direct mapped cache Coherence misses (only in multi-processor cache) Misses caused by cache coherence
13
Example: Cache Associativity
Four one-word blocks in caches The size of cache is 16 bytes ( = 4 x 4 bytes ) Compare three caches A direct-mapped cache A 2-way set associative cache A fully associative cache Find the number misses for the following accesses Reference sequence: 0, 8, 0, 6, 8
14
Example: Direct Mapped Cache
Block placement: direct mapping 0 mod 4 = 0 6 mod 4 = 2 8 mod 4 = 0 Cache simulation 8 6 # of cache misses = 5 (3 compulsory and 2 conflict misses) block# M[0] miss miss M[8] M[0] miss M[0] M[6] miss miss M[8] M[6]
15
Example: 2-way Set Associative Cache
Two sets with two cache blocks 0 mod 2 = 0th set 6 mod 2 = 0th set 8 mod 2 = 0th set Replacement policy within a set: Least Recently Used (LRU) Cache simulation 8 6 # of cache misses = 4 (3 compulsory and 1 conflict misses) set# M[0] miss miss M[0] M[8] M[0] M[8] hit M[0] M[6] miss miss M[8] M[6]
16
Example: Fully Associative Cache
Four cache blocks in a single set Cache simulation 8 6 # of cache misses = 3 Only the compulsory misses in this case one set with all blocks M[0] miss M[0] M[8] miss M[0] M[8] hit M[0] M[8] M[6] miss M[0] M[8] M[6] hit
17
Cache Parameters Cache size, block size, associativity Larger cache
Affect the cache performance (hit time and miss rate) Need to select appropriate ones Larger cache Obvious way to reduce capacity misses But, higher cost and longer hit time Larger block size Spatial locality reduces compulsory misses (prefetch effect) But, large block size will increase conflict misses Higher associativity Reduce conflict misses But, large tag matching logic will increase the hit time
18
Block Size vs. Miss Rate Need appropriate block size Fixed size &
associativity Increased conflict misses Popular choices Reduced compulsory misses
19
Miss Rates from 3 Types of Misses
Associativity matters High associativity reduces miss rates for the same size caches SPEC2000 Conflict Capacity Compulsory 1-way 2-way 4-way 8-way Miss rate Per type
20
Multilevel Caches Reduce miss penalty
Second level cache reduces the miss penalty of the first cache Optimize each cache for different constraints Exploit cost/capacity trade-offs at different levels L1 cache (first level cache, primary cache) Provides fast access time (1~3 CPU cycles) Built on the same chip as processors 8~64 KB, direct-mapped ~ 4-way set-associative L2 cache (second level cache, secondary cache) Provides low miss penalty (20~30 CPU cycles) Can be on the same chip or off-chip in a separate set of SRAMs 256KB~4MB, 4~16-way set-associative Pentium IV L1 and L2 caches on the same chip
21
Summary Performance with cache Reduce hit time Reduce miss rate from
Reduce average access time (= hit time + miss rate x miss penalty) Reduce hit time Small cache with low associativity (direct-mapped cache) Reduce miss rate from Capacity misses use large cache ( could increase hit-time) Compulsory misses use large sized block ( could increase conflict) Conflict misses use high associativity ( could increase hit-time) Select appropriate cache size, block size, and associativity Reduce miss penalty Multilevel cache reduces miss penalty for the primary cache
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.