Lecture 13 Cache Storage System

Lecture 13 Cache Storage System
CS510 Computer Architectures

4 Questions for Cache Storage Designers
Q1: Where can a block be placed in the Cache? (Block placement) Q2: How is a block found if it is in the Cache? (Block identification) Q3: Which block should be replaced on a Cache miss? (Block replacement) Q4: What happens on a Cache write? (Write strategy) Cache Storage System CS510 Computer Architectures

Example: Alpha 21064 Data Cache
Index = 8 bits: 256 blocks = 8192/(32bytesx1) Direct Mapped . . . Write Buffer Lower Level Memory =? 4:1 MUX Block Offset <5> Address <21> <8> CPU Data In Out Valid Tag Data <1> <21> <256> 256 blocks 34 Tag Index Miss Hit(send load signal to CPU) If 1 compare Cache Storage System CS510 Computer Architectures

Example: Alpha 21064 Data Cache
READ: four steps [1] CPU sends 34-bit address to cache for Tag comparison [2] Index selects the Tag to be compared with the Tag in the address from CPU [3] Compare the Tags if the information in the directory is Valid [4] If Tags match, signal CPU to load the data 2 clock cycles for these 4 steps Instructions in this 2 cycles need to be stalled WRITE: the first three steps + data write Tag Index . . . Write Buffer Lower Level Memory =? 4:1 MUX Block Address CPU Data In Out Valid Tag Data <1> <21> <256> 256 blocks 34 Miss Hit Cache Storage System CS510 Computer Architectures

CS510 Computer Architectures
Writes in Alpha 21064 Write Through Cache Write is not complete in 2 clock cycles, but data is written in the Write Buffer in 2 clock cycles, and from where data must be stored in the memory while CPU continues on working No-Write Merging vs. Write Merging in write buffer 100 104 108 112 Write Address V V V V No-Write Merging 4 Writes - 4 entries Write Address V V V V 100 Write Merging 16 sequential writes in a buffer 4 Writes - merged into a single buffer entry Cache Storage System CS510 Computer Architectures

Structural Hazard: Split Cache or Unified Cache?
Miss Rates for Separate Instruction Cache and Data Cache - DM, 32-byte Blocks, SPEC92 avg(instruction reference is about 75%), DECstation 5000 Size Split Cache Unified Cache a Instruction Cache Data Cache a 1 KB 3.06% 24.61% 13.34% 2 KB 2.26% 20.57% 9.78% 4 KB 1.78% 15.94% 7.24% 8 KB 1.10% 10.19% 4.57% 16 KB 0.64% 6.47% 2.87% 32 KB 0.39% 4.82% 1.99% 64 KB 0.15% 3.77% 1.35% 128 KB 0.02% 2.88% 0.95% Cache Storage System CS510 Computer Architectures

Example: Miss Rate and Average Access Time
Compare 16 KB I-cache and 16 KB D-cache vs. 32 KB U-cache Assume a hit takes 1 clock cycle and a miss takes 50 clock cycles a load or a store hit takes 1 extra clock cycle on a U-cache since there is only one cache port to satisfy two simultaneous request(1 Instruction fetch and 1 LD or 1 ST) Integer program: LD;26%, ST;9% Answer instruction accesses: 100%/(100%+26%+9%) = 75% data accesses: (26%+9%) /(100%+26%+9%) = 25% the overall miss rate for the split cache (75% x 0.64%) + (25% x 6.47%) = 2.10% U-cache has a slightly lower miss rate of 1.99% Continue on the next slide Cache Storage System CS510 Computer Architectures

Example: Miss Rate and Average Access Time
Average access time = % instructions x (hit time + instruction miss rate x Miss penalty) + %data x (hit time + data miss rate x Miss penalty) 16 KB split cache average access timesplit = 75% x (1+0.64%x50) + 25% x (1+6.47%x50) = = 2.05 32 KB unified cache average access timeunified = 75% x (1+1.99%x50) + 25% x ( %x50) = = 2.24 The split cache in this example have a better average access time than the single-ported unified cache even though their effective miss rate is higher Stall cycle due to instruction fetch Cache Storage System CS510 Computer Architectures

Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty Cache Storage System CS510 Computer Architectures

Cache Performance CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time Cache Storage System CS510 Computer Architectures

Cache Performance What is the impact of two different cache organizations(DM and SA Cache) on the performance of a CPU? Assume CPI with a perfect cache clock cycle time(=cache access time) -- 2 ns 1.3 memory references per instruction 64 KB cache : block size bytes one cache - direct mapped, the other - two-way SA assume the CPU clock cycle time must be stretched 1.10 times to accommodate the selection MUX of the SA(2x1.1) cache miss penalty ns for either caches miss rate : Direct mapped %, SA % Cache Storage System CS510 Computer Architectures

2-way Set Associative, Address to Select Word
Two sets of Address tags and data RAM Use address bits to select correct DRAM <22> <7> <5> Block Address Block offset Tag Index CPU Address Data Data in Out Write Buffer Low Level Memory Valid Tag <1> <22> Data <64> . . . 2:1 MUX =? Set 0 index Set 1 index Set 0 Set 1 Cache Storage System CS510 Computer Architectures

Cache Performance Answer Average access time = hit time + Miss rate x Miss penalty Average access timedirect = 2+(.014x70) = 2.98ns Average access time2-way = 2x1.10+(.010x70) = 2.90ns CPU performance CPU time = IC x (CPIex + Misses per Instr. x Miss penalty) x clock cycle time CPU timedirect = IC x (2.0x2+(1.3x.014x70)) = 5.27 x IC CPU time2-way = IC x (2.0x2x1.10+(1.3x.010x70))= 5.31 x IC the direct mapped cache leads to slightly better average performance Cache Storage System CS510 Computer Architectures

Improving Cache Performance
Average memory access time = Hit time + Miss rate x Miss penalty (ns or clocks) Improve performance by: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Cache Storage System CS510 Computer Architectures

Cache Storage System CS510 Computer Architectures

1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Cache Storage System CS510 Computer Architectures

Cache Performance Improvement Reducing Miss Rate

Reducing Misses Classifying Misses: 3 Cs Compulsory: The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. (Misses in Infinite Cache) Capacity: If The cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Size of Cache) Conflict: If the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. (Misses in N-way Set Associative, Size of Cache) Cache Storage System CS510 Computer Architectures

3Cs Absolute Miss Rate Miss Rate per Type 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1-way(DM) Conflict 2-way 4-way 8-way Capacity Cache Size (KB) 1 2 4 8 16 32 64 128 Compulsory Cache Storage System CS510 Computer Architectures

2:1 Cache Rule on Set Associativity
2:1 Cache rule of thumb A direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2 Cache Size (KB) Miss Rate per Type 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4 8 16 32 64 128 1-way(DM) 2-way 4-way 8-way Capacity Compulsory Conflict Cache Storage System CS510 Computer Architectures

3Cs Relative Miss Rate (Scaled To Direct-Mapped Miss Ratio)
Cache Size (KB) Miss Rate per Type 0% 20% 40% 60% 80% 100% 2 4 8 16 32 64 128 1-way 2-way 4-way 8-way Capacity Compulsory Conflict 1 Cache Storage System CS510 Computer Architectures

How Can Reduce Misses? Change Block Size? Which of 3Cs affected? Change Associativity? Which of 3Cs affected? Change Compiler? Which of 3Cs affected? Cache Storage System CS510 Computer Architectures

1. Reduce Misses via Larger Block Size
Block Size (bytes) Miss Rate 0% 5% 10% 15% 20% 25% 16 32 64 128 256 1K 4K 16K 64K 256K Cache Capacity For larger caches, Miss rate reduces as the block size increases. But block size should not be large with respect to the cache capacity. It will rather heart the miss rate. Cache Storage System CS510 Computer Architectures

2. Reduce Misses via Higher Associativity
2:1 Cache Rule: Miss Rate DM cache size N » Miss Rate SA(2-way) cache size N/2 Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit times for TTL or ECL board-level external cache +10%, and CMOS internal cache + 2% for 2-way vs. 1-way Cache Storage System CS510 Computer Architectures

Example: AMAT vs. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped, 32 bytes /block Cache Memmory Access Time Size (KB) 1-way 2-way 4-way 8-way AMAT is improved by more associativity AMAT is not improved significantly by more associativity Cache Storage System CS510 Computer Architectures

3. Reducing Misses via Victim Cache
Tag Data =? CPU Address Data Data In Out Write Buffer Low Level Memory Cache How to combine fast hit time of Direct Mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Cache Storage System CS510 Computer Architectures

4. Reducing Misses via Pseudo-Associativity
How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Divide cache: on a DM miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit Time Pseudo-Hit Time Miss Penalty Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor Cache Storage System CS510 Computer Architectures

5. Reducing Misses by HW Prefetching
Prefetch into the Cache or into an external buffer e.g., Instruction Prefetching Alpha fetches 2 consecutive blocks on a miss Extra prefetched block is placed in instruction stream buffer On miss check instruction stream buffer Works with data blocks too: Jouppi [1990]: 1 data stream buffer got 25% of misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994]: for scientific programs for 8 streams got 50% to 70% of misses from two 64KB, 4-way set associative caches Prefetching relies on extra memory bandwidth that can be used without penalty Cache Storage System CS510 Computer Architectures

6. Reducing Misses by Compiler Prefetching of Data
Prefetch Instructions to request the data in advance Data Prefetch Register Prefetch: Load the value of data into register (HP PA-RISC loads) Cache Prefetch: load data into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; non-faulting or nonbinding prefetch- it does’t change R or M content and it cannot cause exceptions---a form of speculative execution Make sense only if a nonblocking cache(lockup-free cache) is used i.e., processor can proceed while waiting for prefetched data Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Faulting Prefetch: Cache Storage System CS510 Computer Architectures

Big Picture The challenge in designing memory hierarchies is that every change that potentially improves the miss rate can also negatively affect overall performance. This combination of positive and negative effects is what makes the design of a memory hierarchy challenging. Design change Effect on miss rate Possible negative performance effect Increase size Decr capacity misses May incr access time Increase assoc Decreases miss rate May incr access time due to conflict misses Increase blk size Decr miss rate for a May incr miss penalty wide range of blk sizes Cache Storage System CS510 Computer Architectures

Compiler Prefetching of Data
a and b are 8-bytes long -> 2 data in a block for (i=0; i<3; i=i+1) for (j=0; j<100; j=j+1) a[i,j] = b[j,0]*b[j+1,0]; Analysis Spatial Locality: Exists in a but not in b The misses due to a -> misses on every other a[i,j](2 data/blk) a[0,0], a[0,1],..,a[0,99],a[1,0],a[1,1],..,a[1,99],a[2,0],..,a[2,99] Total 3 x 100 /2 or 150 misses Misses due to b: ignoring potential conflict misses b[0,0],b[0,1],b[0,2],b[1,0],b[1,2]b[1,2],…,b[99,0],b[99,1],b[99,2],b[100,0],b[100,1],b[100,2] 100 misses on b[j+1,0] when i=0 1 miss on b[j,0] when i=0 and j=0 i=0: b[0,0]b[1,0]b[1,0]b[2,0]b[2,0]b[3,0]…b[99,0]b[99,0]b[100,0] ; purples represents misses Hit on every b accesses when i=1 and i=2 Total 101 misses Total misses : 251 misses Cache Storage System CS510 Computer Architectures

Compiler Prefetching of Data
Assume that the miss penalty is so large that we need to prefetch at least seven iterations in advance Split the iterations of i-loop such that i=0(j-loop,fetch all b’s) and remainders(i-loop) for (j =0; j < 100; j = j+1) { prefetch (b[j+7,0]); prefetch (a[0, j+7]); a[0,j] = b[j,0] * b[j+1,0];} for (i =1; i <3; i= i+1) prefetch (a[i,j+7]); a[i,j] = b[j,0] * b[j+1,0];} PF: b[7,0],b[8,0],…,b[100,0] PF: a[0,7],a[0,8],…,a[0,99] 7/2 misses of a, 7 misses of b a[0,0]a[0.1]a[0,2]a[0,3]a[0,4]a[0,5]a[0,6]a[0,7]a[0,8]a[0,9]….. b[0,0]b[1,0]b[1,0]b[2,0]….b[5,0]b[6,0]b[6.0]b[7,0]b[7,0]b[8,0] PF: a[1,7],a[1,8],…,a[1,99],…,a[2,99] 7/2 x 2 misses of a Total number of misses; 7/2 x 3 misses of a and 7 misses of b = 19 misses Cache Storage System CS510 Computer Architectures

7. Reducing Misses by Compiler Optimizations
Instructions Reorder procedures in memory so as to reduce misses Profiling to look at conflicts McFarling [1989] reduced cache misses by 75% on 8KB direct mapped cache with 4 byte blocks Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in the order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows Cache Storage System CS510 Computer Architectures

Merging Arrays Example
/* Before */ int val[SIZE]; int key[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key Cache Storage System CS510 Computer Architectures

Loop Interchange Example
/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ Processing Sequence x[0,0],x[1,0],...,x[n,0], x[0,1],x[1,1],...,x[n,1], No ………… Spatial x[0,m],x[1,m],…,x[n,m] Locality x[0,0],x[0,1],…,x[0,m], x[1,0],x[1,1],…,x[1,m], Spatial ………… Loacity x[n,0],x[n,1],…,x[n,m] Sequential accesses Instead of striding through memory every 100 words Cache Storage System CS510 Computer Architectures

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c 1 miss per access vs Cache Storage System CS510 Computer Architectures

Blocking Example i j x 1 /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; i k y k j z 3 Two Inner Loops: Read all NxN elements of z[ ] Read N elements of 1 row of y[ ] repeatedly Write N elements of 1 row of x[ ] Capacity Misses is a function of N & Cache Size: Can store 3 NxN => no capacity misses; possible conflict misses Idea: compute on BxB sub-matrix Cache Storage System CS510 Computer Architectures

Blocking Example i j x Blocking is useful to avoid capacity misses when dealing with large arrays 1 /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; k i y k j z Capacity Misses from 2N3 + N2 to 2N3/B +N2 B called Blocking Factor, in this example B=3 Conflict Misses Too? Cache Storage System CS510 Computer Architectures

Reducing Conflict Misses by Blocking
Blocking to reduce capacity misses, also, reduce conflict misses by increasing associativity. Block size smaller than capacity can also reduce conflict misses Blocking Factor 0.05 0.1 50 100 150 Fully Associative Cache Direct Mapped Cache Miss rate % Impact of Conflict misses in caches not FA on Blocking factor Lam et al [1991] a blocking factor of 24 had a fifth the misses of 48 despite both fit in cache Cache Storage System CS510 Computer Architectures

Summary: Compiler Optimization to Reduce Cache Misses
Performance Improvement 1 1.5 2 2.5 3 compress cholesky (nasa7) spice mxm (nasa7) btrix (nasa7) tomcatv gmty (nasa7) vpenta (nasa7) merged arrays loop interchange loop fusion blocking Cache Storage System CS510 Computer Architectures

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Misses by Compiler Optimizations Remember danger of concentrating on just one parameter when evaluating performance Cache Storage System CS510 Computer Architectures

Cache Performance Improvement Reducing Miss Penalty

Reducing Miss Penalty: 1. Read Priority over Write on Miss
Write Through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty might increase read miss penalty by 50% (old MIPS 1000) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back? Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stalls less since restarts as soon as do read Cache Storage System CS510 Computer Architectures

Reducing Miss Penalty: 2. Subblock Placement
Don’t have to load full block on a miss Have bits per sub-block to indicate valid (Originally invented to reduce tag storage) 100 300 200 204 1 Sub-blocks Valid Bits Cache Storage System CS510 Computer Architectures

Reducing Miss Penalty: 3. Early Restart and Critical Word First
Don’t wait for full block to be loaded before restarting CPU Early Restart : As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution while loading remainder of the block Critical Word First: Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Generally useful only in large blocks, Spatial locality a problem; tend to want the next sequential word, so not clear if benefit by early restart Cache Storage System CS510 Computer Architectures

Reducing Miss Penalty: 4. Non-blocking Caches to Reduce Stalls on Misses Non-blocking cache or lockup-free cache allowing the data cache to continue to supply cache hits during a miss “hit under miss” reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Cache Storage System CS510 Computer Architectures

Value of Hit under Miss for SPEC
Hit Under i Misses Avg. Mem. Access Time 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 eqntott espresso xlisp compress mdljsp2 ear fpppp tomcatv swm256 doduc su2cor wave5 mdljdp2 hydro2d alvinn nasa7 spice2g6 ora 0->1 1->2 2->64 Base Integer Floating Point FP programs on average: AMAT= > > > 0.26 Int programs on average: AMAT= > > > 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss Cache Storage System CS510 Computer Architectures

Reducing Miss Penalty: 5. Second Level Cache
CPU Memory L1 L2 Cache Storage System CS510 Computer Architectures

Reducing Miss Penalty: 5. Second Level Cache
L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2) Definitions: Local miss rate: misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) Global miss rate: misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2) Cache Storage System CS510 Computer Architectures

Second Level Cache: An Example
1000 memory references 40 misses in L1 20 misses in L2 the miss rate (either local or global) for L1 : 40/1000 or 4% the local miss rate for L2 : 20/40 or 50 % the global miss rate for L2 : 20/1000 or 2% Cache Storage System CS510 Computer Architectures

Comparing Local and Global Miss Rates
Linear Scale L2 32 KByte 1st level cache; Increasing 2nd level cache Global miss rate close to single level cache rate provided L2 >> L1 Don’t use local miss rate, it must be a function of L1 miss rate, so use L2 global miss rate L2 is not tied to CPU clock cycle, thus it will only affect the L1 miss penalty, not the CPU clock cycle In L2 design, will it lower AMAT? and how much it will cost? must be considered Generally Fast Hit Times and fewer misses Since hits are few, target miss rate reduces Log Scale L2 Cache Storage System CS510 Computer Architectures

Reducing Misses: Which Apply to L2 Cache?
Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 4. Reducing Conflict Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Capacity/Conf. Misses by Compiler Optimizations Cache Storage System CS510 Computer Architectures

The Impact of L2 Cache Associativity on the Miss Penalty
Does set associativity make more sense for L2 caches? two way set associativity : increases hit time by 10% of clock cycle(0.1 clock cycle) hit timeL2 for direct mapped = 10 cycles local miss rateL2 for direct mapped = 25% local miss rateL2 for 2-way SA = 20% Miss penaltyL2 = 50 cycles the first-level cache miss penalty (Miss penaltyL1=Hit TimeL2 + Miss rateL2 x Miss penaltyL2) Miss penalty 1-wayL1 = %x50 = 22.5 clock cycles Miss penalty 2-wayL1 = %x50= 20.1 clock cycles *worst case(hit time is the integral number of clocks, thus rounding 10.1=11) Miss penalty 2-wayL1 = %x50= 21.0 clock cycles Cache Storage System CS510 Computer Architectures

L2 Cache Block Size and AMAT
Block Size of L2 cache(bytes) 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 16 32 64 128 256 512 1.36 1.28 1.27 1.34 1.54 1.95 Relative CPU execution time Small L2 cache; increase conflict misses as the block size increases, but L2 capacity is large popular size 512KB L2 cache; memory bus=32-bit. Long MAT: 1 clock for sending address, 6 clocks for accessing data: 1 word/clock Cache Storage System CS510 Computer Architectures

L2 Cache Block Size Multilevel Inclusion Property All data in the first-level cache are always in the second-level cache Consistency between I/O and caches ( or between caches in multiprocessors) can be determined by checking L2 cache Drawback Different block sizes between caches in different levels Usually, smaller cache has smaller cache blocks Thus, different block sizes for L1 and L2 Still maintain multilevel inclusion property On a miss, complex invalidations, require non-blocking secondary caches Cache Storage System CS510 Computer Architectures

Reducing Miss Penalty Summary
Five techniques Read priority over write on miss Sub-block placement Early Restart and Critical Word First on miss Non-blocking Caches (Hit Under Miss) Second Level Cache Can be applied recursively to Multilevel Caches Danger is that time to access DRAM will increase with multiple levels in between Cache Storage System CS510 Computer Architectures

Cache Performance Improvement Fast Hit Time

Fast Hit Times: 1. Small and Simple Caches
Cache Hit Time Most of time is spent on Reading TAG Memory using INDEX part of address and Compare Smaller and Simple hardware is faster Smaller Cache On-chip Cache On-chip Tag, Off-chip Data Simple Cache Direct Mapped Cache allows to transmit data while checking tag Alpha has 8KB Instruction and 8KB data cache + 96KB second level cache Cache Storage System CS510 Computer Architectures

Fast Hit Times: 2. Fast Hits by Avoiding Address Translation
Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache Make the common case fast Common case is hit, thus allows to eliminate the virtual address translation on hit Every time process is switched logically must flush the cache; otherwise get false hits with the same virtual address Cost is time to flush + Compulsory misses from empty cache Dealing with aliases (sometimes called synonyms); OS as well as user programs map two different virtual addresses to the same physical address I/O(physical address) must interact with cache--virtual address Cache Storage System CS510 Computer Architectures

Solutions to Cache Flush and Alias
Solution to aliases HW: Anti-aliasing: guarantee that every cache block has a unique physical address SW: guarantee the lower n bits of aliases must be same; and they are as long as to cover the index field For direct mapped, they must be unique; page coloring makes the least significant several bits of physical and virtual addresses identical Solution to cache flush Add process identifier(PID) to the address tag that identifies the process as well as address within process: cannot get a hit if wrong process Cache Storage System CS510 Computer Architectures

Conventional Virtually Addressed Cache
CPU TB Cache MEM VA PA 1. Translate the virtual address into a physical address 2. Then access Cache with the physical address Hit Time is slow Cache Storage System CS510 Computer Architectures

Alternative Virtually Addressed Cache
CPU Cache TB MEM VA PA Tags 1. Access cache with the virtual address 2. If miss, translate the virtual address for memory access Synonym Problem Higher Miss penalty Cache Storage System CS510 Computer Architectures

Cache Access and VA Translation Overlapping
VPA P-offset Virtual address CPU Cache TB MEM VPA PPA L2 $ P-Offset 1. Access cache and the virtual address translation simultaneously 2. When miss, access L2 cache with the physical address Requires cache index to remain invariant across translation, i.e. Index is physical address part Cache Storage System CS510 Computer Architectures

2. Avoiding Translation: Process ID impact
20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% Miss Rate Green: Uniprocess Without process switch Red: Multiprocess Process switches using PIDs when flush cache Blue: Multiprocess Process switches w/o PIDs, simply flushes cache Direct mapped cache Block size=16 bytes For Ultrix running on a VAX 2k 4k 8k 16k 32k 64k 128k 256k 512k 1024k Cache Size Cache Storage System CS510 Computer Architectures

2. Avoiding Translation: Index with Physical Portion of Address
If INDEX is a physical part of address, can start TAG access in parallel with address translation so that can compare to physical tag Page Address Address Tag Page Offset Index Block Offset Address Translation Tag Access Limits cache to page size: what if want bigger caches and uses same trick? Higher associativity increases the Index, thus enlarge page Page coloring - OS makes the last few bits of virtual and physical address identical causes to compare only remainder of the page offset bits Cache Storage System CS510 Computer Architectures

Fast Hit times: 3. Fast Hit Times via Pipelined Writes
Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update Only Write in the pipeline; empty during a miss Current Write Previous Write =? Delayed Write Buffer Tag Data MUX CPU Address Data Data In Out Write Buffer Low Level Memory Delayed Write Buffer; must be checked on reads; either complete write or read from buffer Cache Storage System CS510 Computer Architectures

Fast Hit times: 4. Fast Writes on Misses via Small Sub-blocks
If most writes are 1 word, sub-block size is 1 word, & write through then always write sub-block and tag immediately Tag match and valid bit already set: Writing the block was proper, & nothing lost by setting valid bit on again. Tag match and valid bit not set: The tag match means that this is the proper block; writing the data into the sub-block makes it appropriate to turn the valid bit on. Tag mismatch: This is a miss and will modify the data portion of the block. As this is a write-through cache, however, no harm was done; memory still has an up-to-date copy of the old value. Only the tag to the address of the write and the valid bits of the other sub-blocks need to be changed because the valid bit for this sub-block has already been set Doesn’t work with write back due to last case Cache Storage System CS510 Computer Architectures

Cache Optimization Summary
Technique MR MP HT Complexity Larger Block Size Higher Associativity Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses Priority to Read Miss over write miss Sub-block Placement Early Restart & Critical Word 1st Non-Blocking Caches Second Level Caches Small & Simple Cache Avoiding Address Translation + 2 Pipelining Writes for fast write hit Cache Storage System CS510 Computer Architectures

What is the Impact of What You’ve Learned About Caches?
1000 : Speed = ¦ (no. operations) 1995 Pipelined Execution & Fast Clock Rate Out-of-Order completion Superscalar Instruction Issue 1995: Speed = ¦ (non-cached memory accesses) What does this mean for Compilers?,Operating Systems?, Algorithms? Data Structures? CPU 100 10 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Cache Storage System CS510 Computer Architectures

Cross Cutting Issues Parallel Execution vs. Cache locality Want far separation to find independent operations vs. want reuse of data accesses to avoid misses I/O and consistency of data between cache and memory Caches => multiple copies of data Consistency by HW or by SW? Where connect I/O to computer? Cache Storage System CS510 Computer Architectures

Alpha 21064 Separate Instr & Data TLB & Caches TLBs fully associative TLB updates in SW (‘priv Arch Libr”) Caches 8KB direct mapped Critical 8 bytes first Prefetch instr. stream buffer 2 MB L2 cache, direct mapped (off-chip) 256 bit path to main memory, 4 x 64-bit modules Cache Storage System CS510 Computer Architectures

Alpha Memory Performance: Miss Rates
100.00% AlphaSort TPC-B (db2) TPC-B (db1) Espresso Li Eqntott Sc Gcc Compress Mdljsp2 Ora Fpppp Ear Swm256 Doduc Alvinn Tomcatv Wave5 Mdljp2 Hydro2d Spice Nasa7 Su2cor Miss Rate(log scale) 0.01% 0.10% 1.00% 10.00% I $ 8K D $ 8K L2 2M D$ I$ L2 Cache Storage System CS510 Computer Architectures

Alpha CPI Components Primary limitations of performance Instruction stall: branch mispredict; Other: compute + reg conflicts, structural conflicts CPI 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 AlphaSort TPC-B (db2) TPC-B (db1) Espresso Li Eqntott Sc Gcc Compress Mdljsp2 Ora Fpppp Ear Swm256 Doduc Alvinn Tomcatv Wave5 Mdljp2 Hydro2d L2 I$ D$ I Stall Other Commercial workloads Cache Storage System CS510 Computer Architectures

Pitfall: Predicting Cache Performance from Different Pgm (ISA, compiler,...) Miss Rate varies depending on programs Cache Size (KB) Miss Rate 0% 5% 10% 15% 20% 25% 30% 35% 1 2 4 8 16 32 64 128 D: tomcatv D: gcc D: espresso I: gcc I: espresso I: tomcatv 4KB Data cache miss rate 8%,12%, or 28%? 1KB Instr cache miss rate 0%,3%, or 10%? Alpha vs. MIPS for 8KB Data: 17% vs. 10% D$ miss rate I$ miss rate Cache Storage System CS510 Computer Architectures

Pitfall: Simulating too Small an Address Trace
Instructions Executed (billions) Cummlati ve Average Memory Access Time 1 1.5 2 2.5 3 3.5 4 4.5 5 6 7 8 9 10 11 12 SOR(FORTRAN) Tree(Scheme) Multi(multiprogrammed workload) TV(Pascal) Cumulative average memory access time Instructions Executed(billions) Cache Storage System CS510 Computer Architectures

Lecture 13 Cache Storage System

Similar presentations

Presentation on theme: "Lecture 13 Cache Storage System"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 13 Cache Storage System

Similar presentations

Presentation on theme: "Lecture 13 Cache Storage System"— Presentation transcript:

Similar presentations

About project

Feedback