Download presentation
Presentation is loading. Please wait.
Published byΑλθαία Μαρής Modified over 5 years ago
1
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
ECE 463/563 Fall `18 Reducing miss rate: cache dimensions, prefetching, loop transformations Prof. Eric Rotenberg Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
2
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Reduce Miss Rate Cache size, associativity, block size Prefetching: Hardware, Software Transform program to increase locality Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
3
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Cache size Increase cache size Decrease miss rate Increase hit time Asymptotically approaches just the compulsory miss rate. At some point, cache size becomes large enough to eliminate capacity and conflict misses (moreover, this point tends to be reached sooner with higher associativity). miss rate “diminishing returns” log(cache size) Small decrease in miss rate in this region may not justify (1) big increase in hit time, and (2) taking chip area away from other units. Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
4
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Associativity Increase associativity (for a fixed cache size) In general, for the same cache size, increasing associativity tends to decrease miss rate (decrease conflict misses) May increase hit time, for the same cache size. Energy per access must also be looked at. miss rate diminishing returns 4-way or 8-way set-associative is almost equivalent to fully-associative in many cases log(associativity) Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
5
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Block size Increase block size (for a fixed cache size) Miss rate may decrease, up to a point, due to exploiting more spatial locality Miss rate may increase after a point, due to cache pollution For a fixed cache size, a side-effect of larger blocks is having fewer total blocks in the cache. It’s a trade-off between hits on consecutive bytes (fewer, large blocks) and hits on non-consecutive bytes (more, small blocks). At some point, you exhaust all the spatial locality and increasing block size further only takes cache space away from useful bytes in other blocks. A secondary drawback of a larger block is that it increases miss penalty (more bytes to fetch) “cache pollution” miss rate block size Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
6
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Prefetching Idea: get it before you need it Prefetching can be implemented in hardware, software (e.g., compiler), or both Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
7
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Hardware prefetching General idea Autonomous hardware prefetcher sits alongside cache Predict which blocks may be accessed in the future Prefetch these predicted blocks Simplest hardware prefetchers: stride prefetchers +1 prefetch (stride = 1): fetch missing block, and next sequential block Works well for streams with high sequential locality, e.g., instruction caches +n prefetch (stride = n): observe memory is being accessed every n blocks, so prefetch block +n: example of code that has this behavior: for (i = 1; i < MAX; i += 8) a[i] = b[i]; block X b[0] b[1] b[2] b[3] X+1 b[4] b[5] b[6] b[7] X+2 b[8] b[9] b[10] b[11] X+3 b[12] b[13] b[14] b[15] Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
8
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Software prefetching Need a “prefetch” instruction Sole purpose is to calculate an address and access the cache with the address. If it hits, nothing happens. If it misses, the processor does not stall; the only thing that happens is that the cache will fetch the memory block. Like a load instruction, except: It does not delay processor on a miss It does not change the processor’s architectural state in any way: Doesn’t have a destination register Doesn’t cause exceptions (we’ll learn about exceptions later) In other words, the sole purpose of a “prefetch” instruction is to tell the cache to fetch the specified block if it doesn’t already have that block Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
9
Software prefetching (cont.)
Compiler predicts which accesses are likely to cause misses Compiler inserts prefetch instructions well enough ahead to prevent these accesses from missing The misses still occur, but they occur in advance The prefetches miss, but the accesses that are targeted by the prefetches do not (if everything works as planned) for (i = 0; i < 100; i++) { prefetch(x[i+k]); x[i] = c * x[i]; } for (i = 0; i < 100; i++) x[i] = c * x[i]; Where k depends on (1) the miss penalty and (2) the time it takes to execute an iteration assuming hits Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
10
Software prefetching (cont.)
for (i = 0; i < 100; i++) { prefetch(x[i+k]); x[i] = c * x[i]; } Where k depends on (1) the miss penalty and (2) the time it takes to execute an iteration assuming hits CPU is currently in iteration i In the example below: k = 11 prefetch x[i+k] i i+k . . . . . . miss penalty: time to service a miss execution time for one iteration of inner loop, assuming cache hits Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
11
Potential issues with prefetching
Cache pollution Inaccurate prefetches bring in useless blocks, displacing useful ones Must be careful not to increase miss rate Solution: prefetch block into a “stream buffer” or “candidate cache”, transfer block to main cache only when the block is actually referenced by the program Bandwidth hog Inaccurate prefetches waste bandwidth throughout the memory hierarchy Must be careful that prefetch misses (prefetch traffic) do not delay demand misses (legitimate traffic) Solutions: Be selective: balance removing as many misses as possible with minimizing useless prefetches Request queues throughout memory hierarchy should prioritize demand misses over prefetch misses Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
12
Transform program to increase locality
Increase spatial locality Explicitly place items close to each other, that are accessed close in time Increase temporal locality Transform computation to increase the number of times items are reused before being replaced in the cache Examples: Loop interchange Loop fusion Loop tiling (also called loop blocking) Feel free to explore these on your own We’ll only cover this one since it is quite relevant Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
13
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Tiling Idea: access “regions” of arrays instead of whole array T Each iteration of k scans entire x[ ][ ]: cache T 1 2 Tiling factor “T” selected so that a tile of x[i][j] fits in the cache Memory layout of x[ ][ ] shown in 2D i i 3 4 j j ii = for (ii = 0; ii < N; ii += T) for (jj = 0; jj < N; jj += T) for (k = 0; k < N; k++) for (i = ii; i < min(ii+T, N); i++) for (j = jj; j < min(jj+T,N); j++) … reference x[i][j] … for (k = 0; k < N; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) … reference x[i][j] … jj = k = i = 0…T j = 0…T problem: all of x[i][j] can’t fit in the cache Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
14
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Tiling (cont.) T Each iteration of k scans entire x[ ][ ]: cache T 1 2 Tiling factor “T” selected so that a tile of x[i][j] fits in the cache Memory layout of x[ ][ ] shown in 2D i i 3 4 j j ii = for (ii = 0; ii < N; ii += T) for (jj = 0; jj < N; jj += T) for (k = 0; k < N; k++) for (i = ii; i < min(ii+T, N); i++) for (j = jj; j < min(jj+T,N); j++) … reference x[i][j] … for (k = 0; k < N; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) … reference x[i][j] … jj = k = 1 i = 0…T j = 0…T problem: all of x[i][j] can’t fit in the cache Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
15
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Tiling (cont.) T Each iteration of k scans entire x[ ][ ]: cache T 1 2 Tiling factor “T” selected so that a tile of x[i][j] fits in the cache Memory layout of x[ ][ ] shown in 2D i i 3 4 j j ii = for (ii = 0; ii < N; ii += T) for (jj = 0; jj < N; jj += T) for (k = 0; k < N; k++) for (i = ii; i < min(ii+T, N); i++) for (j = jj; j < min(jj+T,N); j++) … reference x[i][j] … for (k = 0; k < N; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) … reference x[i][j] … jj = T k = i = 0…T j = T…2T problem: all of x[i][j] can’t fit in the cache Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.