Reuse distance as a metric for cache behavior - pdcs2001 [1] Characterization and Optimization of Cache Behavior Kristof Beyls, Yijun Yu, Erik D’Hollander Electronics and Information Systems Gent University, Belgium
Reuse distance as a metric for cache behavior - pdcs2001 [2] Presentation overview Part 1: Reuse distance as a metric for cache behavior –How accurate is reuse distance in predicting cache behavior? –How effective are compilers in removing cache misses? Part 2: Program-centric visualization of data locality –Reuse distance-based visualization –Pattern-based visualization Part 3: Cache remapping: eliminating cache conflicts in tiled codes
Reuse distance as a metric for cache behavior - pdcs2001 [3] Part 1: Introduction to the Reuse Distance (PDCS01)
Reuse distance as a metric for cache behavior - pdcs2001 [4] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary
Reuse distance as a metric for cache behavior - pdcs2001 [5] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary
Reuse distance as a metric for cache behavior - pdcs2001 [6] 1. Introduction Gap between processor and memory speed widens exponentially fast –Typical: 1 memory access = 100 processor cycles Caches can deliver data more quickly, but have limited capacity Reuse distance is a metric for a programs cache performance
Reuse distance as a metric for cache behavior - pdcs2001 [7] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary
Reuse distance as a metric for cache behavior - pdcs2001 [8] 2.a Reuse distance Definition: The reuse distance of a memory access is the number of unique addresses referenced since the last access to the requested data. addressABCABBAC distance∞∞∞22012
Reuse distance as a metric for cache behavior - pdcs2001 [9] 2.b Reuse distance and fully associative cache Property: In a fully associative LRU cache with n cache lines, a reference will hit if the reuse distance d<n. Corollary: In any cache with n lines, a cache miss with reuse distance d is: d < nConflict miss n ≤ d < ∞Capacity miss d = ∞Cold miss
Reuse distance as a metric for cache behavior - pdcs2001 [10] 2.c Reuse Distance Distribution Spec95fp
Reuse distance as a metric for cache behavior - pdcs2001 [11] 2.d Classifying cache misses for SPEC95fp Cache size ConflictCapacity
Reuse distance as a metric for cache behavior - pdcs2001 [12] 2.e Reuse distance vs. hit probability
Reuse distance as a metric for cache behavior - pdcs2001 [13] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary
Reuse distance as a metric for cache behavior - pdcs2001 [14] 3.a Reuse distance after optimization ConflictCapacity
Reuse distance as a metric for cache behavior - pdcs2001 [15] 3.b Effect of compiler optimization SGIpro compiler for Itanium 30% of conflict misses are removed, 1% of capacity misses are removed. Conclusion: much work needs to be done to remove the most important kind of cache misses: capacity misses.
Reuse distance as a metric for cache behavior - pdcs2001 [16] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary
Reuse distance as a metric for cache behavior - pdcs2001 [17] 4. Capacity miss reduction 1.Hardware level –Increasing cache size CS Reuse distance must be smaller than cache size 2. Compiler level –Loop tiling –Loop fusion 3. Algorithmic level CS
Reuse distance as a metric for cache behavior - pdcs2001 [18] 4.a Algorithmic level Programmer has a better understanding of the global program structure. Programmer can change algorithm, so that long distance reuses decrease. Visualization of the long reuse distances can help the programmer to identify poor data locality in the code.
Reuse distance as a metric for cache behavior - pdcs2001 [19] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary
Reuse distance as a metric for cache behavior - pdcs2001 [20] 5. What did we learn? Reuse distance predicts cache behavior accurately, even for direct mapped caches. Compiler optimizations for eliminating capacity misses are currently not powerful enough. A large overview over the code is needed. Programmer has large overview of code. Reuse distance visualization can help the programmer to identify regions with poor locality.
Reuse distance as a metric for cache behavior - pdcs2001 [21] Part 2: Program-Centric Visualization of Data Locality (IV2001)
Reuse distance as a metric for cache behavior - pdcs2001 [22] A program uses the cache transparently. The data locality is obscure to the programmer without Background by cache lines: What happens in the cache? e.g. CVT, Rivet by references: What happens to the program? - Reuse distance-based visualization - Pattern-based visualization
Reuse distance as a metric for cache behavior - pdcs2001 [23] Reuse distance-based Visualization What is the source of the capacity misses in my code?
Reuse distance as a metric for cache behavior - pdcs2001 [24] Instrumentation Programs are instrumented to obtain the following data: –Reuse distance distribution for every static array reference It indicates the cache behavior of the reference –For every static reference, the previous accesses to the same data is stored. It can show the programmer where the previous accesses occurred for long distance reuses.
Reuse distance as a metric for cache behavior - pdcs2001 [25] Main view on SWIM 18 H(i,j) 2 18 cache lines = 2 18 * 64 bytes = 16 Mbyte cache is needed to capture reuse (0,0,1) 4 (1,0,-15) HITMISS
Reuse distance as a metric for cache behavior - pdcs2001 [26] How to optimize locality in SWIM DO ncycle=1,itmax do 10 i do 10 j 10 s1(i,j) s2 do 20 i do 20 j 20 s3(i,j)... ENDDO Reuse between iterations of the outer loop (1,0,-15): - tiling is needed - but: non-perfectly nested loop - so: tiling needs to be extend - e.g. Song&Li, PLDI99: extending tiling for this kind of program structure: 52% speedup, because of increased temporal reuse
Reuse distance as a metric for cache behavior - pdcs2001 [27] Reuse-distance based visualization enables: The tool shows the place and the reason for poor data locality in the program. Visualization of reuse distances in the code enables recognizing code patterns leading to capacity misses. Compiler writer can use the tool to think about new effective program optimizations for data locality.
Reuse distance as a metric for cache behavior - pdcs2001 [28] Pattern-based Visualization Show me the cache behavior of my program!
Reuse distance as a metric for cache behavior - pdcs2001 [29] Cache misses –Cache hits and misses –Compulsory, capacity and conflict misses –Horizontally wrap-around of millions of pixels will make periodical patterns visible Histograms –Reuse distances –Cache misses Distribution of arrays and linenos Views of the cache behavior
Reuse distance as a metric for cache behavior - pdcs2001 [30] Matrix multiplication Tiled matrix multiplication TOMCATV (SPECfp95) FFT Experiments
Reuse distance as a metric for cache behavior - pdcs2001 [31] #define N 40 for (i=0;i<N;i++) for (j=0;j<N;j++) { c[i][j] = 0; for (k=0;k<N;k++) c[i][j] += a[i][k] * b[k][j]; } Matrix multiplication
Reuse distance as a metric for cache behavior - pdcs2001 [32] MM Cache miss patterns
Reuse distance as a metric for cache behavior - pdcs2001 [33] Capacity miss Cold miss MM Histogram of reuse distance
Reuse distance as a metric for cache behavior - pdcs2001 [34] #define N 40 #define B 5 double a[N][N],b[N][N],c[N][N]; for (i_0=0;i_0<N; i_0+=B) for (j_0=0;j_0<N; j_0+=B) { for (i=i_0; i<min(i_0+B,N); i++) for (j=j_0; j<min(j_0+B,N); j++) c[i][j]=0.0; for (k_0=0;k_0<N; k_0+=B) for (i=i_0; i<min(i_0+B,N); i++) for (j=j_0; j<min(j_0+B,N); j++) for (k=k_0; k<min(k_0+B,N); k++) c[i][j] += a[i][k] * b[k][j]; } Tiled matrix multiplication B*B+B+1<1024/8 1<B<12
Reuse distance as a metric for cache behavior - pdcs2001 [35] TMM: Fewer cap, more conf
Reuse distance as a metric for cache behavior - pdcs2001 [36] Capacity miss Cold miss TMM: Fewer long reuse distances
Reuse distance as a metric for cache behavior - pdcs2001 [37] histogra m TOMCATV
Reuse distance as a metric for cache behavior - pdcs2001 [38] Capacity miss Cold miss TOMCATV: Histogram with arrays
Reuse distance as a metric for cache behavior - pdcs2001 [39] Top 20 Hits TOMCATV: Conflict misses with arrays
Reuse distance as a metric for cache behavior - pdcs2001 [40] TOMCATV: Conflict misses with linenos
Reuse distance as a metric for cache behavior - pdcs2001 [41] for (j = 2; j < n; ++j) { for (i = 2; i < n; ++i) { … 217 pyy = x[i + (j + 1) * ] - x[i + j * ] * 2. + x[i + (j - 1) * ]; 219 qyy = y[i + (j + 1) * ] - y[i + j * ] * 2. + y[i + (j - 1) * ]; 221 pxy = x[i (j + 1) * ] - x[i (j- 1) * ] - x[i (j + 1) * ] + x[ i (j - 1) * ]; 224 qxy = y[i (j + 1) * ] - y[i (j- 1) * ] - y[i (j + 1) * ] + y[ i (j - 1) * ]; … } 524 (i + (j + 1) * 513 – 514)*8/32 = (i + j * 513 – 514)*8/32 mod (1024/32) TOMCATV: Back to the source code
Reuse distance as a metric for cache behavior - pdcs2001 [42] Changing array size from 513 to 524 leads to about 50% speedup of TOMCATV on real input set on Pentium III. TOMCATV: After array alignment
Reuse distance as a metric for cache behavior - pdcs2001 [43] FFT
Reuse distance as a metric for cache behavior - pdcs2001 [44] FFT: After array padding
Reuse distance as a metric for cache behavior - pdcs2001 [45] Pattern-based visualization of the cache behavior on program execution guides the programmer to detect the bottleneck for cache optimization Cache optimizing program transformations, such as loop tiling, array padding and data alignment, can be verified with the visualization Pattern-based visualization:
Reuse distance as a metric for cache behavior - pdcs2001 [46] Cache Remapping to Improve the Performance of Tiled Algorithms (Euro-Par 2000, J.UCS Oct. 2000)
Reuse distance as a metric for cache behavior - pdcs2001 [47] Overview Cache Remapping Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary
Reuse distance as a metric for cache behavior - pdcs2001 [48] Overview Cache Remapping Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary
Reuse distance as a metric for cache behavior - pdcs2001 [49] If cache size < N*N, no temporal reuse Possible Cache reuse if cache size >= B2*B3 Loop Tiling for i:=1 to N do for k:=1 to N do for j:=1 to N do C[i,j]+=A[i,k]*B[k,j]; Matrix Multiplication for ii:=1 to N by B1 do for kk:=1 to N by B2 do for jj:=1 to N by B3 do for i:=ii to min(ii+B1-1,N) do for k:=kk to min(kk+B2-1,N) do for j:=jj to min(jj+B3-1,N) do C[i,j]=A[i,k]*B[k,j]; Td Ti
Reuse distance as a metric for cache behavior - pdcs2001 [50] Tiling Reduces Capacity Misses Tiling reduces capacity misses by shortening the reuse distances
Reuse distance as a metric for cache behavior - pdcs2001 [51] Tiling is not a Panacea Capacity misses are reduced, cold en conflict misses stay. Conflict misses depend strongly on matrix dimension
Reuse distance as a metric for cache behavior - pdcs2001 [52] Overview Cache Remapping Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary
Reuse distance as a metric for cache behavior - pdcs2001 [53] Threaded Pipeline Remap thread Computation thread time Computation thread performs the calculations Remap thread places data used by computation thread in the cache. After the data has been processed, it is copied back to the memory.
Reuse distance as a metric for cache behavior - pdcs2001 [54] Cache Remapping Processor Cache Main Memory Computation threadRemap thread P1P1 P2P2 P3P3 P 1 : scalars P 2 : current tile P 3 : future tile
Reuse distance as a metric for cache behavior - pdcs2001 [55] Cache Shadow Cache Address Space Processor LD_C3_C3 Cache Shadow LD_C1
Reuse distance as a metric for cache behavior - pdcs2001 [56] Tile Size Selection Condition: 2 data tiles + scalars must fit in cache. Many tile sizes qualify (B1,B2,B3) –Choose the tile size for which is maximal. –The choice shortens the remap thread w.r.t. the computation thread.
Reuse distance as a metric for cache behavior - pdcs2001 [57] Processor Requirements Cache bypass (e.g. cache hints). Multiple instructions can be executed in parallel (e.g. superscalar processor). Processor won’t stall on a main memory access (current superscalar processors stall: the instruction window is to small).
Reuse distance as a metric for cache behavior - pdcs2001 [58] Parallel Execution On SMT processors not to difficult. On EPIC processors possible via static thread scheduling: –Many functional units, not all of them are used in every processor cycle. –Instructions from the remap and computation thread can be executed in parallel. –Instructions from both threads are interwoven by loop transformations.
Reuse distance as a metric for cache behavior - pdcs2001 [59] Overview Cache Remapping Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary
Reuse distance as a metric for cache behavior - pdcs2001 [60] Remap Code remap(double *x, double *y) { FLD_C3_C3 r1,y FST_C1 x,r1 } remapTileX(iter,X,p,II,JJ) { i1 = decoalesce1(iter); i2 = decoalesce2(iter); remap(p+i1*B2+i2, X[i1+II,i2+JJ]); } remapTileX(X,p,II,JJ) { for i1=1 to B1 do for i2=1 to B2 do remap(p+i1*B2+i2, X[i1+II,i2+JJ]); }
Reuse distance as a metric for cache behavior - pdcs2001 [61] Q1*B2*(B3/r) >= iterA Unroll r times Weaving Threads swap(p2,p3); iter=0; for i = II to II+Q1 do for j = JJ to JJ+B2-1 do for k = KK to KK+B3-1 do H(i,j,k,p2)... H(i,j,k+r-1,p2) remapA(iter++,A,p3,II,KK) iter=0 for i = II+Q1 to II+Q1+Q2 do for j = JJ to JJ+B2-1 do for k = KK to KK+B3-1 do... remapB(iter++,B,p3,KK,JJ)
Reuse distance as a metric for cache behavior - pdcs2001 [62] Overview Cache Remapping Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary
Reuse distance as a metric for cache behavior - pdcs2001 [63] Compilation & Simulation Trimaran contains an optimizing compiler + simulator for EPIC-architectures. Compared with published techniques to alleviate conflicts misses in tiled algorithms.
Reuse distance as a metric for cache behavior - pdcs2001 [64] Results
Reuse distance as a metric for cache behavior - pdcs2001 [65] Results
Reuse distance as a metric for cache behavior - pdcs2001 [66] Overview Cache Remapping Introduction Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary
Reuse distance as a metric for cache behavior - pdcs2001 [67] Summary Cache remapping can remove all cache conflicts in tiled codes. Preloading the data hides cold and capacity misses. Implementation in a compiler and extension to more irregular codes is planned for the future.
Reuse distance as a metric for cache behavior - pdcs2001 [68] Overall Conclusions Reuse distances need to be shortened to reduce the capacity misses. It is best done at the compiler level. Visualization allow to think about new program transformations to reduce capacity misses. After capacity misses have been minimized, the conflict misses need to be removed. Cache remapping is a compile- time approach to do it.
Reuse distance as a metric for cache behavior - pdcs2001 [69] The End Time for questions
Reuse distance as a metric for cache behavior - pdcs2001 [70] Study cache behavior patterns of simple programs: simulated on a directed mapped cache with 1024 bytes cache size, 32 bytes line size. 3.1 Sequential accesses –(a) in a single loop –(b) in the outer loop –(c) in the inner loop 3.2 Large stride accesses –(a) One dimensional array –(b) Two dimensional array 3. Visual pattern recognition
Reuse distance as a metric for cache behavior - pdcs2001 [71] Every 4 accesses to the 8-byte array elements, a compulsory miss occurs. double a[16384]; for (i=0; i<16384; i++) a[i] = 1.0; legen d 3.1a Seq. accesses in single loop
Reuse distance as a metric for cache behavior - pdcs2001 [72] A compulsory miss occurs every 4 distinct accesses to the 8-byte array elements, i.e., every 4*3=12 references in the program double a[16384]; for (i=0; i<16384; i++) for (j=0; j<3; j++) a[i] = 1.0; 3.1b Seq. accesses in outer loop
Reuse distance as a metric for cache behavior - pdcs2001 [73] A compulsory miss every 4 accesses in the first iteration of the outer loop. Then a capacity miss every 4 accesses in the remaining 3 iterations of the outer loop double a[16384]; for (i=0; i<4; i++) for (j=0; j<16384; j++) a[j] = 1.0; 3.1c Seq. accesses in inner loop
Reuse distance as a metric for cache behavior - pdcs2001 [74] Every reference is missed! A cold miss or a capacity miss occurs in average every 4 accesses, all the rest are conflict misses. for (i=0; i<16384; i++) a[i] = a[i+128]; 3.2a Large stride access 1D array
Reuse distance as a metric for cache behavior - pdcs2001 [75] The same pattern as the one-dimensional array large stride accesses. The unit-stride loop i gives large stride in the column-major order array. double a[128][128]; for (i=0; i<128; i++) for (j=0; j<128; j++) a[i][j] = a[i+1][j]; 3.2b Large stride access 2D array