Cache Simulations and Application Performance Christopher Kerr Philip Mucci Jeff Brown Los Alamos, Sandia National Laboratories
Goal To optimize large, numerically intensive applications with poor cache utilization. By taking advantage of the memory hierarchy and we can often achieve the greatest performance improvements for our time.
Philosophy By simulating the cache hierarchy, we wish to understand how the application’s data maps to a specific cache architecture. In addition, we wish to understand the application’s reference pattern and the relationship to the mapping. Performance improvements can be obtained from this information algorithmically.
Cache Simulator Consists of: –Instrumentation assistant (Perl) –Header files –Run-time library to be linked with the application (C) Works with: –C, C++, Fortran 77, Fortran 90
How it works Cache simulator is called on memory (array) references Cache simulator reads a configuration file containing an architectural description of the memory hierarchy for multiple machines. Environment variables enable different options
How it works (cont) Each call to the simulator provides as input: –Address of the reference –Size of the datum being accessed –Symbolic name consisting of the name, file and line number
Instrumentation (before) subroutine kji(A, ii, jj, lda, B, kk, ldb, C, ldc) dimension A(lda,lda), B(ldb,ldb), C(ldc,ldc) do k = 1, kk do j = 1, jj do i = 1, ii A(i,j) = A(i,j) + B(i,k) * C(k,j) enddo return end
Instrumentation (after). do i = 1, ii call cache_sim(A(i,j),KIND(A(i,j)), & 'A(i,j):7:stdin\0') call cache_sim(A(i,j),KIND(A(i,j)), & 'A(i,j):7:stdin\0') call cache_sim(B(i,k),KIND(B(i,k)), & 'B(i,k):7:stdin\0') call cache_sim(C(k,j),KIND(C(k,j)), & 'C(k,j):7:stdin\0') A(i,j) = A(i,j) + B(i,k) * C(k,j) enddo.
Output Summary Misses by name Misses by address Conflict matrix Address trace
Summary Machine 1: test-machine Cache level 1: size 32kB, line size 32B, associativity 2, 1024 lines total Total mem accesses: Total cache misses: Total cache hits: Total hit rate: Num split accesses: 0.00 Cold cache misses: Real misses: Real hit rate: 94.41
Misses by name Name Trace: miss rate for references >= 0 percent. Percentage Real misses Line:File:Reference X(i,j):26:stencil.F X(i,j-1):26:stencil.F R(i,j):26:stencil.F X(i+1,j-1):26:stencil.F X(i+1,j+1):26:stencil.F AN(i,j):26:stencil.F ANE(i,j):17:stencil.F AN(i,j):15:stencil.F ANE(i,j-1):26:stencil.F ANE(i,j):26:stencil.F.
Misses by address Address Trace: miss rate for references >= 0 percent. Percentage Real misses Address Line:File:Reference x32d60 R(i,j):13:stencil.F x32d80 R(i,j):13:stencil.F x32da0 R(i,j):13:stencil.F x32dc0 R(i,j):13:stencil.F x32de0 R(i,j):13:stencil.F x32e00 R(i,j):13:stencil.F.
Conflict matrix Each axis represents the different arrays X axis is replacer, Y is replacee Elements are the number of replacements of one array element with another Goal is to algorithmically determine optimal layout, placement, padding and blocking. This is a minimization problem.
Conflict matrix (cont) Num Name 0 A(i,j):7:kji.f 1 B(i,k):7:kji.f 2 C(k,j):7:kji.f
Address dump Symbolic name Virtual address Cache line Goal is to use this for replay.
Future Handle nonblocking caches, replacement policies, write strategies and buffering. Add output file with starting address and extents for each array. Facility for replay of the simulator using the address dump, and data regarding padding, blocking and alignment. This will eliminates the need for additional runs.
Future (cont) Categorize cold misses for repeatedly accessed data items. Provide cost metrics to analyze approximate performance loss due to poor locality. Full Lex/Yacc based parser. Perl/Tk GUI for finer control of instrumentation.
Future (cont) MPI, Thread aware Reduction in run-time requirements MUT integration Tools to compare data sets