Download presentation
Presentation is loading. Please wait.
Published byGillian York Modified over 8 years ago
1
Cache Simulations and Application Performance Christopher Kerr (kerr@gfdl.gov) Philip Mucci (mucci@cs.utk.edu) Jeff Brown (jeffb@lanl.gov Los Alamos, Sandia National Laboratories
2
Goal To optimize large, numerically intensive applications with poor cache utilization. By taking advantage of the memory hierarchy and we can often achieve the greatest performance improvements for our time.
3
Philosophy By simulating the cache hierarchy, we wish to understand how the application’s data maps to a specific cache architecture. In addition, we wish to understand the application’s reference pattern and the relationship to the mapping. Performance improvements can be obtained from this information algorithmically.
4
Cache Simulator Consists of: –Instrumentation assistant (Perl) –Header files –Run-time library to be linked with the application (C) Works with: –C, C++, Fortran 77, Fortran 90
5
How it works Cache simulator is called on memory (array) references Cache simulator reads a configuration file containing an architectural description of the memory hierarchy for multiple machines. Environment variables enable different options
6
How it works (cont) Each call to the simulator provides as input: –Address of the reference –Size of the datum being accessed –Symbolic name consisting of the name, file and line number
7
Instrumentation (before) subroutine kji(A, ii, jj, lda, B, kk, ldb, C, ldc) dimension A(lda,lda), B(ldb,ldb), C(ldc,ldc) do k = 1, kk do j = 1, jj do i = 1, ii A(i,j) = A(i,j) + B(i,k) * C(k,j) enddo return end
8
Instrumentation (after). do i = 1, ii call cache_sim(A(i,j),KIND(A(i,j)), & 'A(i,j):7:stdin\0') call cache_sim(A(i,j),KIND(A(i,j)), & 'A(i,j):7:stdin\0') call cache_sim(B(i,k),KIND(B(i,k)), & 'B(i,k):7:stdin\0') call cache_sim(C(k,j),KIND(C(k,j)), & 'C(k,j):7:stdin\0') A(i,j) = A(i,j) + B(i,k) * C(k,j) enddo.
9
Output Summary Misses by name Misses by address Conflict matrix Address trace
10
Summary Machine 1: test-machine Cache level 1: size 32kB, line size 32B, associativity 2, 1024 lines total -------------------------------- Total mem accesses: 166492.00 Total cache misses: 10276.00 Total cache hits: 156216.00 Total hit rate: 93.83 -------------------------------- Num split accesses: 0.00 Cold cache misses: 1024.00 Real misses: 9252.00 Real hit rate: 94.41
11
Misses by name Name Trace: miss rate for references >= 0 percent. Percentage Real misses Line:File:Reference 0.01 1.000 X(i,j):26:stencil.F 0.01 1.000 X(i,j-1):26:stencil.F 9.08 840.000 R(i,j):26:stencil.F 0.12 11.000 X(i+1,j-1):26:stencil.F 9.08 840.000 X(i+1,j+1):26:stencil.F 9.08 840.000 AN(i,j):26:stencil.F 7.61 704.000 ANE(i,j):17:stencil.F 7.61 704.000 AN(i,j):15:stencil.F 0.13 12.000 ANE(i,j-1):26:stencil.F 9.08 840.000 ANE(i,j):26:stencil.F.
12
Misses by address Address Trace: miss rate for references >= 0 percent. Percentage Real misses Address Line:File:Reference 0.01 1.000 0x32d60 R(i,j):13:stencil.F 0.01 1.000 0x32d80 R(i,j):13:stencil.F 0.01 1.000 0x32da0 R(i,j):13:stencil.F 0.01 1.000 0x32dc0 R(i,j):13:stencil.F 0.01 1.000 0x32de0 R(i,j):13:stencil.F 0.01 1.000 0x32e00 R(i,j):13:stencil.F.
13
Conflict matrix Each axis represents the different arrays X axis is replacer, Y is replacee Elements are the number of replacements of one array element with another Goal is to algorithmically determine optimal layout, placement, padding and blocking. This is a minimization problem.
14
Conflict matrix (cont) 0 1 2 0 100 50 20 1 20 10 10 2 50 0 0 Num Name 0 A(i,j):7:kji.f 1 B(i,k):7:kji.f 2 C(k,j):7:kji.f
15
Address dump Symbolic name Virtual address Cache line Goal is to use this for replay.
16
Future Handle nonblocking caches, replacement policies, write strategies and buffering. Add output file with starting address and extents for each array. Facility for replay of the simulator using the address dump, and data regarding padding, blocking and alignment. This will eliminates the need for additional runs.
17
Future (cont) Categorize cold misses for repeatedly accessed data items. Provide cost metrics to analyze approximate performance loss due to poor locality. Full Lex/Yacc based parser. Perl/Tk GUI for finer control of instrumentation.
18
Future (cont) MPI, Thread aware Reduction in run-time requirements MUT integration Tools to compare data sets
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.