Reuse distance as a metric for cache behavior - pdcs2001 [1] Characterization and Optimization of Cache Behavior Kristof Beyls, Yijun Yu, Erik D’Hollander.

Slides:



Advertisements
Similar presentations
Cache Issues. General Cache Principle SETUP: Requestor making a stream of lookup requests to a data store. Some observed predictability – e.g. locality.
Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Kristof Beyls, Erik D’Hollander, Frederik Vandeputte ICCS 2005 – May 23 RDVIS: A Tool That Visualizes the Causes of Low Locality and Hints Program Optimizations.
It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS) University of Gent.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC pag. 1 Discovery of Locality-Improving Refactorings.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Phase Detection Jonathan Winter Casey Smith CS /05/05.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
Register Allocation (via graph coloring)
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
New Visual Characterization Graphs for Memory System Analysis and Evaluation Edson T. Midorikawa Hugo Henrique Cassettari.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Copyright 1998 UC, Irvine1 Miss Stride Buffer Department of Information and Computer Science University of California, Irvine.
Systems I Locality and Caching
Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.
Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Jeffrey Ellak CS 147. Topics What is memory hierarchy? What are the different types of memory? What is in charge of accessing memory?
Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
CMSC 611: Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
Basic Performance Parameters in Computer Architecture:
Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.
Today How’s Lab 3 going? HW 3 will be out today
Cache Memory Presentation I
Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance
Energy-Efficient Address Translation
Lecture 21: Memory Hierarchy
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Lecture 14: Reducing Cache Misses
Lecture 08: Memory Hierarchy Cache Performance
Lecture 22: Cache Hierarchies, Memory
Computer System Design Lecture 9
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Presentation transcript:

Reuse distance as a metric for cache behavior - pdcs2001 [1] Characterization and Optimization of Cache Behavior Kristof Beyls, Yijun Yu, Erik D’Hollander Electronics and Information Systems Gent University, Belgium

Reuse distance as a metric for cache behavior - pdcs2001 [2] Presentation overview Part 1: Reuse distance as a metric for cache behavior –How accurate is reuse distance in predicting cache behavior? –How effective are compilers in removing cache misses? Part 2: Program-centric visualization of data locality –Reuse distance-based visualization –Pattern-based visualization Part 3: Cache remapping: eliminating cache conflicts in tiled codes

Reuse distance as a metric for cache behavior - pdcs2001 [3] Part 1: Introduction to the Reuse Distance (PDCS01)

Reuse distance as a metric for cache behavior - pdcs2001 [4] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary

Reuse distance as a metric for cache behavior - pdcs2001 [5] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary

Reuse distance as a metric for cache behavior - pdcs2001 [6] 1. Introduction Gap between processor and memory speed widens exponentially fast –Typical: 1 memory access = 100 processor cycles Caches can deliver data more quickly, but have limited capacity Reuse distance is a metric for a programs cache performance

Reuse distance as a metric for cache behavior - pdcs2001 [7] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary

Reuse distance as a metric for cache behavior - pdcs2001 [8] 2.a Reuse distance Definition: The reuse distance of a memory access is the number of unique addresses referenced since the last access to the requested data. addressABCABBAC distance∞∞∞22012

Reuse distance as a metric for cache behavior - pdcs2001 [9] 2.b Reuse distance and fully associative cache Property: In a fully associative LRU cache with n cache lines, a reference will hit if the reuse distance d<n. Corollary: In any cache with n lines, a cache miss with reuse distance d is: d < nConflict miss n ≤ d < ∞Capacity miss d = ∞Cold miss

Reuse distance as a metric for cache behavior - pdcs2001 [10] 2.c Reuse Distance Distribution Spec95fp

Reuse distance as a metric for cache behavior - pdcs2001 [11] 2.d Classifying cache misses for SPEC95fp Cache size ConflictCapacity

Reuse distance as a metric for cache behavior - pdcs2001 [12] 2.e Reuse distance vs. hit probability

Reuse distance as a metric for cache behavior - pdcs2001 [13] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary

Reuse distance as a metric for cache behavior - pdcs2001 [14] 3.a Reuse distance after optimization ConflictCapacity

Reuse distance as a metric for cache behavior - pdcs2001 [15] 3.b Effect of compiler optimization SGIpro compiler for Itanium 30% of conflict misses are removed, 1% of capacity misses are removed. Conclusion: much work needs to be done to remove the most important kind of cache misses: capacity misses.

Reuse distance as a metric for cache behavior - pdcs2001 [16] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary

Reuse distance as a metric for cache behavior - pdcs2001 [17] 4. Capacity miss reduction 1.Hardware level –Increasing cache size CS Reuse distance must be smaller than cache size 2. Compiler level –Loop tiling –Loop fusion 3. Algorithmic level CS

Reuse distance as a metric for cache behavior - pdcs2001 [18] 4.a Algorithmic level Programmer has a better understanding of the global program structure. Programmer can change algorithm, so that long distance reuses decrease. Visualization of the long reuse distances can help the programmer to identify poor data locality in the code.

Reuse distance as a metric for cache behavior - pdcs2001 [19] Overview reuse distance 1.Introduction 2.Reuse distance ↔ cache behavior 3.Effect of compiler optimization 4.Capacity miss reduction techniques 5.Summary

Reuse distance as a metric for cache behavior - pdcs2001 [20] 5. What did we learn? Reuse distance predicts cache behavior accurately, even for direct mapped caches. Compiler optimizations for eliminating capacity misses are currently not powerful enough. A large overview over the code is needed. Programmer has large overview of code. Reuse distance visualization can help the programmer to identify regions with poor locality.

Reuse distance as a metric for cache behavior - pdcs2001 [21] Part 2: Program-Centric Visualization of Data Locality (IV2001)

Reuse distance as a metric for cache behavior - pdcs2001 [22] A program uses the cache transparently. The data locality is obscure to the programmer without Background by cache lines: What happens in the cache? e.g. CVT, Rivet by references: What happens to the program? - Reuse distance-based visualization - Pattern-based visualization

Reuse distance as a metric for cache behavior - pdcs2001 [23] Reuse distance-based Visualization What is the source of the capacity misses in my code?

Reuse distance as a metric for cache behavior - pdcs2001 [24] Instrumentation Programs are instrumented to obtain the following data: –Reuse distance distribution for every static array reference It indicates the cache behavior of the reference –For every static reference, the previous accesses to the same data is stored. It can show the programmer where the previous accesses occurred for long distance reuses.

Reuse distance as a metric for cache behavior - pdcs2001 [25] Main view on SWIM 18 H(i,j) 2 18 cache lines = 2 18 * 64 bytes = 16 Mbyte cache is needed to capture reuse (0,0,1) 4 (1,0,-15) HITMISS

Reuse distance as a metric for cache behavior - pdcs2001 [26] How to optimize locality in SWIM DO ncycle=1,itmax do 10 i do 10 j 10 s1(i,j) s2 do 20 i do 20 j 20 s3(i,j)... ENDDO Reuse between iterations of the outer loop (1,0,-15): - tiling is needed - but: non-perfectly nested loop - so: tiling needs to be extend - e.g. Song&Li, PLDI99: extending tiling for this kind of program structure: 52% speedup, because of increased temporal reuse

Reuse distance as a metric for cache behavior - pdcs2001 [27] Reuse-distance based visualization enables: The tool shows the place and the reason for poor data locality in the program. Visualization of reuse distances in the code enables recognizing code patterns leading to capacity misses. Compiler writer can use the tool to think about new effective program optimizations for data locality.

Reuse distance as a metric for cache behavior - pdcs2001 [28] Pattern-based Visualization Show me the cache behavior of my program!

Reuse distance as a metric for cache behavior - pdcs2001 [29] Cache misses –Cache hits and misses –Compulsory, capacity and conflict misses –Horizontally wrap-around of millions of pixels will make periodical patterns visible Histograms –Reuse distances –Cache misses Distribution of arrays and linenos Views of the cache behavior

Reuse distance as a metric for cache behavior - pdcs2001 [30] Matrix multiplication Tiled matrix multiplication TOMCATV (SPECfp95) FFT Experiments

Reuse distance as a metric for cache behavior - pdcs2001 [31] #define N 40 for (i=0;i<N;i++) for (j=0;j<N;j++) { c[i][j] = 0; for (k=0;k<N;k++) c[i][j] += a[i][k] * b[k][j]; } Matrix multiplication

Reuse distance as a metric for cache behavior - pdcs2001 [32] MM Cache miss patterns

Reuse distance as a metric for cache behavior - pdcs2001 [33] Capacity miss Cold miss MM Histogram of reuse distance

Reuse distance as a metric for cache behavior - pdcs2001 [34] #define N 40 #define B 5 double a[N][N],b[N][N],c[N][N]; for (i_0=0;i_0<N; i_0+=B) for (j_0=0;j_0<N; j_0+=B) { for (i=i_0; i<min(i_0+B,N); i++) for (j=j_0; j<min(j_0+B,N); j++) c[i][j]=0.0; for (k_0=0;k_0<N; k_0+=B) for (i=i_0; i<min(i_0+B,N); i++) for (j=j_0; j<min(j_0+B,N); j++) for (k=k_0; k<min(k_0+B,N); k++) c[i][j] += a[i][k] * b[k][j]; } Tiled matrix multiplication B*B+B+1<1024/8 1<B<12

Reuse distance as a metric for cache behavior - pdcs2001 [35] TMM: Fewer cap, more conf

Reuse distance as a metric for cache behavior - pdcs2001 [36] Capacity miss Cold miss TMM: Fewer long reuse distances

Reuse distance as a metric for cache behavior - pdcs2001 [37] histogra m TOMCATV

Reuse distance as a metric for cache behavior - pdcs2001 [38] Capacity miss Cold miss TOMCATV: Histogram with arrays

Reuse distance as a metric for cache behavior - pdcs2001 [39] Top 20 Hits TOMCATV: Conflict misses with arrays

Reuse distance as a metric for cache behavior - pdcs2001 [40] TOMCATV: Conflict misses with linenos

Reuse distance as a metric for cache behavior - pdcs2001 [41] for (j = 2; j < n; ++j) { for (i = 2; i < n; ++i) { … 217 pyy = x[i + (j + 1) * ] - x[i + j * ] * 2. + x[i + (j - 1) * ]; 219 qyy = y[i + (j + 1) * ] - y[i + j * ] * 2. + y[i + (j - 1) * ]; 221 pxy = x[i (j + 1) * ] - x[i (j- 1) * ] - x[i (j + 1) * ] + x[ i (j - 1) * ]; 224 qxy = y[i (j + 1) * ] - y[i (j- 1) * ] - y[i (j + 1) * ] + y[ i (j - 1) * ]; … } 524 (i + (j + 1) * 513 – 514)*8/32 = (i + j * 513 – 514)*8/32 mod (1024/32) TOMCATV: Back to the source code

Reuse distance as a metric for cache behavior - pdcs2001 [42] Changing array size from 513 to 524 leads to about 50% speedup of TOMCATV on real input set on Pentium III. TOMCATV: After array alignment

Reuse distance as a metric for cache behavior - pdcs2001 [43] FFT

Reuse distance as a metric for cache behavior - pdcs2001 [44] FFT: After array padding

Reuse distance as a metric for cache behavior - pdcs2001 [45] Pattern-based visualization of the cache behavior on program execution guides the programmer to detect the bottleneck for cache optimization Cache optimizing program transformations, such as loop tiling, array padding and data alignment, can be verified with the visualization Pattern-based visualization:

Reuse distance as a metric for cache behavior - pdcs2001 [46] Cache Remapping to Improve the Performance of Tiled Algorithms (Euro-Par 2000, J.UCS Oct. 2000)

Reuse distance as a metric for cache behavior - pdcs2001 [47] Overview Cache Remapping Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary

Reuse distance as a metric for cache behavior - pdcs2001 [48] Overview Cache Remapping Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary

Reuse distance as a metric for cache behavior - pdcs2001 [49] If cache size < N*N, no temporal reuse Possible Cache reuse if cache size >= B2*B3 Loop Tiling for i:=1 to N do for k:=1 to N do for j:=1 to N do C[i,j]+=A[i,k]*B[k,j]; Matrix Multiplication for ii:=1 to N by B1 do for kk:=1 to N by B2 do for jj:=1 to N by B3 do for i:=ii to min(ii+B1-1,N) do for k:=kk to min(kk+B2-1,N) do for j:=jj to min(jj+B3-1,N) do C[i,j]=A[i,k]*B[k,j]; Td Ti

Reuse distance as a metric for cache behavior - pdcs2001 [50] Tiling Reduces Capacity Misses Tiling reduces capacity misses by shortening the reuse distances

Reuse distance as a metric for cache behavior - pdcs2001 [51] Tiling is not a Panacea Capacity misses are reduced, cold en conflict misses stay. Conflict misses depend strongly on matrix dimension

Reuse distance as a metric for cache behavior - pdcs2001 [52] Overview Cache Remapping Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary

Reuse distance as a metric for cache behavior - pdcs2001 [53] Threaded Pipeline Remap thread Computation thread time Computation thread performs the calculations Remap thread places data used by computation thread in the cache. After the data has been processed, it is copied back to the memory.

Reuse distance as a metric for cache behavior - pdcs2001 [54] Cache Remapping Processor Cache Main Memory Computation threadRemap thread P1P1 P2P2 P3P3 P 1 : scalars P 2 : current tile P 3 : future tile

Reuse distance as a metric for cache behavior - pdcs2001 [55] Cache Shadow Cache Address Space Processor LD_C3_C3 Cache Shadow LD_C1

Reuse distance as a metric for cache behavior - pdcs2001 [56] Tile Size Selection Condition: 2 data tiles + scalars must fit in cache. Many tile sizes qualify (B1,B2,B3) –Choose the tile size for which is maximal. –The choice shortens the remap thread w.r.t. the computation thread.

Reuse distance as a metric for cache behavior - pdcs2001 [57] Processor Requirements Cache bypass (e.g. cache hints). Multiple instructions can be executed in parallel (e.g. superscalar processor). Processor won’t stall on a main memory access (current superscalar processors stall: the instruction window is to small).

Reuse distance as a metric for cache behavior - pdcs2001 [58] Parallel Execution On SMT processors not to difficult. On EPIC processors possible via static thread scheduling: –Many functional units, not all of them are used in every processor cycle. –Instructions from the remap and computation thread can be executed in parallel. –Instructions from both threads are interwoven by loop transformations.

Reuse distance as a metric for cache behavior - pdcs2001 [59] Overview Cache Remapping Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary

Reuse distance as a metric for cache behavior - pdcs2001 [60] Remap Code remap(double *x, double *y) { FLD_C3_C3 r1,y FST_C1 x,r1 } remapTileX(iter,X,p,II,JJ) { i1 = decoalesce1(iter); i2 = decoalesce2(iter); remap(p+i1*B2+i2, X[i1+II,i2+JJ]); } remapTileX(X,p,II,JJ) { for i1=1 to B1 do for i2=1 to B2 do remap(p+i1*B2+i2, X[i1+II,i2+JJ]); }

Reuse distance as a metric for cache behavior - pdcs2001 [61] Q1*B2*(B3/r) >= iterA Unroll r times Weaving Threads swap(p2,p3); iter=0; for i = II to II+Q1 do for j = JJ to JJ+B2-1 do for k = KK to KK+B3-1 do H(i,j,k,p2)... H(i,j,k+r-1,p2) remapA(iter++,A,p3,II,KK) iter=0 for i = II+Q1 to II+Q1+Q2 do for j = JJ to JJ+B2-1 do for k = KK to KK+B3-1 do... remapB(iter++,B,p3,KK,JJ)

Reuse distance as a metric for cache behavior - pdcs2001 [62] Overview Cache Remapping Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary

Reuse distance as a metric for cache behavior - pdcs2001 [63] Compilation & Simulation Trimaran contains an optimizing compiler + simulator for EPIC-architectures. Compared with published techniques to alleviate conflicts misses in tiled algorithms.

Reuse distance as a metric for cache behavior - pdcs2001 [64] Results

Reuse distance as a metric for cache behavior - pdcs2001 [65] Results

Reuse distance as a metric for cache behavior - pdcs2001 [66] Overview Cache Remapping Introduction Tiling Cache remapping –Concept –Code transformation for EPIC Evaluation Summary

Reuse distance as a metric for cache behavior - pdcs2001 [67] Summary Cache remapping can remove all cache conflicts in tiled codes. Preloading the data hides cold and capacity misses. Implementation in a compiler and extension to more irregular codes is planned for the future.

Reuse distance as a metric for cache behavior - pdcs2001 [68] Overall Conclusions Reuse distances need to be shortened to reduce the capacity misses. It is best done at the compiler level. Visualization allow to think about new program transformations to reduce capacity misses. After capacity misses have been minimized, the conflict misses need to be removed. Cache remapping is a compile- time approach to do it.

Reuse distance as a metric for cache behavior - pdcs2001 [69] The End Time for questions

Reuse distance as a metric for cache behavior - pdcs2001 [70] Study cache behavior patterns of simple programs: simulated on a directed mapped cache with 1024 bytes cache size, 32 bytes line size. 3.1 Sequential accesses –(a) in a single loop –(b) in the outer loop –(c) in the inner loop 3.2 Large stride accesses –(a) One dimensional array –(b) Two dimensional array 3. Visual pattern recognition

Reuse distance as a metric for cache behavior - pdcs2001 [71] Every 4 accesses to the 8-byte array elements, a compulsory miss occurs. double a[16384]; for (i=0; i<16384; i++) a[i] = 1.0; legen d 3.1a Seq. accesses in single loop

Reuse distance as a metric for cache behavior - pdcs2001 [72] A compulsory miss occurs every 4 distinct accesses to the 8-byte array elements, i.e., every 4*3=12 references in the program double a[16384]; for (i=0; i<16384; i++) for (j=0; j<3; j++) a[i] = 1.0; 3.1b Seq. accesses in outer loop

Reuse distance as a metric for cache behavior - pdcs2001 [73] A compulsory miss every 4 accesses in the first iteration of the outer loop. Then a capacity miss every 4 accesses in the remaining 3 iterations of the outer loop double a[16384]; for (i=0; i<4; i++) for (j=0; j<16384; j++) a[j] = 1.0; 3.1c Seq. accesses in inner loop

Reuse distance as a metric for cache behavior - pdcs2001 [74] Every reference is missed! A cold miss or a capacity miss occurs in average every 4 accesses, all the rest are conflict misses. for (i=0; i<16384; i++) a[i] = a[i+128]; 3.2a Large stride access 1D array

Reuse distance as a metric for cache behavior - pdcs2001 [75] The same pattern as the one-dimensional array large stride accesses. The unit-stride loop i gives large stride in the column-major order array. double a[128][128]; for (i=0; i<128; i++) for (j=0; j<128; j++) a[i][j] = a[i+1][j]; 3.2b Large stride access 2D array