Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Bounding Worst-Case Data Cache Behavior by Analytically Deriving Cache Reference Patterns
2 Motivation Timing Analysis — Calculation of Worst Case Execution Times of tasks — Required by for scheduling of real-time tasks – Schedulability theory requires a-priori knowledge of WCET — Estimates need to be safe — Static Timing Analysis – an efficient method to calculate WCET of a program!
3 Motivation Static Timing Analysis — Traverse all paths in program — Calculate a safe estimate for WCET — Unpredictability introduced due to: –Data-dependent control flow –Pointer accesses –Modern architectural features – branch predictors, instruction caches, data caches,… Data caches should be exploited dilemma: speedup vs. unpredictability
4 Static Timing Analyzer Framework to calculate WCET of a task Instruction cache well analyzed No good data cache analyzer Static Timing Analysis framework
5 Existing Solutions for D$ Analysis Trace based simulation Static Cache simulation [White et al ] Data flow analysis [Li et al ] Cache locking [Lisper et al , Decotigny et al ] Analytical techniques — Characterize data cache behavior [Ghosh et al , Fraguela et al , Chatterjee et al ] All have shortcomings
6 Cache Miss Equations (CMEs) Characterizes data cache behavior statically Used for loop-nest oriented code Produces a set of linear equations to relate — Iteration space — Cache parameters — Memory references Solutions give potential miss points
7 More about CMEs Types of CMEs — Cold miss equations –Capture misses on first access to memory line — Replacement miss equations –Capture interference between two references Solving CMEs — Direct solutions not practical – high complexity — Complexity reduced in implementations
8 Terminology Iteration point — Represents an iteration of a loop-nest — Set of all iteration points – iteration space (0, 1, 0) Iteration space for matrix multiplication code r = (0, 0, 1) Reuse vectors — Represent data reuse across different iteration points — Self/group reuse and temporal/spatial reuse
9 Assumptions in CME Framework Loop-nest oriented code Perfectly nested loops Loop bounds — Known at compile time — Affine functions of loop induction variables No data-dependent conditionals
10 CME Implementation Framework used – Coyote [Vera et al.] Outputs – estimate of # misses for every reference in loop nest Works only for perfectly nested, rectangular loops Results are slightly pessimistic No data dependent conditionals are allowed
11 Arbitrary Loop Nests Existing approach — Loop-nests sequential loop-nests (equal depth) –Move all references in a loop-nest to innermost level –Introduce conditionals to ensure correctness — Disadvantages –Reuse representation needs modification –Representation across iteration spaces required –CME analysis needs some modification
12 Our Approach: Forced Loop Fusion “Forced” loop fusion — Produce a single loop-nest — Concatenate iteration spaces of sequential loops — Introduce conditionals based on loop induction variables to maintain correctness Advantages — No change to reuse representation — Original CME framework reused
13 Example of Forced Loop Fusion Original loop nest Transformed loop nest for (i = 1; i <= 10; i++) for (j = 1; j <= 5; j++) A[i][j] = 19 ; for (k = 1; k <= 10; k++) D[i][k] = A[j][k] + 7 ; for (l = 1; l <= 10; l++) for(m = 1; m <= 5; m++) D[l][m] = 13 ; for(i = 1; i <= 20; i++) for(j = 1; j <= 20; j++) if (i >= 1) && (i <= 10) && (j >= 0) && (j <= 5) A[i][j] = 19 ; if (i >= 1) && (i <= 10) && (j >= 6) && (j <= 15) D[i][j-5] = A[i][j-5] + 7 ; if (i >= 11) && (i <= 20) && (j >= 16) && (j <= 20) D[i-10][j-15] = 13 ;
14 More Conceptual Enhancements Non-rectangular loops — Insert conditional based on loop induction variable — Treat conditional similar to those due to loop fusion Data dependent conditionals — Provide upper bound on number of misses
15 Pessimism in CMEs Does not analyze all iteration points — Trades off accuracy (pessimism) for speed of analysis Does not exploit certain kinds of reuse
16 Deriving Exact Cache Patterns Analyze all iteration points — Acceptable overhead – single pass, static Reanalyze “compulsory misses” — Eliminate pessimistic misses Verify correctness of “hits” — Necessitated by conditionals introduced while fusing loops
17 Illustrative Example Cache configuration — 1KB cache, direct mapped, 32byte line size Information on variables ReferenceDim LB&UBBase address Size of element A1..10, B1..10,
18 for (i = 1; i <= 20; i++) for (j = 1; j <= 20; j++) if (i >= 1) && (i <= 10) && (j >= 0) && (j <= 5) A[i][j] = 19 ; if (i >= 1) && (i <= 10) && (j >= 6) && (j <= 15) D[i][j-5] = A[i][j-5] + 7 ; if (i >= 11) && (i <= 20) && (j >= 16) && (j <= 20) D[i-10][j-15] = 13 ; Sample Program
19 Analysis Results Original CME Framework vs. Our Framework Reference Coyote output Output of our framework 150MMMMM M = 6 misses MMMMM M M = 7 misses 3100 MMMMMMMMMM M...M M = 13 misses = 0 misses
20 New Timing Analyzer Framework Extension to Coyote
21 Implications to the TA Miss count (n) given for each reference All n misses clustered at the beginning — Avoids handling of complex miss/hit pattern More iterations required to reach steady state — # iters could exceed upper bound of innermost loop’s induction variable — Propagation to outer loop level required — Requires extra handling in TA
22 TA Example Sample Program D$ Analyzer Output for(i = 1; i <= 10; i++) for(j = 1; j <= 10; j++) A[i][j] = 19 ; Number of misses for A[i][j] = 13 Let Time of j loop considering A[i][j] as a Miss = Miss time Time of j loop considering A[i][j] as a Hit = Hit time Value of i #iters with Miss time for j loop #iters with Hit time for j loop MMMMMMMMMM | MMM……. | ……(continues) MMMMMMMMMM MMM……. …………………(continues)
23 Experimental Results Original CME vs. our framework vs. trace-driven sim., 4KB cache BenchmarkCME FrameworkOur FrameworkSimulator MissesHitsMissesHitsMissesHits convolution dot product fir lms matrix nrealupdates simple-srt-test loop test
24 Experimental Results Timing Analyzer results [cycles] for various cache categorizations “cold miss” category is unsafe “always miss” category is overly pessimistic Use of our output (n misses) gives safe & often tighter estimates Benchmark Always Miss First N Misses Cold misses 256B Cache1KB Cache4KB Cache convolution dot product fir lms matrix nrealupdates simple-srt loop test
25 Conclusions & Future Work Contributions: 1.Exact data cache reference patterns 2.Wider applicability Allows arbitrary loop nests using “Forced” loop fusion Allows non-rectangular loops Allows certain data-dependent conditionals 3.Integration of outputs with static timing analyzer framework Future work: — Explore larger benchmarks — Test framework with set-associative caches — Extend idea to L2 caches
26 Thank you! Questions?
27 Cache Miss Equations Loop bounds and array subscript expressions — Affine combinations of loop induction variables –Affine equation is a non-homogeneous linear equation –Non-homogeneous isolated constants allowed CMEs – linear Diophantine equations — Diophantine equation – only integer solutions allowed A Linear Diophantine equation is of the form: — ax + by = c — where a, b and c are constants and x and y are variables
28 Cache Miss Equations Cold Miss Equations — Memory lineRa(i) Memory lineRa’(p) –i and p are iteration points and Ra and Ra’ are references Replacement Miss Equations – Intuition — Cache set for Ra = Cache set for Rb — Mem addr of Ra = Mem addr of Rb + n cache size + line size range