Presentation is loading. Please wait.

Presentation is loading. Please wait.

Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Bounding Worst-Case Data Cache Behavior by Analytically.

Similar presentations


Presentation on theme: "Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Bounding Worst-Case Data Cache Behavior by Analytically."— Presentation transcript:

1 Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Bounding Worst-Case Data Cache Behavior by Analytically Deriving Cache Reference Patterns

2 2 Motivation Timing Analysis — Calculation of Worst Case Execution Times of tasks — Required by for scheduling of real-time tasks – Schedulability theory requires a-priori knowledge of WCET — Estimates need to be safe — Static Timing Analysis – an efficient method to calculate WCET of a program!

3 3 Motivation Static Timing Analysis — Traverse all paths in program — Calculate a safe estimate for WCET — Unpredictability introduced due to: –Data-dependent control flow –Pointer accesses –Modern architectural features – branch predictors, instruction caches, data caches,… Data caches should be exploited  dilemma: speedup vs. unpredictability

4 4 Static Timing Analyzer Framework to calculate WCET of a task Instruction cache well analyzed No good data cache analyzer Static Timing Analysis framework

5 5 Existing Solutions for D$ Analysis Trace based simulation Static Cache simulation [White et al. - 1997] Data flow analysis [Li et al. - 1996] Cache locking [Lisper et al. - 2003, Decotigny et al. - 2002] Analytical techniques — Characterize data cache behavior [Ghosh et al. - 1999, Fraguela et al. - 1999, Chatterjee et al. - 2001]  All have shortcomings

6 6 Cache Miss Equations (CMEs) Characterizes data cache behavior statically Used for loop-nest oriented code Produces a set of linear equations to relate — Iteration space — Cache parameters — Memory references Solutions give potential miss points

7 7 More about CMEs Types of CMEs — Cold miss equations –Capture misses on first access to memory line — Replacement miss equations –Capture interference between two references Solving CMEs — Direct solutions not practical – high complexity — Complexity reduced in implementations

8 8 Terminology Iteration point — Represents an iteration of a loop-nest — Set of all iteration points – iteration space (0, 1, 0) Iteration space for matrix multiplication code r = (0, 0, 1) Reuse vectors — Represent data reuse across different iteration points — Self/group reuse and temporal/spatial reuse

9 9 Assumptions in CME Framework Loop-nest oriented code Perfectly nested loops Loop bounds — Known at compile time — Affine functions of loop induction variables No data-dependent conditionals

10 10 CME Implementation Framework used – Coyote [Vera et al.] Outputs – estimate of # misses for every reference in loop nest Works only for perfectly nested, rectangular loops Results are slightly pessimistic No data dependent conditionals are allowed

11 11 Arbitrary Loop Nests Existing approach — Loop-nests  sequential loop-nests (equal depth) –Move all references in a loop-nest to innermost level –Introduce conditionals to ensure correctness — Disadvantages –Reuse representation needs modification –Representation across iteration spaces required –CME analysis needs some modification

12 12 Our Approach: Forced Loop Fusion “Forced” loop fusion — Produce a single loop-nest — Concatenate iteration spaces of sequential loops — Introduce conditionals based on loop induction variables to maintain correctness Advantages — No change to reuse representation — Original CME framework reused

13 13 Example of Forced Loop Fusion Original loop nest Transformed loop nest for (i = 1; i <= 10; i++) for (j = 1; j <= 5; j++) A[i][j] = 19 ; for (k = 1; k <= 10; k++) D[i][k] = A[j][k] + 7 ; for (l = 1; l <= 10; l++) for(m = 1; m <= 5; m++) D[l][m] = 13 ; for(i = 1; i <= 20; i++) for(j = 1; j <= 20; j++) if (i >= 1) && (i <= 10) && (j >= 0) && (j <= 5) A[i][j] = 19 ; if (i >= 1) && (i <= 10) && (j >= 6) && (j <= 15) D[i][j-5] = A[i][j-5] + 7 ; if (i >= 11) && (i <= 20) && (j >= 16) && (j <= 20) D[i-10][j-15] = 13 ;

14 14 More Conceptual Enhancements Non-rectangular loops — Insert conditional based on loop induction variable — Treat conditional similar to those due to loop fusion Data dependent conditionals — Provide upper bound on number of misses

15 15 Pessimism in CMEs Does not analyze all iteration points — Trades off accuracy (pessimism) for speed of analysis Does not exploit certain kinds of reuse

16 16 Deriving Exact Cache Patterns Analyze all iteration points — Acceptable overhead – single pass, static Reanalyze “compulsory misses” — Eliminate pessimistic misses Verify correctness of “hits” — Necessitated by conditionals  introduced while fusing loops

17 17 Illustrative Example Cache configuration — 1KB cache, direct mapped, 32byte line size Information on variables ReferenceDim LB&UBBase address Size of element A1..10, 1..101519444 B1..10, 1..101530004

18 18 for (i = 1; i <= 20; i++) for (j = 1; j <= 20; j++) if (i >= 1) && (i <= 10) && (j >= 0) && (j <= 5) A[i][j] = 19 ; if (i >= 1) && (i <= 10) && (j >= 6) && (j <= 15) D[i][j-5] = A[i][j-5] + 7 ; if (i >= 11) && (i <= 20) && (j >= 16) && (j <= 20) D[i-10][j-15] = 13 ; Sample Program

19 19 Analysis Results Original CME Framework vs. Our Framework Reference Coyote output Output of our framework 150MMMMM.......M..................................... = 6 misses 250.....MMMMM................M......................M.................................................. = 7 misses 3100 MMMMMMMMMM............M...M......................M.................................................. = 13 misses 450.................................................. = 0 misses

20 20 New Timing Analyzer Framework Extension to Coyote

21 21 Implications to the TA Miss count (n) given for each reference All n misses clustered at the beginning — Avoids handling of complex miss/hit pattern More iterations required to reach steady state — # iters could exceed upper bound of innermost loop’s induction variable — Propagation to outer loop level required — Requires extra handling in TA

22 22 TA Example Sample Program D$ Analyzer Output for(i = 1; i <= 10; i++) for(j = 1; j <= 10; j++) A[i][j] = 19 ; Number of misses for A[i][j] = 13 Let Time of j loop considering A[i][j] as a Miss = Miss time Time of j loop considering A[i][j] as a Hit = Hit time Value of i #iters with Miss time for j loop #iters with Hit time for j loop 1100 237 3..10010 MMMMMMMMMM | MMM……. | ……(continues) MMMMMMMMMM MMM……. …………………(continues)

23 23 Experimental Results Original CME vs. our framework vs. trace-driven sim., 4KB cache BenchmarkCME FrameworkOur FrameworkSimulator MissesHitsMissesHitsMissesHits convolution400 2637426374 dot product803517 fir59911922657326573 lms12079449271071271071 matrix14600 77940 0 394561384562 nrealupdates12002400521148501150 simple-srt-test145998614296861429686 loop test391612617426174

24 24 Experimental Results Timing Analyzer results [cycles] for various cache categorizations  “cold miss” category is unsafe  “always miss” category is overly pessimistic  Use of our output (n misses) gives safe & often tighter estimates Benchmark Always Miss First N Misses Cold misses 256B Cache1KB Cache4KB Cache convolution879154815051 dot product530480 460 fir1279772877097 lms185441265411814 matrix19616872538523785055850548 nrealupdates2333812938126581185811838 simple-srt668894377434372034 loop test64824742

25 25 Conclusions & Future Work Contributions: 1.Exact data cache reference patterns 2.Wider applicability Allows arbitrary loop nests using “Forced” loop fusion Allows non-rectangular loops Allows certain data-dependent conditionals 3.Integration of outputs with static timing analyzer framework Future work: — Explore larger benchmarks — Test framework with set-associative caches — Extend idea to L2 caches

26 26 Thank you! Questions?

27 27 Cache Miss Equations Loop bounds and array subscript expressions — Affine combinations of loop induction variables –Affine equation is a non-homogeneous linear equation –Non-homogeneous  isolated constants allowed CMEs – linear Diophantine equations — Diophantine equation – only integer solutions allowed A Linear Diophantine equation is of the form: — ax + by = c — where a, b and c are constants and x and y are variables

28 28 Cache Miss Equations Cold Miss Equations — Memory lineRa(i)  Memory lineRa’(p) –i and p are iteration points and Ra and Ra’ are references Replacement Miss Equations – Intuition — Cache set for Ra = Cache set for Rb — Mem addr of Ra = Mem addr of Rb + n  cache size + line size range


Download ppt "Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Bounding Worst-Case Data Cache Behavior by Analytically."

Similar presentations


Ads by Google