Presentation is loading. Please wait.

Presentation is loading. Please wait.

Static Analysis of Parameterized Loop Nests for Energy Efficient Use of Data Caches P.D’Alberto, A.Nicolau, A.Veidenbaum, R.Gupta University of California.

Similar presentations


Presentation on theme: "Static Analysis of Parameterized Loop Nests for Energy Efficient Use of Data Caches P.D’Alberto, A.Nicolau, A.Veidenbaum, R.Gupta University of California."— Presentation transcript:

1 Static Analysis of Parameterized Loop Nests for Energy Efficient Use of Data Caches P.D’Alberto, A.Nicolau, A.Veidenbaum, R.Gupta University of California at Irvine Information and Computer Science Department Center for Embedded Computer Systems

2 COLP01 - CECS ICS UCI Talk Organization MotivationMotivation Related workRelated work Parameterized Static AnalysisParameterized Static Analysis Implementation and Experimental ResultsImplementation and Experimental Results ConclusionsConclusions

3 COLP01 - CECS ICS UCIMotivation Caches are important in modern systemsCaches are important in modern systems –A key performance determinant –An important part of the overall energy equation: »large and growing size and associativity, multi-port access –Increasingly critical for data-intensive applications ‘Adaptive’ caches can improve performance‘Adaptive’ caches can improve performance –By changing line and/or fetch size based on runtime behavior –Varying associativity, etc –“Optimal” is application specific

4 COLP01 - CECS ICS UCI Adaptive Memory System Problem: How to control adaptive cache line size to maximize performance and minimize energy consumption?Problem: How to control adaptive cache line size to maximize performance and minimize energy consumption? Approach: use a compiler to generate code directing the adaptation at run timeApproach: use a compiler to generate code directing the adaptation at run time Issues in such a compilation approachIssues in such a compilation approach –What application characteristics do we need to measure? –When can we statically determine an optimal line size? –How can we “generate” code based on the static analysis?

5 COLP01 - CECS ICS UCI Rewards of Line Size Adaptation Miss rate is reduced resulting in :Miss rate is reduced resulting in : –Fewer fetches from next-level cache or memory –Less transfer traffic –Fewer processor stalls –The code is, of course, left unchanged What is the effect on energy dissipation?What is the effect on energy dissipation?

6 COLP01 - CECS ICS UCI Energy Effects To ensure total energy reduction need toTo ensure total energy reduction need to –reduce memory accesses and memory traffic The choice of the line size affects both memory accesses and memory trafficThe choice of the line size affects both memory accesses and memory traffic –Minimizing for either one alone will not give minimal energy dissipation A tradeoff is possible and practical, at compile time, when they are “quantified” accurately and fastA tradeoff is possible and practical, at compile time, when they are “quantified” accurately and fast

7 COLP01 - CECS ICS UCI Related Work on Cache Optimizations Two primary approaches to determine Memory Accesses and Cache MissesTwo primary approaches to determine Memory Accesses and Cache Misses –Profiling and Static Analysis ProfilingProfiling 1.Given a set of inputs (training set) 2.The number of interference is determined »Simulation, hardware counters…. 3.Compiler selects a line size and uses it to annotate the code 4.The annotated code is then used in actual execution – –The problem is that often there is not a single optimum line size for different runs of the same code (input dependent)

8 COLP01 - CECS ICS UCI Related Work … Related Work … cont Static AnalysisStatic Analysis –Based on loop nests and CMEs 1.Representation of cache misses by inequalities 2.Each iteration checked to verify inequalities »Yes/no miss 3.The number of iterations/misses is added up »The loop bounds are needed at this point Note:Note: –The complexity of the analysis is proportional to the number of iterations –This work forms the basis of the parameterized analysis »Because it characterizes the cause of misses

9 COLP01 - CECS ICS UCI Parameterized Loop Nests Analysis for Direct Mapped Data Cache 1. Every memory reference is symbolically equated with every possible source of interference 2.Symbolic solution of the equation is sought Is there a solution?Is there a solution? Existence condition for a solutionExistence condition for a solution 3.Trade-off between spatial reuse and interference If there is a solution, how can we estimate the number of solutions without counting them?If there is a solution, how can we estimate the number of solutions without counting them? Bounding the misses by interferenceBounding the misses by interference 4.Misses = Contribution from each reference

10 COLP01 - CECS ICS UCI Interference and Reuse, Interference and Reuse, explanation An interference equation represents the set of iterations in the loop nest where two references interfereAn interference equation represents the set of iterations in the loop nest where two references interfere –Where there is a miss Existence conditions for solutions of the equation: Existence conditions for solutions of the equation: –Find at least one iteration for which the equation is satisfied Bounding the misses due to interferenceBounding the misses due to interference –If there is a solution, we propose, “the interference density”, to bound the ratio of the iteration solutions over the total number of iterations

11 COLP01 - CECS ICS UCI Interference Density, Interference Density, explanation The interference density is a straightforward quantity to determineThe interference density is a straightforward quantity to determine –Function of the coefficients of the interference equation It is independent from the loop nest and the definition domainIt is independent from the loop nest and the definition domain It is a good upper bound to the cache miss ratio due to interference equationIt is a good upper bound to the cache miss ratio due to interference equation –(but only when it is known if there is interference)

12 COLP01 - CECS ICS UCI int A[MAX][MAX]; /* In memory starting at address Aoffset */ int B[MAX][MAX]; /* In memory starting at address Boffset */ /* Size of int = 4 Bytes, Row major layout */ /* Size of int = 4 Bytes, Row major layout */ int empty(int upb) { /* One parameter */ x=0; x=0; for (i=0;i<upb;i++) /* loop bounds affine function of upb */ for (i=0;i<upb;i++) /* loop bounds affine function of upb */ for (j=0;j<upb;j++) for (j=0;j<upb;j++) x += (A[i][j+upb] +1) /* Memory reference affine function of */ x += (A[i][j+upb] +1) /* Memory reference affine function of */ *B[i][j]; /* index variables i, j and parameter upb */ *B[i][j]; /* index variables i, j and parameter upb */ return x; } Example

13 COLP01 - CECS ICS UCI Our Approach int A[MAX][MAX]; int B[MAX][MAX]; void empty(int upb) { x +=(A[i][j+upb]+1)*B[i][j]; x +=(A[i][j+upb]+1)*B[i][j];} 1) Interference Equation A[][] B[][]: Aoffset+MAX*4*i+(j+upb)*4+n(mL)+l=Boffset+MAX*4*i+j*4 n>0 and |l| 0 and |l|<L For example, with Aoffset = 64, Boffset=8256 and MAX=10 And mL = 8192, the equation becomes : 4*upb +n8192 +l = 8192 2) Symbolic Solution: D= 4*upb mod 8192 if L>D there is interfernce (i.e. L=D annotation ) 3) Otherwise Interference Density: min(1,4/L+(L-D)/L)

14 COLP01 - CECS ICS UCI Our Approach … cont void empty(int upb) { …. …. for (i=0;i<upb;i++) for (i=0;i<upb;i++) for (j=0;j<upb ;j++) { for (j=0;j<upb ;j++) { … }} Iteration Points 4) Misses : interference density * number of iteration points : 2 (4/L+min(1,1/(up mod L)) * upb 2 : 2 (4/L+min(1,1/(up mod L)) * upb 2 upb 2

15 COLP01 - CECS ICS UCI Implementation of STAMINA Stamina is a 3-step phase in the ARMR compilerStamina is a 3-step phase in the ARMR compiler 1.Step I: Code Analysis –Input: »Code and the sequence of memory references in the inner loops –Output: »Loop bounds information: Expression of the boundsExpression of the bounds »Reference information Index computationsIndex computations Reuse informationReuse information

16 COLP01 - CECS ICS UCI Implementation of STAMINA. cont Step II: Interference Equation Generation –Input: Step I output and »The cache size and array layout »Sequence of memory references –Output: »Set of equations and domains of validity Step III: Interference estimation –Input: Step II output –Output: »Interference density and symbolic enumeration of the iterations in the loop nest

17 COLP01 - CECS ICS UCI Example: SWIM Swim: a loop-based code from SPEC 2000Swim: a loop-based code from SPEC 2000 –It cannot be analyzed using CME because of unknown loop bounds (introduced as input at run time) STAMINA analysis takes 1 minute per loop nestSTAMINA analysis takes 1 minute per loop nest –4 main loop nests –The execution of swim takes less than 1 hr We verified the analysis using Shade cache simulatorWe verified the analysis using Shade cache simulator The analysis results:The analysis results: –For the reference set Swim has no interference: »Independent of the line size –Larger line size = better performance –Shorter line size = better energy

18 COLP01 - CECS ICS UCI Examples: “Self Interference” “Self Interference” is an artificial example“Self Interference” is an artificial example It consists of 6 loop nests with only one memory reference eachIt consists of 6 loop nests with only one memory reference each –Self interference The choice of line size sharply affects the overall miss ratio and energy consumptionThe choice of line size sharply affects the overall miss ratio and energy consumption –Adaptive = optimal line size per each loop nest Adaptive line size yieldsAdaptive line size yields –Optimum energy consumption –1/3 of the total miss ratio

19 COLP01 - CECS ICS UCI Example “Self Interference” Adaptive = each loop nest has a different and optimal line size

20 COLP01 - CECS ICS UCI Example: Matrix Multiply ijk-Matrix Multiplyijk-Matrix Multiply Stamina takes about 2min for untiled and 8hrs for tiledStamina takes about 2min for untiled and 8hrs for tiled The tiled MM cannot be analyzed by CMEsThe tiled MM cannot be analyzed by CMEs Comparison with the “exact version”Comparison with the “exact version”

21 COLP01 - CECS ICS UCIConclusions Architectural adaptation presents an opportunity to maximize performance based on application and data needsArchitectural adaptation presents an opportunity to maximize performance based on application and data needs Energy consumption can be optimized within the same frameworkEnergy consumption can be optimized within the same framework Compiler analysis and its integration with the runtime system is needed to achieve thisCompiler analysis and its integration with the runtime system is needed to achieve this –This work enables optimum tradeoff between conflict and reuse based on static analysis of nested loops –The result is a possible trade-off between energy and performance or energy optimization The model is validated by the experimental resultsThe model is validated by the experimental results

22 COLP01 - CECS ICS UCI Future Work Find efficient techniques for symbolic solution of the interference equationFind efficient techniques for symbolic solution of the interference equation Improve a friendlier “applicability” on benchmarksImprove a friendlier “applicability” on benchmarks

23 COLP01 - CECS ICS UCI Thank you

24 COLP01 - CECS ICS UCI Energy Implications Energy for data access Data traffic towards L2 Energy for address access Address activations: reads+writes Data miss ratio Memory Accesses

25 COLP01 - CECS ICS UCI Data Traffic and Address Activations

26 COLP01 - CECS ICS UCI Case I, Interference, first two iterations Line size L=16BLine size L=16B Cache Size C=m*LCache Size C=m*L L m In Cache In MemoryIn Memory Next iteration would reuse the same line of A[i][j+upb] A[i][j+upb] L A[0][0+upb] [0][1+upb] A[0][1+upb] Every 4 accesses 2 misses upb=2 A[0][0] and B[0][0] are nm lines apart B[0][0] B[0][1]

27 COLP01 - CECS ICS UCI Case II, Cache Line Adaptation Line size L’ = L/2=8BLine size L’ = L/2=8B Cache Size C=2m*(L/2)Cache Size C=2m*(L/2) –Cache size constant 2m In Cache In MemoryIn Memory No Interference A[0][0+upb] L A[0][1+upb] upb=2 A[0][0] and B[0][0] are 2nm+1 lines apart L’ B[0][0] B[0][1] Every 4 accesses 2 misses Half of the traffic


Download ppt "Static Analysis of Parameterized Loop Nests for Energy Efficient Use of Data Caches P.D’Alberto, A.Nicolau, A.Veidenbaum, R.Gupta University of California."

Similar presentations


Ads by Google