Static Analysis of Parameterized Loop Nests for Energy Efficient Use of Data Caches P.D’Alberto, A.Nicolau, A.Veidenbaum, R.Gupta University of California.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.
Fall 2006CENG 7071 Algorithm Analysis. Fall 2006CENG 7072 Algorithmic Performance There are two aspects of algorithmic performance: Time Instructions.
Synthesis of Embedded Software Using Free-Choice Petri Nets.
CSC401 – Analysis of Algorithms Lecture Notes 1 Introduction
Complexity Analysis (Part I)
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Analysis of Algorithms 7/2/2015CS202 - Fundamentals of Computer Science II1.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Analysis of Algorithms Spring 2015CS202 - Fundamentals of Computer Science II1.
Analysis of Performance
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
SIGCSE Tradeoffs, intuition analysis, understanding big-Oh aka O-notation Owen Astrachan
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Week 2 CS 361: Advanced Data Structures and Algorithms
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Algorithm Input Output An algorithm is a step-by-step procedure for solving a problem in a finite amount of time. Chapter 4. Algorithm Analysis (complexity)
Chapter 12 Recursion, Complexity, and Searching and Sorting
Analysis of Algorithms
Jessie Zhao Course page: 1.
Lecture 4. RAM Model, Space and Time Complexity
Stephen P. Carl - CS 2421 Recursion Reading : Chapter 4.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
ICS 145B -- L. Bic1 Project: Main Memory Management Textbook: pages ICS 145B L. Bic.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Evaluation and Validation Peter Marwedel TU Dortmund, Informatik 12 Germany 2013 年 12 月 02 日 These slides use Microsoft clip arts. Microsoft copyright.
Complexity of Algorithms
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Introduction to Problem Solving. Steps in Programming A Very Simplified Picture –Problem Definition & Analysis – High Level Strategy for a solution –Arriving.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Data Structure Introduction.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Bounding Worst-Case Data Cache Behavior by Analytically.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
CSC 212 – Data Structures Lecture 15: Big-Oh Notation.
Analysis of Algorithms Spring 2016CS202 - Fundamentals of Computer Science II1.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Algorithm Analysis 1.
CSCI1600: Embedded and Real Time Software
Memory Hierarchies.
Evaluation and Validation
Ann Gordon-Ross and Frank Vahid*
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Analysis of Algorithms
Introduction to Algorithm and its Complexity Lecture 1: 18 slides
Analysis of Algorithms
CSCI1600: Embedded and Real Time Software
Analysis of Algorithms
Analysis of Algorithms
Presentation transcript:

Static Analysis of Parameterized Loop Nests for Energy Efficient Use of Data Caches P.D’Alberto, A.Nicolau, A.Veidenbaum, R.Gupta University of California at Irvine Information and Computer Science Department Center for Embedded Computer Systems

COLP01 - CECS ICS UCI Talk Organization MotivationMotivation Related workRelated work Parameterized Static AnalysisParameterized Static Analysis Implementation and Experimental ResultsImplementation and Experimental Results ConclusionsConclusions

COLP01 - CECS ICS UCIMotivation Caches are important in modern systemsCaches are important in modern systems –A key performance determinant –An important part of the overall energy equation: »large and growing size and associativity, multi-port access –Increasingly critical for data-intensive applications ‘Adaptive’ caches can improve performance‘Adaptive’ caches can improve performance –By changing line and/or fetch size based on runtime behavior –Varying associativity, etc –“Optimal” is application specific

COLP01 - CECS ICS UCI Adaptive Memory System Problem: How to control adaptive cache line size to maximize performance and minimize energy consumption?Problem: How to control adaptive cache line size to maximize performance and minimize energy consumption? Approach: use a compiler to generate code directing the adaptation at run timeApproach: use a compiler to generate code directing the adaptation at run time Issues in such a compilation approachIssues in such a compilation approach –What application characteristics do we need to measure? –When can we statically determine an optimal line size? –How can we “generate” code based on the static analysis?

COLP01 - CECS ICS UCI Rewards of Line Size Adaptation Miss rate is reduced resulting in :Miss rate is reduced resulting in : –Fewer fetches from next-level cache or memory –Less transfer traffic –Fewer processor stalls –The code is, of course, left unchanged What is the effect on energy dissipation?What is the effect on energy dissipation?

COLP01 - CECS ICS UCI Energy Effects To ensure total energy reduction need toTo ensure total energy reduction need to –reduce memory accesses and memory traffic The choice of the line size affects both memory accesses and memory trafficThe choice of the line size affects both memory accesses and memory traffic –Minimizing for either one alone will not give minimal energy dissipation A tradeoff is possible and practical, at compile time, when they are “quantified” accurately and fastA tradeoff is possible and practical, at compile time, when they are “quantified” accurately and fast

COLP01 - CECS ICS UCI Related Work on Cache Optimizations Two primary approaches to determine Memory Accesses and Cache MissesTwo primary approaches to determine Memory Accesses and Cache Misses –Profiling and Static Analysis ProfilingProfiling 1.Given a set of inputs (training set) 2.The number of interference is determined »Simulation, hardware counters…. 3.Compiler selects a line size and uses it to annotate the code 4.The annotated code is then used in actual execution – –The problem is that often there is not a single optimum line size for different runs of the same code (input dependent)

COLP01 - CECS ICS UCI Related Work … Related Work … cont Static AnalysisStatic Analysis –Based on loop nests and CMEs 1.Representation of cache misses by inequalities 2.Each iteration checked to verify inequalities »Yes/no miss 3.The number of iterations/misses is added up »The loop bounds are needed at this point Note:Note: –The complexity of the analysis is proportional to the number of iterations –This work forms the basis of the parameterized analysis »Because it characterizes the cause of misses

COLP01 - CECS ICS UCI Parameterized Loop Nests Analysis for Direct Mapped Data Cache 1. Every memory reference is symbolically equated with every possible source of interference 2.Symbolic solution of the equation is sought Is there a solution?Is there a solution? Existence condition for a solutionExistence condition for a solution 3.Trade-off between spatial reuse and interference If there is a solution, how can we estimate the number of solutions without counting them?If there is a solution, how can we estimate the number of solutions without counting them? Bounding the misses by interferenceBounding the misses by interference 4.Misses = Contribution from each reference

COLP01 - CECS ICS UCI Interference and Reuse, Interference and Reuse, explanation An interference equation represents the set of iterations in the loop nest where two references interfereAn interference equation represents the set of iterations in the loop nest where two references interfere –Where there is a miss Existence conditions for solutions of the equation: Existence conditions for solutions of the equation: –Find at least one iteration for which the equation is satisfied Bounding the misses due to interferenceBounding the misses due to interference –If there is a solution, we propose, “the interference density”, to bound the ratio of the iteration solutions over the total number of iterations

COLP01 - CECS ICS UCI Interference Density, Interference Density, explanation The interference density is a straightforward quantity to determineThe interference density is a straightforward quantity to determine –Function of the coefficients of the interference equation It is independent from the loop nest and the definition domainIt is independent from the loop nest and the definition domain It is a good upper bound to the cache miss ratio due to interference equationIt is a good upper bound to the cache miss ratio due to interference equation –(but only when it is known if there is interference)

COLP01 - CECS ICS UCI int A[MAX][MAX]; /* In memory starting at address Aoffset */ int B[MAX][MAX]; /* In memory starting at address Boffset */ /* Size of int = 4 Bytes, Row major layout */ /* Size of int = 4 Bytes, Row major layout */ int empty(int upb) { /* One parameter */ x=0; x=0; for (i=0;i<upb;i++) /* loop bounds affine function of upb */ for (i=0;i<upb;i++) /* loop bounds affine function of upb */ for (j=0;j<upb;j++) for (j=0;j<upb;j++) x += (A[i][j+upb] +1) /* Memory reference affine function of */ x += (A[i][j+upb] +1) /* Memory reference affine function of */ *B[i][j]; /* index variables i, j and parameter upb */ *B[i][j]; /* index variables i, j and parameter upb */ return x; } Example

COLP01 - CECS ICS UCI Our Approach int A[MAX][MAX]; int B[MAX][MAX]; void empty(int upb) { x +=(A[i][j+upb]+1)*B[i][j]; x +=(A[i][j+upb]+1)*B[i][j];} 1) Interference Equation A[][] B[][]: Aoffset+MAX*4*i+(j+upb)*4+n(mL)+l=Boffset+MAX*4*i+j*4 n>0 and |l| 0 and |l|<L For example, with Aoffset = 64, Boffset=8256 and MAX=10 And mL = 8192, the equation becomes : 4*upb +n8192 +l = ) Symbolic Solution: D= 4*upb mod 8192 if L>D there is interfernce (i.e. L=D annotation ) 3) Otherwise Interference Density: min(1,4/L+(L-D)/L)

COLP01 - CECS ICS UCI Our Approach … cont void empty(int upb) { …. …. for (i=0;i<upb;i++) for (i=0;i<upb;i++) for (j=0;j<upb ;j++) { for (j=0;j<upb ;j++) { … }} Iteration Points 4) Misses : interference density * number of iteration points : 2 (4/L+min(1,1/(up mod L)) * upb 2 : 2 (4/L+min(1,1/(up mod L)) * upb 2 upb 2

COLP01 - CECS ICS UCI Implementation of STAMINA Stamina is a 3-step phase in the ARMR compilerStamina is a 3-step phase in the ARMR compiler 1.Step I: Code Analysis –Input: »Code and the sequence of memory references in the inner loops –Output: »Loop bounds information: Expression of the boundsExpression of the bounds »Reference information Index computationsIndex computations Reuse informationReuse information

COLP01 - CECS ICS UCI Implementation of STAMINA. cont Step II: Interference Equation Generation –Input: Step I output and »The cache size and array layout »Sequence of memory references –Output: »Set of equations and domains of validity Step III: Interference estimation –Input: Step II output –Output: »Interference density and symbolic enumeration of the iterations in the loop nest

COLP01 - CECS ICS UCI Example: SWIM Swim: a loop-based code from SPEC 2000Swim: a loop-based code from SPEC 2000 –It cannot be analyzed using CME because of unknown loop bounds (introduced as input at run time) STAMINA analysis takes 1 minute per loop nestSTAMINA analysis takes 1 minute per loop nest –4 main loop nests –The execution of swim takes less than 1 hr We verified the analysis using Shade cache simulatorWe verified the analysis using Shade cache simulator The analysis results:The analysis results: –For the reference set Swim has no interference: »Independent of the line size –Larger line size = better performance –Shorter line size = better energy

COLP01 - CECS ICS UCI Examples: “Self Interference” “Self Interference” is an artificial example“Self Interference” is an artificial example It consists of 6 loop nests with only one memory reference eachIt consists of 6 loop nests with only one memory reference each –Self interference The choice of line size sharply affects the overall miss ratio and energy consumptionThe choice of line size sharply affects the overall miss ratio and energy consumption –Adaptive = optimal line size per each loop nest Adaptive line size yieldsAdaptive line size yields –Optimum energy consumption –1/3 of the total miss ratio

COLP01 - CECS ICS UCI Example “Self Interference” Adaptive = each loop nest has a different and optimal line size

COLP01 - CECS ICS UCI Example: Matrix Multiply ijk-Matrix Multiplyijk-Matrix Multiply Stamina takes about 2min for untiled and 8hrs for tiledStamina takes about 2min for untiled and 8hrs for tiled The tiled MM cannot be analyzed by CMEsThe tiled MM cannot be analyzed by CMEs Comparison with the “exact version”Comparison with the “exact version”

COLP01 - CECS ICS UCIConclusions Architectural adaptation presents an opportunity to maximize performance based on application and data needsArchitectural adaptation presents an opportunity to maximize performance based on application and data needs Energy consumption can be optimized within the same frameworkEnergy consumption can be optimized within the same framework Compiler analysis and its integration with the runtime system is needed to achieve thisCompiler analysis and its integration with the runtime system is needed to achieve this –This work enables optimum tradeoff between conflict and reuse based on static analysis of nested loops –The result is a possible trade-off between energy and performance or energy optimization The model is validated by the experimental resultsThe model is validated by the experimental results

COLP01 - CECS ICS UCI Future Work Find efficient techniques for symbolic solution of the interference equationFind efficient techniques for symbolic solution of the interference equation Improve a friendlier “applicability” on benchmarksImprove a friendlier “applicability” on benchmarks

COLP01 - CECS ICS UCI Thank you

COLP01 - CECS ICS UCI Energy Implications Energy for data access Data traffic towards L2 Energy for address access Address activations: reads+writes Data miss ratio Memory Accesses

COLP01 - CECS ICS UCI Data Traffic and Address Activations

COLP01 - CECS ICS UCI Case I, Interference, first two iterations Line size L=16BLine size L=16B Cache Size C=m*LCache Size C=m*L L m In Cache In MemoryIn Memory Next iteration would reuse the same line of A[i][j+upb] A[i][j+upb] L A[0][0+upb] [0][1+upb] A[0][1+upb] Every 4 accesses 2 misses upb=2 A[0][0] and B[0][0] are nm lines apart B[0][0] B[0][1]

COLP01 - CECS ICS UCI Case II, Cache Line Adaptation Line size L’ = L/2=8BLine size L’ = L/2=8B Cache Size C=2m*(L/2)Cache Size C=2m*(L/2) –Cache size constant 2m In Cache In MemoryIn Memory No Interference A[0][0+upb] L A[0][1+upb] upb=2 A[0][0] and B[0][0] are 2nm+1 lines apart L’ B[0][0] B[0][1] Every 4 accesses 2 misses Half of the traffic