Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Bounding Worst-Case Data Cache Behavior by Analytically.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Xianfeng Li Tulika Mitra Abhik Roychoudhury

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Tightening the Bounds on Feasible Preemption Points.

Constraint Systems used in Worst-Case Execution Time Analysis Andreas Ermedahl Dept. of Information Technology Uppsala University.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Linear Obfuscation to Combat Symbolic Execution Zhi Wang 1, Jiang Ming 2, Chunfu Jia 1 and Debin Gao 3 1 Nankai University 2 Pennsylvania State University.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

SCIENCES USC INFORMATION INSTITUTE An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Computer Science 12 Design Automation for Embedded Systems ECRTS 2011 Bus-Aware Multicore WCET Analysis through TDMA Offset Bounds Timon Kelter, Heiko.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

A Modular and Retargetable Framework for Tree-based WCET analysis Antoine Colin Isabelle Puaut IRISA - Solidor Rennes, France.

ParaScale : Exploiting Parametric Timing Analysis for Real-Time Schedulers and Dynamic Voltage Scaling Sibin Mohan 1 Frank Mueller 1,William Hawkins 2,

Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,

Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 5 Arrays.

Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.

Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Zheng Wu. Background Motivation Analysis Framework Intra-Core Cache Analysis Cache Conflict Analysis Optimization Techniques WCRT Analysis Experiment.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

F A S T Frequency-Aware Static Timing Analysis

Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.

Static WCET Analysis vs. Measurement: What is the Right Way to Assess Real-Time Task Timing? Worst Case Execution Time Prediction by Static Program Analysis.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Static Analysis of Parameterized Loop Nests for Energy Efficient Use of Data Caches P.D’Alberto, A.Nicolau, A.Veidenbaum, R.Gupta University of California.

Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Bounding Preemption Delay within Data Cache Reference.

Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

ECE 750 Topic 8 Meta-programming languages, systems, and applications Automatic Program Specialization for J ava – U. P. Schultz, J. L. Lawall, C. Consel.

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

Timing Anomalies in Dynamically Scheduled Microprocessors Thomas Lundqvist, Per Stenstrom (RTSS ‘99) Presented by: Kaustubh S. Patil.

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance

CSCI1600: Embedded and Real Time Software

Linköping University, IDA, ESLAB

Dynamic Hardware Prediction

rePLay: A Hardware Framework for Dynamic Optimization

Introduction to Optimization

Optimizing single thread performance

CSCI1600: Embedded and Real Time Software

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Harini Ramaprasad, Frank Mueller North Carolina State University Center for Embedded Systems Research Bounding Worst-Case Data Cache Behavior by Analytically Deriving Cache Reference Patterns

2 Motivation Timing Analysis — Calculation of Worst Case Execution Times of tasks — Required by for scheduling of real-time tasks – Schedulability theory requires a-priori knowledge of WCET — Estimates need to be safe — Static Timing Analysis – an efficient method to calculate WCET of a program!

3 Motivation Static Timing Analysis — Traverse all paths in program — Calculate a safe estimate for WCET — Unpredictability introduced due to: –Data-dependent control flow –Pointer accesses –Modern architectural features – branch predictors, instruction caches, data caches,… Data caches should be exploited  dilemma: speedup vs. unpredictability

4 Static Timing Analyzer Framework to calculate WCET of a task Instruction cache well analyzed No good data cache analyzer Static Timing Analysis framework

5 Existing Solutions for D$ Analysis Trace based simulation Static Cache simulation [White et al ] Data flow analysis [Li et al ] Cache locking [Lisper et al , Decotigny et al ] Analytical techniques — Characterize data cache behavior [Ghosh et al , Fraguela et al , Chatterjee et al ]  All have shortcomings

6 Cache Miss Equations (CMEs) Characterizes data cache behavior statically Used for loop-nest oriented code Produces a set of linear equations to relate — Iteration space — Cache parameters — Memory references Solutions give potential miss points

7 More about CMEs Types of CMEs — Cold miss equations –Capture misses on first access to memory line — Replacement miss equations –Capture interference between two references Solving CMEs — Direct solutions not practical – high complexity — Complexity reduced in implementations

8 Terminology Iteration point — Represents an iteration of a loop-nest — Set of all iteration points – iteration space (0, 1, 0) Iteration space for matrix multiplication code r = (0, 0, 1) Reuse vectors — Represent data reuse across different iteration points — Self/group reuse and temporal/spatial reuse

9 Assumptions in CME Framework Loop-nest oriented code Perfectly nested loops Loop bounds — Known at compile time — Affine functions of loop induction variables No data-dependent conditionals

10 CME Implementation Framework used – Coyote [Vera et al.] Outputs – estimate of # misses for every reference in loop nest Works only for perfectly nested, rectangular loops Results are slightly pessimistic No data dependent conditionals are allowed

11 Arbitrary Loop Nests Existing approach — Loop-nests  sequential loop-nests (equal depth) –Move all references in a loop-nest to innermost level –Introduce conditionals to ensure correctness — Disadvantages –Reuse representation needs modification –Representation across iteration spaces required –CME analysis needs some modification

12 Our Approach: Forced Loop Fusion “Forced” loop fusion — Produce a single loop-nest — Concatenate iteration spaces of sequential loops — Introduce conditionals based on loop induction variables to maintain correctness Advantages — No change to reuse representation — Original CME framework reused

13 Example of Forced Loop Fusion Original loop nest Transformed loop nest for (i = 1; i <= 10; i++) for (j = 1; j <= 5; j++) A[i][j] = 19 ; for (k = 1; k <= 10; k++) D[i][k] = A[j][k] + 7 ; for (l = 1; l <= 10; l++) for(m = 1; m <= 5; m++) D[l][m] = 13 ; for(i = 1; i <= 20; i++) for(j = 1; j <= 20; j++) if (i >= 1) && (i <= 10) && (j >= 0) && (j <= 5) A[i][j] = 19 ; if (i >= 1) && (i <= 10) && (j >= 6) && (j <= 15) D[i][j-5] = A[i][j-5] + 7 ; if (i >= 11) && (i <= 20) && (j >= 16) && (j <= 20) D[i-10][j-15] = 13 ;

14 More Conceptual Enhancements Non-rectangular loops — Insert conditional based on loop induction variable — Treat conditional similar to those due to loop fusion Data dependent conditionals — Provide upper bound on number of misses

15 Pessimism in CMEs Does not analyze all iteration points — Trades off accuracy (pessimism) for speed of analysis Does not exploit certain kinds of reuse

16 Deriving Exact Cache Patterns Analyze all iteration points — Acceptable overhead – single pass, static Reanalyze “compulsory misses” — Eliminate pessimistic misses Verify correctness of “hits” — Necessitated by conditionals  introduced while fusing loops

17 Illustrative Example Cache configuration — 1KB cache, direct mapped, 32byte line size Information on variables ReferenceDim LB&UBBase address Size of element A1..10, B1..10,

18 for (i = 1; i <= 20; i++) for (j = 1; j <= 20; j++) if (i >= 1) && (i <= 10) && (j >= 0) && (j <= 5) A[i][j] = 19 ; if (i >= 1) && (i <= 10) && (j >= 6) && (j <= 15) D[i][j-5] = A[i][j-5] + 7 ; if (i >= 11) && (i <= 20) && (j >= 16) && (j <= 20) D[i-10][j-15] = 13 ; Sample Program

19 Analysis Results Original CME Framework vs. Our Framework Reference Coyote output Output of our framework 150MMMMM M = 6 misses MMMMM M M = 7 misses 3100 MMMMMMMMMM M...M M = 13 misses = 0 misses

20 New Timing Analyzer Framework Extension to Coyote

21 Implications to the TA Miss count (n) given for each reference All n misses clustered at the beginning — Avoids handling of complex miss/hit pattern More iterations required to reach steady state — # iters could exceed upper bound of innermost loop’s induction variable — Propagation to outer loop level required — Requires extra handling in TA

22 TA Example Sample Program D$ Analyzer Output for(i = 1; i <= 10; i++) for(j = 1; j <= 10; j++) A[i][j] = 19 ; Number of misses for A[i][j] = 13 Let Time of j loop considering A[i][j] as a Miss = Miss time Time of j loop considering A[i][j] as a Hit = Hit time Value of i #iters with Miss time for j loop #iters with Hit time for j loop MMMMMMMMMM | MMM……. | ……(continues) MMMMMMMMMM MMM……. …………………(continues)

23 Experimental Results Original CME vs. our framework vs. trace-driven sim., 4KB cache BenchmarkCME FrameworkOur FrameworkSimulator MissesHitsMissesHitsMissesHits convolution dot product fir lms matrix nrealupdates simple-srt-test loop test

24 Experimental Results Timing Analyzer results [cycles] for various cache categorizations  “cold miss” category is unsafe  “always miss” category is overly pessimistic  Use of our output (n misses) gives safe & often tighter estimates Benchmark Always Miss First N Misses Cold misses 256B Cache1KB Cache4KB Cache convolution dot product fir lms matrix nrealupdates simple-srt loop test

25 Conclusions & Future Work Contributions: 1.Exact data cache reference patterns 2.Wider applicability Allows arbitrary loop nests using “Forced” loop fusion Allows non-rectangular loops Allows certain data-dependent conditionals 3.Integration of outputs with static timing analyzer framework Future work: — Explore larger benchmarks — Test framework with set-associative caches — Extend idea to L2 caches

26 Thank you! Questions?

27 Cache Miss Equations Loop bounds and array subscript expressions — Affine combinations of loop induction variables –Affine equation is a non-homogeneous linear equation –Non-homogeneous  isolated constants allowed CMEs – linear Diophantine equations — Diophantine equation – only integer solutions allowed A Linear Diophantine equation is of the form: — ax + by = c — where a, b and c are constants and x and y are variables

28 Cache Miss Equations Cold Miss Equations — Memory lineRa(i)  Memory lineRa’(p) –i and p are iteration points and Ra and Ra’ are references Replacement Miss Equations – Intuition — Cache set for Ra = Cache set for Rb — Mem addr of Ra = Mem addr of Rb + n  cache size + line size range