Kristof Beyls, Erik D’Hollander, Frederik Vandeputte ICCS 2005 – May 23 RDVIS: A Tool That Visualizes the Causes of Low Locality and Hints Program Optimizations.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

Chapter 12 Lists and Iterators. List Its an abstract concept not a vector, array, or linked structure. Those are implementations of a List. A list is a.
Functional Image Synthesis. Pan An image synthesis “language” Images are functions Continuous and infinite Embedded in a functional host language Reusable.
Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
www.brainybetty.com1 MAVisto A tool for the exploration of network motifs By Guo Chuan & Shi Jiayi.
Reuse distance as a metric for cache behavior - pdcs2001 [1] Characterization and Optimization of Cache Behavior Kristof Beyls, Yijun Yu, Erik D’Hollander.
Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.
Compiler Challenges for High Performance Architectures
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Simulations of Memory Hierarchy LAB 2: CACHE LAB.
Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC pag. 1 Discovery of Locality-Improving Refactorings.
Data Locality CS 524 – High-Performance Computing.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder.
Phylogenetic Trees Presenter: Michael Tung
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
From Cooper & Torczon1 Implications Must recognize legal (and illegal) programs Must generate correct code Must manage storage of all variables (and code)
Intermediate Code. Local Optimizations
SMIILE Finaly COBOL! and what else is new Gordana Rakić, Zoran Budimac.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.
An Introduction to Programming with CUDA Paul Richmond
Language Evaluation Criteria
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander.
A Low-Cost Memory Remapping Scheme for Address Bus Protection Lan Gao *, Jun Yang §, Marek Chrobak *, Youtao Zhang §, San Nguyen *, Hsien-Hsin S. Lee ¶
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Microprocessor-based systems Curse 7 Memory hierarchies.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.
Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder Used with permission of author.
Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
Single Node Optimization Computational Astrophysics.
C H A P T E R T W O Linking Syntax And Semantics Programming Languages – Principles and Paradigms by Allen Tucker, Robert Noonan.
Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Csontos Péter, Porkoláb Zoltán Eötvös Loránd Tudományegyetem, Budapest ECOOP 2001 On the complexity of exception handling.
Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,
An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Automatic Thread Extraction with Decoupled Software Pipelining
Employing compression solutions under openacc
Chapter 1 Introduction.
Chapter 1 Introduction.
课程名 编译原理 Compiling Techniques
Workshop in Nihzny Novgorod State University Activity Report
Memory Hierarchies.
STUDY AND IMPLEMENTATION
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.
Calpa: A Tool for Automating Dynamic Compilation
Memory System Performance Chapter 3
Algoritmos y Programacion
Presentation transcript:

Kristof Beyls, Erik D’Hollander, Frederik Vandeputte ICCS 2005 – May 23 RDVIS: A Tool That Visualizes the Causes of Low Locality and Hints Program Optimizations

Overview 1. Motivation: cache bottleneck 2. Some theoretical background: reuses 3. View 1: cache-missing reuses 4. View 2: reuse pair clusters corresponding to program optimizations 5. Experimental results 6. Implementation details 7. Conclusion

1. Motivation Many programs incur large cache bottlenecks. Mainly caused by poor locality (temporal or spatial) Temporal locality is hard to optimize automatically in a compiler Therefore: need to help programmer to pin-point sources of low temporal locality.

2. Theoretical background Stream of memory accesses: accesses:abcaab references:r1r1r2r1r1r1 basic block:bb1bb1bb2bb1bb1bb1 Reuses / Reuse Distance Reference pair / Reference pair histogram Basic Block Vector of Intermediately executed code Cache miss  reuse distance ≥ cache size

2. Theoretical background Stream of memory accesses: accesses:abcaab references:r1r1r2r1r1r1 basic block:bb1bb1bb2bb1bb1bb1 Reuses / Reuse Distance Reference pair / Reference pair histogram Basic Block Vector of Intermediately executed code. Reference pair r1-r1 Reuse distance

3. RDVIS by example: matrix multiplication Reuses between a[i*N+k] at distance 2^9 Reuses between b[k*N+j] at distance 2^17 How to bring reuses of b[k*N+j] closer together? What separates reuses? What code is executed between reuses?

3. RDVIS by example: matrix multiplication Reuses occur between iterations of i-loop Solution: bring iterations of i-loop inwards

3. RDVIS by example: matrix multiplication Next to optimize: reuses of A[i*N+k]

3. Matrix multiplication: final result L1 cache L2 cache Main memory Exec. Time on P4: Orig: 0.740s Opt.: 0.223s Speedup: 3.3

4. Cluster Analysis In more complex programs, there can be many arrows. Many arrows can often by optimized by the same program transformation. Key idea: “When the same code is executed between use and reuse, probably the same program transformation is needed.”

4. Cluster Analysis by example: equake Many different arrows contribute to long- distance reuse

2(bis). Theoretical background Stream of memory accesses: accesses:abcaab references:r1r1r2r1r1r1 basic block:bb1bb1bb2bb1bb1bb1 Reuses / Reuse Distance Reference pair / Reference pair histogram Basic Block Vector of Intermediately executed code of a reference pair. BBV(Reference pair r1-r1).66 % exec. betw. reuses bb2bb1Basic block

4. Cluster Analysis by example: equake LOOP FUSION!

5. Experimental Results

6. Some Implementation Details Instrumentation added to GCC 4: –Exact source location info added to all abstract syntax tree nodes. –Source location info is added in language-specific front-end (currently only C, Fortran is being added). –Instrumentation occurs in language-independent middle-end. Inserts function call for each memory reference Inserts function call at begin of each basic block Writes out source location info for memory references and basic blocks

7. Conclusion Visualization indicates reuses at a long distance, and the code that is executed between those reuses. Clustering of intermediately executed code leads to reference pairs that are optimizable with the same program transformation. Give RDVIS a try:

QUESTIONS?

MCF

AMMP