Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Slides:

Advertisements

Similar presentations

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.

Advertisements

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

1.Calculate number of events by searching for event in assembly file or analytical model. 2.Validate the numbers from step one with a simulator. 3.Compare.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Order-Independent Texture Synthesis Li-Yi Wei Marc Levoy Gcafe 1/30/2003.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Register Allocation (via graph coloring)

JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

1 1 Profiling & Optimization David Geldreich (DREAM)

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Ekrem Kocaguneli 11/29/2010. Introduction CLISSPE and its background Application to be Modeled Steps of the Model Assessment of Performance Interpretation.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.

Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Conrad Benham Java Opcode and Runtime Data Analysis By: Conrad Benham Supervisor: Professor Arthur Sale.

BE-SECBS FISA 2003 November 13th 2003 page 1 DSR/SAMS/BASP IRSN BE SECBS – IRSN assessment Context application of IRSN methodology to the reference case.

Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia.

COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.

1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.

Summertime Fun Everyone loves performance Shirley Browne, George Ho, Jeff Horner, Kevin London, Philip Mucci, John Thurman.

CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.

Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Performance Data Standard and API Shirley Browne, Jack Dongarra, and Philip Mucci University of Tennessee from the Ptools Annual Meeting, May 1998.

Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

1 Asstt. Prof Navjot Kaur Computer Dept PRESENTED BY.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Cache Simulations and Application Performance Christopher Kerr Philip Mucci Jeff Brown Los Alamos, Sandia.

Cache Advanced Higher.

Code Optimization.

Chapter 1 Introduction.

Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.

Ramya Kandasamy CS 147 Section 3

Chapter 1 Introduction.

课程名编译原理 Compiling Techniques

Performance Analysis, Tools and Optimization

Many-core Software Development Platforms

What we need to be able to count to tune programs

15-740/ Computer Architecture Lecture 14: Prefetching

CSE 373: Data Structures and Algorithms

What Are Performance Counters?

Presentation transcript:

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Motivation Tuning real DOD and DOE applications! –Performance on most codes is low. –Poor overall efficiency due to poor single node performance. –Show good scalability because of the above and faster interconnects. –The expertise is not there, nor should it be.

Description To use data available at run-time to better compilation and optimization technology. Empirically determine how well the code maps to the underlying architecture. Bottlenecks can be identified and possibly corrected by an explicit set of rules and transformations.

Information not being used Hardware statistics gathered through simulation or monitoring can identify the problem. (sample listing) –Cache and branching behavior –Cycle/Load/Store/FLOP counts –Bottleneck determination –Reference pattern –Dynamic memory placement

Problem Areas Efficient use of the memory hierarchy Register re-use Aliasing Inlining Demotion Algorithms (iterative vs. direct)

Solutions Understanding (tutorials, reference material) Tools Preprocessors Compilers Manpower

Increasing Cache Performance How do we better the use of the memory hierarchy? For computer scientists, its not that hard. We need the right tools. How much can we automate? Through available tools and source analysis we can usually get down to the function.

Cache Simulation Instrumentation of routines Run of the executable Analysis and correlation with source code! Old idea, new implementation.

Cache Simulation Hardware independence Information on: –Locality –Placement –Reference pattern and Reuse –Line usage

Locality Spatial and Temporal –misses/memory reference –misses/re-use Conflict vs. Capacity

Placement Padding can be very important Not always possible to do during static analysis phase. Reference pattern can affect padding.

Reference Pattern Again, not always possible to do during static analysis. Even harder to analyze when dealing with pseudo-optimized code. Examples: Stencils, Sparse solvers etc...

Reuse Blocking is critical to applications where there is re-use. We need to identify re-use potential, to spot areas where blocking and register allocation should be focused on.

Source Code Mapping Most cache tools are hard to use and relate to the source code. This tool simulates the cache(s) on each memory reference and thus can easy correlate the data. Instrumentation is at the source level, not object code.

Statistics Global, per file, per statement, per reference References, misses, cold misses, re-used references Conflict/Re-use matrix –M(A,B) = x means some element of A ejected some element of B from the cache x times iff that element of A has been in the cache before.

Development status GUI for selective instrumentation Real parsers (F90, C, C++) Better report generation

Implementation Simulator written in C Instrumentation in Perl GUI in Java Report generator in Perl

Relevance Why shouldnt this technology be part of a feedback loop? –Compile with instrumentation –Run –Recompile with information from the run –Watch input sensitivity issues.

Integration Identifying and correcting poor cache behavior can be made explicit and part of a compiler. (Ideally a source-to-source transformer or preprocessor) Simulator can stand alone for detailed analysis and optimization by CS folks. Our knowledge and expertise made available through the tools.

Hardware Counters Virtually every processor available has hardware counters The interfaces and documentation are poor or non-existent. Hardware differs greatly as do the semantics Useful for measurement, analysis, optimization, modeling and benchmarking.

Performance Data Standard Standardize an API to obtain hardware performance counters Standardize the definitions of what those counters mean API is lightweight and portable

Performance Data Standard Target platforms –R10K, R12K –P2SC, Power PC 604e, Power 3 –Sun Ultra 2/3 –Intel PII, Katmai, Merced –Alpha 21164, 21264

Performance Data Standard Motivation –Portable performance tools –Optimization through feedback –Developers wanting simple and accurate timing and statistics –Modeling, evaluation

Performance Data Standard Small number of useful measurement points –Timing cycles, microseconds –I/D cache misses, invalidations –Branch mispredictions –Load,store,FLOP,instruction counts –I/D TLB misses

Performance Data Standard API Efficient counter multiplexing Thread safety Functions for –start, stop, reset, get, accumulate, query, control Use the best available vendor supported interface or API Possible pairing with DAIS, Dyninst for naming

Development status Research on the various machines available hardware and interfaces Compilation of findings, web page and mailing list API specification to appear mid August for discussion Vendors are lurking

Deliverables API for O2K, T3E, SP Portable prof implementation

People Shirley Browne (UT) Jeff Brown (LANL) Jeff Durachta (IBM, LANL) Christopher Kerr (IBM, LANL) George Ho (UT) Kevin London (UT) Philip Mucci (UT, Sandia)

Rice/UTK Collaboration RiceUT DOEDOD Low level support Optimization Technology Tools Apps

Deliverables F90,C++ preprocessor with feedback Cache tool Analysis and Optimization of poor codes Performance API