A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
Advertisements

Memory.
Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Performance Monitoring Update Daniele Francesco Kruse April 2010.
Chapter 2- Visual Basic Schneider1 Chapter 2 Problem Solving.
The Scientific Method.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
1 Lecture 6 Performance Measurement and Improvement.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Computer Organization and Architecture
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Design Tradeoffs For Software-Managed TLBs Authers; Nagle, Uhlig, Stanly Sechrest, Mudge & Brown.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology, Fall 2010 Performance.
JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.
A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.
Performance of mathematical software Agner Fog Technical University of Denmark
Srihari Makineni & Ravi Iyer Communications Technology Lab
Introduction Advantages/ disadvantages Code examples Speed Summary Running on the AOD Analysis Platforms 1/11/2007 Andrew Mehta.
* Third party brands and names are the property of their respective owners. Performance Tuning Linux* Applications LinuxWorld Conference & Expo Gary Carleton.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 12/6/2006 Lecture 24 – Profilers.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Software Performance Monitoring Daniele Francesco Kruse July 2010.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
6 September 2007CHEP07 Parallel - SJ1 Perfmon2: A leap forward in Performance Monitoring Sverre Jarp, Ryszard Jurga, Andrzej Nowak CERN openlab CHEP 2007.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.
Performance Monitoring Update Daniele Francesco Kruse August 2010.
Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
Modular Software Performance Monitoring Daniele Francesco Kruse – CERN – PH / SFT Karol Kruzelecki – CERN – PH / LBC.
Measuring Performance II and Logic Design
Modular Software Performance Monitoring
Computer Architecture Principles Dr. Mike Frank
Bruhadeshwar Meltdown Bruhadeshwar
Chapter 8: Main Memory.
Programmable Logic Controllers (PLCs) An Overview.
What we need to be able to count to tune programs
CSCI1600: Embedded and Real Time Software
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Page Replacement.
Chapter 2: System Structures
How much does OS operation impact your code’s performance?
Virtual Memory Overcoming main memory size limitation
Multi-Core Programming Assignment
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
CSCI1600: Embedded and Real Time Software
What Are Performance Counters?
Presentation transcript:

A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010

Summary 1.Performance Monitoring 2.Performance Counters 3.Perfmon2 4.Cycle Accounting Analysis 5.Problems 6.Overall Analysis 7.Symbol Level Analysis 8.Module Level Analysis 9.Modular Symbol Level Analysis 10.What can we do with counters? 11.The 3-step empirical optimization procedure 12.Conclusions 13.What’s next? 2

Performance Monitoring DEF : The action of collecting information related to how an application or system performs HOW : Obtain micro-architectural level information from hardware performance counters WHY : To identify bottlenecks, and possibly remove them in order to improve application performance 3

Performance Counters All recent processor architectures include a processor–specific PMU The Performance Monitoring Unit contains several performance counters Performance counters are able to count micro-architectural events from many hardware sources (cpu pipeline, caches, bus, etc…) In the Intel Core Microarchitecture : 3 fixed counters 2 programmable counters 4

Perfmon2 A generic API to access the PMU ( libpfm ) Developed by Stéphane Eranian Portable across all new processor micro-architectures Supports system-wide and per-thread monitoring Supports counting and sampling CPU Hardware Linux Kernel Generic Perfmon Architectural Perfmon PMU User space Pfmon Other libpfm-based Apps libpfm 5

Cycle Accounting Analysis Total Cycles (Application total execution time) Issuing μops Not Issuing μops Stalled (no work) Not retiring μops (useless work) Retiring μops (useful work) Store-Fwd L2 miss L2 hit LCP L1 TLB miss 6

Problems The stall decomposition approach implies a simplification: other kinds of stalls occur Performance impact of stalls may have different degree of errors (5% - 20%) Is the whole the sum of its parts? Usually, NO. Some events may overlap or their impact be over-estimated Values read with performance counters are good approximations of real values (1% - 3%) Nevertheless, although quantities may be over or under-estimated, they allow a good qualitative analysis 7

The 4-way Performance Monitoring 8 1. Overall Analysis 2. Symbol Level Analysis 3. Module Level Analysis 4. Modular Symbol Level Analysis Overall (pfmon)Modular Sampling Counting

Overall Analysis Uses Pfmon and it is based on the Cycle Accounting Analysis Good for showing overall performance and for checking improvements Good for identifying general software problems Good for comparing different versions of the code NOT enough for finding inefficient parts of the software finding bad programming practices 9

Overall Analysis 10

Symbol Level Analysis Uses sampling capabilities of pfmon Good for identifying general bad programming practices Can identify problems of functions which are frequently used Shows functions that use most of the execution cycles and functions that spend a lot of time doing nothing (stalling) NOT good for finding specific problems in the code 11

Stalled Cycles Symbol Level Analysis Total Cycles counts %self symbol % _int_malloc % __GI___libc_malloc % __cfree % ROOT::Math::SMatrix::operator= % __ieee754_exp % ROOT::Math::SMatrix::operator= % do_lookup_x % ROOT::Math::SMatrix::operator= % __ieee754_log % __atan % ROOT::Math::SMatrix::operator= % _int_free % G__defined_typename % strcmp % TList::FindLink % G__defined_tagname counts %self symbol % _int_malloc % do_lookup_x % __GI___libc_malloc % __ieee754_exp % strcmp % __cfree % __atan % __ieee754_log % TList::FindLink % _int_free % std::basic_string::find % computeFullJacobian % malloc_consolidate % operator new % ROOT::Math::SMatrix::operator= % 33.84% makeAtomStep 12

Module Level Analysis Uses the Perfmon2 interface (libpfm) directly Analyses each CMSSW module separately Allows the identification of “troubled” modules through a sortable HTML table Gives instruction statistics and produces detailed graphs to make analysis easier It requires 21 identical cmsRun ’s (no multiple sets of events are used → more accurate results), but it can be parallelized so (using 7 cores): time = ~3 runs Code outside modules is not monitored (framework) 13 DEMO

Module Level Analysis - Results Snapshot 14

Single Module Graphs 15

Modular Symbol Level Analysis - Overview 16 Uses the Perfmon2 interface (libpfm) directly and analyses each CMSSW module separately Sampling periods are specific to each event in order to have reasonable measurements The list of modules is a HTML table sortable by number of samples of UNHALTED_CORE_CYCLES For each module the complete set of usual events (Cycle Accounting Analysis & others) is sampled Results of each module are presented in separate HTML pages in tables sorted by decreasing sample count DEMO

The List of Modules 17

Table Example of a Module 18

An empirical study: What can we do with counters? 19 Question: Is all this useful? Answer: We don’t know, but we shall see Lack of papers and literature about the subject An empirical study is underway to find out: 1.A relationship between counter results and coding practices 2.A practical procedure to use counter results to optimize a program A procedure has already been developed and will be tested The trial study will be conducted on Gaudi together with Karol Kruzelecki (PH-LBC group)

The 3-step empirical optimization procedure 20 We start from counter results and choose one algorithm to work on using the Improvement Margin and the iFactor. We then apply the following procedure: MyAlg : Total Cycles: 1000, Total Instructions: 1000 → CPI: 1.00 MyAlg : Total Cycles: 300, Total Instructions: 250 → CPI: 1.20 MyAlg : Total Cycles: 250, Total Instructions: 250 → CPI: 1.00 MyAlg : Total Cycles: 230, Total Instructions: 250 → CPI: Change to a more efficient algorithm and vectorize it 2. Remove stall sources (L1 & L2 misses, store-fwd, etc..) 3. Remove misprediction sources (branches, calls, etc..)

PfmCodeAnalyser : a new tool for fast monitoring 21 Unreasonable (and useless) to run a complete analysis for every change in code Often interested in only small part of code and in one single event Solution: a fast, precise and light “singleton” class called PfmCodeAnalyser How to use it: #include PfmCodeAnalyser::Instance(“INSTRUCTIONS_RETIRED”).start(); //code to monitor PfmCodeAnalyser::Instance().stop();

From Core to Nehalem 22 All the work described was done for the Intel Core Microarchitecture The porting to Nehalem was completed yesterday and will be validated and tested soon A new Cycle Accounting model, new performance counters, different latencies and stall impacts, etc... Core i7, Xeon 5500 series and NUMA: core (4+3) and uncore counters (8+1) CMSSW has been profiled successfully on my Nehalem machine and the results can be found here.here

Conclusions 23 1.A consistent set of events has been monitored across the 4 different analysis approaches both in CMSSW and in Gaudi (Cycle Accounting Analysis) 2.An empirical counter-based optimization approach is being studied and tested 3.A new monitoring tool has been developed for quick performance monitoring: PfmCodeAnalyser 4.A new Cycle Accounting Analysis model for Nehalem processors is currently being validated on CMSSW 5.The report and background of the work done for CMSSW is available at

Questions ?