A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010
Summary 1.Performance Monitoring 2.Performance Counters 3.Perfmon2 4.Cycle Accounting Analysis 5.Problems 6.Overall Analysis 7.Symbol Level Analysis 8.Module Level Analysis 9.Modular Symbol Level Analysis 10.What can we do with counters? 11.The 3-step empirical optimization procedure 12.Conclusions 13.What’s next? 2
Performance Monitoring DEF : The action of collecting information related to how an application or system performs HOW : Obtain micro-architectural level information from hardware performance counters WHY : To identify bottlenecks, and possibly remove them in order to improve application performance 3
Performance Counters All recent processor architectures include a processor–specific PMU The Performance Monitoring Unit contains several performance counters Performance counters are able to count micro-architectural events from many hardware sources (cpu pipeline, caches, bus, etc…) In the Intel Core Microarchitecture : 3 fixed counters 2 programmable counters 4
Perfmon2 A generic API to access the PMU ( libpfm ) Developed by Stéphane Eranian Portable across all new processor micro-architectures Supports system-wide and per-thread monitoring Supports counting and sampling CPU Hardware Linux Kernel Generic Perfmon Architectural Perfmon PMU User space Pfmon Other libpfm-based Apps libpfm 5
Cycle Accounting Analysis Total Cycles (Application total execution time) Issuing μops Not Issuing μops Stalled (no work) Not retiring μops (useless work) Retiring μops (useful work) Store-Fwd L2 miss L2 hit LCP L1 TLB miss 6
Problems The stall decomposition approach implies a simplification: other kinds of stalls occur Performance impact of stalls may have different degree of errors (5% - 20%) Is the whole the sum of its parts? Usually, NO. Some events may overlap or their impact be over-estimated Values read with performance counters are good approximations of real values (1% - 3%) Nevertheless, although quantities may be over or under-estimated, they allow a good qualitative analysis 7
The 4-way Performance Monitoring 8 1. Overall Analysis 2. Symbol Level Analysis 3. Module Level Analysis 4. Modular Symbol Level Analysis Overall (pfmon)Modular Sampling Counting
Overall Analysis Uses Pfmon and it is based on the Cycle Accounting Analysis Good for showing overall performance and for checking improvements Good for identifying general software problems Good for comparing different versions of the code NOT enough for finding inefficient parts of the software finding bad programming practices 9
Overall Analysis 10
Symbol Level Analysis Uses sampling capabilities of pfmon Good for identifying general bad programming practices Can identify problems of functions which are frequently used Shows functions that use most of the execution cycles and functions that spend a lot of time doing nothing (stalling) NOT good for finding specific problems in the code 11
Stalled Cycles Symbol Level Analysis Total Cycles counts %self symbol % _int_malloc % __GI___libc_malloc % __cfree % ROOT::Math::SMatrix::operator= % __ieee754_exp % ROOT::Math::SMatrix::operator= % do_lookup_x % ROOT::Math::SMatrix::operator= % __ieee754_log % __atan % ROOT::Math::SMatrix::operator= % _int_free % G__defined_typename % strcmp % TList::FindLink % G__defined_tagname counts %self symbol % _int_malloc % do_lookup_x % __GI___libc_malloc % __ieee754_exp % strcmp % __cfree % __atan % __ieee754_log % TList::FindLink % _int_free % std::basic_string::find % computeFullJacobian % malloc_consolidate % operator new % ROOT::Math::SMatrix::operator= % 33.84% makeAtomStep 12
Module Level Analysis Uses the Perfmon2 interface (libpfm) directly Analyses each CMSSW module separately Allows the identification of “troubled” modules through a sortable HTML table Gives instruction statistics and produces detailed graphs to make analysis easier It requires 21 identical cmsRun ’s (no multiple sets of events are used → more accurate results), but it can be parallelized so (using 7 cores): time = ~3 runs Code outside modules is not monitored (framework) 13 DEMO
Module Level Analysis - Results Snapshot 14
Single Module Graphs 15
Modular Symbol Level Analysis - Overview 16 Uses the Perfmon2 interface (libpfm) directly and analyses each CMSSW module separately Sampling periods are specific to each event in order to have reasonable measurements The list of modules is a HTML table sortable by number of samples of UNHALTED_CORE_CYCLES For each module the complete set of usual events (Cycle Accounting Analysis & others) is sampled Results of each module are presented in separate HTML pages in tables sorted by decreasing sample count DEMO
The List of Modules 17
Table Example of a Module 18
An empirical study: What can we do with counters? 19 Question: Is all this useful? Answer: We don’t know, but we shall see Lack of papers and literature about the subject An empirical study is underway to find out: 1.A relationship between counter results and coding practices 2.A practical procedure to use counter results to optimize a program A procedure has already been developed and will be tested The trial study will be conducted on Gaudi together with Karol Kruzelecki (PH-LBC group)
The 3-step empirical optimization procedure 20 We start from counter results and choose one algorithm to work on using the Improvement Margin and the iFactor. We then apply the following procedure: MyAlg : Total Cycles: 1000, Total Instructions: 1000 → CPI: 1.00 MyAlg : Total Cycles: 300, Total Instructions: 250 → CPI: 1.20 MyAlg : Total Cycles: 250, Total Instructions: 250 → CPI: 1.00 MyAlg : Total Cycles: 230, Total Instructions: 250 → CPI: Change to a more efficient algorithm and vectorize it 2. Remove stall sources (L1 & L2 misses, store-fwd, etc..) 3. Remove misprediction sources (branches, calls, etc..)
PfmCodeAnalyser : a new tool for fast monitoring 21 Unreasonable (and useless) to run a complete analysis for every change in code Often interested in only small part of code and in one single event Solution: a fast, precise and light “singleton” class called PfmCodeAnalyser How to use it: #include PfmCodeAnalyser::Instance(“INSTRUCTIONS_RETIRED”).start(); //code to monitor PfmCodeAnalyser::Instance().stop();
From Core to Nehalem 22 All the work described was done for the Intel Core Microarchitecture The porting to Nehalem was completed yesterday and will be validated and tested soon A new Cycle Accounting model, new performance counters, different latencies and stall impacts, etc... Core i7, Xeon 5500 series and NUMA: core (4+3) and uncore counters (8+1) CMSSW has been profiled successfully on my Nehalem machine and the results can be found here.here
Conclusions 23 1.A consistent set of events has been monitored across the 4 different analysis approaches both in CMSSW and in Gaudi (Cycle Accounting Analysis) 2.An empirical counter-based optimization approach is being studied and tested 3.A new monitoring tool has been developed for quick performance monitoring: PfmCodeAnalyser 4.A new Cycle Accounting Analysis model for Nehalem processors is currently being validated on CMSSW 5.The report and background of the work done for CMSSW is available at
Questions ?