Download presentation
Presentation is loading. Please wait.
Published byJocelyn Stewart Modified over 9 years ago
1
Profiling Tools 1 Profiling tools By Vitaly Kroivets for Software Design Seminar
2
Profiling Tools2 Contents Introduction Software optimization process, optimization traps and pitfalls Benchmark Performance tools overview Optimizing compilers System Performance monitors Profiling tools GNU gprof INTEL VTune Valgrind What does it mean to use system efficiently
3
Profiling Tools3 The Problem PC speed increased 500 times since 1981, but today’s software is more complex and still hungry for more resources How to run faster on same hardware and OS architecture? Highly optimized applications run tens times faster than poorly written ones. Using efficient algorithms and well-designed implementations leads to high performance applications
4
Profiling Tools4 The Software Optimization Process Find hotspots Modify application Retest using benchmark Investigate causes Create benchmark Hotspots are areas in your code that take a long time to execute
5
Profiling Tools5 Extreme Optimization Pitfalls Large application’s performance cannot be improved before it runs Build the application then see what machine it runs on Runs great on my computer… Debug versus release builds Performance requires assembly language programming Code features first then optimize if there is time leftover
6
Profiling Tools6 Key Point: Software optimization doesn’t begin where coding ends – It is ongoing process that starts at design stage and continues all the way through development
7
Profiling Tools7 The Benchmark The benchmark is program that used to Objectively evaluate performance of an application Provide repeatable application behavior for use with performance analysis tools Industry standard benchmarks : TPC-C 3D-Winbench http://www.specbench.com/ Enterprise Services Graphics/Applications HPC/OMP Java Client/Server Mail Servers Network File System Web Servers
8
Profiling Tools8 Attributes of good benchmark Repeatable (consistent measurements) Remember system tasks, caching issues “incoming fax” problem : use minimum performance number Representative Execution of typical code path, mimic how customer uses the application Poor benchmarks : Using QA tests
9
Profiling Tools9 Benchmark attributes (cont.) Easy to run Verifiable need QA for benchmark! Measure Elapsed Time vs. other number Use benchmark to test functionality Algorithmic tricks to gain performance may break the application…
10
Profiling Tools10 How to find performance bottlenecks Determine how your system resources, such as memory and processor, are being utilized to identify system-level bottlenecks Measure the execution time for each module and function in your application Determine how the various modules running on your system affect the performance of each other Identify the most time-consuming function calls and call sequences within your application Determine how your application is executing at the processor level to identify microarchitecture-level performance problems
11
Profiling Tools11 Performance Tools Overview Timing mechanisms Stopwatch : UNIX time tool Optimizing compiler (easy way) System load monitors vmstat, iostat, perfmon.exe, Vtune Counter Software profiler Gprof, VTune, Visual C++ Profiler, IBM Quantify Memory debugger/profiler Valgrind, IBM Purify, Parasoft Insure++
12
Profiling Tools12 Using Optimizing Compilers Always use compiler optimization settings to build an application for use with performance tools Understanding and using all the features of an optimizing compiler is required for maximum performance with the least effort
13
Profiling Tools13 Optimizing Compiler : choosing optimization flags combination
14
Profiling Tools14 Optimizing Compiler’s effect
15
Profiling Tools15 Optimizing Compilers: Conclusions Some processor-specific options still do not appear to be a major factor in producing fast code More optimizations do not guarantee faster code Different algorithms are most effective with different optimizations Idea : using statistics gathered by profiler as input for compiler/linker
16
Profiling Tools16 Windows Performance Monitor Sampling “profiler” Uses OS timer interrupt to wake up and record the value of software counters – disk reads, free memory Maximum resolution : 1 sec Cannot identify piece of code that caused event to occur Good for finding system issues Unix tools : vmstat, iostat, xos, top, oprofile, etc.
17
Profiling Tools17 Performance Monitor Counters
18
Profiling Tools18 Profilers Profiler may show time elapsed in each function and its descendants number of calls, call-graph (some) Profilers use either instrumentation or sampling to identify performance issues
19
Profiling Tools19 Sampling vs. Instrumentation SamplingInstrumentation Overhead Typically about 1%High, may be 500% ! System-wide profiling Yes, profiles all app, drivers, OS functionsJust application and instrumented DLLs Detect unexpected events Yes, can detect other programs using OS resources No Setup NoneAutomatic ins. of data collection stubs required Data collected Counters, processor an OS stateCall graph, call times, critical path Data granularity Assembly level instr., with src lineFunctions, sometimes statements Detects algorithmic issues No, Limited to processes, threadsYes – can see algorithm, call path is expensive
20
Profiling Tools20 Profiling Tools Gprof Intel VTune Valgrind Old, buggy and inaccurate $700. Unstable Is not profiler really …
21
Profiling Tools 21 GNU gprof Instrumenting profiler for every UNIX-like system
22
Profiling Tools22 Using gprof GNU profiler Compile and link your program with profiling enabled cc -g -c myprog.c utils.c -pg cc -o myprog myprog.o utils.o -pg Execute your program to generate a profile data file Program will run normally (but slower) and will write the profile data into a file called gmon.out just before exiting Program should exit using exit() function Run gprof to analyze the profile data gprof a.out
23
Profiling Tools23 Example Program
24
Profiling Tools24 The flat profile shows the total amount of time your program spent executing each function. If a function was not compiled for profiling, and didn't run long enough to show up on the program counter histogram, it will be indistinguishable from a function that was never called Understanding Flat Profile
25
Profiling Tools25 Flat profile : %time Percentage of the total execution time your program spent in this function. These should all add up to 100%.
26
Profiling Tools26 Flat profile: Cumulative seconds This is cumulative total number of seconds the spent in this functions, plus the time spent in all the functions above this one
27
Profiling Tools27 Number of seconds accounted for this function alone Flat profile: Self seconds
28
Profiling Tools28 Number of times was invoked Flat profile: Calls
29
Profiling Tools29 Average number of sec per call Spent in this function alone Flat profile: Self seconds per call
30
Profiling Tools30 Average number of seconds spent in this function and its descendents per call Flat profile: Total seconds per call
31
Profiling Tools31 Call Graph : call tree of the program Current Function: g( ) Called by : main ( ) Descendants: doit ( )
32
Profiling Tools32 Call Graph : understanding each line Current Function: g( ) Unique index of this function Percentage of the `total‘ time spent in this function and its children. Total time propagated into this function by its children total amount of time spent in this function Number of times was called
33
Profiling Tools33 Call Graph : understanding each line Current Function: g( ) Time that was propagated from the function's children into this parent Time that was propagated directly from the function into this parent Number of times this parent called the function `/‘ total number of times the function was called Call Graph : parents numbers
34
Profiling Tools34 Call Graph : “children” numbers Current Function: g( ) Amount of time that was propagated from the child's children to the function Amount of time that was propagated directly from the child into function Number of times this function called the child `/‘ total number of times this child was called
35
Profiling Tools35 How gprof works Instruments program to count calls Watches the program running, samples the PC every 0.01 sec Statistical inaccuracy : fast function may take 0 or 1 samples Run should be long enough comparing with sampling period Combine several gmon.out files into single report The output from gprof gives no indication of parts of your program that are limited by I/O or swapping bandwidth. This is because samples of the program counter are taken at fixed intervals of run time number-of-calls figures are derived by counting, not sampling. They are completely accurate and will not vary from run to run if your program is deterministic Profiling with inlining and other optimizations needs care
36
Profiling Tools 36 VTune performance analyzer To squeeze every bit of power out of Intel architecture !
37
Profiling Tools37 VTune Modes/Features Time- and Event-Based, System-Wide Sampling provides developers with the most accurate representation of their software's actual performance with negligible overhead Call Graph Profiling provides developers with a pictorial view of program flow to quickly identify critical functions and call sequences Counter Monitor allows developers to readily track system activity during runtime which helps them identify system level performance issues
38
Profiling Tools38 Sampling mode Monitors all active software on your system including your application, the OS, JIT- compiled Java* class files, Microsoft*.NET files, 16-bit applications, 32-bit applications, device drivers Application performance is not impacted during data collection
39
Profiling Tools39 Sampling Mode Benefits Low-overhead, system-wide profiling helps you identify which modules and functions are consuming the most time, giving you a detailed look at your operating system and application Benefits of sampling: Profiling to find hotspots. Find the module, functions, lines of source code and assembly instructions that are consuming the most time Low overhead. Overhead incurred by sampling is typically about one percent No need to instrument code. You do not need to make any changes to code to profile with sampling
40
Profiling Tools40 How does sampling work? Sampling interrupts the processor after a certain number of events and records the execution information in a buffer area. When the buffer is full, the information is copied to a file. After saving the information, the program resumes operation. In this way, the VTune™ maintains very low overhead (about one percent) while sampling Time-based sampling: collects samples of active instruction addresses at regular time-based intervals (1ms. by default) Event-based sampling: collects samples of active instruction addresses after a specified number of processor events After the program finishes, the samples are mapped to modules and stored in a database within the analyzer program.
41
Profiling Tools41 Starting the Sampling Wizard
42
Profiling Tools42 Starting the Sampling Wizard Hardware prevents from sampling of many counters simultaneously
43
Profiling Tools43 Starting the Sampling Wizard
44
Profiling Tools44 Starting the Sampling Wizard Unsupported CPU ? Ha-ha-ha…
45
Profiling Tools45 EBS : choosing events
46
Profiling Tools46 Events counted by VTune Basic Events: clock cycles, retired instructions Instruction Execution: instruction decode, issue and execution, data and control speculation, and memory operations Cycle Accounting Events: stall cycle breakdowns Branch Events: branch prediction Memory Hierarchy: instruction prefetch, instruction and data caches System Events: operating system monitors, instruction and data TLBs About 130 different events in Pentium 4 architecture !
47
Profiling Tools47 Sampling …
48
Profiling Tools48 Viewing Sampling Results Process view all the processes that ran on the system during data collection Thread view the threads that ran within the processes you select in Process view Module view the modules that ran within the selected processes and threads Hotspot view the functions within the modules you select in Module view
49
Profiling Tools49 Different events collected – modules view Our program System-wide look at software running on the system CPI- good average indication
50
Profiling Tools50 Hotspot Graph Each bar represents one of the functions of our program Click on hotspot bar VTune displays source code view
51
Profiling Tools51 Source View Test_if function
52
Profiling Tools52 See how much time is spent on each one line Annotated Source View (% of module) Check this “for” loop ! 10% of CPU spent in few statements
53
Profiling Tools53 VTune Tuning assistant In few clicks we reached to the performance problem! Now, how to solve it ? Tuning Assistant highlights performance problems Provides approximate time lost by each performance problem Database contains performance metrics based on Intel’s experience of tuning hundreds of applications Analyzes the data gathered by our application Generates tuning recommendations for each “hotspot” Gives user idea what might be done to fix the problem
54
Profiling Tools54 Tuning Assistance Report
55
Profiling Tools55 Hotspot Assistant Report : Penalties
56
Profiling Tools56 Hotspot Assistant Report
57
Profiling Tools57 Call Graph Mode Provides with a pictorial view of program flow to quickly identify critical functions and call sequences Call graph profiling reveals: Structure of your program on a function level Number of times a function is called from a particular location The time spent in each function Functions on a critical path.
58
Profiling Tools58 Call Graph Screenshot Critical Path displayed as red lines: call sequence in an application that took the most time to execute. the function summary pane Switch to Call- list View
59
Profiling Tools59 Call Graph (Cont.) Wait time – how much time spent waiting for event to occur Additional info available - by hovering the move over the functions
60
Profiling Tools60 Jump to Source view
61
Profiling Tools61 Call Graph – Call List View Caller Functions are the functions that called the Focus Function Callee Functions are the functions that called by Focus Function
62
Profiling Tools62 Counter Monitor Use the Counter Monitor feature of the VTune™ to collect and display performance counter data. Counter monitor selectively polls performance counters, which are grouped categorically into performance objects. With the VTune analyzer, you can: Monitor selected counters in performance objects. Correlate performance counter data with data collected by other features in the VTune analyzer, such as sampling. Trigger the collection of counter data on events other than a periodic timer.
63
Profiling Tools63 Counter Monitor
64
Profiling Tools64 Getting Help Context –sensitive help Online Help repository
65
Profiling Tools65 VTune Summary Pros: Allows to get best possible performance out of Intel architecture Cons: Extreme tuning requires deep understanding of processor and OS internals
66
Profiling Tools 66 Valgrind Multi-purpose Linux x86 profiling tool
67
Profiling Tools67 Valgrind Toolkit Memcheck is memory debugger detects memory-management problems Cachegrind is a cache profiler performs detailed simulation of the I1, D1 and L2 caches in your CPU Massif is a heap profiler performs detailed heap profiling by taking regular snapshots of a program's heap Helgrind is a thread debugger finds data races in multithreaded programs
68
Profiling Tools68 Memcheck Features When a program is run under Memcheck's supervision, all reads and writes of memory are checked, and calls to malloc/new/free/delete are intercepted Memcheck can detect: Use of uninitialised memory Reading/writing memory after it has been free'd Reading/writing off the end of malloc'd blocks Reading/writing inappropriate areas on the stack Memory leaks -- where pointers to malloc'd blocks are lost forever Passing of uninitialised and/or unaddressible memory to system calls Mismatched use of malloc/new/new [] vs free/delete/delete [] Overlapping src and dst pointers in memcpy() and related functions Some misuses of the POSIX pthreads API
69
Profiling Tools69 Memcheck Example Using non- initialized value Using “free” of memory allocated by “new” Access of unallocated memory Memor y leak
70
Profiling Tools70 Memcheck Example (Cont.) Compile the program with –g flag: g++ -c a.cc –g –o a.out Execute valgrind : valgrind --tool=memcheck --leak-check=yes a.out > log View log Debug leaks Executabl e name
71
Profiling Tools71 Memcheck report
72
Profiling Tools72 Memcheck report (cont.) Leaks detected: STACKSTACK
73
Profiling Tools73 Cachegrind Detailed cache profiling can be very useful for improving the performance of the program On a modern x86 machine, an L1 miss will cost around 10 cycles, and an L2 miss can cost as much as 200 cycles Cachegrind performs detailed simulation of the I1, D1 and L2 caches in your CPU Can accurately pinpoint the sources of cache misses in your code Identifies number of cache misses, memory references and instructions executed for each line of source code, with per-function, per-module and whole- program summaries Cachegrind runs programs about 20--100x slower than normal
74
Profiling Tools74 How to run Run valgrind --tool=cachegrind in front of the normal command line invocation Example : valgrind --tool=cachegrind ls -l When the program finishes, Cachegrind will print summary cache statistics. It also collects line-by-line information in a file cachegrind.out.pid Execute cg_annotate to get annotated source file: cg_annotate --7618 a.cc > a.cc.annotated PID Source files
75
Profiling Tools75 Cachegrind Summary output I-cache reads (instructions executed) I1 cache read misses L2-cache instruction read misses Instruction caches performance
76
Profiling Tools76 Cachegrind Summary output D-cache reads (memory reads) L2-cache data read misses Data caches READ performance D1 cache read misses
77
Profiling Tools77 Cachegrind Summary output D-cache writes (memory writes) D1 cache write misses L2-cache data write misses Data caches WRITE performance
78
Profiling Tools78 Cachegrind Accuracy Valgrind's cache profiling has a number of shortcomings: It doesn't account for kernel activity -- the effect of system calls on the cache contents is ignored It doesn't account for other process activity (although this is probably desirable when considering a single program) It doesn't account for virtual-to-physical address mappings; hence the entire simulation is not a true representation of what's happening in the cache
79
Profiling Tools79 Massif tool Massif is a heap profiler - it measures how much heap memory programs use. It can give information about: Heap blocks Heap administration blocks Stack sizes Help to reduce the amount of memory the program uses smaller program interact better with caches, avoid paging Detect leaks that aren't detected by traditional leak- checkers, such as Memcheck That's because the memory isn't ever actually lost - a pointer remains to it - but it's not in use anymore
80
Profiling Tools80 Executing Massif Run valgrind –tool=massif prog Produces following: Summary Graph Picture Report Summary will look like this: Total spacetime: 2,258,106 ms.B Heap: 24.0% Heap admin: 2.2% Stack (s): 73.7% number of words allocated on heap, via malloc(), new and new[]. Space (in bytes) multiplied by time (in milliseconds).
81
Profiling Tools81 Spacetime Graphs
82
Profiling Tools82 Spacetime Graph (Cont.) Each band represents single line of source code It's the height of a band that's important Triangles on the x-axis show each point at which a memory census was taken Not necessarily evenly spread; Massif only takes a census when memory is allocated or de-allocated The time on the x-axis is wall-clock time not ideal because can get different graphs for different executions of the same program, due to random OS delays
83
Profiling Tools83 Text/HTML Report example Contains a lot of extra information about heap allocations that you don't see in the graph. Shows places in the program where most memory was allocated
84
Profiling Tools84 Valgrind – how it works Valgrind is compiled into a shared object, valgrind.so. The shell script valgrind sets the LD_PRELOAD environment variable to point to valgrind.so. This causes the.so to be loaded as an extra library to any subsequently executed dynamically-linked ELF binary The dynamic linker allows each.so in the process image to have an initialization function which is run before main(). It also allows each.so to have a finalization function run after main() exits When valgrind.so's initialization function is called by the dynamic linker, the synthetic CPU to starts up. The real CPU remains locked in valgrind.so until end of run System call are intercepted; Signal handlers are monitored
85
Profiling Tools85 Valgrind Summary Valgrind will save hours of debugging time Valgrind can help speed up your programs Valgrind runs on x86-Linux Valgrind works with programs written in any language Valgrind is actively maintained Valgrind can be used with other tools (gdb) Valgrind is easy to use uses dynamic binary translation, so no need to modify, recompile or re-link applications. Just prefix command line with valgrind and everything works Valgrind is not a toy Used by large projects : 25 millions lines of code Valgrind is free
86
Profiling Tools86 Other Tools Tools not included in this presentation: IBM Purify Parasoft Insure KCachegrind Oprofile GCC’s and GLIBC’s debugging hooks
87
Profiling Tools87 Writing Fast Programs Select right algorithm Implement it efficiently Detect hotspots using profiler and fix them Understanding of target system architecture is often required – such as cache structure Use platform-specific compiler extensions – memory pre-fetching, cache control-instruction, branch prediction, SIMD instructions Write multithreaded applications (“Hyper Threading Technology”)
88
Profiling Tools88 CPU Architecture (Pentium 4) Instruction fetch Instruction decode Branch prediction Execution Units retirement Instruction pool Memory Out-of-order Execution !
89
Profiling Tools89 Instruction Execution Instruction pool Dispatch unit Integer Memory Save Memory Load Floating point Execution Units
90
Profiling Tools90 Keeping CPU Busy Processors are limited by data dependencies and speed of instructions Keep data dependencies low Good blend of instructions keep all execution units busy at same time Waiting for memory with nothing else to execute is most common reason for slow applications Goals: ready instructions, good mix of instructions and predictable branches Remove branches if possible Reduce randomness of branches, avoid function pointers and jump tables
91
Profiling Tools91 Memory Overview (Pentium 4) L1 cache (data only) 8 kbytes Execution Trace Cache that stores up to 12K of decoded micro-ops L2 Advanced Transfer Cache (data + instructions) 256 kbytes, 3 times slower than L1 L3 : 4MB cache (optional) Main RAM (usually 64M … 4G), 10 times slower than L1
92
Profiling Tools92 Fixing memory problems Use less memory to reduce compulsory cache misses Increase cache efficiency (place items used at same time near each other) Read sooner with prefetch Write memory faster without using cache Avoid conflicts Avoid capacity issues Add more work for CPU (execute non- dependent instruction while waiting)
93
Profiling Tools93 References SPEC website http://www.specbench.org http://www.specbench.org The Software Optimization Cookbook High-Performance Recipes for the Intel® Architecture by Richard Gerber GCC Optimization flags http://gcc.gnu.org/onlinedocs/gcc/Optimize- Options.html http://gcc.gnu.org/onlinedocs/gcc/Optimize- Options.html Valgrind Homepage http://valgrind.kde.org An Evolutionary Analysis of GNU C Optimizations Using Natural Selection to Investigate Software Complexities by Scott Robert Ladd Intel VTune Performace Analyzer webpage http://www.intel.com/software/products/vtune/ Gprof man page http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html_mono/gprof.html
94
Profiling Tools94 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.