Download presentation
Presentation is loading. Please wait.
Published byEverett Weatherford Modified over 10 years ago
1
SE-292 High Performance Computing Profiling and Performance R. Govindarajan govind@serc
2
2 Performance Measurement and Tuning Tools to help you measure the performmance of program Determining program execution time % time a.out real 0m0.019s user 0m0.014s system 0m0.002s Gives elapse time, user, and system Tools to identify the important parts of your program for perf. Improvement Concentrate optimization efforts on those parts
3
3 Amdahls Law Which part of the program to optimize? Amdahls Law: Speedup is limited by the part of program which does not benefit by the optimization IOW, Sp 1/s ! Implies concentrate on part of the program where maximum time is spent!
4
4 Timing Timing: measuring the time spent in specific parts of your program Examples of `parts: Functions, loops, … Recall: Different kinds of time that can be measured (real/wallclock/elapsed vs virtual/CPU) 1.Decide which time you are interested in measuring at what granularity 2.Find out what mechanisms are available and their granularity of measurement
5
5 Timing Mechanisms gettimeofday Real time in seconds and microseconds since 00:00 1/1/1970 Q: Overflow of 32b second value? getrusage times system call High resolution timers Example: gethrtime
6
6 Profiling Profiler: tool that helps you identify the `important parts of your program to concentrate your optimization efforts Profile: breakup (of execution time) across different parts of the program Can be done by adding statements to your program (instrumentation) -- so that during execution, data is gathered, outputted and possibly processed later Automation: where a profiling tool adds those instructions into your program for you
7
7 Profiling Mechanisms Levels of Granularity typically supported Function level Statement level Basic block level: A basic block is a sequence of contiguous instructions in a program with a single entry point (the first instruction in the basic block) and a single exit point (the last instruction in the basic block) Two kinds of profile data execution time execution counts We will look at examples of profiling mechanisms at the function and basic block level
8
8 Prof: UNIX Function Level Profiling Usage % cc –p program.c /generates instrumented a.out % a.out / execution; instrumentation / generates data and mon.out % prof / processing of profile data Output gives a function by function breakup of execution time Useful in identifying which functions to concentrate optimization efforts on
9
9 Output: %TimeSecondsCumSecs#Calls Name 56.8 0.500.501000 _baz 27.4 0.240.741000 _bar 15.9 0.140.88500 _foo … 0.0 0.000.88 1 _main 0.0 0.000.88 3 _strcpy
10
10 Prof: How it Works Instrumentation does three things 1. At entry of each function: increment an execution count for that function 2. At program entry: make a call to system call profil to get execution times 3. At program exit: write profile data to output file that can later be processed by prof profil(): execution time profiler Generates an execution time histogram, execution time in each function
11
11 Profil: What it does One of the parameters in call to profil is a buffer Used as an array of counters initialized to 0 Array elements are associated with contiguous regions of program text During execution, PC value is sampled (once every clock tick, default: 10 msec); triggered on timer interrupt Corresponding buffer element is incremented Later associated with a function; time weight of 10 msec used to estimate CPU times
12
12 Using prof From how it works, we understand that Granularity is at best 10 msec Generated profile could differ for multiple runs of a program with same input! Could be completely wrong; observe that there could be a particular function that just happens to be running each time the timer interrupt occurs Some usage guidelines Run under light load conditions Run a few times and see if results vary a lot Note that function execution counts are exact, while execution times are estimates
13
13 Pixie: Basic Block Level Profiling Available on MIPS, Alpha machines Usage % cc program.c / a.out % pixie a.out / instrumented a.out.pixie % a.out.pixie / profile output file % prof / report on profile data Output is based on basic block level execution counts Useful for all kinds of things
14
14 What is a Basic Block? A section of program that does not cross any conditional branches, loop boundaries or other transfers of control A sequence of instructions with a single entry point, single exit point, and no internal branches A sequence of program statements that contains no labels and no branches A basic block can only be executed completely and in sequence
15
15 Pixie: How it works 1.Identification of basic blocks Q: How can basic blocks be identified? Pixie uses heuristics where necessary 2.Instrumentation Increment a counter for the basic block On program entry and exit: initialization of data structures; writing profile output file
16
16 How intrusive are these mechanisms? Issue: Does the instrumented program behave enough like the original program? If not, the profile generated might mislead the direction of program optimization efforts Pixie: instrumented executable can be several times the size of the original Does not matter; basic block execution counts are accurate Prof: gathers more than just execution counts Instrumentation is not very large
17
17 Performance Tuning Tools Performance Counters provided in hardware Event-based or sampled counters Measure various events (e.g., CPU cycles, L1 Cache misses, TLB misses, loads, instrn. Count, … ) Counters may be accessible to user-level or kernel level. Accessible through command-line (user level) or through Performance tools! %perfex executable [arguments] Accesses MIPS R10000 Counters
18
18 Other tools : Vtune Use Sampling to gain an accurate representation of your software's actual performance, with negligible overhead. Gather CPU snapshots to identify problems such as cache misses. No special builds or instrumentation are required. Produce a picture of program flow to quickly identify critical functions and call sequences using Call Graph Profiling. Gain a high-level, algorithmic view of program execution.
19
19 Other Tools: Pin Uses dynamic instrumentation Does not need source code, recompilation, post-linking Programmable Instrumentation: Provides rich APIs to write in C/C++ your own instrumentation tools (called Pintools) Instrumentation done on executable (binary) and can be attached statically or dynamically Launch and instrument an application $pin –t pintool –- application Instrumentation Engine (provided) Instrumentation tool (write your own or provided)
20
20 Assignment #2 (contd.) 5. Use any of the performance tuning tools to measure various performance metrics (cache misses, exec. Time, etc.) and reason the performance of different versions of the matrix multiplication program. (Due: Oct. 14, 2010)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.