Download presentation
Presentation is loading. Please wait.
Published byAlbert Bond Modified over 9 years ago
1
Profile Guided Optimizations in Visual C++ 2005 Andrew Pardoe Phoenix Team (C++ Optimizer)
2
What do optimizers do? int setArray(int a, int *array) { for(int x = 0; x < a; ++x) array[x] = 0; return x; } The compiler knows nothing about the value of ‘a’ The compiler knows nothing about the array’s alignment The compiler doesn’t look at all the source files together The compiler doesn’t know how the program will execute
3
What is PGO (pronounced PoGO)? A “profile” details a program’s behavior in a specific scenario Profile-guided optimizations use the profile to guide the optimizer for that given scenario PGO tells the optimizer which areas of the application were most frequently executed This information lets the optimizer be more selective in optimizing the program PGO has its own set of optimizations as well as improving traditional optimizations
4
Example of a PGO win Compiler optimizations make assumptions based on static analysis and standard heuristics For example, we assume that a loop executes multiple times for (p=list; *p; p=p->next) { p->f = sqrt(F); } The optimizer would hoist the call to the loop invariant sqrt(F) tmp = sqrt(F); for (p=list; *p; p=p->next) { p->f = tmp; } If the profile shows that p is zero, we will not hoist the call
5
How is PGO used? Scenarios Source code Instrumented binary Source code Optimized binary Instrumented binary PGO Probes Profile
6
How is PGO used? PGO is built on top of Link-Time Code Generation Must link object files twice: once for instrumented build, once for optimized build Can be used on almost all native code exe, dll, lib COM/MFC Windows services Cannot be used on system or managed code Drivers or kernel mode code No code compiled with /CLR Incorrect scenarios could cause worse optimizations!
7
PGO profile gathering Two major themes of PGO profile gathering Identify “hot paths” in program execution path and optimize to make these paths perform well Likewise, identify “cold paths” to separate cold code— or dead code—from hot code Identify “typical” values such as switch values, loop induction variables and targets of indirect calls and optimize code for these values
8
PGO main optimizations: inlining Improved inlining heuristics Inline based on frequency of call, not function size or depth of call stack “Hot” call sites: inline agressively “Cold” call sites: only inline if there are other optimization opportunities (such as folding) “Dead” call sites: only inline the trivial cases
9
PGO main optimizations: inlining Speculative inlining: used for virtual call specification Indirect calls are profiled to find typical targets An indirect call heavily biased toward certain target(s) can be multi-versioned The new sequence contains direct call(s) to typical target(s), which can be inlined Partial inlining: only inline the portions of the callee we execute. If the cold code is called, call the non- inlined function.
10
PGO main optimizations: code size Choice of favoring size versus speed made on a per- function basis Program execution should be dominated by functions optimized for speed and less-frequently used functions should be small PGO computes a dynamic instruction count for each profiled function. Inlining effects are taken into account. Sorts functions in descending order by count. Functions in the upper 99% of total dynamic instruction count are optimized for speed. Others are compressed. In large applications (Vista, SQL) most functions are optimized for size.
11
PGO main optimizations: locality Reorder the code to “fall through” wherever possible Intra-function layout reorders basic blocks so that the major trace falls through whenever possible. Inter-function layout tries to place frequent caller-callee pairs near one another in the image. Extract “dead” code from the.text section and put it in a remote section of the image Dead code can be entire functions that are not called or basic blocks inside a function Penalty for being wrong is very large so the profile must be accurate!
12
What code benefits most? C++ programs: many virtual calls can be inlined once the target is determined through profiling Large applications where size and speed are important Code with frequent branches that are difficult to predict at compile time Code which can be separated by profiling into “hot” and “cold” blocks to help instruction cache locality Code for which you know the typical usage patterns and can produce accurate profiling scenarios
13
Scenario 1 Customer compiles with /O2 and gets pretty good performance but wants to take advantage of advanced optimizations like LTCG and PGO Code is tested by the dev team throughout development cycle using unit and bug regression tests Customer has done performance measurements of the code. Customer has no automated tests to measure performance but believes it can improve. Is this customer ready to try PGO? Probably not.
14
Scenario 2 Customer has well-defined performance goals and tests set up to measure performance Customer knows typical usage patterns for the application Application is being built with LTCG Most of the execution time is spent in tightly-nested loops doing heavy floating-point calculations Is this customer ready to use PGO? Maybe…
15
Scenario 3 Customer has well-defined performance goals and tests set up to measure performance Customer knows typical usage patterns for the application Application is being built with LTCG Application spends most of its time in branches and calls Application is fairly large and makes use of inheritance Is this customer ready to use PGO? Definitely.
16
Scenario 4 Customer has a build lab and wants to enable PGO in nightly builds But profiling every night seems too expensive Solution: PGO Incremental Update Avoid running profile scenarios at every build PGU uses “stale” profile data Can check in profile data and refresh weekly PGU restricts optimizations Functions which have changed will not be optimized Effects of localized changes are usually negligible
17
PGO sweeper Some scenarios are difficult to collect profile data for Profile scenario may not begin and end with application launch and shutdown Some components cannot write a file Some components cannot link to the PGO runtime DLL PGO sweeper collects profile data from running instrumented processes This allows you to close a currently open.pgc file and create a new one without exiting the instrumented binary You get one.pgc file per run or sweep. You can delete any.pgc files you do not want reflected in your scenario.
18
PGO Manager PGO manager adds profile data from one or more.pgc files into the.pgd file The.pgd file is the main profile database Allows you to profile multiple scenarios (.pgc) for a single codebase into one profile database (.pgd) PGO manager also lets you generate reports from the.pgd file to see that your scenarios “feel right” in the code Information in the reports include Module count, function count, arc and value count Static (all) instruction count, dynamic (hot) instruction count Basic block count, average basic block size Function entry count
19
How much performance does PGO get? Performance gain is architecture and application specific IA64 sees biggest gains x64 benefits more than x32 Large applications benefit more than small: SQL server saw over 30% gains through PGO Many parts of Windows use PGO to balance size vs. speed If you understand your real-world scenarios and have adequate, repeatable tests PGO is almost always a win Once your testing is in-place integrating PGO into your build process should be easy
20
Performance gains over LTCG
21
Call-graph profiling Given this call graph, determine which code paths are hot and which are cold foo bat barbaza
22
Call-graph profiling continued Measure the frequency of calls 100 foo bat 2050 barbaz 15 bar baza 75 bar baz 15 10 20 100 75 50 15
23
Call-graph profiling after inlining Inline functions based on call profile Highest-frequency calls are (bar, baz) and (bat, bar) foo bat 20125 barbaz 100 15 barbaz a 10 15
24
Reordering basic blocks Change code layout to improve instruction cache locality A CB D 100 10 A B C D Default layout A C B D Optimized layout Default layoutOptimized layoutExecution profile 10 100
25
Speculative inlining of virtual calls Profiling shows the dynamic type of object A in function Func was almost always Foo (and almost never Bar) class Foo:Base { … void call(); } class Bar:Base { … void call(); } class Base { … virtual void call(); } void Bar(Base *A) { … while(true) { … A->call(); … } void Func(Base *A) { … while(true) { … if(type(A) == Foo:Base) { // inline of A->call(); } else // virtual dispatch A->call(); … }
26
Partial inlining Profiling shows that condition Cond favors the left branch over the right branch Basic Block 1 Cond Cold CodeHot Code More Code
27
Partial inlining concluded We can inline the hot path, and not the cold path. We can make different decisions at each call site! Basic Block 1 Cond Cold Code Hot Code More Code
28
Using PGO (in more detail) Scenarios Source code Optimized binary Compile with /GL and opts Object files Link with /LTCG:PGI Object files Instrumented binary.PGD file Instrumented binary Object files.PGC file(s).PGC files.PGD file Link with /LTCG:PGO
30
PGO tips The scenarios used to generate the profile data should be real- world scenarios. The scenarios are NOT and attempt to do code coverage. Using scenarios to train with that are not representative of real-world use can result in code that performs worse than if PGO was not used. Name the optimized code something different from the instrumented code, for example, app.opt.exe and app.inst.exe. This way you can rerun the instrumented application to supplement your set of scenario profiles without rerunning everything again. To tweak results, use the /clear option of pgomgr to clear out a.PGD files.
31
PGO tips If you have two scenarios that run for different amounts of time, but would like them to be weighted equally, you can use the weight switch (/merge:weight in pgomgr) on.PGC files to adjust them. You can use the speed switch to change the speed/size thresholds. You can control the inlining threshold with a switch but use it with care. The values from 0-100 aren't linear. Integrate PGO into your build process and update scenarios frequently for the most consistent results and best performance increases.
32
In summary Using PGO is very easy, with four simple steps CL to parse the source files cl /c /O2 /GL *.cpp LINK / PGI to generate instrumented image link /ltcg:pgi /pgd:appname.pgd *.obj *.lib Also generates a PGD file (PGO database) Run your program on representative scenarios Generates PGC files (PGO profile data) LINK / PGO to generate optimized image Implicitly uses the generated PGC files link /ltcg:pgo /pgd:appname.pgd *.obj *.lib
33
More information Matt Pietrek’s Under the Hood column from May 2002 has a fantastic explanation of LTCG internals Multiple articles on PGO located on MSDN The links are long: just search for PGO on MSDN Look through articles by Kang Su Gatlin on his blog at http://blogs.msdn.com/kangsu or on MSDN http://blogs.msdn.com/kangsu Improvements are coming in the new VC++ backend Based on the Phoenix optimization framework Profiling is a major scenario for the Phoenix-based optimizer There will be a talk on Phoenix later today
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.