Visual C++ Optimizations Jonathan Caves Principal Software Engineer Visual C++ Microsoft Corporation
How can your application run faster? ► Maximize optimization for each file. ► Whole Program Optimization (WPO) goes beyond individual files. ► Profile Guided Optimization (PGO) specializes optimizations specifically for your application. ► New Floating Point Model. ► OpenMP ► 64bit Code Generation.
Maximum Optimization for Each File ► Compiler optimizes each source code file to get best runtime performance The only type optimization available in Visual C++ 6 ► Visual C added better optimization algorithms Specialized support for newer processors such as Pentium 4 Improved speed and better precision of floating point operations New optimization techniques like loop unrolling ► Typical expectation for performance after rebuild 10-20% improvement from Visual C++ 6 to Visual C 20-30% improvement from Visual C++ 6 to Visual C
Whole Program Optimization ► Typically Visual C++ will optimize programs by generating code for object files separately ► Introducing whole program optimization First introduced with Visual C and has since improved Compiler and linker set with new options (/GL and /LTCG) Compiler has freedom to do additional optimizations ► Cross-module inlining ► Custom calling conventions Visual C supports this on all platforms Whole program optimizations is widely used for Microsoft products such as SQL Server ► Typically expect significant performance improvement About 30% improvement from Visual C to Visual C
Profile Guided Optimization ► Static analysis leaves many open optimization questions for the compiler, leading to conservative optimizations ► Visual C++ programs can be tuned for expected user scenarios by collecting information from running application ► Introducing profile guided optimization Optimizing code by using program in a way how its customer use it Runs optimizations at link time like whole program optimization Available in Visual Studio 2005 Widely adopted in Microsoft if (p != NULL) { /* Perform action with p */ } else { /* Error code */ } Is it common for p to be NULL? If it is not common for p to be NULL, the error code should be collected with other infrequently used code
PGO: Instrumentation ► We instrument with “probes” inserted into the code ► Two main types of probes Value probes ► Used to construct histogram of values Count (simple/entry) probes ► Used to count number of times a path is taken ► We try to insert the minimum number of probes to get full coverage Minimizes the cost of instrumentation
PGO Optimizations ► Switch expansion ► Better inlining decisions ► Cold code separation ► Virtual call speculation ► Partial inlining
Compile with /GL & Optimizations On (e.g. /O2) Source Object files Instrumented Image Scenarios Output Profile data Object files Link with /LTCG:PGI Instrumented Image Profile data Object files Link with /LTCG:PGO Optimized Image Profile Guided Optimization
PGO: Inlining Sample ► Profile Guided uses call graph path profiling. foo bat barbaza
PGO: Inlining Sample (Cont) 100 foo bat 2050 barbaz 15 bar baz ► Profile Guided uses call graph path profiling. a 1075 bar baz 15
PGO – Inlining Sample (cont) foo bat barbaz barbaz ► Inlining decisions are made at each call site. a 10 15
PGO – Switch Expansion if (i == 10) goto default; switch (i) { case 1: … case 2: … case 3: … default:… } Most frequent values are pulled out. switch (i) { case 1: … case 2: … case 3: … default:… } // 90% of the // time i = 10; ►
PGO – Code Separation A CB D A B C D Default layout A B C D Optimized layout Basic blocks are ordered so that most frequent path falls through.
PGO – Virtual Call Speculation class Foo:Base{ … void call(); } class Bar:Base { … void call(); } class Base{ … virtual void call(); } void Func(Base *A) { … while(true) { … A->call(); … } void Func(Base *A) { … while(true) { … if(type(A) == Foo:Base) { // inline of A->call(); } else A->call(); … } The type of object A in function Func was almost always Foo via the profiles
PGO – Partial Inlining Basic Block 1 Cond Cold CodeHot Code More Code
PGO – Partial Inlining (cont) Basic Block 1 Cond Cold CodeHot Code More Code Hot path is inlined, but NOT the cold
Demo Optimizing applications with Visual C++
New Floating Point Model ► /Op made your code run slow No intermediate switch ► New Floating Point Model /fp:fast /fp:precise (default) /fp:strict /fp:except
/fp:precise ► The default floating point switch ► Performance and Precision ► IEEE Conformant ► Round to the appropriate precision At assignments, casts and function calls
/fp:fast ► When performance matters most ► You know your application does simple floating point operations ► What can /fp:fast do? Association Distribution Factoring inverse Scalar reduction Copy propagation And others …
/fp:except ► Reliable floating point exceptions ► Thrown and not thrown when expected Faults and traps, when reliable, should occur at the line that causes the exception FWAITs on x86 might be added ► Cannot be used with /fp:fast and in managed code
/fp:strict ► The strictest FP option Turns off contractions Assumes floating point control word can change or that the user will examine flags ► /fp:except is implied ► Low double digit percent slowdown versus /fp:fast
What is the output? #include #include int main() { double x, y, z; double sum; x = 1e20; y = -1e20; z = 10.0; sum = x + y + z; printf ("sum=%f\n",sum); } / fp:fast /O2 = o.ooo /fp:strict /O2 = 10.0
OpenMP A specification for writing multithreaded programs It consists of a set of simple #pragmas and runtime routines Makes it very easy to parallelize loop-based code Helps with load balancing, synchronization, etc… In Visual Studio, only available in C++
OpenMP Parallelization ► Can parallelize loops and straight-line code ► Includes synchronization constructs first = 1 last = ≤ i ≤ ≤ i ≤ ≤ i ≤ ≤ i ≤ 1000 void test(int first, int last) { #pragma omp parallel for for (int i = first; i <= last; ++i) { a[i] = b[i] + c[i]; }
64bit Compilers ► 64bit Compiler Cross Tools Compiler is 32bit but resulting image is 64bit ► 64bit Compiler Native Tools Compiler and resulting image are 64bit binaries. ► All previous optimizations apply for 64bit as well.
27 Understanding of Your Source Code ► Visual Studio Team System 2005 provides tools that help you understand defects and behavior of your source code ► Static code analysis Finds defects in source code at build time ► Profiler Determines where application spends time ► Code coverage Verifies that code paths are used as expected
28 Static Code Analysis ► Static code analysis helps developers find defects in code (/analyze) Reports code defects Warns about possible security vulnerabilities Suggests ways to improve performance Identifies possible design issues Enforces best practices ► Warns about defects and displays path to a problem void vulnerable(char* p) { wchar_t buf[16]; int ret; ret = MultiByteToWideChar(CP_ACP, 0, p, -1, buf, sizeof(buf)); printf("%d\n", ret); } Do you see the buffer overrun? This caused Code Red.
29 char *name = new char[10]; if(x < n) return ERR_CODE; delete name;.EXE Intermediate Representation Static Code Analysis Code Analysis
30 DefectsSecurityDesignPolicyPerformance char *name = new char[10]; if(x < n) return ERR_CODE; delete name; Potential Memory Leak! Defect Detection
31 DefectsSecurityDesignPolicyPerformance class Buffer { char buffer[10]; public: void* Fill(int value, int fillCount) { while (--fillCount) buffer[fillCount] = value; *buffer = value; return buffer; } }; Integer Overflow Error Security Defect Detection
32 Profiler ► Examine performance for entire application or for its specific parts ► Helps to find runtime bottlenecks of programs Option of collecting information via sampling or instrumentation Collect up to 15 performance counters ► Significantly better than profiler in Visual C++ 6
Resources ► Visual C++ Dev Center This is the place to go for all our news and whitepapers Also VC2005 specific forums at ► Myself