Download presentation
Presentation is loading. Please wait.
Published bySamson Lang Modified over 9 years ago
1
1 Tips and Tricks: Visual C++ 2005 Optimization Best Practices Kang Su Gatlin TLNL04 Program Manager Visual C++ Microsoft Corporation
2
2 6 Tips/Best Practices To Help Any C++ Dev Write Faster Code Managed + Unmanaged 1. Pick the right level of optimization 2. Add instant parallelism Unmanaged 3. Disambiguate memory 4. Use intrinsics Managed 5. Avoid double thunks 6. Speed app startup time
3
3 1. Pick the Right Level Of Optimization Builds from the Lab If at all possible use Profile-Guided Optimization Only available unmanaged More on this next slide If not, use Whole Program Optimization (/GL) Available managed and unmanaged After that we recommend /O2 (optimize for speed) for hot functions/files /O1 (optimize for size) for the rest Other switches to use for maximum speed /Gy /OPT:REF,ICF (good size win on 64bit) /fp:fast /arch:SSE2 (will not work on downlevel architectures) Debug Symbols Are NOT Only for Debug Builds Executable size and codegen are NOT effected by this It’s all in the PDB file Always building debug symbols will make life easier Make sure you use /OPT:REF,ICF, don’t use /ZI, and use /INCREMENTAL:NO
4
4 Next-Gen Optimizations Today Profile Guided Optimization The next level beyond Whole Program Optimization Static compilers can’t answer everything We get 20-50% improvement on large server applications that we ship Current support is unmanaged only if(a < b) foo(); foo();else baz(); baz(); for(i = 0; i < count; ++i) bar(); bar(); Should we inline foo()? Should we unroll this loop?
5
5 Profile Guided Optimization Compile with /GL Source Object files InstrumentedImage Scenarios Output Profile data Object files Link with /LTCG:PGI Instrumented Image + PGD file Profile data Object files Link with /LTCG:PGO OptimizedImage There is throughput impact
6
6 What PGO Does And Does Not Do PGO does Optimizations galore Speed/Size Determination Switch expansion Better inlining decisions Function/basic block layout Virtual call speculation Partial inlining Optimize within a single image Merging and weighting of multiple scenarios PGO does not No probing assembly language (inline or otherwise) No optimizations across DLLs No data layout optimization
7
7 PGO Compilation in Visual C++ 2005
8
8 2. Add Instant Parallelism Just add OpenMP Pragmas! OpenMP is a popular API for multithreaded programs Born from the HPC community It consists of a set of simple #pragmas and runtime routines Most value parallelizing large loops with no loop-dependencies Visual C++ 2005 implements the full OpenMP 2.5 standard Full unmanaged and/clr managed support See the PDC issue of MSDN magazine for an article on OpenMP
9
9 OpenMP Parallelization void test(int first, int last) { for (int i = first; for (int i = first; i <= last; ++i) { i <= last; ++i) { a[i] = b[i] * c[i]; a[i] = b[i] * c[i]; }} Each iteration is independent; order of execution does not matter if(x < 0) a = foo(x); a = foo(x);else a = x + 5; a = x + 5; b = bat(y); c = baz(x + y); j = a*b+c; #pragma omp parallel for #pragma omp parallel sections { #pragma omp section #pragma omp section if(x < 0) if(x < 0) a = foo(x); a = foo(x); else else a = x + 5; a = x + 5; #pragma omp section #pragma omp section b = bat(y); b = bat(y); #pragma omp section #pragma omp section c = baz(x + y); c = baz(x + y);} j = a+b+c; Assignments to ‘a’, ‘b’, and ‘c’ are independent
10
10 OpenMP Case Study Panorama Factory by Smoky City Design Top-rated image stitching application Added multithreading with OpenMP in Visual C++ 2005 Beta2 Used 102 instances of #pragma omp * Extremely impressive Results… Stitching together several large images Dual processor, dual core x64 machine
11
11
12
12 3. Disambiguate Memory Programmer knows a and b never overlap movedx, DWORD PTR [eax] movDWORD PTR [ecx], edx movedx, DWORD PTR [eax+4] movDWORD PTR [ecx+4], edx movedx, DWORD PTR [eax] movDWORD PTR [ecx+8], edx movedx, DWORD PTR [eax+4] movDWORD PTR [ecx+12], edx movedx, DWORD PTR [eax] movDWORD PTR [ecx+16], edx movedx, DWORD PTR [eax+4] movDWORD PTR [ecx+20], edx movedx, DWORD PTR [eax] movDWORD PTR [ecx+24], edx moveax, DWORD PTR [eax+4] movDWORD PTR [ecx+28], eax ecx = a, eax = b void copy8(int * a, int * b) { int * b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[0]; a[3] = b[1]; a[4] = b[0]; a[5] = b[1]; a[6] = b[0]; a[7] = b[1]; }
13
13 Aliasing And Memory Disambiguation Aliasing is when one object can be used as an alias to another object If compiler can NOT prove that an object does not alias then it MUST assume it can How can we address some of these problems? 1. Avoid taking address of an object. 2. Avoid taking address of a function. 3. Avoid using global variables. Statics are preferable. 4. Use __restrict, __declspec(noalias), and __declspec(restrict) when possible.
14
14 __restrict – A compiler hint Programmer knows a and b don’t overlap void copy8(int * __restrict a, int * b) { int * b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[0]; a[3] = b[1]; a[4] = b[0]; a[5] = b[1]; a[6] = b[0]; a[7] = b[1]; } movecx, DWORD PTR [edx] movedx, DWORD PTR [edx+4] movDWORD PTR [eax], ecx movDWORD PTR [eax+4], edx movDWORD PTR [eax+8], ecx movDWORD PTR [eax+12], edx movDWORD PTR [eax+16], ecx movDWORD PTR [eax+20], edx movDWORD PTR [eax+24], ecx movDWORD PTR [eax+28], edx eax = a, edx = b
15
15 __declspec(restrict) Tells the compiler that the function returns an unalised pointer Only applicable to functions This is a promise the programmer makes to the compiler If this promise is violated the compiler may generate bad code The CRT uses this decoration, e.g., malloc, calloc, etc… __declspec(restrict) void *malloc(int size);
16
16 __declspec(noalias) Tells the compiler that the function is a semi-pure function Only references locals, arguments, and first-level indirections of arguments This is a promise the programmer makes to the compiler If this promise is violated the compiler may generate bad code __declspec(noalias) void isElement(Tree *t, Element e);
17
17 4. Use Intrinsics Simply represented as functions to the programmer _mm_load_pd(double const*); Compilers understand these as primitives Allows the user to get right at the hardware w/o using asm Almost anything you can do in assembly interlock, memory fences, cache control, SIMD The key to things such as vectorization and lock-free programming You can use intrinsics in a file compiled /clr, but the function(s) will be compiled as unmanaged Intrinsics are consumed by PGO and our optimizer Inline asm is not Documentation for intrinsics is much better in Visual C++ 2005 [Visual Studio 8]\VC\include\intrin.h
18
18 Matrix Addition With Intrinsics void MatMatAdd(Matrix &a, Matrix &b, Matrix &c) { for(int i = 0; i < a.m_rows; ++i) for(int i = 0; i < a.m_rows; ++i) for(int j = 0; j < a.m_cols; j++) for(int j = 0; j < a.m_cols; j++) c[i][j] = a[i][j] + b[i][j]; c[i][j] = a[i][j] + b[i][j];} #include #include void MatMatAddVect(Matrix &a, Matrix &b, Matrix &c) { __m128 aSIMD, bSIMD, cSIMD; __m128 aSIMD, bSIMD, cSIMD; for(int i = 0; i < a.m_rows; ++i) for(int i = 0; i < a.m_rows; ++i) for(int j = 0; j < a.m_cols; j += 4) for(int j = 0; j < a.m_cols; j += 4) { aSIMD = _mm_load_ps(&a[i][j]); aSIMD = _mm_load_ps(&a[i][j]); bSIMD = _mm_load_ps(&b[i][j]); bSIMD = _mm_load_ps(&b[i][j]); cSIMD= _mm_add_ps(aSIMD, bSIMD); cSIMD= _mm_add_ps(aSIMD, bSIMD); _mm_store_ps(&c[i][j], cSIMD); _mm_store_ps(&c[i][j], cSIMD); }}
19
19 Spin-Lock With Intrinsics #include #include void EnterSpinLock(volatile long &lock) { while(_InterlockedCompareExchange(&lock, 1, 0) != 0) Sleep(0);} void ExitSpinLock(volatile long &lock) { lock = 0; }
20
20 5. Avoid Double-Thunks Thunks are functions used to transition from managed to unmanaged (and vice-versa) Managed Code UnmanagedFunc(); Unmanaged Code UnmanagedFunc() { … } Managed To UnmanagedThunk Thunks are a part of life… but sometimes we can have Double Thunks…
21
21 Double Thunking From managed to managed only Indirect calls Function pointers and virtual functions Is the callee is managed or unmanaged entry point? __declspec(dllexport) No current mechanism to export functions as managed entry points Managed Code ManagedFunc(); ManagedFunc() { … } Managed To UnmanagedThunk Unmanaged To ManagedThunk
22
22 How To Fix Double Thunking Indirect Functions (including Virtual Funcs) Compile with /clr:pure Use __clrcall __declspec(export) Wrap functions in a managed class, and then #using the object file
23
23 Using __clrcall To Improve Performance
24
24 6. Speed App Startup Time No one likes to wait for an app to start-up There is still some time associated with loading CLR In some apps you may have non-CLR paths Only load the CLR when you need to Use DelayLoading technology in the linker If the EXE is compiled /clr then we will always load the CLR
25
25 Delay Loading The CLR
26
26 Summary Of Best Practices Managed + Unmanaged 1. Use PGO for unmanaged and WPO for managed… 2. OpenMP can ease multithreaded development. Unmanaged 3. Make it easier for the compiler to track pointers. 4. Intrinsics give the ability to get to the metal. Managed 5. Know where your double thunks are and fix. 6. Delay load the CLR to improve startup. Large and ongoing investment in managed and unmanaged C++ code
27
27 Resources Visual C++ Dev Center http://msdn.microsoft.com/visualc This is the place to go for all our news and whitepapers Myself kanggatl@microsoft.com http://blogs.msdn.com/kangsu Must See Talks TLN309 C++: Future Directions in Language Innovation with Herb Sutter (Friday 10:30am)
28
28 © 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.