1 Tips and Tricks: Visual C++ 2005 Optimization Best Practices Kang Su Gatlin TLNL04 Program Manager Visual C++ Microsoft Corporation.

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

1 Lecture 4: Procedure Calls Today’s topics:  Procedure calls  Large constants  The compilation process Reminder: Assignment 1 is due on Thursday.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Assembler/Linker/Loader Mooly Sagiv html:// Chapter 4.3 J. Levine: Linkers & Loaders
Prof. Necula CS 164 Lecture 141 Run-time Environments Lecture 8.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
CS162B: POSIX Threads Jacob Chan. Objectives ▪ Review on fork() and exec() – Some issues on forking and exec-ing ▪ POSIX Threads ▪ Lab 8.
PROFILE GUIDED OPTIMIZATION ( ) ANKIT ASTHANA PROGRAM MANAGER POG.
DEV392: Extending SharePoint Products And Technologies Through Web Parts And ASP.NET Clint Covington, Program Manager Data And Developer Services - Office.
Inline Assembly Section 1: Recitation 7. In the early days of computing, most programs were written in assembly code. –Unmanageable because No type checking,
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
1 Lecture 6 Performance Measurement and Improvement.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
1 1 Lecture 4 Structure – Array, Records and Alignment Memory- How to allocate memory to speed up operation Structure – Array, Records and Alignment Memory-
Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.
CS 536 Spring Run-time organization Lecture 19.
CS2422 Assembly Language & System Programming October 3, 2006.
3/17/2008Prof. Hilfinger CS 164 Lecture 231 Run-time organization Lecture 23.
OpenMP 3.0 Feature: Error Detection Capability Kang Su Gatlin Visual C++ Program Manager.
CS-341 Dick Steflik Introduction. C++ General purpose programming language A superset of C (except for minor details) provides new flexible ways for defining.
CS-341 Dick Steflik Introduction. C++ General purpose programming language A superset of C (except for minor details) provides new flexible ways for defining.
Run-time Environment and Program Organization
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
Why C++? Isn’t C# enough? Kate Gregory Gregory Consulting.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
ACCU Conference 2003, © Mark Bartosik, 1 An Introduction to Native Postmortem Debugging This talk contains x86 and OS and compiler specific.
Chapter 1 Algorithm Analysis
Visual C New Optimizations Ayman Shoukry Program Manager Visual C++ Microsoft Corporation.
CIS NET Applications1 Chapter 2 –.NET Component- Oriented Programming Essentials.
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training course describes how to configure the the C/C++ compiler options.
CS 11 C track: lecture 5 Last week: pointers This week: Pointer arithmetic Arrays and pointers Dynamic memory allocation The stack and the heap.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
Copyright 2001 Oxford Consulting, Ltd1 January Storage Classes, Scope and Linkage Overview Focus is on the structure of a C++ program with –Multiple.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Topic 2d High-Level languages and Systems Software
CSE451 Linking and Loading Autumn 2002 Gary Kimura Lecture #21 December 9, 2002.
COP4020 Programming Languages Names, Scopes, and Bindings Prof. Xin Yuan.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Building More Reliable And Better Performing Web Applications With Visual Studio 2005 Team System Gabriel Marius TLN312 Program Manager Microsoft Corporation.
Writing a Run Time DLL The application loads the DLL using LoadLibrary() or LoadLibraryEx(). The standard search sequence is used by the operating system.
Assembly Language for x86 Processors 7th Edition Chapter 13: High-Level Language Interface (c) Pearson Education, All rights reserved. You may modify.
Profile Guided Optimizations in Visual C Andrew Pardoe Phoenix Team (C++ Optimizer)
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
CS412/413 Introduction to Compilers and Translators April 14, 1999 Lecture 29: Linking and loading.
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Optimization of C Code The C for Speed
Developing a 64-bit Strategy Craig McMurtry Developer Evangelist, Software Vendors Developer and Platform Evangelism Microsoft Corporation.
Tips & Tricks: Writing Performant Managed Code Rico Mariani FUNL04 Performance Architect Microsoft Corporation.
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
Single Node Optimization Computational Astrophysics.
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
M E L B O U R N E S Y D N E Y C A N B E R R A B R I S B A N EW W W. R E A D I F Y. N E T DEV315: Visual C# Under the Covers An In-Depth Look at C# 2.0.
Binding & Dynamic Linking Presented by: Raunak Sulekh(1013) Pooja Kapoor(1008)
Visual C++ Optimizations Jonathan Caves Principal Software Engineer Visual C++ Microsoft Corporation.
1 Lecture 5a: CPU architecture 101 boris.
Assemblers, linkers, loaders
Exploiting Parallelism
Getting Started with Automatic Compiler Vectorization
Code Generation.
More examples How many processes does this piece of code create?
Optimizing ARM Assembly
Multithreading Why & How.
10/6: Lecture Topics C Brainteaser More on Procedure Call
Lecture 4: Instruction Set Design/Pipelining
Presentation transcript:

1 Tips and Tricks: Visual C Optimization Best Practices Kang Su Gatlin TLNL04 Program Manager Visual C++ Microsoft Corporation

2 6 Tips/Best Practices To Help Any C++ Dev Write Faster Code Managed + Unmanaged 1. Pick the right level of optimization 2. Add instant parallelism Unmanaged 3. Disambiguate memory 4. Use intrinsics Managed 5. Avoid double thunks 6. Speed app startup time

3 1. Pick the Right Level Of Optimization Builds from the Lab If at all possible use Profile-Guided Optimization Only available unmanaged More on this next slide If not, use Whole Program Optimization (/GL) Available managed and unmanaged After that we recommend /O2 (optimize for speed) for hot functions/files /O1 (optimize for size) for the rest Other switches to use for maximum speed /Gy /OPT:REF,ICF (good size win on 64bit) /fp:fast /arch:SSE2 (will not work on downlevel architectures) Debug Symbols Are NOT Only for Debug Builds Executable size and codegen are NOT effected by this It’s all in the PDB file Always building debug symbols will make life easier Make sure you use /OPT:REF,ICF, don’t use /ZI, and use /INCREMENTAL:NO

4 Next-Gen Optimizations Today Profile Guided Optimization The next level beyond Whole Program Optimization Static compilers can’t answer everything We get 20-50% improvement on large server applications that we ship Current support is unmanaged only if(a < b) foo(); foo();else baz(); baz(); for(i = 0; i < count; ++i) bar(); bar(); Should we inline foo()? Should we unroll this loop?

5 Profile Guided Optimization Compile with /GL Source Object files InstrumentedImage Scenarios Output Profile data Object files Link with /LTCG:PGI Instrumented Image + PGD file Profile data Object files Link with /LTCG:PGO OptimizedImage There is throughput impact

6 What PGO Does And Does Not Do PGO does Optimizations galore Speed/Size Determination Switch expansion Better inlining decisions Function/basic block layout Virtual call speculation Partial inlining Optimize within a single image Merging and weighting of multiple scenarios PGO does not No probing assembly language (inline or otherwise) No optimizations across DLLs No data layout optimization

7 PGO Compilation in Visual C

8 2. Add Instant Parallelism Just add OpenMP Pragmas! OpenMP is a popular API for multithreaded programs Born from the HPC community It consists of a set of simple #pragmas and runtime routines Most value parallelizing large loops with no loop-dependencies Visual C implements the full OpenMP 2.5 standard Full unmanaged and/clr managed support See the PDC issue of MSDN magazine for an article on OpenMP

9 OpenMP Parallelization void test(int first, int last) { for (int i = first; for (int i = first; i <= last; ++i) { i <= last; ++i) { a[i] = b[i] * c[i]; a[i] = b[i] * c[i]; }} Each iteration is independent; order of execution does not matter if(x < 0) a = foo(x); a = foo(x);else a = x + 5; a = x + 5; b = bat(y); c = baz(x + y); j = a*b+c; #pragma omp parallel for #pragma omp parallel sections { #pragma omp section #pragma omp section if(x < 0) if(x < 0) a = foo(x); a = foo(x); else else a = x + 5; a = x + 5; #pragma omp section #pragma omp section b = bat(y); b = bat(y); #pragma omp section #pragma omp section c = baz(x + y); c = baz(x + y);} j = a+b+c; Assignments to ‘a’, ‘b’, and ‘c’ are independent

10 OpenMP Case Study Panorama Factory by Smoky City Design Top-rated image stitching application Added multithreading with OpenMP in Visual C Beta2 Used 102 instances of #pragma omp * Extremely impressive Results… Stitching together several large images Dual processor, dual core x64 machine

11

12 3. Disambiguate Memory Programmer knows a and b never overlap movedx, DWORD PTR [eax] movDWORD PTR [ecx], edx movedx, DWORD PTR [eax+4] movDWORD PTR [ecx+4], edx movedx, DWORD PTR [eax] movDWORD PTR [ecx+8], edx movedx, DWORD PTR [eax+4] movDWORD PTR [ecx+12], edx movedx, DWORD PTR [eax] movDWORD PTR [ecx+16], edx movedx, DWORD PTR [eax+4] movDWORD PTR [ecx+20], edx movedx, DWORD PTR [eax] movDWORD PTR [ecx+24], edx moveax, DWORD PTR [eax+4] movDWORD PTR [ecx+28], eax ecx = a, eax = b void copy8(int * a, int * b) { int * b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[0]; a[3] = b[1]; a[4] = b[0]; a[5] = b[1]; a[6] = b[0]; a[7] = b[1]; }

13 Aliasing And Memory Disambiguation Aliasing is when one object can be used as an alias to another object If compiler can NOT prove that an object does not alias then it MUST assume it can How can we address some of these problems? 1. Avoid taking address of an object. 2. Avoid taking address of a function. 3. Avoid using global variables. Statics are preferable. 4. Use __restrict, __declspec(noalias), and __declspec(restrict) when possible.

14 __restrict – A compiler hint Programmer knows a and b don’t overlap void copy8(int * __restrict a, int * b) { int * b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[0]; a[3] = b[1]; a[4] = b[0]; a[5] = b[1]; a[6] = b[0]; a[7] = b[1]; } movecx, DWORD PTR [edx] movedx, DWORD PTR [edx+4] movDWORD PTR [eax], ecx movDWORD PTR [eax+4], edx movDWORD PTR [eax+8], ecx movDWORD PTR [eax+12], edx movDWORD PTR [eax+16], ecx movDWORD PTR [eax+20], edx movDWORD PTR [eax+24], ecx movDWORD PTR [eax+28], edx eax = a, edx = b

15 __declspec(restrict) Tells the compiler that the function returns an unalised pointer Only applicable to functions This is a promise the programmer makes to the compiler If this promise is violated the compiler may generate bad code The CRT uses this decoration, e.g., malloc, calloc, etc… __declspec(restrict) void *malloc(int size);

16 __declspec(noalias) Tells the compiler that the function is a semi-pure function Only references locals, arguments, and first-level indirections of arguments This is a promise the programmer makes to the compiler If this promise is violated the compiler may generate bad code __declspec(noalias) void isElement(Tree *t, Element e);

17 4. Use Intrinsics Simply represented as functions to the programmer _mm_load_pd(double const*); Compilers understand these as primitives Allows the user to get right at the hardware w/o using asm Almost anything you can do in assembly interlock, memory fences, cache control, SIMD The key to things such as vectorization and lock-free programming You can use intrinsics in a file compiled /clr, but the function(s) will be compiled as unmanaged Intrinsics are consumed by PGO and our optimizer Inline asm is not Documentation for intrinsics is much better in Visual C [Visual Studio 8]\VC\include\intrin.h

18 Matrix Addition With Intrinsics void MatMatAdd(Matrix &a, Matrix &b, Matrix &c) { for(int i = 0; i < a.m_rows; ++i) for(int i = 0; i < a.m_rows; ++i) for(int j = 0; j < a.m_cols; j++) for(int j = 0; j < a.m_cols; j++) c[i][j] = a[i][j] + b[i][j]; c[i][j] = a[i][j] + b[i][j];} #include #include void MatMatAddVect(Matrix &a, Matrix &b, Matrix &c) { __m128 aSIMD, bSIMD, cSIMD; __m128 aSIMD, bSIMD, cSIMD; for(int i = 0; i < a.m_rows; ++i) for(int i = 0; i < a.m_rows; ++i) for(int j = 0; j < a.m_cols; j += 4) for(int j = 0; j < a.m_cols; j += 4) { aSIMD = _mm_load_ps(&a[i][j]); aSIMD = _mm_load_ps(&a[i][j]); bSIMD = _mm_load_ps(&b[i][j]); bSIMD = _mm_load_ps(&b[i][j]); cSIMD= _mm_add_ps(aSIMD, bSIMD); cSIMD= _mm_add_ps(aSIMD, bSIMD); _mm_store_ps(&c[i][j], cSIMD); _mm_store_ps(&c[i][j], cSIMD); }}

19 Spin-Lock With Intrinsics #include #include void EnterSpinLock(volatile long &lock) { while(_InterlockedCompareExchange(&lock, 1, 0) != 0) Sleep(0);} void ExitSpinLock(volatile long &lock) { lock = 0; }

20 5. Avoid Double-Thunks Thunks are functions used to transition from managed to unmanaged (and vice-versa) Managed Code UnmanagedFunc(); Unmanaged Code UnmanagedFunc() { … } Managed To UnmanagedThunk Thunks are a part of life… but sometimes we can have Double Thunks…

21 Double Thunking From managed to managed only Indirect calls Function pointers and virtual functions Is the callee is managed or unmanaged entry point? __declspec(dllexport) No current mechanism to export functions as managed entry points Managed Code ManagedFunc(); ManagedFunc() { … } Managed To UnmanagedThunk Unmanaged To ManagedThunk

22 How To Fix Double Thunking Indirect Functions (including Virtual Funcs) Compile with /clr:pure Use __clrcall __declspec(export) Wrap functions in a managed class, and then #using the object file

23 Using __clrcall To Improve Performance

24 6. Speed App Startup Time No one likes to wait for an app to start-up There is still some time associated with loading CLR In some apps you may have non-CLR paths Only load the CLR when you need to Use DelayLoading technology in the linker If the EXE is compiled /clr then we will always load the CLR

25 Delay Loading The CLR

26 Summary Of Best Practices Managed + Unmanaged 1. Use PGO for unmanaged and WPO for managed… 2. OpenMP can ease multithreaded development. Unmanaged 3. Make it easier for the compiler to track pointers. 4. Intrinsics give the ability to get to the metal. Managed 5. Know where your double thunks are and fix. 6. Delay load the CLR to improve startup. Large and ongoing investment in managed and unmanaged C++ code

27 Resources Visual C++ Dev Center This is the place to go for all our news and whitepapers Myself Must See Talks TLN309 C++: Future Directions in Language Innovation with Herb Sutter (Friday 10:30am)

28 © 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.