PROFILE GUIDED OPTIMIZATION ( ) ANKIT ASTHANA PROGRAM MANAGER POG.

Slides:



Advertisements
Similar presentations
Memory Protection: Kernel and User Address Spaces  Background  Address binding  How memory protection is achieved.
Advertisements

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
White Box and Black Box Testing Tor Stålhane. What is White Box testing White box testing is testing where we use the info available from the code of.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Chapter 3 Loaders and Linkers
File Systems.
Memory Management Chapter 7.
Prof. Necula CS 164 Lecture 141 Run-time Environments Lecture 8.
Objects and Classes David Walker CS 320. Advanced Languages advanced programming features –ML data types, exceptions, modules, objects, concurrency,...
CS 536 Spring Run-time organization Lecture 19.
4/23/09Prof. Hilfinger CS 164 Lecture 261 IL for Arrays & Local Optimizations Lecture 26 (Adapted from notes by R. Bodik and G. Necula)
Previous finals up on the web page use them as practice problems look at them early.
Run time vs. Compile time
Incremental Path Profiling Kevin Bierhoff and Laura Hiatt Path ProfilingIncremental ApproachExperimental Results Path profiling counts how often each path.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
From last time: Inlining pros and cons Pros –eliminate overhead of call/return sequence –eliminate overhead of passing args & returning results –can optimize.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Chapter 8: Introduction to High-Level Language Programming Invitation to Computer Science, C++ Version, Fourth Edition.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Chapter 3 Memory Management: Virtual Memory
1 Tips and Tricks: Visual C Optimization Best Practices Kang Su Gatlin TLNL04 Program Manager Visual C++ Microsoft Corporation.
Visual C New Optimizations Ayman Shoukry Program Manager Visual C++ Microsoft Corporation.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Basics of Web Databases With the advent of Web database technology, Web pages are no longer static, but dynamic with connection to a back-end database.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
7. Just In Time Compilation Prof. O. Nierstrasz Jan Kurs.
Programmer's view on Computer Architecture by Istvan Haller.
© 2004, D. J. Foreman 1 Memory Management. © 2004, D. J. Foreman 2 Building a Module -1  Compiler ■ generates references for function addresses may be.
Spring 2014Jim Hogg - UW - CSE - P501X1-1 CSE P501 – Compiler Construction Inlining Devirtualization.
IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Programming Languages by Ravi Sethi Chapter 6: Groupings of Data and Operations.
Topic 2d High-Level languages and Systems Software
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Profile Guided Optimizations in Visual C Andrew Pardoe Phoenix Team (C++ Optimizer)
Static Shared Library. Non-shared v.s. Shared Library A library is a collection of pre-written function calls. Using existing libraries can save a programmer.
CS412/413 Introduction to Compilers and Translators April 14, 1999 Lecture 29: Linking and loading.
Swap Space and Other Memory Management Issues Operating Systems: Internals and Design Principles.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.
CSc 453 Linking and Loading
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
OCR A Level F453: The function and purpose of translators Translators a. describe the need for, and use of, translators to convert source code.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
Visual C++ Optimizations Jonathan Caves Principal Software Engineer Visual C++ Microsoft Corporation.
WCET 2007, Pisa, page 1 of 27 Tid rum WCET'2007 Analysing Switch-Case Tables by Partial Evaluation Niklas Holsti Tidorum Ltd
Advanced Computer Systems
High-level optimization Jakub Yaghob
Introduction to Web Assembly
Profile-Guided Optimization in the LDC D compiler
Memory Management © 2004, D. J. Foreman.
Run-time organization
Design III Chapter 13 9/20/2018 Crowley OS Chap. 13.
CSCI1600: Embedded and Real Time Software
Memory Allocation CS 217.
Inlining and Devirtualization Hal Perkins Autumn 2011
Inlining and Devirtualization Hal Perkins Autumn 2009
Computer Organization and Design Assembly & Compilation
Languages and Compilers (SProg og Oversættere)
CSCI1600: Embedded and Real Time Software
Dynamic Binary Translators and Instrumenters
Just In Time Compilation
Presentation transcript:

PROFILE GUIDED OPTIMIZATION ( ) ANKIT ASTHANA PROGRAM MANAGER POG

INDEX History What is Profile Guided Optimization (POGO) ? POGO Build Process Steps to do POGO (Demo) POGO under the hood POGO case studies Questions

HISTORY POGO that is shipped in VS, was started as a joint venture between VisualC and Microsoft Research group in the late 90’s. POGO initially only focused on Itanium platform For almost an entire decade, even within Microsoft only a few components were POGO’ized POGO was first shipped in 2005 on all pro-plus SKU(s) Today POGO is a KEY optimization which provides significant performance boost to a plethora of Microsoft products. ~ In a nutshell POGO is a major constituent which makes up the DNA for many Microsoft products ~

HISTORY ~ In a nutshell POGO is a major constituent which makes up the DNA for many Microsoft products ~ BROWSERS Microsoft Products BUSINESS ANALYTICS PRODUCTIVITY SOFTWARE DIRECTLY or INDIRECTLY you have used products which ship with POGO technology! POG

What is Profile Guided Optimization (POGO) ? Really ?, NO! . But how many people here have used POGO ?

Static analysis of code leaves many open questions for the compiler… if(a < b) foo(); else baz(); for(i = 0; i < count; ++i) bar(); How often is a < b? What is the typical value of count? switch (i) { case 1: … case 2: … What is the typical value of i? for(i = 0; i < count; ++i) (*p)(x, y); What is the typical value of pointer p? What is Profile Guided Optimization (POGO) ? if(a < b) foo(); else baz(); for(i = 0; i < count; ++i) bar(); How often is a < b? switch (i) { case 1: … case 2: … What is the typical value of i?

PGO (Profile guided optimization) is a runtime compiler optimization which leverages profile data collected from running important or performance centric user scenarios to build an optimized version of the application. PGO optimizations have some significant advantage over traditional static optimizations as they are based upon how the application is likely to perform in a production environment which allow the optimizer to optimize for speed for hotter code paths (common user scenarios) and optimize for size for colder code paths (not so common user scenarios) resulting in generating faster and smaller code for the application attributing to significant performance gains. PGO can be used on traditional desktop applications and is currently on supported on x86, x64 platform. What is Profile Guided Optimization (POGO) ? Mantra behind PGO is ‘Faster and Smaller Code’

POGO Build Process INSTRUMENTTRAINOPTIMIZE ~ Three steps to perform Profile Guided Optimization ~

POGO Build Process

1 23 TRIVIA ? Does anyone know (1), (2) and (3) do ?

POGO Build Process /GL: This flag tells the compiler to defer code generation until you link your program. Then at link time the linker calls back to the compiler to finish compilation. If you compile all your sources this way, the compiler optimizes your program as a whole rather than one source file at a time. Although /GL introduces a plethora of optimizations, one major advantage is that it with Link Time Code Gen we can inline functions from one source file (foo.obj) into callers defined in another source file (bar.obj)

POGO Build Process 1 23 /LTCG The linker invokes link-time code generation if it is passed a module that was compiled by using /GL. If you do not explicitly specify /LTCG when you pass /GL or MSIL modules to the linker, the linker eventually detects this and restarts the link by using /LTCG. Explicitly specify /LTCG when you pass /GL and MSIL modules to the linker for the fastest possible build performance. /LTCG:PGI Specifies that the linker outputs a.pgd file in preparation for instrumented test runs on the application. /LTCG:PGO Specifies that the linker uses the profile data that is created after the instrumented binary is run to create an optimized image. 23

STEPS to do POGO (DEMO) POG TRIVIA Does anyone know what Nbody Simulation is all about ?

STEPS to do POGO (DEMO) POG NBODY Sample application Speaking plainly, An N-body simulation is a simulation for a System of particles, usually under the influence of physical forces, such as gravity.

POGO Under the hood! What is the typical value of count? for(i = 0; i < count; ++i) (*p)(x, y); What is the typical value of pointer p? if(a < b) foo(); else baz(); for(i = 0; i < count; ++i) bar(); How often is a < b? switch (i) { case 1: … case 2: … What is the typical value of i? Remember this ?

POGO Under the hood Instrument with “probes” inserted into the code There are two kinds of probes: 1. Count (Simple/Entry) probes Used to count the number of a path is taken. (Function entry/exit) 2. Value probes Used to construct histogram of values (Switch value, Indirect call target address) To simplify correlation process, some optimizations, such as Inliner, are off 1.5X to 2X slower than optimized build Side-effects: Instrumented build of the application, empty.pgd file Instrument Phase

POGO Under the hood Instrument Phase Foo Cond switch (i) { case 1: … default:… } More code More Code return Entry Probe Simple Probe 1 Simple probe 2 Value probe 1 Single dataset Entry probe Value probe 1 Simple probe 1 Simple probe 2

Run your training scenarios, During this phase the user runs the instrumented version of the application and exercises only common performance centric user scenarios. Exercising these training scenarios results in creation of (.pgc) files which contain training data correlating to each user scenario. For example, For modern applications a common performance user scenario is startup of the application. Training for these scenarios would result in creation of appname!#.pgc files (where appname is the name of the running application and # is 1 + the number of appname!#.pgc files in the directory). POGO Under the hood Training Phase Side-effects: A bunch of.pgc files

POGO Under the hood Optimize Phase Full and partial inlining Function layout Speed and size decision Basic block layout Code separation Virtual call speculation Switch expansion Data separation Loop unrolling

CALL GRAPH PATH PROFILING Behavior of function on one call-path may be drastically different from another Call-path specific info results in better inlining and optimization decisions Let us take an example, (next slide) POGO Under the hood Optimize Phase

EXAMPLE: CALL GRAPH PATH PROFILING Assign path numbers bottom-up Number of paths out of a function =  callee paths + 1 Foo DB A C Start Path 1: Foo 1 Path 2: B Path 3: B-Foo Path 4: C Path 5: C-Foo Path 6: D Path 7: D-Foo Path 8: A Path 9: A-B Path 10: A-B-Foo Path 11: A-C Path 12: A-C-Foo Path 13: A-D Path 14: A-D-Foo There are 7 paths for Foo POGO Under the hood Optimize Phase

INLINING foo bat barbazgoo POGO Under the hood Optimize Phase

100 foo bat 2050 barbaz 15 bar baz INLINING POGO uses call graph path profiling. goo 1075 bar baz 15 POGO Under the hood Optimize Phase

foo bat barbaz barbaz INLINING Inlining decisions are made at each call site. goo POGO Under the hood Optimize Phase Call site specific profile directed inlining minimizes the code bloat due to inlining while still gaining performance where needed.

INLINE HEURISTICS Pogo Inline decision is made before layout, speed-size decision and all other optimizations POGO Under the hood Optimize Phase

SPEED AND SIZE The decision is based on post-inliner dynamic instruction count Code segments with higher dynamic instruction count = SPEED Code segments with lower dynamic instruction = SIZE The decision is based on post-inliner dynamic instruction count Code segments with higher dynamic instruction count = SPEED Code segments with lower dynamic instruction = SIZE foo bat barbaz barbaz goo POGO Under the hood Optimize Phase

BLOCK LAYOUT Basic blocks are ordered so that most frequent path falls through. A CB D A B C D Default layout A B C D Optimized layout POGO Under the hood Optimize Phase

BLOCK LAYOUT Basic blocks are ordered so that most frequent path falls through. A CB D A B C D Default layout A B C D Optimized layout POGO Under the hood Optimize Phase Better Instruction Cache Locality

LIVE AND PGO DEAD CODE SEPARATION Dead functions/blocks are placed in a special section. A B C D Default layout A B C D Optimized layout A CB D POGO Under the hood Optimize Phase To minimize working set and improve code locality, code that is scenario dead can be moved out of the way.

FUNCTION LAYOUT Based on post-inliner and post-code-separation call graph and profile data Only functions/segments in live section is laid out. POGO Dead blocks are not included Overall strategy is Closest is best: functions strongly connected are put together A call is considered achieving page locality if the callee is located in the same page. Based on post-inliner and post-code-separation call graph and profile data Only functions/segments in live section is laid out. POGO Dead blocks are not included Overall strategy is Closest is best: functions strongly connected are put together A call is considered achieving page locality if the callee is located in the same page. POGO Under the hood Optimize Phase

A CB D E 300 EXAMPLE: FUNCTION LAYOUT AB CDE AB CD E ABECD In general, >70% page locality is achieved regardless the component size POGO Under the hood Optimize Phase

if (i == 10) goto default; switch (i) { case 1: … case 2: … case 3: … default:… } Most frequent values are pulled out. SWITCH EXPANSION switch (i) { case 1: … case 2: … case 3: … default:… } // 90% of the // time i = 10; Many ways to expand switches: linear search, jump table, binary search, etc Pogo collects the value of switch expression POGO Under the hood Optimize Phase

VIRTUAL CALL SPECULATION Class Foo:Base{ … void call(); } class Bar:Base { … void call(); } class Base{ … virtual void call(); } void Bar(Parent *A) { … while(true) { … A->call(); … } void Bar(Base *A) { … while(true) { … if(type(A) == Foo:Base) { // inline of A->call(); } else A->call(); … } The type of object A in function Bar was almost always Foo via the profiles POGO Under the hood Optimize Phase

During this phase the application is rebuilt for the last time to generate the optimized version of the application. Behind the scenes, the (.pgc) training data files are merged into the empty program database file (.pgd) created in the instrumented phase. The compiler backend then uses this program database file to make more intelligent optimization decisions on the code generating a highly optimized version of the application POGO Under the hood Optimize Phase Side-effect: An optimized version of the application!

SPEC2K: Application Size GobmkSjengGccPerlPovray SmallMedium Large LTCG size Mbyte Pogo size Mbyte Live section size # of functions % of live functions 54%62%47%39%47% % of Speed funcs 18%2.9%5%2%4.2% # of LTCG Inlines # of POGO Inlines % of Inlined edge counts 50%53%25%79%65% % of page locality97%75%85%98%80% % of speed gain8.5%6.6%14.9%36.9%7.9% POGO CASE STUDIES SPEC2K

QUESTIONS ? ANKIT ASTHANA POG