Performance Optimizations in Dyninst

Slides:



Advertisements
Similar presentations
Calling sequence ESP.
Advertisements

Target Code Generation
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
University of Maryland Smarter Code Generation for Dyninst Nick Rutar.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
ITEC 352 Lecture 27 Memory(4). Review Questions? Cache control –L1/L2  Main memory example –Formulas for hits.
Whole-Program Linear-Constant Analysis with Applications to Link-Time Optimization Ludo Van Put – Dominique Chanet – Koen De Bosschere Ghent University.
1 Storage Registers vs. memory Access to registers is much faster than access to memory Goal: store as much data as possible in registers Limitations/considerations:
Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load.
1 Lecture 5: Procedures Assembly Language for Intel-Based Computers, 4th edition Kip R. Irvine.
Practical Session 3. The Stack The stack is an area in memory that its purpose is to provide a space for temporary storage of addresses and data items.
Microprocessors Frame Pointers and the use of the –fomit-frame-pointer switch Feb 25th, 2002.
Introduction CS 104: Applied C++ What is Programming? For some given problem: __________ a solution for it -- identify, organize & store the problem's.
CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Chapter 4 Memory Management Virtual Memory.
Part II Let’s make it real Memory Layout of a Process.
Microprocessors The ia32 User Instruction Set Jan 31st, 2002.
November 2005 New Features in Paradyn and Dyninst Matthew LeGendre Ray Chen
CS412/413 Introduction to Compilers and Translators April 14, 1999 Lecture 29: Linking and loading.
JIT Instrumentation – A Novel Approach To Dynamically Instrument Operating Systems Marek Olszewski Keir Mierle Adam Czajkowski Angela Demke Brown University.
Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson.
1 The Stack and Procedures Chapter 5. 2 A Process in Virtual Memory  This is how a process is placed into its virtual addressable space  The code is.
© 2006 Andrew R. BernatMarch 2006Generalized Code Relocation Generalized Code Relocation for Instrumentation and Efficiency Andrew R. Bernat University.
Procedures and Functions Procedures and Functions – subprograms – are named fragments of program they can be called from numerous places  within a main.
Overview of Back-end for CComp Zhaopeng Li Software Security Lab. June 8, 2009.
Correct RelocationMarch 20, 2016 Correct Relocation: Do You Trust a Mutated Binary? Drew Bernat
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Recitation 3: Procedures and the Stack
Reading Condition Codes (Cont.)
Machine-Level Programming 2 Control Flow
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Efficient Instrumentation for Code Coverage Testing
Code Optimization.
The University of Adelaide, School of Computer Science
1. Introduction A microprocessor executes instructions given by the user Instructions should be in a language known to the microprocessor Microprocessor.
143A: Principles of Operating Systems Lecture 4: Calling conventions
Optimization Code Optimization ©SoftMoore Consulting.
Chapter 5 Conclusion CIS 61.
Introduction to Compilers Tim Teitelbaum
Functions and Procedures
Chapter 7 Subroutines Dr. A.P. Preethy
Discussion Section – 11/3/2012
Chapter 9 :: Subroutines and Control Abstraction
Page Replacement.
Chap. 8 :: Subroutines and Control Abstraction
Chap. 8 :: Subroutines and Control Abstraction
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
Stack Frame Linkage.
Procedures – Overview Lecture 19 Mon, Mar 28, 2005.
Lesson Objectives Aims Understand how machine code is generated
Optimizing Your Dyninst Program
The University of Adelaide, School of Computer Science
Practical Session 4.
by Richard P. Paul, 2nd edition, 2000.
Compiler Code Optimizations
Virtual Memory Hardware
EECE.3170 Microprocessor Systems Design I
EECE.3170 Microprocessor Systems Design I
PZ09A - Activation records
Activation records Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
8 Code Generation Topics A simple code generator algorithm
Target Code Generation
Lecture 4: Instruction Set Design/Pipelining
Dynamic Binary Translators and Instrumenters
Activation records Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Presentation transcript:

Performance Optimizations in Dyninst Andrew Bernat, Matthew Legendre

Instrumentation is Complicated User perspective: “Insert some new code here, here, and here.” Dyninst’s perspective: Relocation – Move code to make space for instrumentation Infrastructure – Save/restore machine state Instrumentation – Generate user provided code Performance Optimizations in Dyninst

Performance Optimizations in Dyninst Sources of Overhead Relocation Infrastructure Instrumentation Extra jumps Unnecessary emulation Traps Extra register saves Tramp guards Inefficient register usage Poor code generation Optimizations Inlining instrumentation Compiler optimizations of generated code 665% -> 32% Performance Optimizations in Dyninst

Performance Optimizations in Dyninst History Enable fast (and frequent) insertion and removal of code “Linked list” model Insert/remove by patching branches Model has evolved over time Long-lived instrumentation (particularly with static rewriter) Focus on speed of execution instead of speed of insertion Performance Optimizations in Dyninst

Outlined Instrumentation Original Code Relocated Code Instrumentation/Infrastructure Relocated Function Relocated Block Basetramp Minitramp Branch Minitramp Basetramp Minitramp Relocated Block Basetramp Minitramp Branch Basetramp Relocated Function Relocated Block Basetramp Minitramp Relocated Function Branch Minitramp Basetramp Performance Optimizations in Dyninst

Performance Optimizations in Dyninst Outlined System Fast insertion and removal Simple to update Original serves as a “handle” Reduced code relocation Block or instruction Hard to optimize New code can be inserted without warning Poor code locality Performance Optimizations in Dyninst

Performance Optimizations in Dyninst Partial Inlining Original Code Relocated Code Instrumentation & Instrumentation Relocated Function Relocated Block Minitramp Basetramp Branch Minitramp Basetramp Minitramp Relocated Block Basetramp Minitramp Branch Basetramp Relocated Function Relocated Block Basetramp Minitramp Relocated Function Branch Basetramp Minitramp Performance Optimizations in Dyninst

Performance Optimizations in Dyninst Full Inlining Original Code Relocated Code & Instrumentation Relocated Function Relocated Block Branch ? Relocated Function Relocated Block Branch Relocated Function Relocated Block Relocated Function Branch Performance Optimizations in Dyninst

Performance Optimizations in Dyninst Branch Reduction Inlining removed three levels of branching Function to block to basetramp to minitramp One level is left Function original to relocated copy Can we remove this branch as well? Identify and rewrite calls to relocated functions Regenerate whenever target is moved Performance Optimizations in Dyninst

Optimizing BaseTramps and MiniTramps DyninstAPI contains a built-in compiler Converts ASTs to machine code Used for BaseTramps and MiniTramps Designed to be cross-platform (x86, x86_64, ppc32, ppc64, IA-64, Sparc) Build new optimizations into compiler Some optimizations from classic compilers Some optimizations are instrumentation specific Performance Optimizations in Dyninst

Optimizing Code Generation pusha pushf push %ebp mov %esp,%ebp sub $128,%esp mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) mov $1,%eax mov %eax,4(%ebp) mov 0x805a494,%ebx mov 4(%ebp),%eax add %eax,%ebx mov %ebx,0x805a494 mov $0x1,(%eax) done: leave popf popa Saving too many registers Register Saves Stack frame (Setup) Stack frame unnecessary Tramp guards unnecessary Trampoline Guard (Check) Extraneous register usage “Virtual” registers unnecessary Instrumentation Inefficient instrumentation Trampoline Guard (Restore) Recalculating old value Stack frame (Clean) Register Restores

Performance Optimizations in Dyninst Register Saves Register Saves pusha pushf push %eax lahf Calculate live registers at inst point Calculate registers used by instrumentation Save intersection Use more efficient flag saves Performance Optimizations in Dyninst

Performance Optimizations in Dyninst Virtual Registers Instrumentation mov $1,%eax mov %eax,4(%ebp) mov 4(%ebp),%eax mov $1,%eax “Virtual Registers” were stack slots on x86 Load from virtual register to eax Operate on eax Store from eax to virtual register Now use real register allocation algorithm, with spilling Performance Optimizations in Dyninst

AST to Machine Code Compilation Instrumentation mov $1,%eax incl 0x805a494 = mov 0x805a494,%ebx 0x805a494 + add %eax,%ebx mov $0x805a494,%ecx 0x805a494 1 mov %ebx,(%ecx) Each AST node is converted to an instruction Not optimal on CISC systems Recognize sequences of ASTs, emit optimized code Performance Optimizations in Dyninst

Optional Infrastructure Tramp Guard Stack Frame mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) push %ebp mov %esp,%ebp sub $0x32,%esp ... FP Saves mov %esp,%eax sub $512,%esp and 0xfffffff0,%esp fxsave (%esp) push %eax Stack Shift lea 0x128(%rsp),%rsp Some tramp infrastructure not always required. E.g, Stack frame only needed for register spilling Tramp guard only need for function calls Save only necessary infrastructure Performance Optimizations in Dyninst

Fixed Point Code Generation Optimizations may be interlinked. E.g., Removing code may leave registers unused Removing unused registers eliminates saves Eliminating saves removes stack access Removing stack accesses may eliminate stack shift Typical code generation requires 2 passes Performance Optimizations in Dyninst

Optimizing Code Generation pusha pushf push %ebp mov %esp,%ebp sub $128,%esp mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) mov $1,%eax mov %eax,4(%ebp) mov 0x805a494,%ebx mov 4(%ebp),%eax add %eax,%ebx mov %ebx,0x805a494 mov $0x1,(%eax) done: leave popf popa pusha pushf push %ebp mov %esp,%ebp sub $128,%esp mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) mov $1,%eax mov %eax,4(%ebp) mov 0x805a494,%ebx mov 4(%ebp),%eax incl 0x805a494 mov %ebx,0x805a494 mov $0x1,(%eax) done: leave popf popa Register Saves Stack frame (Setup) Trampoline Guard (Check) incl 0x805a494 Instrumentation Trampoline Guard (Restore) Stack frame (Clean) Register Restores

Results Basic block instrumentation on ‘go’ from SPEC2000 Instrumented run time (base: 12.25s) Old Optimizations + Inlining Dynamic 70.96s (479%) 25.18s (105%) NA Static 93.72s (665%) 24.10s (97%) 16.21s (32%) Instrumentation time Original Optimizations Optimizations + Inlining Dynamic 17.43s 3.22s NA Static 2.77s 2.12s 4.81s Performance Optimizations in Dyninst

Performance Optimizations in Dyninst Conclusions Optimizations in DyninstAPI instrumentation Inline instrumentation levels Generate more efficient code Significant performance gains Instrumentation code runs faster More time spent generating instrumentation Performance Optimizations in Dyninst

Performance Optimizations in Dyninst Questions? Performance Optimizations in Dyninst