Download presentation
Presentation is loading. Please wait.
Published byDelphia Pearson Modified over 9 years ago
1
“Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 Presented by: David Carrillo CS 295: Run-time Techniques for Efficient and Reliable Program Execution. Guoqing Xu
2
Motivation Recent developments decrease effectiveness of compiler techniques Software shipped as collection of DLLs not single executable Object oriented languages use delayed binding Dynamic code generation System vendors have to rely on Software Vendors to avail powerful static compiler optimizations Still trend is to offload complexity from hardware to compiler CISC vs. RISC Superscalar vs. VLIW (EPIC)
3
Motivation (contd.) Legacy binaries may not take advantages of modern compilers/chip innovations Two opposite developments Static compiler taking on increased performance burden Obstacles to static compiler analysis increasing Choices 1. Go for complex specialized compilers 2. Analyze code dynamically and do optimizations (Run-time optimization) Dynamo’s Choice
4
Overview Dynamo is a dynamic “native-to-native” optimization system Input an executing native instruction stream No involvement of Software Vendor Transparent operation Input instruction stream to Dynamo may be Statically created binary Dynamically generated binary (i.e. JIT, emulator) Present work focuses on static binary case
5
Overview (contd.) Dynamo works like a Virtual Machine Can observe execution behavior Tries to identify hot instruction sequences (traces) After identification of hot traces Generates optimized fragments for those traces Stored into fragment cache (a software code cache) Links successive traces (trace tree). Subsequent occurrences of a hot trace Stops interpreting Executes corresponding fragment from “fragment cache” Upon end of fragment, Dynamo resumes interpreting code
6
Overview (contd.) Overheads associated Interpreting code is very slow Identification of hot traces and keeping track Generation of optimized fragments Advantages of approach Working set of application gradually materializes into cache Optimized using run-time information (expected to be much faster than original code) Beneficial if Performance benefits from reuse of fragments offset overheads Assumes that major working set of application is small
7
Dynamo: System Architecture
8
Identification of Hot Traces Uses the MRET (Most Recently Executed Trail) scheme Interpreter associates counters with possible start-of-trace points Increments whenever these points executed Tries to do minimal profiling Assumes that following instruction are also hot Estimates the end-of-trace point using some conditions When count exceeds a threshold Stops interpreting Records instruction sequence till end-of-trace condition This gives the hot trace
9
Trace Optimization and Fragment Generation Selected hot trace converted to intermediate representation(IR) IR close to underlying machine instruction set Performs optimizations on the IR Removes unconditional branches Other optimizations on branch instructions For fragment generation Trace might be split into multiple fragments due to branches Register allocator tries to preserve original mappings Stubs are inserted for exit from fragment Fragment linking done to couple contiguous fragments (highest performance gain)
10
The Trace A BC D E F G HI J call return * Speculative Trace Construction A B* C D E F* G H* I J Back to the interpreter From the interpreter
11
Effects of Fragment Construction Hot path is linearized Eliminates branches Creates superblock Applies local optimization Cold-path branches remain Targets are stubs Send control to interpreter Path includes call & return Jumps not branches Interprocedural effects Indirect branches Speculate on target Fall back on hash table of branch targets A B* C D E F* G H* I J Back to the interpreter From the interpreter
12
Sources of Improvement Many small things contribute to make Dynamo profitable Linearization eliminates branches Improves TLB & I-cache behavior 2 passes of local optimization Redundancy elimination, copy propagation, constant folding, simple strength reduction, loop unrolling, loop invariant code motion, redundant load removal Keep in mind that “local” includes interprocedural in Dynamo Focuses on Optimization hard to do on compilation, at least without profile information. Aims to not compete with static compiler. Nevertheless, Engineering detail makes a difference
13
Fragment Linking A BC D E F G HI J call return What happens if another path becomes hot? (Say ABDGIJE) * Speculative Trace Construction Block B meets “start of trace” condition (exit from fragment) A B* C D E F* G H* I J Back to the interpreter From the interpreter
14
Fragment Linking When counter reaches hot: Builds a fragment A B* C D E F* G H* I J Back to the interpreter From the interpreter
15
Fragment Linking When counter reaches hot: Builds a fragment A B* C D E F* G H* I J Back to the interpreter From the interpreter B D E F* G H* I J A*
16
Fragment Linking When counter reaches hot: Builds a fragment Links exit A B to new fragment Links exit E A to old fragment A C D E F* G H* I J Back to the interpreter From the interpreter B D E F* G H* I J
17
Fragment Linking When counter reaches hot: Builds a fragment Links exit A B to new fragment Links exit E A to old fragment What if B* held redundant op? Have L IVE on entry to B Can test L IVE sets for both exit from A & entry to B May show op is dead … Link-time RE A C D E F* G H* I J Back to the interpreter From the interpreter B D E F* G H* I J
18
Fragment Cache Management Need to flush out cold entries regularly Mechanism for replacement can’t take too much overhead Schemes such as LRU might not be feasible Dynamo’s approach Looks for sharp increase in fragment creation range Indicates significant change in program working set Complete fragment cache flush triggered
19
An example justifying this assumption
20
Signal handling in Dynamo Signals/interrupts could be of two types Asynchronous Synchronous Asynchronous signals queued at arrival time and processed at end of current fragment’s execution Synchronous signal arrival on optimized code can create problems Can’t be postponed Might assume original program context which no longer exists due to optimizations
21
Signal handling in Dynamo (contd.) Conservative Optimizations Constant propagation, constant folding, copy propagation, etc Allow precise signal context to be reconstructed Aggressive Optimizations Dead code removal, code sinking, loop invariant code motion, etc. Can’t guarantee that synchronous signals can be handled properly
22
Results Figure. Speedup of +O2 optimized PA-8000 binaries running on Dynamo, relative to the identical binaries running standalone. The contributions from dynamo inlining due to trace selection, conservative trace optimization and aggressive trace optimization are shown. Dynamo bails out to direct native execution on go and vortex.
23
Results Dynamo runs faster than native code in the average case Overhead averages 1.5% (Spec95 on HP PA-8000) Most significant source of speedup is trace-selection. Most of the overhead is spent in trace selection Interpret, bump target counters, test against threshold Optimization & code generation have minor cost
24
Results Performance gain reduces V.S. highly optimized binaries.
25
Explanation of results Gains due to trace selection due to Selecting a trace and forming fragments out of it Partial procedure inlining and improved layout in cache Large improvements in Perl, compress, li and m88skim Have stable working sets No performance improvements on go, ijpeg and vortex Startup time is non-negligible fraction of the total runtime in these programs Don’t run long enough to recoup Dynamo’s startup overhead Lack of a stable working set.
26
Remarks Dynamo project Started in 1996 Tries to achieve dynamic performance improvement Novel native-to-native online optimization aproach Complements the static compiler Very influential paper 279 tracked in ACM Digital Library
27
Remarks Starting up Single call to library routine It copies the stack, creates space for interpreter’s state, and begins Dynamo’s execution This model simplifies initialization of program and grants Dynamo local access to binary memory space. Counter management 1st time at branch target allocates & initializes a counter 2nd & subsequent times bumps that counter Optimization & code generation recycles counter’s space With cheap breakpoint mechanism, it could spare the use of an interpreter and use breakpoint to invoke Dynamo (similar to a debugger).
28
Caveats Runs on PA-8000 (introduced on November 1995) Very simple instructions, easier to write optimizer than CISC (recall Itanium Fiasco) This chip lacks most of hardware optimization features common on modern chips. Low performance gain on already highly optimized binaries Poor handling of interruptions may cause problems in I/O intensive scenarios. Experiments does not test this scenario Posterior documents from the authors market Dynamo as an interpreter for emulators
29
References Dynamo: A Transparent Dynamic Optimization System Vasanth Bala, Evelyn Duesterwald and Sanjeev Banerjia Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation ARS Technica report on Dynamo http://www.arstechnica.com/reviews/1q00/dynamo/dynamo-1.html Slides based on previous slides of: Dynamic Optimization as typified by the Dynamo System. Keith D. Cooper & Linda Torczon. COMP 512, Rice University. Mayank Agarwal. CS 812 : High Level Design and Modeling Student Presentation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.