Dynamic Optimization as typified by the Dynamo System See “Dynamo: A Transparent Dynamic Optimization System”, V. Bala, E. Duesterwald, and S. Banerjia,

Slides:

Advertisements

Similar presentations

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Advertisements

Operator Strength Reduction From Cooper, Simpson, & Vick, “Operator Strength Reduction”, ACM TOPLAS, 23(5), See also § of EaC2e. 1COMP 512,

Course Outline Traditional Static Program Analysis Software Testing

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Code Motion of Control Structures From the paper by Cytron, Lowry, and Zadeck, COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda.

SSA-Based Constant Propagation, SCP, SCCP, & the Issue of Combining Optimizations 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon,

The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.

Lazy Code Motion Comp 512 Spring 2011

Lazy Code Motion C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Loop Invariant Code Motion — classical approaches — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Code Shape IV Procedure Calls & Dispatch Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

1 CS 201 Compiler Construction Lecture 5 Code Optimizations: Copy Propagation & Elimination.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

1 Copy Propagation What does it mean? – Given an assignment x = y, replace later uses of x with uses of y, provided there are no intervening assignments.

Improving Code Generation Honors Compilers April 16 th 2002.

Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

First Principles (with examples from value numbering) C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

The Procedure Abstraction, Part VI: Inheritance in OOLs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

“Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

The Procedure Abstraction, Part V: Support for OOLs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.

Operator Strength Reduction C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

Online partial evaluation of bytecodes (3)

Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.

Dynamo: A Transparent Dynamic Optimization System Bala, Dueterwald, and Banerjia projects/Dynamo.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Global Redundancy Elimination: Computing Available Expressions Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

The Procedure Abstraction, Part IV Addressability Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.

Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.

Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Eliminating Array Bounds Checks — and related problems — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.

Definition-Use Chains

Introduction to Optimization

The Procedure Abstraction Part IV: Allocating Storage & Establishing Addressability Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights.

Local Instruction Scheduling

Introduction to Optimization

Code Generation Part III

Introduction to Code Generation

Optimizing Transformations Hal Perkins Autumn 2011

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.

Closure Representations in Higher-Order Programming Languages

Optimizing Transformations Hal Perkins Winter 2008

Code Generation Part III

The Last Lecture COMP 512 Rice University Houston, Texas Fall 2003

Code Shape IV Procedure Calls & Dispatch

Introduction to Optimization

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

rePLay: A Hardware Framework for Dynamic Optimization

Dynamic Binary Translators and Instrumenters

Presentation transcript:

Dynamic Optimization as typified by the Dynamo System See “Dynamo: A Transparent Dynamic Optimization System”, V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring COMP 512, Rice University

2 What is “Dynamic Optimization”? And why should it be profitable? Run-time adaptation of the code to reflect run-time knowledge  Constant values, variable alignments, code in DLLs  Hot paths (dynamic as opposed to averaged over entire run) If the code is performing poorly …  Correct for poor choice of optimizations or strategy  Correct for poor data layout If the code is performing well …  Straighten branches  Perform local optimization based on late-bound information Dynamo is an excellent example of these effects

COMP 512, Rice University3 The Dynamo System Runs on PA-RISC emulating PA-RISC Focuses on opportunities that arise at run time Provides transparent operation  No user intervention, except to start Dynamo  Interprets cold code, optimizes & runs hot code  Speedup on hot must cover its overhead Key notions  Fragment cache to hold hot paths  Threshold for cold  hot transition  Branch straightening

COMP 512, Rice University4 How does it work? Interpret until taken branch Lookup branch target in cache start of trace? Jump to cached fragment Counter for br. target ++ Counter > hot threshold miss hit yes Interpret + code gen until taken branch no end of trace? no Create & opt’ze new fragment Emit, link, & recycle counter Fragment Cache Signal Handler

COMP 512, Rice University5 How does it work? Interpret until taken branch Lookup branch target in cache start of trace? Jump to cached fragment Counter for br. target ++ Counter > hot threshold miss hit yes Interpret + code gen until taken branch no end of trace? no Create & opt’ze new fragment Emit, link, & recycle counter Fragment Cache Signal Handler Step 1: Interpret cold code until the code takes a branch Target is in fragment cache? Yes  jump to the fragment No  decide whether to start a new fragment Cold code is interpreted (slow) Hot code is optimized & run (fast) Hot must make up for cold, if system is to be profitable …

COMP 512, Rice University6 How does it work? Interpret until taken branch Lookup branch target in cache start of trace? Jump to cached fragment Counter for br. target ++ Counter > hot threshold miss hit yes Interpret + code gen until taken branch no end of trace? no Create & opt’ze new fragment Emit, link, & recycle counter Fragment Cache Signal Handler If “start of trace” condition holds, bump trace’s counter If the counter > some threshold value (50) Move into code generation & optimization phase to convert code into an optimized fragment in the fragment cache Otherwise, go back to interpreting Counter forces trace to be hot before spending effort to improve it Start of trace  Target of backward branch (loop header)  Fragment cache exit branch

COMP 512, Rice University7 How does it work? Interpret until taken branch Lookup branch target in cache start of trace? Jump to cached fragment Counter for br. target ++ Counter > hot threshold miss hit yes Interpret + code gen until taken branch no end of trace? no Emit, link, & recycle counter Fragment Cache Signal Handler To build an optimized fragment: Interpret each operation & generate the code Encounter a branch? If end-of-trace condition holds create & optimize the new fragment emit the fragment, link it to other fragments, & free the counter Otherwise, keep interpreting … Create & opt’ze new fragment End of trace  Taken backward branch (bottom of loop)  Fragment cache entry label

COMP 512, Rice University8 How does it work? Interpret until taken branch Lookup branch target in cache start of trace? Jump to cached fragment Counter for br. target ++ Counter > hot threshold miss hit yes Interpret + code gen until taken branch no end of trace? no Create & opt’ze new fragment Emit, link, & recycle counter Fragment Cache Signal Handler Moving code fragment into the cache End of fragment branch jumps to stub that jumps to the interpreter …

COMP 512, Rice University9 Overhead Dynamo runs faster than native code Overhead averages 1.5% (Spec95 on HP PA-8000) Most of the overhead is spent in trace selection  Interpret, bump target counters, test against threshold  Optimization & code generation have minor cost Dynamo makes back its overhead on fast execution of hot code  Hot code executes often  Dynamo must be able to improve it Trace selection method lowers overhead via speculation  Only profiles selected blocks (targets of backward branches)  Once counter > threshold, it compiles until “end of trace”  No profiling on internal branches in the trace  End of trace relies on static properties

COMP 512, Rice University10 Effects of Fragment Construction Profile mechanism identifies A as start of a hot path After threshold trips through A, the next path is compiled Speculative construction method adds C, D, G, I, J, & E Run-time compiler builds a fragment for ACDGIJE, with exits for B, H, &F A BC D E F G HI J call return

COMP 512, Rice University11 Effects of Fragment Construction Hot path is linearized  Eliminates branches  Creates superblock  Applies local optimization Cold-path branches remain  Targets are stubs  Send control to interpreter Path includes call & return  Jumps not branches  Interprocedural effects Indirect branches  Speculate on target  Fall back on hash table of branch targets A B* C D E F* G H* I J Back to the interpreter From the interpreter

COMP 512, Rice University12 Sources of Improvement Many small things contribute to make Dynamo profitable Linearization eliminates branches Improves TLB & I-cache behavior 2 passes of local optimization  Redundancy elimination, copy propagation, constant folding, simple strength reduction, loop unrolling, loop invariant code motion, redundant load removal ( spill code? ) Keep in mind that “local” includes interprocedural in Dynamo Engineering detail makes a difference

COMP 512, Rice University13 Redundant Computations Some on-trace redundancies are easily detected Trace defines r 3 Definition may be partially dead  Live on exit but not trace  move it to the exit stub  Live on trace but not exit  move it below the exit † Implies that we have L IVE information for the code  Collect L IVE sets during optimization & code generation  Store L IVE set for each fragment entry & exit  Allows interfragment optimization † Can we know this?  Only if exit is to a fragment rather than to the interpreter.  Otherwise, must assume that definition is L IVE on each exit

COMP 512, Rice University14 Fragment Linking A BC D E F G HI J call return What happens if another path becomes hot? (Say ABDGIJE) * Speculative Trace Construction Block B meets “start of trace” condition (exit from fragment) A B* C D E F* G H* I J Back to the interpreter From the interpreter

COMP 512, Rice University15 Fragment Linking When counter reaches hot: Builds a fragment A B* C D E F* G H* I J Back to the interpreter From the interpreter

COMP 512, Rice University16 Fragment Linking When counter reaches hot: Builds a fragment A B* C D E F* G H* I J Back to the interpreter From the interpreter B D E F* G H* I J A*

COMP 512, Rice University17 Fragment Linking When counter reaches hot: Builds a fragment Links exit A  B to new fragment Links exit E  A to old fragment A C D E F* G H* I J Back to the interpreter From the interpreter B D E F* G H* I J

COMP 512, Rice University18 Fragment Linking When counter reaches hot: Builds a fragment Links exit A  B to new fragment Links exit E  A to old fragment What if B* held redundant op? Have L IVE on entry to B Can test L IVE sets for both exit from A & entry to B May show op is dead … Link-time RE A C D E F* G H* I J Back to the interpreter From the interpreter B D E F* G H* I J

COMP 512, Rice University19 Results They measured performance on Spec95 codes Graphic from ARS Technica report on Dynamo

COMP 512, Rice University20 Fragment Cache Management Examples in paper (Spec95), cache was big enough  Flushed cache when fragment creation got hot  Worked well enough What about real programs?  Microsoft Word produces huge fragment cache  Loses some of I-cache & TLB benefits  Does not trigger replacement early enough New research needed on fragment cache management  Algorithms must be dirt cheap & very effective  Work at Harvard on this problem (Kim Hazelwood)

COMP 512, Rice University21 Details Starting up  Single call to library routine  It copies the stack, creates space for interpreter’s state, and begins Dynamo’s execution Counter management  1st time at branch target allocates & initializes a counter  2nd & subsequent times bumps that counter  Optimization & code generation recycles counter’s space With cheap breakpoint mechanism, could the cold code

COMP 512, Rice University22 Summary They built it It works pretty well on benchmarks With some tuning, it should work well in practice Principles: Do well on hot paths Run slowly on cold paths Win from locality & (now) local optimization Postscript (8 years later): Dynamo was an influential system, in that it sparked a line of research in both academia and industry and led to a reasonably large body of literature on similar techniques. ( The paper has a huge reference count for this area. )