Adaptive Optimization with On-Stack Replacement Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University.

Slides:



Advertisements
Similar presentations
Mohamed. M. Saad.  Java Virtual Machine Prototype based on Jikes RVM  Targets  Code profiling/visualization using execution flow  Utilize large number.
Advertisements

Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.
Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.
Virtual Machines Matthew Dwyer 324E Nichols Hall
1 Lecture 10 Intermediate Representations. 2 front end »produces an intermediate representation (IR) for the program. optimizer »transforms the code in.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 1 Lecture 14 Java Virtual Machine Instructors: Fu-Chiung Cheng ( 鄭福炯 ) Associate Professor Computer Science & Engineering Tatung Institute of Technology.
Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Aarhus University, 2005Esmertec AG1 Implementing Object-Oriented Virtual Machines Lars Bak & Kasper Lund Esmertec AG
Feedback: Keep, Quit, Start
Partial Method Compilation using Dynamic Profile Information John Whaley Stanford University October 17, 2001.
Previous finals up on the web page use them as practice problems look at them early.
JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.
An Adaptive, Region-based Allocator for Java Feng Qian & Laurie Hendren 2002.
Compile-Time Deallocation of Individual Objects Sigmund Cherem and Radu Rugina International Symposium on Memory Management June, 2006.
Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.
Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.
Schedule Midterm out tomorrow, due by next Monday Final during finals week Project updates next week.
1 Software Testing and Quality Assurance Lecture 31 – SWE 205 Course Objective: Basics of Programming Languages & Software Construction Techniques.
CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson.
Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.
Adaptive Optimization in the Jalapeño JVM M. Arnold, S. Fink, D. Grove, M. Hind, P. Sweeney Presented by Andrew Cove Spring 2006.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation II John Cavazos University.
An Adaptive, Region-based Allocator for Java Feng Qian, Laurie Hendren {fqian, Sable Research Group School of Computer Science McGill.
Java Bytecode What is a.class file anyway? Dan Fleck George Mason University Fall 2007.
7. Just In Time Compilation Prof. O. Nierstrasz Jan Kurs.
Adaptive Optimization in the Jalapeño JVM Matthew Arnold Stephen Fink David Grove Michael Hind Peter F. Sweeney Source: UIUC.
Lecture 10 : Introduction to Java Virtual Machine
The Jikes RVM | Ian Rogers, The University of Manchester | Dr. Ian Rogers Jikes RVM Core Team Member Research Fellow, Advanced.
Advanced Programming Collage of Information Technology University of Palestine, Gaza Prepared by: Mahmoud Rafeek Alfarra Lecture 2: Major Concepts of Programming.
O VERVIEW OF THE IBM J AVA J UST - IN -T IME C OMPILER Presenters: Zhenhua Liu, Sanjeev Singh 1.
Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl.
1 Introduction to JVM Based on material produced by Bill Venners.
Oct Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.
Roopa.T PESIT, Bangalore. Source and Credits Dalvik VM, Dan Bornstein Google IO 2008 The Dalvik virtual machine Architecture by David Ehringer.
Java Virtual Machine Case Study on the Design of JikesRVM.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Code Optimization 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a.
1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff,
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.
Virtual Machines, Interpretation Techniques, and Just-In-Time Compilers Kostis Sagonas
380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.
1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),
Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08.
CS 598 Scripting Languages Design and Implementation 14. Self Compilers.
CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.
Programming Languages and Paradigms Activation Records in Java.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
 Control Flow statements ◦ Selection statements ◦ Iteration statements ◦ Jump statements.
A Region-Based Compilation Technique for a Java Just-In-Time Compiler Toshio Suganuma, Toshiaki Yasue and Toshio Nakatani Presenter: Ioana Burcea.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Method Profiling John Cavazos University.
RealTimeSystems Lab Jong-Koo, Lim
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:
1 The Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss.
Static Single Assignment
Department of Computer Science University of California, Santa Barbara
CSc 453 Interpreters & Interpretation
Adaptive Code Unloading for Resource-Constrained JVMs
Inlining and Devirtualization Hal Perkins Autumn 2011
Inlining and Devirtualization Hal Perkins Autumn 2009
Adaptive Optimization in the Jalapeño JVM
Lecture 9 Dynamic Compilation
Department of Computer Science University of California, Santa Barbara
rePLay: A Hardware Framework for Dynamic Optimization
Garbage Collection Advantage: Improving Program Locality
CSc 453 Interpreters & Interpretation
JIT Compiler Design Maxine Virtual Machine Dhwani Pandya
Practical Assignment Sinking for Dynamic Compilers
Presentation transcript:

Adaptive Optimization with On-Stack Replacement Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University

Motivation  Modern VM uses adaptive recompilation strategies  VM replaces entry in dispatching table with newly compiled code  Switching to new code can only happen at the next invocation  On-stack replacement (OSR) allows transformation happen in the middle of method execution

What is On-stack Replacement?  Transfer execution from compiled code m1 to compiled code m2 even while m1 runs on some thread’s stack stack PC frame m1 stack PC frame m2

Why On-Stack Replacement (OSR)?  Debugging optimized code via dynamic de- optimization [SELF-93]  Deferred compilation of cold paths in a method [SELF-91, HotSpot, Whaley 2001]  Promotion of long-run activations [SELF-93]  Safe invalidation for speculative optimization [HotSpot, SELF-91]

Related Work  Holzle, Chambers, and Ungar (SELF-91, SELF- 93) deferred compilation, de-optimization for debugging, promotion of long-run loops, safe invalidation [OOPSLA’91, PLDI’92, OOPSLA’94]  HotSpot server compiler [JVM’01]  Partial method compilation [OOPSLA’01]

OSR Challenges  Engineering Complexity  How to minimize disruption to VM code base?  How to constrain optimizations?  Policies for applying OSR  How to make rational decisions for applying OSR?  Effectiveness  How does OSR improve/constrain dataflow optimizations?  How effective are online OSR-based optimizations?

Outline Motivation  OSR Mechanism  Applications  Experimental Results  Conclusion

OSR Mechanism Overview  Extract compiler-independent state from a suspended activation for m1  Generate specialized code m2 for the suspended activation  Compile and transfer execution to the new code m2 m2 stack PC frame m1 compiler- independent state stack PC frame m

JVM Scope Descriptor  Compiler-independent state of a running activation  Based on Java Virtual Machine Architecture  Five components: 1)Thread running the activation 2)Reference to the activation's stack frame 3)Program Counter (as a bytecode index) 4)Value of each local variable 5)Value of each stack location

class C { static int sum(int c) { int y = 0; for (int i=0; i<c; i++) { y += i; } return y; } Running thread: MainThread Frame Pointer: 0xSomeAddress Program Counter: 16 Local variables: L0(c) = 100; L1(y) = 1225; L2(i) = 50; Stack Expressions: S0 = 50; S1 = 100; JVM Scope Descriptor 0 iconst_0 1 istore_1 2 iconst_0 3 istore_2 4 goto 14 7 iload_1 8 iload_2 9 iadd 10 istore_1 11 iinc iload_2 15 iload_0 16 if_icmplt 7 19 iload_1 20 ireturn Bytecode JVM Scope Descriptor Example Suspend after 50 loop iterations (i = 50)

Extracting JVM Scope Descriptor  Trivial from interpreter  Optimizing Compiler  Insert OSR Point (safe-point) instructions in initial IR  OSR Point uses stack, local state needed to recover scope descriptor  OSR Point is treated as a call, transfers control to exit block  Aggregate OSR points to an OSR map when generating machine instructions stack PC frame m1 compiler- independent state 1

Specialized Code Generation  Prepend a specialized prologue to original bytecode  Prologue will Save JVM Scope Descriptor values into local variables Push JVM Scope Descriptor values onto the stack Jump to the desired program counter m2 compiler- independent state 2

Running thread: MainThread Frame Pointer: 0xSomeAddress Program Counter: 16 Local variables: L0(c) = 100; L1(y) = 1225; L2(i) = 50; Stack Expressions: S0 = 50; S1 = 100; JVM Scope Descriptor ldc 100 istore_0 ldc 1225 istore_1 ldc 50 istore_2 ldc 50 ldc 100 goto 16 0 iconst_ if_icmplt ireturn Specialized Bytecode 0 iconst_0 1 istore_1 2 iconst_0 3 istore_2 4 goto 14 7 iload_1 8 iload_2 9 iadd 10 istore_1 11 iinc iload_2 15 iload_0 16 if_icmplt 7 19 iload_1 20 ireturn Original Bytecode Transition Example

m2 stack PC frame m2 3 Transfer Execution to the New Code  Compile m2 as a normal method  System unfolds the stack frame of m1  Reschedule the thread to execute m2  By construction, executing specialized m2 sets up target stack frame and continues execution m2 stack PC frame m2 3

 Suppose optimizer inlines A -> B -> C: A' stack PC frame A A JVM Scope Descriptor A JVM Scope Descriptor C JVM Scope Descriptor B C' B' stack PC frame m2 C' A' B' AA frame C' frame A' frame B' frame Recovering from Inlining

Inlining Example foo_prime() { <specialized foo prologue> call bar_prime() goto A;... bar(); A:... } bar_prime() { <specialized bar prologue> goto B:... B:... } void foo() { bar(); A:... } void bar() {... B:... } Wipe stack to caller C and call foo_prime frame A stack PC frame m2 foo' bar' C frame bar' frame foo' Suspend at B: in A -> B

Implementation Details Target Compiler unmodified, except for....  New pseudo-bytecodes  Load literals (to avoid inserting new constants in constant pool)  Load an address/bytecode index: JSR return address on stack  Fix bytecode indices for GC maps, exception tables, line number tables

Pros and Cons Advantages  mostly compiler-independent  avoid multi-entry points of compiled code  target compiler can exploit run-time constants Disadvantage  must compile target method twice (once for transition, once for next invocation)

Outline Motivation OSR Mechanism  Applications  Experimental Results  Conclusion

Two OSR Applications  Promotion (see the paper for details)  recompile a long-running activation  Deferred Compilation  don't compile uncommon paths  saves compile-time x = 1; x = foo(); return x; if (foo is currently final) trap/OSR;

Deferred Compilation  What's "infrequent"?  static heuristics  profile data  Adaptive recompilation decision is modified to consider OSR factors Feng Qian: Class initialization is called by a class loader, when do we need OSR for it? Feng Qian: Class initialization is called by a class loader, when do we need OSR for it?

Outline Motivation OSR Mechanism Applications  Experimental Results  Conclusion

Online Experiments Eager : (by default) no deferred compilation OSR/static: deferred compilation for CHA-based inlining only OSR/edge counts: deferred compilation w/online profile data & CHA-based inlining

Adaptive System Performance better

Adaptive System Performance better

PromotionsInvalidations compress36 jess00 db01 javac010 mpegaudio01 mtrt05 jack01 total324 OSR Activities SPECjvm98 size 100 First Run

Outline Motivation OSR Mechanism Applications Experimental Results  Conclusion

Summary  A new On-stack replacement mechanism  Online profile-directed deferred compilation  Evaluation of OSR applications in JikesRVM

Conclusion  Should a VM implement OSR? +Can be done with minimal intrusion to code base  Modest gains from deferred compilation  No benefit for class-hierarchy-based inlining +Debugging with dynamic de-optimization valuable  TODO: More advanced speculative optimizations Implementation is available to public in JikesRVM under CPL: Linux/x86, Linux/PPC, and AIX/PPC

Backup Slides

Compile Rate Offline Profile

Compile Rate Offline Profile

Machine Code Size Offline Profile

Machine Code Size Offline Profile

Code Quality Offline Profile

Code Quality Offline Profile better

Jikes RVM Analytic Recompilation Model  Define cur, current optimization level for method m T j, expected future execution time at level j C j, compilation cost at opt level j  Choose j > cur that minimizes T j + C j  If T j + C j < T cur recompile at level j  Assumptions  Method will execute for twice its current duration  Compilation cost and speedup based on offline average  Sample data determines how long a method has executed

Jikes RVM OSR Promotion Model  Given: Outdated activation A of method m  Define L, last optimization level for any compiled version of m cur, current optimization level for activation A T cur, expected future execution time of A at level cur C L, compilation cost for method m at opt level L  T L, expected future execution time of A at level L  If T L + C L < T cur specialize A at level L  Assumption  Outdated activation will execute for twice its current duration

Jikes RVM Recompilation Model, with Profile-Driven Deferred Compilation  Define cur, current optimization level for method m T j, expected future execution time at level j C j, compilation cost at opt level j P, percentage of code in m that profile data indicates was reached  Choose j > cur that minimizes T j + P*C j  If T j + P*C j < T cur recompile at level j  Assumptions  Method will execute for twice its current duration  Compilation cost and speedup based on offline average  Sample data determines how long a method has executed

Offline Profile experiments  Collect "perfect" profile data offline  Mark any block never reached as "uncommon"  Defer compilation of "uncommon" blocks  Four configurations Ideal: deferred compilation trap keeps no state live Ideal-OSR: deferred compilation trap is valid OSR point Static-OSR: no profile data; defer compilation for CHA-based inlining; trap is valid OSR point Eager: (default) no deferred compilation

Compile Rate Offline Profile

Machine Code Size Offline Profile

Code Quality Offline Profile

OSR Challenges  Engineering Complexity  How to minimize disruption to VM code base?  How to constrain optimizations?  Policies for applying OSR  How to make rational decisions for applying OSR?  Effectiveness  How does OSR improve/constrain dataflow optimizations?  How effective are online OSR-based optimizations?

Recompilation Activities First Run O0O1O2totalO0O1O2total compress jess db javac mpegaudio mtrt jack total With OSR Without OSR

Summary of Study (1)  Engineering Complexity  How to minimize disruption to VM code base? °Compiler-independent specialized source code to manage transition transparently  How to constrain optimizations? °Model OSR Points like CALLS in standard transformations  Policies for applying OSR  How to make rational decisions for applying OSR? °Simple modifications to cost-benefit analytic model

Summary of Study (2)  Effectiveness (for an implementation of online profile-directed deferred compilation)  How does OSR improve/constrain dataflow optimizations? °small ideal benefit from dataflow merges ( %) °negligible benefit when constraining optimization for potential invalidation °negligible benefit for just CHA-based inlining  patch points + splitting + pre-existence good enough  How effective are online OSR-based optimizations? °average performance improvement of 2.6% on first run SPECjvm98 s=100 °individual benchmarks range from +8% to -4% °negligible impact on steady state performance (best of 10 iterations) °adaptive recompilation model relatively insensitive, compiles 4% more methods

Experimental Details  SPECjvm98, size 100  Jikes RVM  FastAdaptiveSemispace configuration  one virtual processor  500MB heap  separate VM instance for each benchmark  IBM RS/6000 Model F80  six 500 MHz PowerPC 630's  AIX  4 GB memory

Specialized Code Generation  Generate specialized m2 that sets up new stack frame and continues execution, preserving semantics.  Express the transition to new stack frame in source code (bytecode) m2 compiler- independent state 2

Deferred Compilation  Don't compile "infrequent" blocks x = 1; trap/OSR; return x; if (foo is currently final) x = 1; x = foo(); return x; if (foo is currently final)

Experimental Results  Online profile-directed deferred compilation  Evaluation  How much do OSR points improve optimization by eliminating merges?  How much do OSR points constrain optimization?  How effective is online profile-directed deferred compilation?

Adaptive System Performance

Online Experiments  Before optimizing, collect intraprocedural edge counters  Defer compilation at blocks that profile data says not reached  If deferred block reached  Trigger OSR and deoptimize  Invalidate compiled code  Modify analytic recompilation model  Promotion from baseline to optimized  Compile-time cost estimate modified according to profile data