Partial Method Compilation using Dynamic Profile Information John Whaley Stanford University October 17, 2001.

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.

COMPILERS Register Allocation hussein suleman uct csc305w 2004.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Program Representations. Representing programs Goals.

Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.

CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 18.

Chapter 8 Runtime Support. How program structures are implemented in a computer memory? The evolution of programming language design has led to the creation.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

Feedback: Keep, Quit, Start

Next Section: Pointer Analysis Outline: –What is pointer analysis –Intraprocedural pointer analysis –Interprocedural pointer analysis (Wilson & Lam) –Unification.

Previous finals up on the web page use them as practice problems look at them early.

Incremental Path Profiling Kevin Bierhoff and Laura Hiatt Path ProfilingIncremental ApproachExperimental Results Path profiling counts how often each path.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

Schedule Midterm out tomorrow, due by next Monday Final during finals week Project updates next week.

1 Software Testing and Quality Assurance Lecture 31 – SWE 205 Course Objective: Basics of Programming Languages & Software Construction Techniques.

Pointer analysis. Pointer Analysis Outline: –What is pointer analysis –Intraprocedural pointer analysis –Interprocedural pointer analysis Andersen and.

Adaptive Optimization in the Jalapeño JVM M. Arnold, S. Fink, D. Grove, M. Hind, P. Sweeney Presented by Andrew Cove Spring 2006.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation II John Cavazos University.

Visual C New Optimizations Ayman Shoukry Program Manager Visual C++ Microsoft Corporation.

JIT in webkit. What’s JIT See time_compilation for more info. time_compilation.

7. Just In Time Compilation Prof. O. Nierstrasz Jan Kurs.

Adaptive Optimization in the Jalapeño JVM Matthew Arnold Stephen Fink David Grove Michael Hind Peter F. Sweeney Source: UIUC.

Lecture 10 : Introduction to Java Virtual Machine

O VERVIEW OF THE IBM J AVA J UST - IN -T IME C OMPILER Presenters: Zhenhua Liu, Sanjeev Singh 1.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Adaptive Optimization with On-Stack Replacement Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University.

Fast Conservative Garbage Collection Rifat Shahriyar Stephen M. Blackburn Australian National University Kathryn S. M cKinley Microsoft Research.

Java Virtual Machine Case Study on the Design of JikesRVM.

Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler Compiler workshop ’08.

CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.

Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel.

Conrad Benham Java Opcode and Runtime Data Analysis By: Conrad Benham Supervisor: Professor Arthur Sale.

Chapter 4 Memory Management Virtual Memory.

Mark Marron IMDEA-Software (Madrid, Spain) 1.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Virtual Machines, Interpretation Techniques, and Just-In-Time Compilers Kostis Sagonas

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),

Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.

Register Allocation CS 471 November 12, CS 471 – Fall 2007 Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

A Region-Based Compilation Technique for a Java Just-In-Time Compiler Toshio Suganuma, Toshiaki Yasue and Toshio Nakatani Presenter: Ioana Burcea.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Method Profiling John Cavazos University.

Sunpyo Hong, Hyesoon Kim

Lecture 10 Page 1 CS 111 Online Memory Management CS 111 On-Line MS Program Operating Systems Peter Reiher.

Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.

Improving java performance using Dynamic Method Migration on FPGAs

The Simplest Heuristics May Be The Best in Java JIT Compilers

Feedback directed optimization in Compaq’s compilation tools for Alpha

Chapter 9: Virtual-Memory Management

CSc 453 Interpreters & Interpretation

Adaptive Code Unloading for Resource-Constrained JVMs

Inlining and Devirtualization Hal Perkins Autumn 2011

Inlining and Devirtualization Hal Perkins Autumn 2009

Adaptive Optimization in the Jalapeño JVM

Lecture 9 Dynamic Compilation

Trace-based Just-in-Time Type Specialization for Dynamic Languages

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

CSc 453 Interpreters & Interpretation

Dynamic Binary Translators and Instrumenters

Just In Time Compilation

Presentation transcript:

Partial Method Compilation using Dynamic Profile Information John Whaley Stanford University October 17, 2001

Outline Background and Overview Dynamic Compilation System Partial Method Compilation Technique Optimizations Experimental Results Related Work Conclusion

Dynamic Compilation We want code performance comparable to static compilation techniques However, we want to avoid long startup delays and slow responsiveness Dynamic compiler should be fast AND good

Traditional approach Interpreter plus optimizing compiler Switch from interpreter to optimizing compiler via some heuristic Problems: Interpreter is too slow! (10x to 100x)

Another approach Simple compiler plus optimizing compiler (Jalapeno, JUDO, Microsoft) Switch from simple to optimizing compiler via some heuristic Problems: Code from simple compiler is still too slow! (30% to 100% slower than optimizing) Memory footprint problems (Suganuma et al., OOPSLA’01)

Yet another approach Multi-level compilation (Jalapeno, HotSpot) Use multiple compiled versions to slowly “accelerate” into optimized execution Problems: This simply increases the delay before the program runs at full speed!

Problem with compilation Compilation takes time proportional to the amount of code being compiled Many optimizations are superlinear in the size of the code Compilation of large amounts of code is the cause of undesirably long compilation times

Methods can be large All of these techniques operate at method boundaries Methods can be large, especially after inlining Cutting inlining too much hurts performance considerably (Arnold et al., Dynamo’00) Even when being frugal about inlining, methods can still become very large

Methods are poor boundaries Method boundaries do not correspond very well to the code that would most benefit from optimization Even “hot” methods typically contain some code that is rarely or never executed

Example: SpecJVM db void read_db(String fn) { int n = 0, act = 0; byte buffer[] = null; try { FileInputStream sif = new FileInputStream(fn); buffer = new byte[n]; while ((b = sif.read(buffer, act, n-act))>0) { act = act + b; } sif.close(); if (act != n) { /* lots of error handling code, rare */ } } catch (IOException ioe) { /* lots of error handling code, rare */ } Hot loop

Example: SpecJVM db Lots of rare code! void read_db(String fn) { int n = 0, act = 0; byte buffer[] = null; try { FileInputStream sif = new FileInputStream(fn); buffer = new byte[n]; while ((b = sif.read(buffer, act, n-act))>0) { act = act + b; } sif.close(); if (act != n) { /* lots of error handling code, rare */ } } catch (IOException ioe) { /* lots of error handling code, rare */ }

Hot “regions”, not methods The regions that are important to compile have nothing to do with the method boundaries Using a method granularity causes the compiler to waste time optimizing large pieces of code that do not matter

Overview of our technique Increase the precision of selective compilation to operate at a sub-method granularity 1.Collect basic block level profile data for hot methods 2.Recompile using the profile data, replacing rare code entry points with branches into the interpreter

Overview of our technique Takes advantage of the well-known fact that a large amount of code is rarely or never executed Simple to understand and implement, yet highly effective Beneficial secondary effect of improving optimization opportunities on the common paths

Overview of Dynamic Compilation System

interpreted code compiled code fully optimized code when execution count = t1 when execution count = t2 Stage 1: Stage 2: Stage 3:

Identifying rare code Simple technique: any basic block executed during Stage 2 is said to be hot Effectively ignores initialization Add instrumentation to the targets of conditional forward branches Better techniques exist, but using this we saw no performance degradation Enable/disable profiling is implicitly handled by stage transitions

Method-at-a-time strategy execution threshold % of basic blocks

execution threshold Actual basic blocks executed % of basic blocks

Partial method compilation technique

Technique 1.Based on profile data, determine the set of rare blocks. Use code coverage information from the first compiled version

Technique 2.Perform live variable analysis. Determine the set of live variables at rare block entry points live: x,y,z

Technique 3.Redirect the control flow edges that targeted rare blocks, and remove the rare blocks. to interpreter…

Technique 4.Perform compilation normally. Analyses treat the interpreter transfer point as an unanalyzable method call.

Technique 5.Record a map for each interpreter transfer point. In code generation, generate a map that specifies the location, in registers or memory, of each of the live variables. Maps are typically < 100 bytes x: sp - 4 y: R1 z: sp - 8 live: x,y,z

Optimizations

Partial dead code elimination Modified dead code elimination to treat rare blocks specially Move computation that is only live on a rare path into the rare block, saving computation in the common case

Partial dead code elimination Optimistic approach on SSA form Mark all instructions that compute essential values, recursively Eliminate all non-essential instructions

Partial dead code elimination Calculate necessary code, ignoring all rare blocks For each rare block, calculate the instructions that are necessary for that rare block, but not necessary in non-rare blocks If these instructions are recomputable at the point of the rare block, they can be safely copied there

Partial dead code example x = 0; if (rare branch 1) {... z = x + y;... } if (rare branch 2) {... a = x + z;... }

Partial dead code example if (rare branch 1) { x = 0;... z = x + y;... } if (rare branch 2) { x = 0;... a = x + z;... }

Pointer and escape analysis Treating an entrance to the rare path as a method call is a conservative assumption Typically does not matter because there are no merges back into the common path However, this conservativeness hurts pointer and escape analysis because a single unanalyzed call kills all information

Pointer and escape analysis Stack allocate objects that don’t escape in the common blocks Eliminate synchronization on objects that don’t escape the common blocks If a branch to a rare block is taken: Copy stack-allocated objects to the heap and update pointers Reapply eliminated synchronizations

Copying from stack to heap stack object Heap stack object copy rewrite

Reconstructing interpreter state We use a runtime “glue” routine Construct a set of interpreter stack frames, initialized with their corresponding method and bytecode pointers Iterate through each location pair in the map, and copy the value at the location to its corresponding position in the interpreter stack frame Branch into the interpreter, and continue execution

Experimental Results

Experimental Methodology Fully implemented in a proprietary system Unfortunately, cannot publish those numbers! Proof-of-concept implementation in the joeq virtual machine Unfortunately, joeq does not perform significant optimizations!

Experimental Methodology Also implemented as an offline step, using refactored class files Use offline profile information to split methods into “hot” and “cold” parts We then rely on the virtual machine’s default method-at-a-time strategy Provides a reasonable approximation of the effectiveness of this technique Can also be used as a standalone optimizer Available under LGPL as part of joeq release

Experimental Methodology IBM JDK 1.3 cx on RedHat Linux 7.1 Pentium mhz, 512 MB RAM Thresholds: t1 = 2000, t2 = Benchmarks: SpecJVM, SwingSet, Linpack, JavaLex, JavaCup

Run time improvement First bar: original Second bar: PMC Third bar: PMC + my opts Blue: optimized execution

Related Work Dynamic techniques Dynamo (Bala et al., PLDI’00) Self (Chambers et al., OOPSLA’91) HotSpot (JVM’01) IBM JDK (Ishizaki et al., OOPSLA’00)

Related Work Static techniques Trace scheduling (Fisher, 1981) Superblock scheduling (IMPACT compiler) Partial redundancy elimination with cost- benefit analysis (Horspool, 1997) Optimal compilation unit shapes (Bruening, FDDO’00) Profile-guided code placement strategies

Conclusion Partial method compilation technique is simple to implement, yet very effective Compile times reduced drastically Overall run times improved by an average of 10%, and up to 32% System is available under LGPL at: