Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08.

Slides:

Advertisements

Similar presentations

Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.

Advertisements

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Program Representations. Representing programs Goals.

Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Code Generation Mooly Sagiv html:// Chapter 4.

The Use of Traces for Inlining in Java Programs Borys J. Bradel Tarek S. Abdelrahman Edward S. Rogers Sr.Department of Electrical and Computer Engineering.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Introduction to Code Generation Mooly Sagiv html:// Chapter 4.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

4/23/09Prof. Hilfinger CS 164 Lecture 261 IL for Arrays & Local Optimizations Lecture 26 (Adapted from notes by R. Bodik and G. Necula)

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Previous finals up on the web page use them as practice problems look at them early.

Tentative Schedule 20/12 Interpreter+ Code Generation 27/12 Code Generation for Control Flow 3/1 Activation Records 10/1 Program Analysis 17/1 Register.

Introduction to Program Optimizations Chapter 11 Mooly Sagiv.

Run time vs. Compile time

Incremental Path Profiling Kevin Bierhoff and Laura Hiatt Path ProfilingIncremental ApproachExperimental Results Path profiling counts how often each path.

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.

Intermediate Code. Local Optimizations

Improving Code Generation Honors Compilers April 16 th 2002.

Introduction to Code Generation Mooly Sagiv html:// Chapter 4.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson.

Adaptive Optimization in the Jalapeño JVM M. Arnold, S. Fink, D. Grove, M. Hind, P. Sweeney Presented by Andrew Cove Spring 2006.

Dynamic Optimization as typified by the Dynamo System See “Dynamo: A Transparent Dynamic Optimization System”, V. Bala, E. Duesterwald, and S. Banerjia,

7. Just In Time Compilation Prof. O. Nierstrasz Jan Kurs.

Adaptive Optimization in the Jalapeño JVM Matthew Arnold Stephen Fink David Grove Michael Hind Peter F. Sweeney Source: UIUC.

O VERVIEW OF THE IBM J AVA J UST - IN -T IME C OMPILER Presenters: Zhenhua Liu, Sanjeev Singh 1.

Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl.

Adaptive Optimization with On-Stack Replacement Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University.

P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.

1 Names, Scopes and Bindings Aaron Bloomfield CS 415 Fall

Java Virtual Machine Case Study on the Design of JikesRVM.

Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler Compiler workshop ’08.

CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.

© 2010 IBM Corporation Code Alignment for Architectures with Pipeline Group Dispatching Helena Kosachevsky, Gadi Haber, Omer Boehm Code Optimization Technologies.

CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 9.

Dynamo: A Transparent Dynamic Optimization System Bala, Dueterwald, and Banerjia projects/Dynamo.

Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.

1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.

CSc 453 Interpreters & Interpretation

Inlining and Devirtualization Hal Perkins Autumn 2011

Inlining and Devirtualization Hal Perkins Autumn 2009

Optimizing Transformations Hal Perkins Winter 2008

Adaptive Optimization in the Jalapeño JVM

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Instruction Level Parallelism (ILP)

Trace-based Just-in-Time Type Specialization for Dynamic Languages

Dynamic Hardware Prediction

Where is all the knowledge we lost with information? T. S. Eliot

rePLay: A Hardware Framework for Dynamic Optimization

CSc 453 Final Code Generation

Procedure Linkages Standard procedure linkage Procedure has

CSc 453 Interpreters & Interpretation

Dynamic Binary Translators and Instrumenters

Presentation transcript:

Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08

2 Overview Would trace fragment dispatch benefit VMs with JITs? –Fragment-dispatch as a feedback-directed optimization Why? –Improve VM performance via better instruction layout Overview –Motivation –New scheme for trace selection –Viability in JikesRVM Evaluate opportunities for code improvement Evaluate trace selection overhead

3 Traditional VM Adaptive Code Generation Phase 3: More Advanced JIT Compilation Update Class/TOC dispatch tables, perform OSR Phase 2: JIT Method compilation Compilation Shape: Source Method Dispatch Shape: Corresponding MC Code Array & Machine Code Trace Fragment Phase 1: Interpreter Compilation Shape: Source Instruction Dispatch Shape:Corresponding MC Instruction(s) Machine Code Trace Fragment

4 SDT/ DBI/ Embedded VM Adaptive Code Generation Phase 3: More Advanced JIT Compilation Update Class/TOC dispatch tables, perform OSR Phase 2: JIT Method compilation Compilation Shape: Source Method Dispatch Shape: Corresponding MC Code Array & Machine Code Trace Fragment Phase 1: Interpreter Compilation Shape: Source Instruction Dispatch Shape:Corresponding MC Instruction(s) Machine Code Trace Fragment

5 Proposed VM Adaptive Code Generation Phase 3: More Advanced JIT Compilation Update Class/TOC dispatch tables, perform OSR Phase 2: JIT Method compilation Compilation Shape: Source Method Dispatch Shape(s): Corresponding MC Code Array & Machine Code Trace Fragment Phase 1: Interpreter Compilation Shape: Source Instruction Dispatch Shape:Corresponding MC Instruction(s) Machine Code Trace Fragment

6 Trace Fragment Dispatch Trace –A specific sequence of instructions observed at runtime –Span: Branches Procedure calls and returns Potentially arbitrary number of instructions Trace Fragment –A finite, linear sequence of machine code instructions –Single-entry, multiple-exit (viz. superblock) –Cached, linked A B E C DM NO P foo() bar() ABDMOPE to Cto N

7 Trace Fragment Dispatch: The Good Location, Location, Location –“Inlining-like”: Context sensitive Partial –Spatial locality provides most of achieved speedup Simple, low-cost “local” optimizations –Redundancy elimination Nimbly adjusts to changing behavior –Efficient –Lots of early-exits? Discard fragment and re-trace A B E C DM NO P foo() bar() ABDMOPE to Cto N

8 Trace Fragment Dispatch: The Bad Lacks optimization power –Data flow analysis –Code motion & loop optimizations Code expansion –Tail duplication –Exponential growth (if all paths maintained indefinitely) A B E C DM NO P foo() bar() ABDMOPE to Cto N

9 Trace Fragment Dispatch: The Bad to A A B E C DM NO P foo() bar() ABDMOPE to Cto N CDMOPE Lacks optimization power –Data flow analysis –Code motion & loop optimizations Code expansion –Tail duplication –Exponential growth (if all paths maintained indefinitely)

10 Trace Fragment Dispatch: The Bad to A A B E C DM NO P foo() bar() ABDMOPE to Cto N CDMOPE NPE Lacks optimization power –Data flow analysis –Code motion & loop optimizations Code expansion –Tail duplication –Exponential growth (if all paths maintained indefinitely)

11 Supplement Method Dispatch with Trace Dispatch Why? –Improve VM performance via better instruction layout –Easily-disposable fragments reflect current program behavior How? –JIT compiler inserts instrumentation into method code arrays: Monitor potential “hot trace headers” Record control flow –VM runtime assembles & patches trace fragments: Blocks “scavenged” from compiled code arrays Conditionals adjusted for proper fallthoughs Method code arrays patched to transfer control to fragments New fragments linked to existing fragments

12 Easy Fragment Management Improved trace selection –JIT to identify trace starting –VM to determine trace stopping locations “Friendly” encoding of instructions –Patch spots built-in –Avoid pesky PC-relative jumps (e.g., switch statements) Knowledge of language implementation features: –Calling conventions –Stack layout –Virtual method dispatch tables

13 Efficient Fragment Management “Mixed-mode” scheme: –Execution in both method code arrays & trace fragments Share the same register allocation –Control flows off-trace into method code arrays Fewer trace fragments Manageable code expansion –JVM control is already built into yield points –Disposable trace fragments No need to redo expensive analysis as behavior changes

14 Our Work: Trace Fragment Selection 1.Develop new trace selection methodology –Leverage JIT global analysis, VM runtime 2.Implement trace selection in JikesRVM and evaluate viability –Do recorded traces indicate room for code improvement? –Do the traces exhibit good characteristics? –Is instrumentation overhead reasonable?

15 Improved Trace Selection: Starting Locations 1.Loop Header Locations –Identified by JIT loop analysis –More accurate than “target of backward branch” heuristic 2.“Early exit” blocks –Allows trace fragments to be “layered” 3.Method prologue –Catches recursive execution A B E C DM NO P foo() bar() ABDMOPE to Cto N

16 to A Improved Trace Selection: Starting Locations 1.Loop Header Locations –Identified by JIT loop analysis –More accurate than “target of backward branch” heuristic 2.“Early exit” blocks –Allows trace fragments to be “layered” 3.Method prologue –Catches recursive execution A B E C DM NO P foo() bar() ABDMOPE to Cto N NPE

17 Improved Trace Selection: Starting Locations 1.Loop Header Locations –Identified by JIT loop analysis –More accurate than “target of backward branch” heuristic 2.“Early exit” blocks –Allows trace fragments to be “layered” 3.Method prologue –Catches recursive execution A BC D foo() ABD to C to Epilogue

18 Improved Trace Selection: Stopping Criteria 1.Cycle Returned to the loop header 2.Abutted Arrived at another loop header 3.Length Limited (unusual) 128 basic blocks encountered 4.Rejoined (unusual) Returned to a basic block already in trace 5.Exited (unusual) Exited the method without meeting above conditions. (Identifiable by stack height.) to A A B E C DM NO P foo() bar() ABDMOPE to Cto N NPE

19 Improved Trace Selection: Stopping Criteria 1.Cycle Returned to the loop header 2.Abutted Arrived at another loop header 3.Length Limited (unusual) 128 basic blocks encountered 4.Rejoined (unusual) Returned to a basic block already in trace 5.Exited (unusual) Exited the method without meeting above conditions. (Identifiable by stack height.) to A A B E C DM NO P foo() bar() ABDMOPE to Cto N NPE

20 JIT-Inserted Instrumentation (a) Assembly of original method code-block (b) Assembly of code-block to be used for tracing ABCD Loop header counters Paths through blocks Low-fidelity InstrumentationHigh-fidelity Instrumentation A JUMP_BLOCK TRACE_HEAD_A BCD TRACE_HEAD_BTRAMPOLINE_ATRAMPOLINE_B A’ INSTRUM_A B’C’D’ INSTRUM_B TRAMPOLINE_A’TRAMPOLINE_B’ INSTRUM_C TRAMPOLINE_C’TRAMPOLINE_D’ INSTRUM_D (Loop header)

21 JIT-Inserted Instrumentation (a) Assembly of original method code-block (b) Assembly of code-block to be used for tracing Low-fidelity InstrumentationHigh-fidelity Instrumentation A JUMP_BLOCK TRACE_HEAD_A BCD TRACE_HEAD_BTRAMPOLINE_ATRAMPOLINE_B A’ INSTRUM_A B’C’D’ INSTRUM_B TRAMPOLINE_A’TRAMPOLINE_B’ INSTRUM_C TRAMPOLINE_C’TRAMPOLINE_D’ INSTRUM_D ABCD Loop header counters Paths through blocks (Loop header)

22 JIT-Inserted Instrumentation (a) Assembly of original method code-block (b) Assembly of code-block to be used for tracing Low-fidelity InstrumentationHigh-fidelity Instrumentation A JUMP_BLOCK TRACE_HEAD_A BCD TRACE_HEAD_BTRAMPOLINE_ATRAMPOLINE_B A’ INSTRUM_A B’C’D’ INSTRUM_B TRAMPOLINE_A’TRAMPOLINE_B’ INSTRUM_C TRAMPOLINE_C’TRAMPOLINE_D’ INSTRUM_D ABCD Loop header counters Paths through blocks (Loop header)

23 JIT-Inserted Instrumentation (a) Assembly of original method code-block (b) Assembly of code-block to be used for tracing Low-fidelity InstrumentationHigh-fidelity Instrumentation A JUMP_BLOCK BCD TRACE_HEAD_BTRAMPOLINE_ATRAMPOLINE_B A’ INSTRUM_A B’C’D’ INSTRUM_B TRAMPOLINE_A’TRAMPOLINE_B’ INSTRUM_C TRAMPOLINE_C’TRAMPOLINE_D’ INSTRUM_D ABCD Loop header counters Paths through blocks (Loop header)

24 Improvement Opportunity A B E C DM NO P foo() bar() ABDECMNPO

25 Improvement Opportunity A B E C DM NO P foo() bar() ABDECMNPO 5B0480C6 (Low) 9BFE8D1F (High) Virtual Address Space (1GB)

26 Trace Layouts in Address Space (227_MTRT) Traces 5B0480C6 (Low) 9BFE8D1F (High) Virtual Address Space (1GB)

27 Improvement Opportunity A B E C DM NO P foo() bar() ABDECMNPO Gap Transition Fallthrough Transition

28 Trace Continuity DaCapo & SpecJVM98 Benchmarks –1/3 traces necessarily fragmented (inter-procedural) –Most intra-procedural traces non-contiguous

29 Transitions between basic blocks –Appropriate fallthough block 80% of the time –15% misprediction rate for local control flow. –20% of all transitions could benefit from trace fragment dispatch DistanceTransition Gaps B (cacheline)34.7% 65B - 4KB (page)40.7% 4KB+24.6%

30 Trace Characteristics –Cycle and abutted traces make the majority –Few length-limited, rejoined traces –Surprisingly large number of exited traces Sporadic loops

31 Instrumentation Overhead (Startup) –One-iteration tests. (40x) –Mixed slowdown results: 7.4% (jython), -6.5% (_227_mtrt) –Average startup overhead: 1.7%

32 Instrumentation Overhead (Steady State) –40-iteration tests. (8x) –Average steady-state overhead: 1.7%

33 Summary Envision trace fragment dispatch as a feedback-directed optimization –Locality optimizations not addressed by JIT compiler –Adapt to changing behavior without recompilation More accurate trace selection –Enabled by the co-location with the JIT and VM runtime Evaluated opportunity and cost –20% of basic block transitions do not use sequential fallthough. –25% of taken branches/calls transfer control flow to locations outside the VM page –Minimal startup and maintenance overhead for trace selection

34 Questions?

35 Improved Trace Selection: Starting Locations 1.Loop Header Locations –Identified by JIT loop analysis –More accurate than “target of backward branch” heuristic 2.“Early exit” blocks –Allows trace fragments to be “layered” 3.Method prologue –Catches recursive execution A B C foo() BC to D D

36 to A Improved Trace Selection: Starting Locations 1.Loop Header Locations –Identified by JIT loop analysis –More accurate than “target of backward branch” heuristic 2.“Early exit” blocks –Allows trace fragments to be “layered” 3.Method prologue –Catches recursive execution A B C foo() BC to D DA D

37 Normalized Trace Layouts (227_MTRT) Traces