Online partial evaluation of bytecodes (3)

Slides:

Advertisements

Similar presentations

Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.

Advertisements

Static Single-Assignment ? ? Introduction: Over last few years [1991] SSA has been Stablished as… Intermediate program representation.

Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

1 Compiler Construction Intermediate Code Generation.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Chapter 2- Visual Basic Schneider1 Chapter 2 Problem Solving.

Trace-based Just-in-Time Type Specialization for Dynamic Languages Andreas Gal, Brendan Eich, Mike Shaver, David Anderson, David Mandelin, Mohammad R.

CS412/413 Introduction to Compilers Radu Rugina Lecture 16: Efficient Translation to Low IR 25 Feb 02.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Intermediate code generation. Code Generation Create linear representation of program Result can be machine code, assembly code, code for an abstract.

Code Generation Mooly Sagiv html:// Chapter 4.

Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.

Introduction to Code Generation Mooly Sagiv html:// Chapter 4.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Previous finals up on the web page use them as practice problems look at them early.

Tentative Schedule 20/12 Interpreter+ Code Generation 27/12 Code Generation for Control Flow 3/1 Activation Records 10/1 Program Analysis 17/1 Register.

Run time vs. Compile time

JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.

Intermediate Code. Local Optimizations

Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.

Introduction to Code Generation Mooly Sagiv html:// Chapter 4.

Building An Interpreter After having done all of the analysis, it’s possible to run the program directly rather than compile it … and it may be worth it.

Chapter 7Louden, Programming Languages1 Chapter 7 - Control I: Expressions and Statements "Control" is the general study of the semantics of execution.

COP4020 Programming Languages

CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson.

Dynamic Optimization as typified by the Dynamo System See “Dynamo: A Transparent Dynamic Optimization System”, V. Bala, E. Duesterwald, and S. Banerjia,

Chapter 7: Runtime Environment –Run time memory organization. We need to use memory to store: –code –static data (global variables) –dynamic data objects.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

A Portable Virtual Machine for Program Debugging and Directing Camil Demetrescu University of Rome “La Sapienza” Irene Finocchi University of Rome “Tor.

JIT in webkit. What’s JIT See time_compilation for more info. time_compilation.

7. Just In Time Compilation Prof. O. Nierstrasz Jan Kurs.

Lecture 10 : Introduction to Java Virtual Machine

COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 10, 10/30/2003 Prof. Roy Levow.

CSC 310 – Imperative Programming Languages, Spring, 2009 Virtual Machines and Threaded Intermediate Code (instead of PR Chapter 5 on Target Machine Architecture)

O VERVIEW OF THE IBM J AVA J UST - IN -T IME C OMPILER Presenters: Zhenhua Liu, Sanjeev Singh 1.

1 Introduction to JVM Based on material produced by Bill Venners.

“Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 “Dynamo: A Transparent Dynamic Optimization System.

Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler Compiler workshop ’08.

Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.

Dynamo: A Transparent Dynamic Optimization System Bala, Dueterwald, and Banerjia projects/Dynamo.

Virtual Machines, Interpretation Techniques, and Just-In-Time Compilers Kostis Sagonas

Runtime Organization (Chapter 6) 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture.

Programming Languages

1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.

Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08.

By Mr. Muhammad Pervez Akhtar

CS 598 Scripting Languages Design and Implementation 12. Interpreter implementation.

Language Implementation Overview John Keyser Spring 2016.

Procedures and Functions Procedures and Functions – subprograms – are named fragments of program they can be called from numerous places  within a main.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Ch. 4 – Semantic Analysis Errors can arise in syntax, static semantics, dynamic semantics Some PL features are impossible or infeasible to specify in grammar.

Optimization Code Optimization ©SoftMoore Consulting.

Important Concepts from Clojure

Important Concepts from Clojure

CSc 453 Interpreters & Interpretation

Inlining and Devirtualization Hal Perkins Autumn 2011

Closure Representations in Higher-Order Programming Languages

Trace-based Just-in-Time Type Specialization for Dynamic Languages

Important Concepts from Clojure

CSc 453 Interpreters & Interpretation

Presentation transcript:

Online partial evaluation of bytecodes (3)

Run-time Optimization Overview of Specialization The DyC System The Dynamo System Run-time Specialization for ML

Specialization Program Input1Input2 Output Program Input1Input2 Output ProgramSpecializer Input1 ProgramInput1Output Input2 ProgramInput1 is a specialized program Specializer performs specialization of Program with respect to Input1 add Specializer 1 succ3 2

Flow-Chart Language n m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result;

Flow-Chart Language n m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; ?2?2 ?2

Flow-Chart Language n m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; Program2 ?2?2 ?2

Flow-Chart Language n m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; Program2 ?2?2 ? Specializer 2

Flow-Chart Language Program2 ?2?2 ? m (init) init:goto test0; test0:goto loop0; loop0:result := 1 * m; goto test1; test1:goto loop1; loop1:result := result * m; goto test2; test2:goto end; end:return result;

What is Inside the Specializer? A specializer is a program processor Originally seen a static source-to-source transformation Two families of specializers: Online specialization (one pass), Offline specialization (two passes)

Program Input1 Specializer ProgramInput1 (Specialized program) Input2 Output Program Analyse Annotated Program It’s Input1 that will be given first Specializer ProgramInput1 (Specialized program) Output Input1 Input2 Online SpecializationOffline Specialization

DyC’s Run-time Specialization It’s specialization that occurs at run-time We define run-time by the availability of the input The optimization entails an obvious trade-off cost/performance Typically expected situation for a couple of variables: –The static variable is constant and appears earlier, and –The dynamic variable appears later and is not constant The DyC system is a run-time specialization system with offline specialization of FCL

Program Input1 Run-time Specializer ProgramInput1 (Specialized program) Input2 Output Program Analysis Annotated Program It’s Input1 that will be given first Run-time Specializer ProgramInput1 (Specialized program) Output Input1 Input2 Online SpecializationOffline Specialization Run-time

Program Analysis Annotated Program Run-time Specializer ProgramInput1 (Specialized program) Output Input1 Input2 Offline Specialization Program Analysis Annotated Program ProgramInput1 (Specialized program) Output Input1 Input2 DyC (Generating Extension) It’s Input1 that will be given first It’s Input1 that will be given first Generating Extension (Custom Specializer) Cogen Run-time Specialized version of the specializer w.r.t. to the annotated program

‘It’s Input1 that will be given first’ make_static (n) m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; Run-time

‘It’s Input1 that will be given first’ Specialize program w.r.t. n if not in cache -- make_static (n) m (init) init:result := 1; goto test; test:if n < 1 then end else loop loop:result := result * m; n =: n – 1 goto test: end:return result; Run-time

‘It’s Input1 that will be given first’ Run-time Specialize program w.r.t. n if not in cache -- make_static (n) m (init) … Program4 Cache Program3

‘It’s Input1 that will be given first’ Run-time Specialize program w.r.t. n if not in cache -- make_static (n) m (init) … Program4 Cache Program3 Program2

Annotation-directed Run-time Optimization The underlying language is a version of the Flow-Chart Language Annotations help avoiding non-termination and unneeded specialization: –Eager : Aggressive speculative specialization –Lazy : Demand-driven specialization Annotations guide cache policy: –CacheAllUnchecked variable: disposable code –CacheOne variable: the current version is cached

Program Analysis Annotated Program ProgramInput1 (Specialized program) Output Input1 Input2 DyC (Generating Extension) Make_static (…) Generating Extension (Custom Specializer) Cogen Run-time Annotated C Annotated Intermediate Representation Native Code

The Dynamo System 6 years long project Transparent operation (custom crt0.o ) Dynamo is an PA-8000 code interpreter Assumption: “Most of the time is spent in a small portion of the code” Performance opportunities: –Redundancies that cross program boundaries –Cache utilization It interprets until a trace is detected For which it generates a fragment

Most Recently Executed Tail A trace is delimited by start-of-trace and end-of- trace conditions Start-of-trace condition: –Target of backward-taken branches loop header –Taken branches from fragment code exit End-of-trace condition: –Backward-taken branches only loops whose header is the start-of-trace are allowed to appear in the trace –Taken branches to fragment code entry Each branch is associated with a counter

Interpret until taken branch Native instruction stream Run-time Interpreter Native Code

Interpret until taken branch Native instruction stream Run-time Interpreter Lookup branch target in cache Increment counter Associated with Branch target addr Counter value Exceeds hot threshold? Start-of-trace condition? Interpret+codegen until taken branch miss no yesno yes End-of-trace condition? no yes Native Code

Interpret until taken branch Native instruction stream Run-time Interpreter Fragment cache Create new fragment and optimize it Emit into cache, link with other fragments & recycle the associated counter Optimizer Intermediate Representation

Interpret until taken branch Lookup branch target in cache Native instruction stream Jump to top of fragment in cache Fragment cache Context switch Increment counter Associated with Branch target addr Counter value Exceeds hot threshold? Start-of-trace condition? Interpret+codegen until taken branch End-of-trace condition? Create new fragment and optimize it Emit into cache, link with other fragments & recycle the associated counter hit miss no yesno yes no yes Run-time Interpreter Optimizer Native Code Intermediate Representation

Dynamo: Notes The whole overhead is less than 1.5% The optimization part contribution is negligible The average overall speedup is about 9% Dynamic branches are treated in a lazy way: –We may loop to the start-of-trace –If the target is in the cache, we’re done –If not, we return to the interpreter The optimizer actually performs a partial evaluation similar to DyC’s one

Run-time Specialization for ML The ML virtual machine features partial application by means of closures We may see currying as an annotation that guides run-time specialization By use of a pe annotation, we would like to perform on-demand specialization –merge list1 list2 –(pe merge list1) list2 In the context of a virtual machine, specilization is a bytecode-to-bytecode transformation

Program Input1 Run-time Specializer ProgramInput1 (Specialized program) Input2 Output Online Specialization + JIT Run-time JIT ProgramInput1 (Specialized program) ML bytecode Native code pe

Program Run-time Specializer ProgramInput1 (Specialized program) Input2 Output Online Specialization + JIT JIT ProgramInput1 (Specialized program) Program Analysis Annotated Program ProgramInput1 (Specialized program) Output Input1 Input2 DyC (Generating Extension) Make_static (…) Generating Extension (Custom Specializer) Cogen Run-time pe Input1 Run-time

Program Input1 Run-time Specializer ProgramInput1 (Specialized program) Input2 Output Online Specialization + JIT Run-time JIT ProgramInput1 (Specialized program) Standard Compilation Portability Program Analysis Annotated Program ProgramInput1 (Specialized program) Output Input1 Input2 DyC (Generating Extension) Make_static (…) Generating Extension (Custom Specializer) Cogen Run-time Non-Standard Compilation Portability Offline Strategy Online Strategy pe Reusable

switch (*code_ptr) { case ACC: code_ptr++; accu := stack[*code_ptr++]; break; case POP: code_ptr++; stack += *code_ptr++; break; … void* array[] = {&&lbl_ACC, &&lbl_POP, …} lbl_ACC: code_ptr++; accu := stack[*code_ptr++]; goto *array[*code_ptr]; lbl_POP: code_ptr++; stack += *code_ptr; goto *array[*code_ptr]; … Interpreter Threaded Interpreter Just-in-time Compilation

Just-in-time compilation is the natural step following threaded code Also involves inlining of the virtual machine Implemented via the GCC’s asm instruction ACC arg = 0x8B 0x44 0x24 0x4*arg (movl 4*arg(%esp,1),%eax) POP arg = 0x59 0x59 … (popl %ecx;popl %ecx;…) CONSTINT arg = 0xB9 0x00 0x00 0x00 0xarg 0xD1 0xE0 0x40 (movl $arg,%eax; shl %eax; inc %eax)

Specialization Algorithm Implemented as an interpreter/[JIT-]compiler Performs an aggressive: –Constant propagation –Unfolding of [recursive] function calls Manipulates symbolic values: –

Mixed Computation ACC 1 PUSH ACC 1 ADDINT 0: 1: CONSTINT 7 0: 1: ACC 0 PUSH CONSTINT 3 ADDINT 0: 1: ACC 0 PUSH CONSTINT 4 PUSH ACC 1 ADDINT POP 1 0: 1: ACC 1 PUSH ACC 1 ADDINT Specializer Subject Program Specialized Program Context Interpreting static expressions Residualizing dynamic expressions

Program Point Specialization We define a program point as either: –the entry of the program, –the else-branch of dynamic conditional branches, –the application of a recursive function We define a context as the state of the stack when we enter a program point Each specialized program point is associated with a residual code We maintain a cache of recursive function applications together with a its context

Algorithm processed = {} pending = {pp0, cx0} while pending != {} do {pp, cx} := one element of pending if {pp, cx} notin processed then specialize pp w.r.t. cx processed := processed + {pp, cx} pending := pending – {pp, cx} arrange processed Input: A program point and a context Output: The corresponding specialized program Specializer

Notes Non-termination is avoided by means of the cache of recursive function applications: A BRANCHEND bytecode is added to make lifting easier in a dynamic context Branches with side-effect are ruled dynamic The actual work is perforned by the specialize function

Conclusion We’ve seen two recent run-time optimizing systems (PLDI’99 & PLDI’00) DyC performs accurate optimizations thanks to: –advanced specialization –programmer annotations Dynamo’s faster than native code execution: –despite the interpretive overhead –while staying completely transparent We would like in a similar vein to: –speed up ML interpretation –while retaining portability