How to select superinstructions for Ruby ZAKIROV Salikh*, CHIBA Shigeru*, and SHIBAYAMA Etsuya** * Tokyo Institute of Technology, dept. of Mathematical.

Slides:



Advertisements
Similar presentations
Chapt.2 Machine Architecture Impact of languages –Support – faster, more secure Primitive Operations –e.g. nested subroutine calls »Subroutines implemented.
Advertisements

Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.
INSTRUCTION SET ARCHITECTURES
Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Dynamic Branch Prediction
Tail Recursion. Problems with Recursion Recursion is generally favored over iteration in Scheme and many other languages – It’s elegant, minimal, can.
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 18.
G Robert Grimm New York University Cool Pet Tricks with… …Virtual Memory.
Memory Allocation. Three kinds of memory Fixed memory Stack memory Heap memory.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Run-Time Storage Organization
Tentative Schedule 20/12 Interpreter+ Code Generation 27/12 Code Generation for Control Flow 3/1 Activation Records 10/1 Program Analysis 17/1 Register.
S. Barua – CPSC 440 CHAPTER 2 INSTRUCTIONS: LANGUAGE OF THE COMPUTER Goals – To get familiar with.
Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.
CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson.
Discussion Section: HW1 and Programming Tips GS540.
Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Intro to Java The Java Virtual Machine. What is the JVM  a software emulation of a hypothetical computing machine that runs Java bytecodes (Java compiler.
Fast, Effective Code Generation in a Just-In-Time Java Compiler Rejin P. James & Roshan C. Subudhi CSE Department USC, Columbia.
JIT in webkit. What’s JIT See time_compilation for more info. time_compilation.
Lecture 10 : Introduction to Java Virtual Machine
CSC 310 – Imperative Programming Languages, Spring, 2009 Virtual Machines and Threaded Intermediate Code (instead of PR Chapter 5 on Target Machine Architecture)
Runtime Environments Compiler Construction Chapter 7.
Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl.
1 Introduction to JVM Based on material produced by Bill Venners.
Optimizing dynamic dispatch with fine-grained state tracking Salikh Zakirov, Shigeru Chiba and Etsuya Shibayama Tokyo Institute of Technology Dept. of.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Copyright © 2005 Elsevier Chapter 8 :: Subroutines and Control Abstraction Programming Language Pragmatics Michael L. Scott.
Conrad Benham Java Opcode and Runtime Data Analysis By: Conrad Benham Supervisor: Professor Arthur Sale.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
Virtual Machines, Interpretation Techniques, and Just-In-Time Compilers Kostis Sagonas
Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
1 A Secure Access Control Mechanism against Internet Crackers Kenichi Kourai* Shigeru Chiba** *University of Tokyo **University of Tsukuba.
Lecture 04: Instruction Set Principles Kai Bu
RUBY by Ryan Chase.
How to execute Program structure Variables name, keywords, binding, scope, lifetime Data types – type system – primitives, strings, arrays, hashes – pointers/references.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
CS 598 Scripting Languages Design and Implementation 12. Interpreter implementation.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Overview: Lecture 6: Dolores Zage. What is a program? n Operations that are to be applied to certain data in a certain sequence (definition holds for.
An Efficient Compilation Framework for Languages Based on a Concurrent Process Calculus Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa Yonezawa Laboratory.
A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.
Smalltalk Implementation Harry Porter, October 2009 Smalltalk Implementation: Optimization Techniques Prof. Harry Porter Portland State University 1.
Introduction to Advanced Topics Chapter 1 Text Book: Advanced compiler Design implementation By Steven S Muchnick (Elsevier)
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Why to use the assembly and why we need this course at all?
6.001 SICP Compilation Context: special purpose vs. universal machines
Chapter 9 :: Subroutines and Control Abstraction
Instruction Scheduling for Instruction-Level Parallelism
Superscalar Processors & VLIW Processors
CS170 Computer Organization and Architecture I
Chap. 8 :: Subroutines and Control Abstraction
Chap. 8 :: Subroutines and Control Abstraction
CSc 453 Interpreters & Interpretation
Adaptive Code Unloading for Resource-Constrained JVMs
Inlining and Devirtualization Hal Perkins Autumn 2011
Closure Representations in Higher-Order Programming Languages
Towards JIT compiler for IO language Dynamic mixin optimization
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Loop-Level Parallelism
Lecture 4: Instruction Set Design/Pipelining
rePLay: A Hardware Framework for Dynamic Optimization
Procedure Linkages Standard procedure linkage Procedure has
CSc 453 Interpreters & Interpretation
Presentation transcript:

How to select superinstructions for Ruby ZAKIROV Salikh*, CHIBA Shigeru*, and SHIBAYAMA Etsuya** * Tokyo Institute of Technology, dept. of Mathematical and Computing Sciences ** Tokyo University, Information Technology Center

Ruby Dynamic language Becoming popular recently Numeric benchmarks 100—1000 times slower than equivalent program in C Numeric benchmarks marked in red * 2

Interpreter optimization efforts Many techniques to optimize interpreter were proposed – Threaded interpretation – Stack top caching – Pipelining – Superinstructions Superinstructions – Merge code of operations executed in sequence 3 Focus of this presentation

Superinstructions (contrived example) PUSH: // put argument on stack stack[sp++] = *pc++; goto **pc++; ADD: // add two topmost values on stack sp--; stack[sp-1] += stack[sp]; goto **pc++; PUSH_ADD: // add to stack top stack[sp++] = *pc++; //goto **pc++; sp--; stack[sp-1] += stack[sp]; goto **pc++; PUSH_ADD: // add to stack top stack[sp-1] += *pc++; goto **pc++; Dispatch eliminated Optimizations applied 4

Superinstructions (effects) Effects 1.Reduce dispatch overhead a.Eliminate some jumps b.Provide more context for indirect branch predictorby replicating indirect jump instructions 2.Allow more optimizations within VM op 5

Good for reducing dispatch overhead Superinstructions help when: VM operations are small (~10 hwop/vmop) Dispatch overhead is high (~50%) Examples of successful use in prior research ANSI C interpreter: 2-3 times improvement (Proebsting 1995) Ocaml: more than 50% improvement (Piumarta 1998) Forth: 20-80% improvement (Ertl 2003) 6

Superinstructions help when: VM operations are small (~10 hwop/vmop) Dispatch overhead is high (~50%) Ruby does not fit well Hardware profiling data on Intel Core 2 Duo hardware ops per VM op Only 1-3% misprediction overhead on interpreter dispatch 7 BUT

Superinstructions for Ruby We experimentally evaluated effect of “naive” superinstructions on Ruby – Superinstructions are selected statically – Frequently occurring in training run combinations of length 2 selected as superinstructions – Training run uses the same benchmark – Superinstructions constructed by concatenating C source code, C compiler optimizations applied 8

Naive superinstructions effect on Ruby 9 Number of superinstructions used Normalized execution time Limited benefit Unpredictable effects 4 benchmarks

Branch mispredictions 10 Number of superinstructions used Normalized execution time 2 benchmarks: mandelbrot and spectral_norm

Branch mispredictions, reordered 11 Number of superinstructions used, reordered by execution time Normalized execution time 2 benchmarks: mandelbrot and spectral_norm

So why Ruby is slow? Profile of numeric benchmarks 12 Garbage collection takes significant time Boxed floating point values dominate allocation

Floating point value boxing 13 OPT_PLUS: VALUE a = *(sp-2); VALUE b = *(sp-1); /*... */ if (CLASS_OF(a) == Float && CLASS_OF(b) == Float) { sp--; *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)); } else { CALL(1/*argnum*/, PLUS, a); } goto **pc++; New “box” object is allocated on each operation Typical Ruby 1.9 VM operation

Proposal: use superinstructions for boxing optimization 2 operation per allocation instead of 1 14 OPT_MULT_OPT_PLUS: VALUE a = *(sp-3); VALUE b = *(sp-2); VALUE c = *(sp-1); /*... */ if (CLASS_OF(a) == Float && CLASS_OF(b) == Float && CLASS_OF(c) == Float) { sp-=2; *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)*DOUBLE_VALUE(c)); } else { CALL(1/*argnum*/, MULT/*method*/, b/*receiver*/); CALL(1/*argnum*/, PLUS/*method*/, a/*receiver*/); } goto **pc++; Boxing of intermediate result eliminated

Implementation 15 VM operations that handle floating point values directly: – opt_plus – opt_minus – opt_mult – opt_div – opt_mod We implemented all 25 combinations of length 2 – Based on Ruby – Using existing Ruby infrastructure for superinstructions with some modifications

Limitations Coding style-sensitive Not applicable to other types (e.g. Fixnum, Bignum, String) – Fixnum is already unboxed – Bignum and String cannot be unboxed Sequences of 3 arithmetic instructions or longer virtually non-existent – No occurrences in the benchmarks 16

Evaluation Methodology – median time of 30 runs Reduction in allocation 17

Results Up to 22% benefit on numeric benchmarks No slowdown on other benchmarks 18

Example: mandelbrot tweak 19 ITER.times do - tr = zrzr - zizi + cr + tr = cr + (zrzr - zizi) - ti = 2.0*zr*zi + ci + ti = ci + 2.0*zr*zi Slight modification produces 20% difference in performance – 4 of 9 arithmetic instructions get merged into 2 superinstructions – 24% reduction in float allocation Normalized execution time

Discussion of alternative approaches Faster GC would improve performance as well – Superinstructions still apply, but with reduced benefit Type inference – Would allow to specialize expressions and eliminate boxing – Interoperability with dynamic code is an issue Dynamic specialization – Topic for further research 20

Related work: Tagged values Use lower bits of pointers to trigger alternative handling Embed floating point value into higher bits Limited to 64-bit platforms, as Ruby uses double precision 64 bit floating point arithmetic – Our approach has same effect on 32 and 64 bit platforms Allows to eliminate majority of boxed floats Provides 28-35% benefit (on the same benchmarks) 21 * Sasada 2008

Related work: Lazy boxing Java-like language with generics over value- types Boxing needed to avoid duplication of template instantiation code for primitive types Lazy optimization works by allocating boxed objects in the stack frame, and moving to heap as needed Relies on static compiler analysis for escape path detection, and runtime checks 22 * Owen 2004

Related work:Superinstructions Superinstructions used for code compression – ANSI C hybrid compiler-interpreter – Trimedia code compression system Superinstructions chosen statically to minimize code size Superinstructions used to reduce dispatch overhead – Forth, Ocaml Superinstructions chosen dynamically 23 * Piumarta 1998 * Proebsting 1995 * Hoogerbrugge 1999 * Ertl 2003

Conclusion Naive approach to superinstructions does not produce substantial benefit for Ruby Floating point values boxing overhead is a problem of Ruby Superinstructions provide some help (up to 22%) Future work Eliminate float boxing further – Specializing computation loop 24