Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl.

Slides:



Advertisements
Similar presentations
More Intel machine language and one more look at other architectures.
Advertisements

Mohamed. M. Saad.  Java Virtual Machine Prototype based on Jikes RVM  Targets  Code profiling/visualization using execution flow  Utilize large number.
Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.
1 Lecture 10 Intermediate Representations. 2 front end »produces an intermediate representation (IR) for the program. optimizer »transforms the code in.
Chapter 4 - MicroArchitecture
Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.
Compilation 2007 Code Generation Michael I. Schwartzbach BRICS, University of Aarhus.
1 1 Lecture 14 Java Virtual Machine Instructors: Fu-Chiung Cheng ( 鄭福炯 ) Associate Professor Computer Science & Engineering Tatung Institute of Technology.
Java Implementation Arthur Sale & Saeid Nooshabadi The background to a Large Grant ARC Application.
Computer Organization and Architecture
1 Programming Languages b Each type of CPU has its own specific machine language b But, writing programs in machine languages is cumbersome (too detailed)
IPT Readings on Instrumentation, Profiling, and Tracing Seminar presentation by Alessandra Gorla University of Lugano December 7, 2006.
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Microarchitecture Level.
JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Consider With x = 10 we may proceed as (10-1) = 9 (10-7) = 3 (9*3) = 27 (10-11) = -1 27/(-1) = -27 Writing intermediates on paper.
Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.
COMP 14: Intro. to Intro. to Programming May 23, 2000 Nick Vallidis.
1 Software Testing and Quality Assurance Lecture 31 – SWE 205 Course Objective: Basics of Programming Languages & Software Construction Techniques.
Inline Function. 2 Expanded in a line when it is invoked Ie compiler replace the function call with function code To make a function inline the function.
CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson.
CH12 CPU Structure and Function
David Evans CS201j: Engineering Software University of Virginia Computer Science Lecture 18: 0xCAFEBABE (Java Byte Codes)
Application Security Tom Chothia Computer Security, Lecture 14.
1 Intro to Computer Science I Chapter 1 Introduction to Computation Algorithms, Processors, and Programs.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schöberl.
Krakatoa: Decompilation in Java “Does Bytecode Reveal Source?” Todd A. Proebsting Scott A. Watterson The University of Arizona Presented by Karl von Randow.
Instruction Set Architecture
1 October 1, October 1, 2015October 1, 2015October 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Java Bytecode What is a.class file anyway? Dan Fleck George Mason University Fall 2007.
Lecture 10 : Introduction to Java Virtual Machine
Java Programming Introduction & Concepts. Introduction to Java Developed at Sun Microsystems by James Gosling in 1991 Object Oriented Free Compiled and.
CSC 310 – Imperative Programming Languages, Spring, 2009 Virtual Machines and Threaded Intermediate Code (instead of PR Chapter 5 on Target Machine Architecture)
1 Introduction to JVM Based on material produced by Bill Venners.
Roopa.T PESIT, Bangalore. Source and Credits Dalvik VM, Dan Bornstein Google IO 2008 The Dalvik virtual machine Architecture by David Ehringer.
Adaptive Optimization with On-Stack Replacement Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University.
1 CPSC 185 Introduction to Computing The course home page
Java Virtual Machine Case Study on the Design of JikesRVM.
Code Optimization 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
How to select superinstructions for Ruby ZAKIROV Salikh*, CHIBA Shigeru*, and SHIBAYAMA Etsuya** * Tokyo Institute of Technology, dept. of Mathematical.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
The Microarchitecture Level
Virtual Machines, Interpretation Techniques, and Just-In-Time Compilers Kostis Sagonas
Alternative Dispatch Techniques for the Tcl VM Benjamin Vitale Mathew Zaleski.
Programming Languages
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Microarchitecture Level.
Building a Java Interpreter CS 142 (b) 01/14. Lab Sessions Monday, Wednesday – 10am – noon – ICS 189 Send an or make an appointment with the TA.
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
CS 598 Scripting Languages Design and Implementation 12. Interpreter implementation.
Review on Program Challenge CSc3210 Yuan Long.
RealTimeSystems Lab Jong-Koo, Lim
CS 536 © CS 536 Spring Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 15.
Instruction Set Architectures Continued. Expanding Opcodes & Instructions.
Vijay Janapa Reddi The University of Texas at Austin Interpretation 2
Sungkyunkwan University, Korea
CS216: Program and Data Representation
Computer Architecture and Organization Miles Murdocca and Vincent Heuring Chapter 4 – The Instruction Set Architecture.
Improving java performance using Dynamic Method Migration on FPGAs
Java Byte Codes (0xCAFEBABE) cs205: engineering software
CSc 453 Interpreters & Interpretation
Inlining and Devirtualization Hal Perkins Autumn 2011
Instruction Set Architectures Continued
Building a Java Interpreter
Reasons To Study Programming Languages
CSc 453 Interpreters & Interpretation
Presentation transcript:

Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown

Context Threading Interpreter performance Why not just JIT? High performance JVMs still interpret People use interpreted languages that don’t yet have JITs They still want performance! 30-40% of execution time is due to branch misprediction Our technique eliminates 95% of branch mispredictions

Context Threading Overview üMotivation Background: The Context Problem Existing Solutions Our Approach Inlining Results

Context Threading load A Tale of Two Machines Loaded Program Virtual Program Return Address Wayness (Conditional) Execution Cycle Bytecode Bodies Pipeline Target Address (Indirect) Predictors Execution Cycle Virtual Machine Interpreter Real Machine CPU

Context Threading Interpreter Loaded Program Bytecode bodies Internal Representation fetch dispatch Load Parms execute Execution Cycle

Context Threading 0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return Running Java Example void foo(){ int i=1; do{ i+=i; } while(i<64); } Java Source Java Bytecode Javac compiler

Context Threading while(1){ opcode = *vpc++; switch(opcode){ } }; Switched Interpreter … iload_1 iadd istore_1 iload_1 bipush 64 if_icmplt 2 … Virtual Program Switched Body Implementation case iload_1:.. break; case iadd:.. break; Internal Representation vPC  Simple, portable and extremely slow

Context Threading while(1){ opcode = *vPC++; switch(opcode){ //and many more.. } }; Switched Interpreter case iload_1:.. break; case iadd:.. break;  slow. burdened by switch and loop overhead

Context Threading 9 “Threading” Dispatch ‣ No switch overhead. Data driven indirect branch. execution of virtual program “threads” through bodies (as in needle & thread) iload_1:.. goto *vPC++; iadd:.. goto *vPC++; istore:.. goto *vPC++; 0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return

Context Threading 10 0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return Context Problem ‣ Data driven indirect branches hard to predict iload_1:.. goto *vPC++; iadd:.. goto *vPC++; istore:.. goto *vPC++; indirect branch predictor (micro-arch)

Context Threading 11 “Threading” Dispatch ‣ No switch overhead. Data driven indirect branch. execution of virtual program “threads” through bodies (as in needle & thread) iload_1:.. goto *vPC++; iadd:.. goto *vPC++; istore:.. goto *vPC++;

Context Threading Direct Threaded Interpreter -7 &&if_icmplt 64 &&bipush &&iload_1 &&istore_1 &&iadd &&iload_1 … iload_1 iadd istore_1 iload_1 bipush 64 if_icmplt 2 … DTT - Direct Threading Table Virtual Program vPC iload_1:.. goto *vPC++; iadd:.. goto *vPC++;  Target of computed goto is data-driven C implementation of each body istore:.. goto *vPC++;

Context Threading iload: goto *vpc iadd: goto *vpc istore: goto *vpc... iload iadd istore... virtual program (DTT) indirect branch predictor (micro-arch) vpc direct threaded bytecode bodies (native code)  pc of dispatch branch insufficient context for prediction Context Problem

Context Threading Context Problem &&iload_1 &&iadd &&istore_1 &&iload_1 &&bipush 64 &&if_icmplt -7 DTT - Direct Threading Table iload_1:.. goto *vPC++; Indirect Branch Predictors iload_1 iadd bipush vPC

Context Threading Existing Solutions Body GOTO *PC ???? Piumarta & Ricardi : Bodies Replicated Super InstructionReplicate iload_1 goto *pc 1 iload_1 goto *pc Ertl & Gregg: Bodies and Dispatch Replicated  Limited to relocatable virtual instructions

Context Threading Overview üMotivation üBackground: The Context Problem üExisting Solutions Our Approach Inlining Results

Context Threading Key Observation Virtual and native control flow similar Linear or straight-line code Conditional branches Calls and Returns Indirect branches Hardware has predictors for each type Direct uses indirect branch for everything! ‣ Solution: Leverage hardware predictors

Context Threading Key Observation Virtual and native control flow have same branch types Linear (not really a branch) Conditional Calls and Returns Indirect Hardware has predictors for each type ‣ Solution: Leverage hardware predictors

Context Threading Essence of our Solution iload_1:.. ret; iadd:.. ret;.. call iload_1 call istore_1 call iadd call iload_1 CTT - Context Threading Table (generated code) Bytecode bodies (ret terminated) Return Branch Predictor Stack … iload_1 iadd istore_1 iload_1 bipush 64 if_icmplt 2 …  Package bodies as subroutines and call them

Context Threading Subroutine Threading iload_1: … ret; iadd: … ret; call bipush call if_icmplt call iload_1 call istore_1 call iadd call iload_1 CTT load time generated code Bytecode bodies (ret terminated) if_cmplt: … goto *vPC++;  virtual branch instructions as before … iload_1 iadd istore_1 iload_1 bipush 64 if_icmplt 2 … DTT contains addresses in CTT vPC

Context Threading Virtual Branches target: … … … … call … call iload_1 call if_icmplt … Context Threading Table 5 DTT vPC if(icmplt) pc=target; goto *vPC++; Virtual Branch body  Context problem remains for virtual branches

Context Threading The Context Threading Table A sequence of generated call instructions Good alignment of virtual and hardware control flow for straight-line code. ‣ Can virtual branches go into the CTT?

Context Threading Specialized Branch Inlining Conditional Branch Predictor now mobilized … … target: … call … call iload_1 if(icmplt) goto target: … Branch Inlined Into the CTT 5 DTT vPC target : … …  Inlining conditional branches provides context

Context Threading Tiny Inlining Context Threading is a dispatch technique But, we inline branches Some non-branching bodies are very small Why not inline those? ► Inline all tiny linear bodies into the CTT

Context Threading What can go in the CTT? Calls to bodies Inlined bodies Mixed-Mode virtual machine? ‣ Performance?

Context Threading Overview üMotivation üBackground: The Context Problem üExisting Solutions üOur Approach üInlining Results

Context Threading Experimental Setup Two Virtual Machines on two hardware architectures. VM: Java/SableVM, OCaml interpreter Compare against direct threaded SableVM SableVM distro uses selective inlining Arch: P4, PPC Branch Misprediction Execution Time ► Is our technique effective and general?

Context Threading Mispredicted Taken Branches Normalized to Direct Threading  95% mispredictions eliminated on average SableVm/Java Pentium 4

Context Threading Execution time Normalized to Direct Threading  27% average reduction in execution time Pentium 4

Context Threading Execution Time (geomean) Normalized to Direct Threading  Our technique is effective and general

Context Threading Conclusions Context Problem: branch mispredictions due to mismatch between native and virtual control flow Solution: Generate control flow code into the Context Threading Table Results Eliminate 95% of branch mispredictions Reduce execution time by 30-40% ‣ recent, post CGO 2005, work follows

Context Threading 32 What about Scripting Languages? Recently ported context threading to TCL. 10x cycles executed per bytecode dispatched. Much lower dispatch overhead. Speedup due to subroutine threading, approx. 5%. TCL conference 2005 Cycles per virtual instruction