Improving java performance using Dynamic Method Migration on FPGAs

Slides:

Advertisements

Similar presentations

Operating Systems Components of OS

Advertisements

An Overview Of Virtual Machine Architectures Ross Rosemark.

More Intel machine language and one more look at other architectures.

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.

Altera FLEX 10K technology in Real Time Application.

Java Implementation Arthur Sale & Saeid Nooshabadi The background to a Large Grant ARC Application.

JAVA Processors and JIT Scheduling. Overview & Literature n Formulation of the problem n JAVA introduction n Description of Caffeine * Literature: “Java.

Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Intro to Java The Java Virtual Machine. What is the JVM  a software emulation of a hypothetical computing machine that runs Java bytecodes (Java compiler.

Java Introduction 劉登榮 Deng-Rung Liu 87/7/15. Outline 4 History 4 Why Java? 4 Java Concept 4 Java in Real World 4 Language Overview 4 Java Performance!?

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schöberl.

November , 2009SERVICE COMPUTATION 2009 Analysis of Energy Efficiency in Clouds H. AbdelSalamK. Maly R. MukkamalaM. Zubair Department.

1 Comp 104: Operating Systems Concepts Java Development and Run-Time Store Organisation.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 7 OS System Structure.

Conrad Benham Java Opcode and Runtime Data Analysis By: Conrad Benham Supervisor: Professor Arthur Sale.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

Processes Introduction to Operating Systems: Module 3.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

June 30 - July 2, 2009AIMS 2009 Towards Energy Efficient Change Management in A Cloud Computing Environment: A Pro-Active Approach H. AbdelSalamK. Maly.

CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Sunpyo Hong, Hyesoon Kim

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

RealTimeSystems Lab Jong-Koo, Lim

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

(Not too) Real-Time JVM (Progress Report)

A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.

Introduction to Operating Systems Concepts

Design and Analysis of Low-Power novel implementation of encryption standard algorithm by hybrid method using SHA3 and parallel AES.

Processes and threads.

Overview of Compilers and Language Translation

Current Generation Hypervisor Type 1 Type 2.

Instruction Packing for a 32-bit Stack-Based Processor Witcharat Lertteerawattana and Prabhas Chongstitvatana Department of Computer Engineering Chulalongkorn.

A Closer Look at Instruction Set Architectures

Topic: Difference b/w JDK, JRE, JIT, JVM

Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads

Introduction of microprocessor

Lecture 1 Runtime environments.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Introduction Enosis Learning.

Real-time Software Design

Introduction Enosis Learning.

Reconfigurable Computing

Department of Computer Science University of California, Santa Barbara

Mark Claypool and Jonathan Tanner Computer Science Department

Chapter 3: Operating-System Structures

Adaptive Code Unloading for Resource-Constrained JVMs

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Chapter 12 Pipelining and RISC

Portable SystemC-on-a-Chip

Lecture 1 Runtime environments.

Department of Computer Science University of California, Santa Barbara

ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Presentation transcript:

Improving java performance using Dynamic Method Migration on FPGAs E. Lattanzi(1), A. Gayasen(2), M. Kandemir(2), V. Narayanan(2), L. Benini(3), and A. Bogliolo(1) (1) STI - University of Urbino (2) DCSE –Penn State University (3) DEIS –University of Bologna 61029 Urbino – Italy 16802 University Park – PA 40136 Bologna -Italy

Outline Motivations and contribution Previous work The proposed approach System architecture Dynamic method migration Communication and synchronization issues Experimental results Conclusions

Motivations In 2007 Java will be the dominant terminal platform in the wireless sector. (Over 450 million handsets will support Java). The use of interpreters to implement the JVM in the embedded devices makes Java execution performance a limiting factor for real-time applications.

Our Contribution We propose and analyze a complete run-time environment based on a microprocessor coupled with an FPGA coprocessor supporting an efficient shared-memory communication We focus on enhancing the speed of Java applications by executing computation intensive code segments on the reconfigurable hardware.

Java optimization strategies JIT “Just-In-Time compiler” Dinamically translates byte-code to machine’s native code Java hardware accelerators Execute java code natively (aJile’s JemCore, Sun’s PicoJava, Arm’s Jazelle, Nazomi’s JSTAR, etc. )

Java and reconfigurable hardware: related work Fleishmann et al. 1999. The execution of computation-intensive methods is committed to an FPGA directly coupled with the CPU. Communication is based on the Java Native Interface (JNI) introducing sizeable data transfer overhead. Serra et al. 2002. Reconfigurable hardware is used to execute single Java bytecodes. The fine-grained interaction between HW and software raises communication issues that can limit the effectiveness of the solution.

Java run-time environment: overview Java Method Pre-compiled libraries Dynamic Translation Interpreter JIT Configuration byte-stream Compiled Code Processor FPGA

System architecture CPU FPGA SHARED BUS MAIN MEMORY SHARED DATA MEMORY SHARED CONF. MEMORY

Dynamic method migration JVM controls the migration Collects usage statistics about methods utilization Implements a dynamic policy to select which methods are to be mapped in hardware Triggers hardware mapping Handles run-time switching between software and hardware execution Heat of each method drives mapping The heat is obtained by counting the number of fetched bytecodes belonging to a method each time that a method is executed The hottest method is the first candidate for hardware mapping Method mapping requirements A method must be hardware mappable (i.e., either synthesizable or pre-synthesized) All the objects used by the method must be allocated in shared memory The method must be non-recursive

Timing diagram of method migration

Coprocessor interface Interface between JVM and a HW-mapped method must: grant access to shared objects pass input parameters return output parameters

Shared memory: reducing communication overhead JVM can use both the main heap allocated on main memory or a shared heap allocated on the shared memory Shared objects are allocated directly on the shared heap when a “new” opcode is encountered Shared objects are made accessible to the FPGA by providing the pointers to their positions in the shared heap Input and output objects are allocated in the shared heap while primitive-type parameters (e.g., intreger, double, ..) are directly passed to the FPGA by writing in specific memory-mapped registers

FPGA/CPU synchronization Synchronization is based on the mutually-exclusive access to the shared memory When all input parameters have been provided to the FPGA, JVM grants shared memory control to the FPGA and enables hardware computation. FPGA returns the shared memory control to the CPU as soon as it completes execution During FPGA computation the CPU keeps executing in parallel until it needs to access the shared memory (e.g., to get the results back from the FPGA)

Platform implementation We built a full-system simulation environment on top of Virtutech Simics System-level instruction-set simulator Hardware control (PLI) Run-time statistics Simulated machine: Complete system based on Pentium II pro Linux RedHat 6.0 (kernel 2.2.18) Java KVM (java kilo virtual machine)

FPGA modeling and parameters characterization The bytecode of the method to be mapped in HW is directly used as the functional specification for the hardware device a stack-oriented java processor is encapsulated on a Simics module representing the reconfigurable device Hardware performance and configuration time were modeled by means of three parameters configuration-cycles-per-bytecode execution-cycles-per-bytecode shared-memory-access-time Parameters were characterized by performing real experiments on a Xilinx Virtex2 FPGA

Experimental results: speedup

Sensitivity analysis: changing CPU frequency

Sensitivity analysis: simulation parameters

Conclusions We proposed a coprocessor-based architecture for speeding up Java execution by means of dynamic method migration on FPGA Our platform reduces communication overhead through dedicated hardware support (shared memory and non-blocking run-time configuration) and through a modified Java run-time support system A Xilinx Virtex2 FPGAs was used to characterize the simulation parameters Experimental results, based on pessimistic assumptions, show that the proposed architecture provides an average speedup of 35% on benchmark execution time