Improving java performance using Dynamic Method Migration on FPGAs E. Lattanzi(1), A. Gayasen(2), M. Kandemir(2), V. Narayanan(2), L. Benini(3), and A. Bogliolo(1) (1) STI - University of Urbino (2) DCSE –Penn State University (3) DEIS –University of Bologna 61029 Urbino – Italy 16802 University Park – PA 40136 Bologna -Italy
Outline Motivations and contribution Previous work The proposed approach System architecture Dynamic method migration Communication and synchronization issues Experimental results Conclusions
Motivations In 2007 Java will be the dominant terminal platform in the wireless sector. (Over 450 million handsets will support Java). The use of interpreters to implement the JVM in the embedded devices makes Java execution performance a limiting factor for real-time applications.
Our Contribution We propose and analyze a complete run-time environment based on a microprocessor coupled with an FPGA coprocessor supporting an efficient shared-memory communication We focus on enhancing the speed of Java applications by executing computation intensive code segments on the reconfigurable hardware.
Java optimization strategies JIT “Just-In-Time compiler” Dinamically translates byte-code to machine’s native code Java hardware accelerators Execute java code natively (aJile’s JemCore, Sun’s PicoJava, Arm’s Jazelle, Nazomi’s JSTAR, etc. )
Java and reconfigurable hardware: related work Fleishmann et al. 1999. The execution of computation-intensive methods is committed to an FPGA directly coupled with the CPU. Communication is based on the Java Native Interface (JNI) introducing sizeable data transfer overhead. Serra et al. 2002. Reconfigurable hardware is used to execute single Java bytecodes. The fine-grained interaction between HW and software raises communication issues that can limit the effectiveness of the solution.
Java run-time environment: overview Java Method Pre-compiled libraries Dynamic Translation Interpreter JIT Configuration byte-stream Compiled Code Processor FPGA
System architecture CPU FPGA SHARED BUS MAIN MEMORY SHARED DATA MEMORY SHARED CONF. MEMORY
Dynamic method migration JVM controls the migration Collects usage statistics about methods utilization Implements a dynamic policy to select which methods are to be mapped in hardware Triggers hardware mapping Handles run-time switching between software and hardware execution Heat of each method drives mapping The heat is obtained by counting the number of fetched bytecodes belonging to a method each time that a method is executed The hottest method is the first candidate for hardware mapping Method mapping requirements A method must be hardware mappable (i.e., either synthesizable or pre-synthesized) All the objects used by the method must be allocated in shared memory The method must be non-recursive
Timing diagram of method migration
Coprocessor interface Interface between JVM and a HW-mapped method must: grant access to shared objects pass input parameters return output parameters
Shared memory: reducing communication overhead JVM can use both the main heap allocated on main memory or a shared heap allocated on the shared memory Shared objects are allocated directly on the shared heap when a “new” opcode is encountered Shared objects are made accessible to the FPGA by providing the pointers to their positions in the shared heap Input and output objects are allocated in the shared heap while primitive-type parameters (e.g., intreger, double, ..) are directly passed to the FPGA by writing in specific memory-mapped registers
FPGA/CPU synchronization Synchronization is based on the mutually-exclusive access to the shared memory When all input parameters have been provided to the FPGA, JVM grants shared memory control to the FPGA and enables hardware computation. FPGA returns the shared memory control to the CPU as soon as it completes execution During FPGA computation the CPU keeps executing in parallel until it needs to access the shared memory (e.g., to get the results back from the FPGA)
Platform implementation We built a full-system simulation environment on top of Virtutech Simics System-level instruction-set simulator Hardware control (PLI) Run-time statistics Simulated machine: Complete system based on Pentium II pro Linux RedHat 6.0 (kernel 2.2.18) Java KVM (java kilo virtual machine)
FPGA modeling and parameters characterization The bytecode of the method to be mapped in HW is directly used as the functional specification for the hardware device a stack-oriented java processor is encapsulated on a Simics module representing the reconfigurable device Hardware performance and configuration time were modeled by means of three parameters configuration-cycles-per-bytecode execution-cycles-per-bytecode shared-memory-access-time Parameters were characterized by performing real experiments on a Xilinx Virtex2 FPGA
Experimental results: speedup
Sensitivity analysis: changing CPU frequency
Sensitivity analysis: simulation parameters
Conclusions We proposed a coprocessor-based architecture for speeding up Java execution by means of dynamic method migration on FPGA Our platform reduces communication overhead through dedicated hardware support (shared memory and non-blocking run-time configuration) and through a modified Java run-time support system A Xilinx Virtex2 FPGAs was used to characterize the simulation parameters Experimental results, based on pessimistic assumptions, show that the proposed architecture provides an average speedup of 35% on benchmark execution time