Just-in-Time Compilation for FPGA Processor Cores This work was supported in part by the National Science Foundation (CNS1016792) and by the Semiconductor.

Slides:

Advertisements

Similar presentations

Digitally-Bypassed Transducers: Interfacing Digital Mockups to Real-Time Medical Equipment Scott Sirowy*, Tony Givargis and Frank Vahid* This work was.

Advertisements

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

MotoHawk Training Model-Based Design of Embedded Systems.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

JIT FPGA Ideas Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

Performed by : Rivka Cohen and Sharon Solomon Instructor : Walter Isaschar המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון.

Define Embedded Systems Small (?) Application Specific Computer Systems.

Configurable System-on-Chip: Xilinx EDK

Chapter 13 Embedded Systems

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Figure 1.1 Interaction between applications and the operating system.

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.

Performance Analysis of Processor Midterm Presentation Performed by : Winter 2005 Alexei Iolin Alexander Faingersh Instructor: Evgeny.

1 Chapter 14 Embedded Processing Cores. 2 Overview RISC: Reduced Instruction Set Computer RISC-based processor: PowerPC, ARM and MIPS The embedded processor.

November 18, 2004 Embedded System Design Flow Arkadeb Ghosal Alessandro Pinto Daniele Gasperini Alberto Sangiovanni-Vincentelli

Device Driver for Generic ASC Module - Project Presentation - By: Yigal Korman Erez Fuchs Instructor: Evgeny Fiksman Sponsored by: High Speed Digital Systems.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Final presentation Encryption/Decryption on embedded system Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Winter 2013 Part A.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Scott Sirowy Department of Computer Science and Engineering University of California, Riverside This work was supported in part by the National Science.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

NetBurner MOD 5282 Network Development Kit MCF 5282 Integrated ColdFire 32 bit Microcontoller 2 DB-9 connectors for serial I/O supports: RS-232, RS-485,

Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan

DOP - A CPU CORE FOR TEACHING BASICS OF COMPUTER ARCHITECTURE Miloš Bečvář, Alois Pluháček and Jiří Daněček Department of Computer Science and Engineering.

SOC Consortium Course Material ASIC Logic National Taiwan University Adopted from National Chiao-Tung University IP Core Design.

Hy-C A Compiler Retargetable for 2014 and beyond Philip Sweany 4/29/2014.

1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,

Design and Characterization of TMD-MPI Ethernet Bridge Kevin Lam Professor Paul Chow.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.

Configurable, reconfigurable, and run-time reconfigurable computing.

Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada

Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.

Computer Performance Computer Engineering Department.

Programmable Logic Training Course HDL Editor

MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Embedded Computer Architecture 5SAI0 Simulation - chapter 9 - Luc Waeijen 16 Nov.

Dept. of Computer Science - CS6461 Computer Architecture CS6461 – Computer Architecture Fall 2015 Lecture 1 – Introduction Adopted from Professor Stephen.

Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.

1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:

Additional Hardware Optimization m Yumiko Kimezawa October 25, 20121RPS.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,

A Fast SystemC Engine D. Gracia Pérez LRI, Paris South Univ. O. Temam LRI, Paris South Univ. G. Mouchard LRI, Paris South Univ. CEA.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Lab 1: Using NIOS II processor for code execution on FPGA

Sujata Ray Dey Maheshtala College Computer Science Department

Improving java performance using Dynamic Method Migration on FPGAs

Sujata Ray Dey Maheshtala College Computer Science Department

Computer Evolution and Performance

Portable SystemC-on-a-Chip

Automatic Tuning of Two-Level Caches to Embedded Applications

Reasons To Study Programming Languages

Online SystemC Emulation Acceleration

Presentation transcript:

Just-in-Time Compilation for FPGA Processor Cores This work was supported in part by the National Science Foundation (CNS ) and by the Semiconductor Research Corporation (GRC ) Andrew Becker 1, Scott Sirowy 2, Frank Vahid Department of Computer Science and Engineering University of California, Riverside {abecker | ssirowy | 1. Now at EPFL 2. Now at ESRI

Andrew Becker 2 of 20 Motivation SystemC useful capture language Concurrency, structure, timing Simulation typical, but in-system I/O often useful Design/synthesis to FPGA may take hours/days and require advanced tools Switches/LEDs Cameras/displays In-system I/O Simulation

Andrew Becker 3 of 20 Background Want rapid design iteration with in-system I/O Compile design description; avoid design/synthesis Previously: Hybrid approach—SystemC bytecode class CLK_GEN : public sc_module { sc_in clock; … CLK_GEN(){ … class CLK_GEN : public sc_module { sc_in clock; … CLK_GEN(){ … SystemC Code Compiler process(clock) READ $1 dataRdy BGT $1 $0 Start J Done Start: ADDI $2 $2 1 ADDI $3 $0 7 … process(clock) READ $1 dataRdy BGT $1 $0 Start J Done Start: ADDI $2 $2 1 ADDI $3 $0 7 … Bytecode Simulator (no in-system I/O) Design/synthesis (time-consuming) … Portable SystemC-on-a-chip – Sirowy [CODES+ISSS ’09]

Andrew Becker 4 of 20 Background Emulate bytecode in engine on FPGA Fast compilation Bytecode also portable (FPGA-device independent) Compiler FPGA Emulation Engine process(clock) READ $1 dataRdy BGT $1 $0 Start J Done Start: ADDI $2 $2 1 ADDI $3 $0 7 … process(clock) READ $1 dataRdy BGT $1 $0 Start J Done Start: ADDI $2 $2 1 ADDI $3 $0 7 … Bytecode Portable SystemC-on-a-chip – Sirowy [CODES+ISSS ’09] In-system I/O class CLK_GEN : public sc_module { sc_in clock; … CLK_GEN(){ … class CLK_GEN : public sc_module { sc_in clock; … CLK_GEN(){ …

Andrew Becker 5 of 20 Emulation Engine Discrete event simulator C code on a processor (Currently Microblaze soft-core; could be hard-core) Support-circuits for architectural features, peripheral I/O Processor Core UART LEDs Buttons Instruction Mem. Read Signal Memory Write Signal Memory Peripheral Bus Event Kernel Frame Buffer

Andrew Becker 6 of 20 Caveat Emptor Emulation is slow On soft-core, is even slower than PC simulation Won't meet many real-time constraints

Andrew Becker 7 of 20 This work – Speed up emulator First analyzed emulator performance

Andrew Becker 8 of 20 Low-Hanging Fruit 69% of time spent emulating bytecode Two strategies to reduce Reduce each instruction’s emulation time Reduce instruction memory latency

Andrew Becker 9 of 20 First Step Reduce instruction emulation time Optimize event kernel? Processor Core UART LEDs Buttons Instruction Mem. Read Signal Memory Write Signal Memory Peripheral Bus Event Kernel Frame Buffer

Andrew Becker 10 of 20 First Step Reduce instruction emulation time Optimize event kernel? Just-in-time (JIT) compile bytecode to native processor code, done transparently by event kernel Processor Core UART LEDs Buttons Instruction Mem. Read Signal Memory Write Signal Memory Peripheral Bus Event Kernel Frame Buffer

Andrew Becker 11 of 20 Just-in-Time Compilation of Bytecode Implemented SystemC-bytecode to Microblaze JIT compiler 3x speedup; still portable Tunable delay/jitter Still want more speed process(clock) READ $1 dataRdy BGT $1 $0 Start J Done Start:ADDI $2 $2 1 ADDI $3 $0 7 … process(clock) READ $1 dataRdy BGT $1 $0 Start J Done Start:ADDI $2 $2 1 ADDI $3 $0 7 … Emulation Engine Machine Code Event Kernel Machine Code Bytecode IMM 0xDEAD LWI $11 $0 0xBEEF BGTI $11 Start BRAI Done Start: … IMM 0xDEAD LWI $11 $0 0xBEEF BGTI $11 Start BRAI Done Start: … Machine Code Emulation Engine JIT

Andrew Becker 12 of 20 Further Improvement Reduce instruction memory latency Add dedicated small, fast memory for JIT code on a fast, local bus Unique JIT possibility due to FPGA configurability

Andrew Becker 13 of 20 Architecture Changes Processor Core UART LEDs Buttons Instr. Mem. Read Signal Memory Write Signal Memory Peripheral Bus Emulation Engine Local Memory Bus JIT Mem. Frame Buffer

Andrew Becker 14 of 20 Even Further Improvement 23% of time spent maintaining signal queue What can be done? Optimize signal queue maintenance code?

Andrew Becker 15 of 20 Common Denominator FPGA offers configurability Engine designer can make tradeoffs Trade hardware resources for speed FPGA Emulation Engine FPGA Emulation Engine Extra Resources

Andrew Becker 16 of 20 Common Denominator FPGA offers configurability Engine designer can make tradeoffs Trade hardware resources for speed Add another soft-core? FPGA Emulation Engine FPGA Emulation Engine Extra Resources

Andrew Becker 17 of 20 Even Further Improvement 23% of time spent maintaining signal queue What can be done? Optimize signal queue maintenance code? Offload job to coprocessor Again, unique JIT option due to FPGA configurability

Andrew Becker 18 of 20 Architecture Changes Processor Core UART LEDs Buttons Instr. Mem. Read Signal Memory Write Signal Memory Peripheral Bus Emulation Engine Local Memory Bus JIT Mem. Signal Queue Emulation Memory Controller Frame Buffer

Andrew Becker 19 of 20 Experimental Results

Andrew Becker 20 of 20 Conclusions Approach rapid design iteration with in-system I/O Uses Education (typically loose timing constraints) System prototypes that can tolerate real-time slowdown (e.g., slow frame rate) Portable and flexible Engine design sets speed, not compiler or CAD flow This work: 15x speedup via normal JIT (3x) + FPGA-specific JIT (5x) But, still orders of magnitude slower than design/synthesis Future work: Bytecode accelerators, JIT synthesis