Online SystemC Emulation Acceleration

Slides:



Advertisements
Similar presentations
Embedded System, A Brief Introduction
Advertisements

Digitally-Bypassed Transducers: Interfacing Digital Mockups to Real-Time Medical Equipment Scott Sirowy*, Tony Givargis and Frank Vahid* This work was.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
JIT FPGA Ideas Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Department of Electrical and Computer Engineering Texas A&M University College Station, TX Abstract 4-Level Elevator Controller Lessons Learned.
Configurable System-on-Chip: Xilinx EDK
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
Just-in-Time Compilation for FPGA Processor Cores This work was supported in part by the National Science Foundation (CNS ) and by the Semiconductor.
Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Scott Sirowy Department of Computer Science and Engineering University of California, Riverside This work was supported in part by the National Science.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Languages for HW and SW Development Ondrej Cevan.
Processes Introduction to Operating Systems: Module 3.
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Full and Para Virtualization
Teaching Digital Logic courses with Altera Technology
Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,
1 Introduction to Engineering Spring 2007 Lecture 18: Digital Tools 2.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Computer System Structures
Introduction to the FPGA and Labs
Introduction to Computing Systems
The Post Windows Operating System
These slides are based on the book:
Android Mobile Application Development
Current Generation Hypervisor Type 1 Type 2.
Chapter 1 Introduction.
Dynamo: A Runtime Codesign Environment
IAY 0600 Digitaalsüsteemide disain
Distributed Processors
Introduction to Programmable Logic
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 1 Introduction.
Introduction to Operating System (OS)
FPGA: Real needs and limits
CS 21a: Intro to Computing I
Chapter 1: Introduction
Real-time Software Design
Improving java performance using Dynamic Method Migration on FPGAs
The Extensible Tool-chain for Evaluation of Architectural Models
Reconfigurable Computing
Figure 1 PC Emulation System Display Memory [Embedded SOC Software]
Department of Computer Science University of California, Santa Barbara
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
Ann Gordon-Ross and Frank Vahid*
ECE-C662 Introduction to Behavioral Synthesis Knapp Text Ch
A High Performance SoC: PkunityTM
A Simulator to Study Virtual Memory Manager Behavior
Chapter 1 Introduction.
A Self-Tuning Configurable Cache
Multithreaded Programming
Presented By: Darlene Banta
Portable SystemC-on-a-Chip
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Online SystemC Emulation Acceleration Scott Sirowy, Chen Huang, and Frank Vahid† Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang, vahid}@cs.ucr.edu †Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation and the Office of Naval Research

Introduction: Prototyping Circuits and Systems address data go Edge Detector SystemC C++ based Creation, instantiation, and connection of components Precisely timed communication and execution among concurrently executing components(processes) Supports both “software” and “hardware” constructs and semantics Memory Controller s1 s2 s3 s4 s6 s7 s8 s9 + + + + + + + + + + + + - - + 255 MIN Pixel Value Capture in HDL Given the task of designing a particular system or circuit, edge detection in this case, a designer may wish to proceed with a prototype in one of many ways. This might include implementing the design in C or C++, or perhaps implementing as a circuit using VHDL or Verilog. (animation) An alternative approach is to use SystemC. SystemC is a set of freely-available libraries built on top of the C++ language that allow a designer to capture parallel-executing, spatial applications. SystemC supports a precisely-timed communication model that allows a designer spatial connectivity and timing at a fine granularity. Further, SystemC allows a designer to capture both software and hardware constructs in one unified description. Because SystemC was primarily intended for PC-based execution, a designer wishing to use SystemC as an early prototyping language is really limited to simulation on a PC. Ideally though, a designer would be able to execute their SystemC description in-system, fully interacting with input and output. A designer could go about this in one of many ways, but perhaps the easiest way to use in-system emulation. class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos();

Introduction: Prototyping Circuits and Systems Memory Controller s1 s2 s3 s4 s6 s7 s8 s9 go - MIN + 255 data address Edge Detector In-System Emulation Quickly-obtained simulation interaction with real I/O Prior to time-consuming mapping and synthesis But slower Capture in HDL In-system emulation has a number of benefits, including the use and interaction of real input and output, and the obviation for the initial need of time consuming and difficult-to-use synthesis and mapping tools. The tradeoff of course is some performance degradation, but in an initial prototyping scenario , such performance degradation may be tolerable. class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos(); Emulation

SystemC Emulation Engine Earlier created [CODES 2009] Real I/O Peripherals Representative of many systems Emulation Engine Kernel Virtual Machine Discrete Event Kernel Peripheral Access and Hooks Emulation Engine Main Processor Input Memory SystemC bytecode Output Memory USB Interface Instruction Memory UART Read Signal Memory Buttons Write Signal Memory LEDs USB Download Interface I/O Peripherals Event Kernel and Virtual Machine Peripherals

Emulation Engine Acceleration For some SystemC applications, emulation can be slow An Edge Detection circuit required ~10 minutes to process a 320x240 image * Main Processor Input Memory SystemC bytecode Output Memory Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory LEDs * on a 100 MHz/SRAM Microblaze SystemC Emulation Engine implementation

Emulation Engine Acceleration For some SystemC applications, emulation can be slow An Edge Detection circuit required ~10 minutes to process a 320x240 image * Main Processor Input Memory SystemC bytecode Output Memory Instruction Memory UART Read Signal Memory USB Interface Buttons If available, use platform FPGA to create bytecode accelerators Execute SystemC bytecode natively Write Signal Memory LEDs Accelerator 1 Accelerator 2 Accelerator 3 For the common situation where the SystemC emulation engine is executing next to an available amount of FPGA resources, we developed the concept of a SystemC bytecode accelerator. A SystemC bytecode accelerator executes SystemC bytecode as its native instruction set, greatly reducing the time required to execute one (or more) SystemC bytecode instructions. An additional advantage is that an emulation architecture can support multiple SystemC bytecode accelerators, enabling parallel execution of multiple SystemC processes in a given circuit. This behavior is more reflective of the actual circuit behavior, and, achieves greater performance. Accelerators show great promise, yet due to area constraints, we are obviously limited to a finite number of accelerators. Such finiteness ultimately begs the question on how best to use the number of accelerators on a given platform such that performance is increased… Can we do some clever scheduling of the SystemC circuit? Can we statically analyze our circuit to optimally load/unload accelerators to improve performance? Turns out we can’t… FPGA Accelerators can speedup emulation by over 100X * on a 100 MHz Microblaze SystemC Emulation Engine implementation

Emulation Engine Acceleration Management Image Processing System Edge Blur Sharp Process Queue Sketch Emboss Dalton Cartoon Blur Edge … UART Buttons LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA Emulation Engine Main Processor Read Signal Memory Write Signal Instruction Input Output Sharp … Emboss Cartoon Sketch Sharp … Edge Emulation platforms have dynamically changing inputs Compared to simulation platforms Prevents a static analysis of system to improve performance Because a SystemC application is now running in an emulation environment, it is likely that multiple users could alter the execution of certain processes within a given SystemC application. Considering a simple image processing system, one user’s input might bias certain filters to always execute, another user might always use one filter, etc. Because of this, statically analyzing the process queue affords little in terms of optimizing how best t

Emulation Engine Acceleration Management Image Processing System Emulation Engine Main Processor Process Queue Input Memory Edge Blur … ? Output Memory Instruction Memory Emulate on microprocessor, or accelerate using a bytecode accelerator? Available Accelerators Accelerator Loading Overhead Communication Overhead UART Read Signal Memory USB Interface Buttons Write Signal Memory LEDs Accelerator 1 Accelerator 2 Accelerator 3 We really have only one of two choices… Execute the SystemC bytecode on the microprocessor, or execute using an accelerator. Naively , the choice is obvious. Accelerate every process all the time, without thinking about emulating the process on a microprocessor. Emulate every process on microprocessor Accelerate Every Process FPGA Communication and Loading Overhead Total Execution Time Dynamically Manage Accelerators

Emulation Engine Acceleration Management Image Processing System Emulation Engine Main Processor Process Queue Input Memory ? Edge Blur Blur Edge Edge … Output Memory Instruction Memory Online Decision (decision based only on past and current inputs) UART Read Signal Memory USB Interface Buttons Write Signal Memory Currently on Accelerator Accelerate Process # of uses Edge Emboss Mean Blur Radial History Table 8 4 3 5 2 Yes No LEDs Accelerator 1 Accelerator 2 Accelerator 3 Our solution involved an online algorithm, first developed Huang for use with partially reconfigurable regions on an FPGA. An online algorithm decides at runtime which decision to make, based only on current and past inputs. The algorithm maintains some a history table to help decide whether the decision to load a process onto an accelerator is advisable. FPGA

Online Accelerator Management Process Queue: Emulation (ms) Accelerator (ms) Loading Time p1 80 10 70 p2 60 20 p3 25 30 uP Only p1 p2 Statically Preloaded p3 Time(ms) Initial state: Accelerator 1 and Accelerator 2 are preloaded with processes p1 and p2 from the SystemC circuit. Available Acceleration Engines Accelerator 1 Accelerator 2

Online Accelerator Management Process Queue: Emulation (ms) Accelerator (ms) Loading Time p1 80 10 70 p2 60 20 p3 25 30 uP Only p1 p2 Statically Preloaded p3 Time(ms) Loading Time 3 -> 2 2 -> 1 1 -> 3 3 -> 2 Available Acceleration Engines Greedy Online Schedule Time(ms) Accelerator 1 Accelerator 2 3 -> 1 Better Online Schedule Time(ms)

Aggregate Gain Solution: AG Table p1 p2 p3 Emulation_only 200 100 50 Emulation+Acc 10 20 25 Gain 190 80 25 Gain = Emulation only – (Emulation + Accelerator) Maintain a gain table for process in the SystemC circuit: ag(i) = ag(i) + gain(i) Fading process for temporal locality: ag(i)=ag(i)*f How to define fading factor f ? Q = <p1, p1, p3, p2, p2, p1, p3> ag(1) ag(2) ag(3) 190 380 380 25 380 80 25 160 570 50 Q = <p1, p1, p3, p2, p2, p1, p3> 190*F+ 190 ag(1) ag(2) ag(3) 190 285 142 25 71 80 12 35 120 6 207 60 3 103 30 26 F=0.5

AG: Overheads And Replacement Policy Emulation Engine Process Queue Input Memory Main Processor Edge Blur Blur Edge Edge … Output Memory ag(Edge) ag(Blur) 80 Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory Loading time (LT): Accelerator loading time(i) Communication overhead (CO): (Dλ’- Dλ) * runtime(i) Overhead = LT + CO LEDs Accelerator 1 FPGA Policies: Load: ag(i) > Overhead Edge Blur Emulation_only 100 200 Emulation+Acc 20 10 Gain 80 190

AG: Overheads And Replacement Policy Emulation Engine Process Queue Input Memory Main Processor Blur Blur Edge Edge … Output Memory ag(Edge) ag(Blur) 80 80 190 Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory Loading time (LT): Accelerator loading time(i) Communication overhead (CO): (Dλ’- Dλ) * runtime(i) Overhead = LT + CO LEDs Accelerator 1 Edge FPGA Policies: Load: ag(i) > Overhead Replace: ag(i) > Overhead + ag(j) (j is the Acc. to be replaced) Wait: ag(i) > Overhead + ag(j) + wait_time(j) Edge Blur Emulation_only 100 200 Emulation+Acc 20 10 Gain 80 190

Comparison Solutions Emulation Engine Base Emulation: Emulating every process on main processor Infinite Accelerators: Accelerating every process w/o loading overhead Main Processor Input Memory Output Memory Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA … Accelerator n

Comparison Solutions Emulation Engine Base Emulation: Emulating every process on main processor Infinite Accelerators: Accelerating every process w/o loading overhead Static preloaded: Each accelerator is statically assigned a process to execute when on the event queue and never changes Greedy: Always assign the current process on the event queue to an accelerator, Main Processor Input Memory Output Memory Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA

Experiments and Results Greedy slower than microprocessor-only emulation because of high reconfiguration cost Static preloading gives 1.5X improvement AG performs 9X better than microprocessor-only emulation Execution Time (ms) (a) Virtex 4 Ml403: 1 Accelerator (b) Virtex 5 vlx110t: 3 Accelerators Microproecssor-Only Statically Preloaded Greedy AG Infinite Accelerators

Further Performance Improvements: Bypassing the Emulation Kernel Emulation Engine Main Processor Sample Application Input Memory Output Memory p3 p4 p5 Instruction Memory p1 UART Read Signal Memory USB Interface Buttons Write Signal Memory p2 p6 LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA … Accelerator n p2 p1

Further Performance Improvements: Bypassing the Emulation Kernel Emulation Engine Main Processor Sample Application Input Memory Output Memory p3 p4 p5 Instruction Memory p1 UART Read Signal Memory USB Interface Buttons Write Signal Memory p2 p6 LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA … Accelerator n p2 p1

Further Performance Improvements: Bypassing the Emulation Kernel Emulation Engine Main Processor Sample Application Input Memory Output Memory p3 p4 p5 Instruction Memory p1 UART Read Signal Memory USB Interface Buttons Write Signal Memory p2 p6 LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA … Accelerator n p2 p1

Further Performance Improvements: Bypassing the Emulation Kernel Emulation Engine Main Processor Sample Application Input Memory Output Memory p3 p4 p5 Instruction Memory p1 UART Read Signal Memory USB Interface Buttons Write Signal Memory p2 p6 LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA … Accelerator n p2 p1 3 costly bus accesses for just one signal! The effect is exacerbated with more complex examples

Bypassing the Emulation Kernel System Bus Acceleration Engine Acceleration Engine Signal Cache Core Acceleration Engine Signal Cache Core Acceleration Engine Register File Register File Bus, Start, Load Logic Bus, Start, Load Logic RISC Datapath RISC Datapath Local Memory Local Memory Kernel Bypass Configuration Kernel Bypass Configuration The direct connections between the core acceleration engine and the adjacent signal cache allow the two acceleration engines to communicate without using a shared bus memory Signals to the main datapath to communicate with the signal cache and not the system bus when configured properly For a limited number of signals, allows single-cycle reading and writing of signals

Kernel Bypass- Experiments and Results Heavy one-way communication results in larger kernel bypass speedups Execution Time (ms) Base platform is running AG Heuristic with 3 accelerators Without Kernel Bypass With Kernel Bypass Kernel Bypass improves Online Emulation by 11-12% on average

Conclusions SystemC Emulation can be improved by dynamically managing the SystemC bytecode accelerators Improved emulation performance by over 9X compared to software-only emulation Bypassing the emulation kernel results in additional 20% performance improvement Just completed Microblaze JIT Compilation as another speedup technique