Presentation is loading. Please wait.

Presentation is loading. Please wait.

Online SystemC Emulation Acceleration

Similar presentations


Presentation on theme: "Online SystemC Emulation Acceleration"— Presentation transcript:

1 Online SystemC Emulation Acceleration
Scott Sirowy, Chen Huang, and Frank Vahid† Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang, †Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation and the Office of Naval Research

2 Introduction: Prototyping Circuits and Systems
address data go Edge Detector SystemC C++ based Creation, instantiation, and connection of components Precisely timed communication and execution among concurrently executing components(processes) Supports both “software” and “hardware” constructs and semantics Memory Controller s1 s2 s3 s4 s6 s7 s8 s9 + + + + + + + + + + + + - - + 255 MIN Pixel Value Capture in HDL Given the task of designing a particular system or circuit, edge detection in this case, a designer may wish to proceed with a prototype in one of many ways. This might include implementing the design in C or C++, or perhaps implementing as a circuit using VHDL or Verilog. (animation) An alternative approach is to use SystemC. SystemC is a set of freely-available libraries built on top of the C++ language that allow a designer to capture parallel-executing, spatial applications. SystemC supports a precisely-timed communication model that allows a designer spatial connectivity and timing at a fine granularity. Further, SystemC allows a designer to capture both software and hardware constructs in one unified description. Because SystemC was primarily intended for PC-based execution, a designer wishing to use SystemC as an early prototyping language is really limited to simulation on a PC. Ideally though, a designer would be able to execute their SystemC description in-system, fully interacting with input and output. A designer could go about this in one of many ways, but perhaps the easiest way to use in-system emulation. class EDGE_DETECTOR : public sc_module { //signal declarations EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos();

3 Introduction: Prototyping Circuits and Systems
Memory Controller s1 s2 s3 s4 s6 s7 s8 s9 go - MIN + 255 data address Edge Detector In-System Emulation Quickly-obtained simulation interaction with real I/O Prior to time-consuming mapping and synthesis But slower Capture in HDL In-system emulation has a number of benefits, including the use and interaction of real input and output, and the obviation for the initial need of time consuming and difficult-to-use synthesis and mapping tools. The tradeoff of course is some performance degradation, but in an initial prototyping scenario , such performance degradation may be tolerable. class EDGE_DETECTOR : public sc_module { //signal declarations EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos(); Emulation

4 SystemC Emulation Engine
Earlier created [CODES 2009] Real I/O Peripherals Representative of many systems Emulation Engine Kernel Virtual Machine Discrete Event Kernel Peripheral Access and Hooks Emulation Engine Main Processor Input Memory SystemC bytecode Output Memory USB Interface Instruction Memory UART Read Signal Memory Buttons Write Signal Memory LEDs USB Download Interface I/O Peripherals Event Kernel and Virtual Machine Peripherals

5 Emulation Engine Acceleration
For some SystemC applications, emulation can be slow An Edge Detection circuit required ~10 minutes to process a 320x240 image * Main Processor Input Memory SystemC bytecode Output Memory Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory LEDs * on a 100 MHz/SRAM Microblaze SystemC Emulation Engine implementation

6 Emulation Engine Acceleration
For some SystemC applications, emulation can be slow An Edge Detection circuit required ~10 minutes to process a 320x240 image * Main Processor Input Memory SystemC bytecode Output Memory Instruction Memory UART Read Signal Memory USB Interface Buttons If available, use platform FPGA to create bytecode accelerators Execute SystemC bytecode natively Write Signal Memory LEDs Accelerator 1 Accelerator 2 Accelerator 3 For the common situation where the SystemC emulation engine is executing next to an available amount of FPGA resources, we developed the concept of a SystemC bytecode accelerator. A SystemC bytecode accelerator executes SystemC bytecode as its native instruction set, greatly reducing the time required to execute one (or more) SystemC bytecode instructions. An additional advantage is that an emulation architecture can support multiple SystemC bytecode accelerators, enabling parallel execution of multiple SystemC processes in a given circuit. This behavior is more reflective of the actual circuit behavior, and, achieves greater performance. Accelerators show great promise, yet due to area constraints, we are obviously limited to a finite number of accelerators. Such finiteness ultimately begs the question on how best to use the number of accelerators on a given platform such that performance is increased… Can we do some clever scheduling of the SystemC circuit? Can we statically analyze our circuit to optimally load/unload accelerators to improve performance? Turns out we can’t… FPGA Accelerators can speedup emulation by over 100X * on a 100 MHz Microblaze SystemC Emulation Engine implementation

7 Emulation Engine Acceleration Management
Image Processing System Edge Blur Sharp Process Queue Sketch Emboss Dalton Cartoon Blur Edge UART Buttons LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA Emulation Engine Main Processor Read Signal Memory Write Signal Instruction Input Output Sharp Emboss Cartoon Sketch Sharp Edge Emulation platforms have dynamically changing inputs Compared to simulation platforms Prevents a static analysis of system to improve performance Because a SystemC application is now running in an emulation environment, it is likely that multiple users could alter the execution of certain processes within a given SystemC application. Considering a simple image processing system, one user’s input might bias certain filters to always execute, another user might always use one filter, etc. Because of this, statically analyzing the process queue affords little in terms of optimizing how best t

8 Emulation Engine Acceleration Management
Image Processing System Emulation Engine Main Processor Process Queue Input Memory Edge Blur ? Output Memory Instruction Memory Emulate on microprocessor, or accelerate using a bytecode accelerator? Available Accelerators Accelerator Loading Overhead Communication Overhead UART Read Signal Memory USB Interface Buttons Write Signal Memory LEDs Accelerator 1 Accelerator 2 Accelerator 3 We really have only one of two choices… Execute the SystemC bytecode on the microprocessor, or execute using an accelerator. Naively , the choice is obvious. Accelerate every process all the time, without thinking about emulating the process on a microprocessor. Emulate every process on microprocessor Accelerate Every Process FPGA Communication and Loading Overhead Total Execution Time Dynamically Manage Accelerators

9 Emulation Engine Acceleration Management
Image Processing System Emulation Engine Main Processor Process Queue Input Memory ? Edge Blur Blur Edge Edge Output Memory Instruction Memory Online Decision (decision based only on past and current inputs) UART Read Signal Memory USB Interface Buttons Write Signal Memory Currently on Accelerator Accelerate Process # of uses Edge Emboss Mean Blur Radial History Table 8 4 3 5 2 Yes No LEDs Accelerator 1 Accelerator 2 Accelerator 3 Our solution involved an online algorithm, first developed Huang for use with partially reconfigurable regions on an FPGA. An online algorithm decides at runtime which decision to make, based only on current and past inputs. The algorithm maintains some a history table to help decide whether the decision to load a process onto an accelerator is advisable. FPGA

10 Online Accelerator Management
Process Queue: Emulation (ms) Accelerator (ms) Loading Time p1 80 10 70 p2 60 20 p3 25 30 uP Only p1 p2 Statically Preloaded p3 Time(ms) Initial state: Accelerator 1 and Accelerator 2 are preloaded with processes p1 and p2 from the SystemC circuit. Available Acceleration Engines Accelerator 1 Accelerator 2

11 Online Accelerator Management
Process Queue: Emulation (ms) Accelerator (ms) Loading Time p1 80 10 70 p2 60 20 p3 25 30 uP Only p1 p2 Statically Preloaded p3 Time(ms) Loading Time 3 -> 2 2 -> 1 1 -> 3 3 -> 2 Available Acceleration Engines Greedy Online Schedule Time(ms) Accelerator 1 Accelerator 2 3 -> 1 Better Online Schedule Time(ms)

12 Aggregate Gain Solution: AG Table
p1 p p3 Emulation_only Emulation+Acc Gain Gain = Emulation only – (Emulation + Accelerator) Maintain a gain table for process in the SystemC circuit: ag(i) = ag(i) + gain(i) Fading process for temporal locality: ag(i)=ag(i)*f How to define fading factor f ? Q = <p1, p1, p3, p2, p2, p1, p3> ag(1) ag(2) ag(3) 190 380 380 25 380 80 25 160 570 50 Q = <p1, p1, p3, p2, p2, p1, p3> 190*F+ 190 ag(1) ag(2) ag(3) 190 285 142 25 71 80 12 35 120 6 207 60 3 103 30 26 F=0.5

13 AG: Overheads And Replacement Policy
Emulation Engine Process Queue Input Memory Main Processor Edge Blur Blur Edge Edge Output Memory ag(Edge) ag(Blur) 80 Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory Loading time (LT): Accelerator loading time(i) Communication overhead (CO): (Dλ’- Dλ) * runtime(i) Overhead = LT + CO LEDs Accelerator 1 FPGA Policies: Load: ag(i) > Overhead Edge Blur Emulation_only Emulation+Acc Gain

14 AG: Overheads And Replacement Policy
Emulation Engine Process Queue Input Memory Main Processor Blur Blur Edge Edge Output Memory ag(Edge) ag(Blur) 80 80 190 Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory Loading time (LT): Accelerator loading time(i) Communication overhead (CO): (Dλ’- Dλ) * runtime(i) Overhead = LT + CO LEDs Accelerator 1 Edge FPGA Policies: Load: ag(i) > Overhead Replace: ag(i) > Overhead + ag(j) (j is the Acc. to be replaced) Wait: ag(i) > Overhead + ag(j) + wait_time(j) Edge Blur Emulation_only Emulation+Acc Gain

15 Comparison Solutions Emulation Engine Base Emulation: Emulating every process on main processor Infinite Accelerators: Accelerating every process w/o loading overhead Main Processor Input Memory Output Memory Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA Accelerator n

16 Comparison Solutions Emulation Engine Base Emulation: Emulating every process on main processor Infinite Accelerators: Accelerating every process w/o loading overhead Static preloaded: Each accelerator is statically assigned a process to execute when on the event queue and never changes Greedy: Always assign the current process on the event queue to an accelerator, Main Processor Input Memory Output Memory Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA

17 Experiments and Results
Greedy slower than microprocessor-only emulation because of high reconfiguration cost Static preloading gives 1.5X improvement AG performs 9X better than microprocessor-only emulation Execution Time (ms) (a) Virtex 4 Ml403: 1 Accelerator (b) Virtex 5 vlx110t: 3 Accelerators Microproecssor-Only Statically Preloaded Greedy AG Infinite Accelerators

18 Further Performance Improvements: Bypassing the Emulation Kernel
Emulation Engine Main Processor Sample Application Input Memory Output Memory p3 p4 p5 Instruction Memory p1 UART Read Signal Memory USB Interface Buttons Write Signal Memory p2 p6 LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA Accelerator n p2 p1

19 Further Performance Improvements: Bypassing the Emulation Kernel
Emulation Engine Main Processor Sample Application Input Memory Output Memory p3 p4 p5 Instruction Memory p1 UART Read Signal Memory USB Interface Buttons Write Signal Memory p2 p6 LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA Accelerator n p2 p1

20 Further Performance Improvements: Bypassing the Emulation Kernel
Emulation Engine Main Processor Sample Application Input Memory Output Memory p3 p4 p5 Instruction Memory p1 UART Read Signal Memory USB Interface Buttons Write Signal Memory p2 p6 LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA Accelerator n p2 p1

21 Further Performance Improvements: Bypassing the Emulation Kernel
Emulation Engine Main Processor Sample Application Input Memory Output Memory p3 p4 p5 Instruction Memory p1 UART Read Signal Memory USB Interface Buttons Write Signal Memory p2 p6 LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA Accelerator n p2 p1 3 costly bus accesses for just one signal! The effect is exacerbated with more complex examples

22 Bypassing the Emulation Kernel
System Bus Acceleration Engine Acceleration Engine Signal Cache Core Acceleration Engine Signal Cache Core Acceleration Engine Register File Register File Bus, Start, Load Logic Bus, Start, Load Logic RISC Datapath RISC Datapath Local Memory Local Memory Kernel Bypass Configuration Kernel Bypass Configuration The direct connections between the core acceleration engine and the adjacent signal cache allow the two acceleration engines to communicate without using a shared bus memory Signals to the main datapath to communicate with the signal cache and not the system bus when configured properly For a limited number of signals, allows single-cycle reading and writing of signals

23 Kernel Bypass- Experiments and Results
Heavy one-way communication results in larger kernel bypass speedups Execution Time (ms) Base platform is running AG Heuristic with 3 accelerators Without Kernel Bypass With Kernel Bypass Kernel Bypass improves Online Emulation by 11-12% on average

24 Conclusions SystemC Emulation can be improved by dynamically managing the SystemC bytecode accelerators Improved emulation performance by over 9X compared to software-only emulation Bypassing the emulation kernel results in additional 20% performance improvement Just completed Microblaze JIT Compilation as another speedup technique


Download ppt "Online SystemC Emulation Acceleration"

Similar presentations


Ads by Google