Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,

Slides:



Advertisements
Similar presentations
Digitally-Bypassed Transducers: Interfacing Digital Mockups to Real-Time Medical Equipment Scott Sirowy*, Tony Givargis and Frank Vahid* This work was.
Advertisements

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Extensible Networking Platform 1 Liquid Architecture Cycle Accurate Performance Measurement Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain,
Welcome to CPCS 214 Computer Organization & Architecture Fall 2011 Muhammad Al-Hashimi Media clips are from the MS Office clip art collection copyright.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
GALAXY Project Final project review IHP, February 4th 2011 Tools Demonstration Dr Lilian Janin, Dr Doug Edwards - University of Manchester.
PTIDES: Programming Temporally Integrated Distributed Embedded Systems Yang Zhao, EECS, UC Berkeley Edward A. Lee, EECS, UC Berkeley Jie Liu, Microsoft.
1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
JIT FPGA Ideas Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
1 of 16 March 30, 2000 Bus Access Optimization for Distributed Embedded Systems Based on Schedulability Analysis Paul Pop, Petru Eles, Zebo Peng Department.
1/21 Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration Chen Huang and Frank Vahid Dept. of Computer Science and Engineering.
Just-in-Time Compilation for FPGA Processor Cores This work was supported in part by the National Science Foundation (CNS ) and by the Semiconductor.
Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Engineering 1040: Mechanisms & Electric Circuits Fall 2011 Introduction to Embedded Systems.
Scott Sirowy Department of Computer Science and Engineering University of California, Riverside This work was supported in part by the National Science.
Computer Architecture Lecture 01 Fasih ur Rehman.
CS 21a: Intro to Computing I Department of Information Systems and Computer Science Ateneo de Manila University.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Input/ Output By Mohit Sehgal. What is Input/Output of a Computer? Connection with Machine Every machine has I/O (Like a function) In computing, input/output,
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Processes Introduction to Operating Systems: Module 3.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
FPL Sept. 2, 2003 Software Decelerators Eric Keller, Gordon Brebner and Phil James-Roxby Xilinx Research Labs.
SOC Virtual Prototyping: An Approach towards fast System- On-Chip Solution Date – 09 th April 2012 Mamta CHALANA Tech Leader ST Microelectronics Pvt. Ltd,
Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.
Chapter 11: Operating System Support Dr Mohamed Menacer Taibah University
Full and Para Virtualization
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
Fail-Safe Module for Unmanned Autonomous Vehicle
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.
Critical Design Review University of Utah Engineering Clinic December 8,2009.
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
A Fast SystemC Engine D. Gracia Pérez LRI, Paris South Univ. O. Temam LRI, Paris South Univ. G. Mouchard LRI, Paris South Univ. CEA.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
Implementation of Real Time Image Processing System with FPGA and DSP Presented by M V Ganeswara Rao Co- author Dr. P Rajesh Kumar Co- author Dr. A Mallikarjuna.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
What Do Computers Do? A computer system is
Prototyping SoC-based Gate Drive Logic for Power Convertors by Generating code from Simulink models. Researchers Rounak Siddaiah, Graduate Student-University.
Current Generation Hypervisor Type 1 Type 2.
Dynamo: A Runtime Codesign Environment
Parallel Programming By J. H. Wang May 2, 2017.
Process Management Presented By Aditya Gupta Assistant Professor
Paul Pop, Petru Eles, Zebo Peng
CS 21a: Intro to Computing I
Online Shopping APP.
Real-time Software Design
Improving java performance using Dynamic Method Migration on FPGAs
Figure 1 PC Emulation System Display Memory [Embedded SOC Software]
Shenghsun Cho, Mrunal Patel, Han Chen, Michael Ferdman, Peter Milder
A Self-Tuning Configurable Cache
Portable SystemC-on-a-Chip
Automatic Tuning of Two-Level Caches to Embedded Applications
Online SystemC Emulation Acceleration
Presentation transcript:

Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang, †Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation and the Office of Naval Research Dynamic Acceleration Management of SystemC Emulation

2/17 Introduction: Prototyping Circuits and Systems class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos(); SystemC C++ based Creation, instantiation, and connection of components Precisely timed communication and execution among concurrently executing components Supports both “software” and “hardware” constructs and semantics Memory Controller s1 s2 s3s4 s6 s7 s8 s9 go - - MIN data address Edge Detector Capture in HDL Pixel Value

3/17 Introduction: Prototyping Circuits and Systems class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos(); In-System Emulation Quickly-obtained simulation interaction with real I/O Prior to time-consuming mapping and synthesis But slower Emulation Capture in HDL

4/17 SystemC Emulation Engine USB Download Interface Emulation Engine Input Memory Output Memory UART Buttons LEDs Read Signal Memory Write Signal Memory Instruction Memory I/O Peripherals Event Kernel and Virtual Machine Peripherals USB Interface Real I/O Peripherals Representative of many systems Emulation Engine Kernel Virtual Machine Discrete Event Kernel Peripheral Access and Hooks Main Processor

5/17 Emulation Engine Acceleration Emulation Engine UART Buttons LEDs USB Interface SystemC bytecode For some SystemC applications, emulation can be slow An Edge Detection circuit required ~10 minutes to process a 320x240 image * * on a 100 MHz/SRAM Microblaze SystemC Emulation Engine implementation Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory

6/17 SystemC bytecode Emulation Engine Acceleration Input Memory Output Memory UART Buttons LEDs Read Signal Memory Write Signal Memory Main Processor Instruction Memory USB Interface Accelerator 1 Accelerator 2 Accelerator 3 FPGA Accelerators speedup emulation by over 100X For some SystemC applications, emulation can be slow An Edge Detection circuit required ~10 minutes to process a 320x240 image * If available, use platform FPGA to create bytecode accelerators Execute SystemC bytecode natively Emulation Engine * on a 100 MHz Microblaze SystemC Emulation Engine implementation

7/17 Emulation Engine Acceleration Management UART Buttons LEDs USB Interface Accelerator 1 Accelerator 2 Accelerator 3 FPGA Emulation Engine Event Queue Edge Blur Edge Blur … Image Processing System ? Emulate in software, or accelerate using a bytecode accelerator? Available Accelerators Accelerator Loading Overhead Total Execution Time Dynamically Manage Accelerators Accelerate Every Process Emulate every process in software Communication and Loading Overhead Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory

8/17 Emulation Engine Acceleration Management UART Buttons LEDs USB Interface Accelerator 1 Accelerator 2 Accelerator 3 FPGA Emulation Engine Event Queue Blur Edge Blur … Image Processing System ? Online Decision Accelerate Edge Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory Process # of uses In Accelerator Edge Emboss Mean Blur Radial History Table Yes No Yes No

9/17 Dynamic Accelerator Management Emulation (ms) Emulation + Accelerator (ms) Loading Time (ms) p p p Initial state: Accelerator 1 and Accelerator 2 are preloaded with processes p1 and p2 from the SystemC circuit. Event Queue: Accelerator 1 Accelerator 2 Available Acceleration Engines p1 p2 p3 Statically Preloaded Time(ms)

10/17 Dynamic Accelerator Management Emulation (ms) Emulation + Accelerator (ms) Loading Time (ms) p p p Event Queue: Accelerator 1 Accelerator 2 Available Acceleration Engines p1 p2 p3 Statically Preloaded Time(ms) Greedy schedule Better schedule Time(ms) Loading Time 2 -> 1 3 -> 1 3 -> 2 1 -> 3 3 -> 2

11/17 Aggregate Gain Solution: AG Table Gain = Emulation only – (Emulation + Accelerator) Maintain a gain table for process in the SystemC circuit: ag(i) = ag(i) + gain(i) Fading process for temporal locality: ag(i)=ag(i)*f How to define fading factor f ? p1 p2 p3 Emulation_only Emulation+Acc Gain Q = ag(1) ag(2) ag(3) F= Q = ag(1) ag(2) ag(3) *F+ 190

12/17 AG: Overheads And Replacement Policy Event Queue Blur Edge Blur … UART Buttons LEDs USB Interface Accelerator 1 FPGA Emulation Engine Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory Loading time (LT): Accelerator loading time(i) Communication overhead (CO): (Dλ’- Dλ) * runtime(i) Overhead = LT + CO ag(Edge) ag(Blur) 80 0 Policies: Load: ag(i) > Overhead Edge Edge Blur Emulation_only Emulation+Acc Gain

13/17 Accelerator 1 FPGA AG: Overheads And Replacement Policy Event Queue Blur Edge … UART Buttons LEDs USB Interface Emulation Engine Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory Loading time (LT): Accelerator loading time(i) Communication overhead (CO): (Dλ’- Dλ) * runtime(i) Overhead = LT + CO ag(Edge) ag(Blur) Edge Blur Emulation_only Emulation+Acc Gain Policies: Load: ag(i) > Overhead Replace: ag(i) > Overhead + ag(j) (j is the Acc. to be replaced) Wait: ag(i) > Overhead + ag(j) + wait_time(j ) Edge Blur

14/17 Accelerator 1 Accelerator 2 Accelerator 3 FPGA … Accelerator n Comparison Solutions Base Emulation: Emulating every process on main processor Infinite Accelerators: Accelerating every process w/o loading overhead UART Buttons LEDs USB Interface Emulation Engine Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory

15/17 Comparison Solutions Base Emulation: Emulating every process on main processor Infinite Accelerators: Accelerating every process w/o loading overhead Static preloaded: Each accelerator is statically assigned a process to execute when on the event queue and never changes Greedy: Always assign the current process on the event queue to an accelerator, UART Buttons LEDs USB Interface Emulation Engine Accelerator 1 Accelerator 2 Accelerator 3 FPGA Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory

16/17 Experiments and Results (a) Virtex 4 Ml403: 1 Accelerator (b) Virtex2Pro : 3 Accelerators Base emulator Greedy Infinite Accelerators Statically Preloaded AG Aggregate Gain Algorithm on average 3.8X faster than statically preloading accelerators 1.3X faster than greedily assigning accelerators

17/17 Conclusions SystemC Emulation can be improved by dynamically managing the SystemC bytecode accelerators Applied the online Aggregate Gain Algorithm to the SystemC emulation framework Improves emulation performance by 14X compared to emulating all of the SystemC on a base software emulation kernel 3.8X performance improvement over statically preloading the accelerator engines