Download presentation
Presentation is loading. Please wait.
Published bySharon Fowler Modified over 8 years ago
1
Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang, vahid}@cs.ucr.edu †Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation and the Office of Naval Research Dynamic Acceleration Management of SystemC Emulation
2
2/17 Introduction: Prototyping Circuits and Systems class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos(); SystemC C++ based Creation, instantiation, and connection of components Precisely timed communication and execution among concurrently executing components Supports both “software” and “hardware” constructs and semantics Memory Controller s1 s2 s3s4 s6 s7 s8 s9 go - - MIN + 255 data address Edge Detector ++++ + + + ++ + + + Capture in HDL Pixel Value
3
3/17 Introduction: Prototyping Circuits and Systems class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos(); In-System Emulation Quickly-obtained simulation interaction with real I/O Prior to time-consuming mapping and synthesis But slower Emulation Capture in HDL
4
4/17 SystemC Emulation Engine USB Download Interface Emulation Engine Input Memory Output Memory UART Buttons LEDs Read Signal Memory Write Signal Memory Instruction Memory I/O Peripherals Event Kernel and Virtual Machine Peripherals USB Interface Real I/O Peripherals Representative of many systems Emulation Engine Kernel Virtual Machine Discrete Event Kernel Peripheral Access and Hooks Main Processor
5
5/17 Emulation Engine Acceleration Emulation Engine UART Buttons LEDs USB Interface SystemC bytecode For some SystemC applications, emulation can be slow An Edge Detection circuit required ~10 minutes to process a 320x240 image * * on a 100 MHz/SRAM Microblaze SystemC Emulation Engine implementation Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory
6
6/17 SystemC bytecode Emulation Engine Acceleration Input Memory Output Memory UART Buttons LEDs Read Signal Memory Write Signal Memory Main Processor Instruction Memory USB Interface Accelerator 1 Accelerator 2 Accelerator 3 FPGA Accelerators speedup emulation by over 100X For some SystemC applications, emulation can be slow An Edge Detection circuit required ~10 minutes to process a 320x240 image * If available, use platform FPGA to create bytecode accelerators Execute SystemC bytecode natively Emulation Engine * on a 100 MHz Microblaze SystemC Emulation Engine implementation
7
7/17 Emulation Engine Acceleration Management UART Buttons LEDs USB Interface Accelerator 1 Accelerator 2 Accelerator 3 FPGA Emulation Engine Event Queue Edge Blur Edge Blur … Image Processing System ? Emulate in software, or accelerate using a bytecode accelerator? Available Accelerators Accelerator Loading Overhead Total Execution Time Dynamically Manage Accelerators Accelerate Every Process Emulate every process in software Communication and Loading Overhead Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory
8
8/17 Emulation Engine Acceleration Management UART Buttons LEDs USB Interface Accelerator 1 Accelerator 2 Accelerator 3 FPGA Emulation Engine Event Queue Blur Edge Blur … Image Processing System ? Online Decision Accelerate Edge Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory Process # of uses In Accelerator Edge Emboss Mean Blur Radial History Table 8435284352 Yes No Yes No
9
9/17 Dynamic Accelerator Management Emulation (ms) Emulation + Accelerator (ms) Loading Time (ms) p1801070 p2602060 p3702530 Initial state: Accelerator 1 and Accelerator 2 are preloaded with processes p1 and p2 from the SystemC circuit. Event Queue: Accelerator 1 Accelerator 2 Available Acceleration Engines p1 p2 p3 Statically Preloaded Time(ms)
10
10/17 Dynamic Accelerator Management Emulation (ms) Emulation + Accelerator (ms) Loading Time (ms) p1801070 p2602060 p3702530 Event Queue: Accelerator 1 Accelerator 2 Available Acceleration Engines p1 p2 p3 Statically Preloaded Time(ms) Greedy schedule Better schedule Time(ms) Loading Time 2 -> 1 3 -> 1 3 -> 2 1 -> 3 3 -> 2
11
11/17 Aggregate Gain Solution: AG Table Gain = Emulation only – (Emulation + Accelerator) Maintain a gain table for process in the SystemC circuit: ag(i) = ag(i) + gain(i) Fading process for temporal locality: ag(i)=ag(i)*f How to define fading factor f ? p1 p2 p3 Emulation_only 200 100 50 Emulation+Acc 10 20 25 Gain 190 80 25 Q = ag(1) ag(2) ag(3) 190 0 F=0.5 380 0 380 0 25 380 80 25 380 160 25 570 160 25 570 160 50 Q = ag(1) ag(2) ag(3) 190 0 285 0 142 0 25 71 80 12 35 120 6 207 60 3 103 30 26 190*F+ 190
12
12/17 AG: Overheads And Replacement Policy Event Queue Blur Edge Blur … UART Buttons LEDs USB Interface Accelerator 1 FPGA Emulation Engine Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory Loading time (LT): Accelerator loading time(i) Communication overhead (CO): (Dλ’- Dλ) * runtime(i) Overhead = LT + CO ag(Edge) ag(Blur) 80 0 Policies: Load: ag(i) > Overhead Edge Edge Blur Emulation_only 100 200 Emulation+Acc 20 10 Gain 80 190
13
13/17 Accelerator 1 FPGA AG: Overheads And Replacement Policy Event Queue Blur Edge … UART Buttons LEDs USB Interface Emulation Engine Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory Loading time (LT): Accelerator loading time(i) Communication overhead (CO): (Dλ’- Dλ) * runtime(i) Overhead = LT + CO ag(Edge) ag(Blur) 80 0 80 190 Edge Blur Emulation_only 100 200 Emulation+Acc 20 10 Gain 80 190 Policies: Load: ag(i) > Overhead Replace: ag(i) > Overhead + ag(j) (j is the Acc. to be replaced) Wait: ag(i) > Overhead + ag(j) + wait_time(j ) Edge Blur
14
14/17 Accelerator 1 Accelerator 2 Accelerator 3 FPGA … Accelerator n Comparison Solutions Base Emulation: Emulating every process on main processor Infinite Accelerators: Accelerating every process w/o loading overhead UART Buttons LEDs USB Interface Emulation Engine Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory
15
15/17 Comparison Solutions Base Emulation: Emulating every process on main processor Infinite Accelerators: Accelerating every process w/o loading overhead Static preloaded: Each accelerator is statically assigned a process to execute when on the event queue and never changes Greedy: Always assign the current process on the event queue to an accelerator, UART Buttons LEDs USB Interface Emulation Engine Accelerator 1 Accelerator 2 Accelerator 3 FPGA Main Processor Read Signal Memory Write Signal Memory Instruction Memory Input Memory Output Memory
16
16/17 Experiments and Results (a) Virtex 4 Ml403: 1 Accelerator (b) Virtex2Pro : 3 Accelerators Base emulator Greedy Infinite Accelerators Statically Preloaded AG 622 651 617 397 389 428 617 651 622 Aggregate Gain Algorithm on average 3.8X faster than statically preloading accelerators 1.3X faster than greedily assigning accelerators
17
17/17 Conclusions SystemC Emulation can be improved by dynamically managing the SystemC bytecode accelerators Applied the online Aggregate Gain Algorithm to the SystemC emulation framework Improves emulation performance by 14X compared to emulating all of the SystemC on a base software emulation kernel 3.8X performance improvement over statically preloading the accelerator engines
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.