Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Similar presentations


Presentation on theme: "Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University."— Presentation transcript:

1 Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University Speaker: 陳雋中

2 Outline Introduction Introduction System Architecture System Architecture Tool Overview Tool Overview Experiments Experiments Conclusion Conclusion

3 Sw ______ Introduction (1/3) Drawbacks of current dynamic optimizations Drawbacks of current dynamic optimizations –Currently limited to software optimizations Alternatively, we could perform hw/sw partitioning Alternatively, we could perform hw/sw partitioning –Achieve large speedups (2x to 10x common) –However, presently dynamic optimization not possible Sw ______ Hw ______ Profiler Critical Regions Processor ASIC/FPGA

4 Introduction (2/3) Ideally, we would perform hardware/software partitioning dynamically Ideally, we would perform hardware/software partitioning dynamically –Most partitioning approaches have complex tool flows –Achieves better results than software optimizations >2x speedup, energy savings >2x speedup, energy savings Appropriate architecture required Appropriate architecture required –Requires a processor and configurable logic

5 Introduction (3/3) Binary-level hw/sw partitioning Binary-level hw/sw partitioning –Binary is profiled and hardware candidates are determined –Binary is updated to use hardware Many advantages over source-level partitioning Many advantages over source-level partitioning –Supports any language or software compiler No change in tools No change in tools –Better performance estimation at binary level Enables dynamic hw/sw partitioning Enables dynamic hw/sw partitioning Binary Netlist Processor FPGA Updated Binary Profiling Hw Exploration Decompilation Behavioral Synthesis Binary Updater

6 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW add

7 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW beq

8 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW add Dynamic Partitioning Module add

9 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW beq Dynamic Partitioning Module beq

10 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW Dynamic Partitioning Module Frequent Loops SW

11 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW Dynamic Partitioning Module Frequent Loops HW Frequent Loops

12 Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor Dynamic Partitioning Module Dynamic Hw/Sw Partitioning SW _________ SW Frequent Loops Configurable Logic Frequent Loops

13 Dynamic Partitioning Module(1/2) Dynamic partitioning module executes partitioning tools on chip Dynamic partitioning module executes partitioning tools on chip –Profiler, partitioning compiler, synthesis, place&route Profiler Partitioning Compiler Synthesis SW Binary HW SW Source Place&Route Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor

14 Dynamic Partitioning Module(2/2) Dynamically detects frequent loops and then reimplements the loops in hardware running on the configurable logic Dynamically detects frequent loops and then reimplements the loops in hardware running on the configurable logic Architectural components Architectural components –Profiler –Additional processor and memory But SOCs may have dozens anyways But SOCs may have dozens anyways Alternatively, we could share main processor Alternatively, we could share main processor Memory Profiler Partitioning Co-Processor

15 Configurable Logic Greatly simplified in order to create lean place & route tools Greatly simplified in order to create lean place & route tools DMA used to access memory DMA used to access memory Two registers Two registers –R0_Input stores data from memory –R1_InOut stores temporary data & data to write back to memory Fabric Fabric –Supports combinational logic –Implies loops must have body implemented in single cycle (temporary restriction) DMA R0_Input Configurable Logic Fabric R1_InOut

16 Tool Overview Binary Loop Profiling Small, Frequent Loops Decompilation Place & Route HW RT and Logic Synthesis Binary Modification Updated Binary DMA Configuration Bitfile Creation Tech. Mapping Tool flow slightly different from standard partitioning flow Tool flow slightly different from standard partitioning flow –Decompilation –Binary modification

17 Loop Profiling Non-intrusive profiler Non-intrusive profiler –Monitors instruction bus Very little overhead Very little overhead –Small cache (~16 entries) and 2,300 logic gates Less than 1% power overhead Less than 1% power overhead

18 Decompilation Decompilation recovers high-level information Decompilation recovers high-level information

19 DMA Configuration Maps memory accesses to our DMA architecture Maps memory accesses to our DMA architecture –Reads/writes –Increment/decrement address updates –Single/block request modes Optimizes DFG for DMA Optimizes DFG for DMA –Removes address calculations –Removes loop counters/exit conditions

20 Bitfile Creation Combines place&routed hardware description with DMA configuration into bitfile Combines place&routed hardware description with DMA configuration into bitfile –Used to initialize the configurable logic HW Netlist Bitfile Creation DMA Configuration Bitfile DMA R0_Input Configurable Logic Fabric R1_InOut

21 Binary Modification Updates the application binary in order to utilize the new hardware Updates the application binary in order to utilize the new hardware –Loop replaced with jump to hw initialization code loop: Load r2, 0(r1) Add r1, r1, 1 Add r3, r3, r2 Blt r1, 8, loop after_loop: ….. hw_init: 1.Initialize HW registers 2.Enable HW 3.Shutdown processor Woken up by HW interrupt 4.Store any results 5.Jump to after_loop loop: Jump hw_init.. after_loop: …..

22 Tool Statistics Executed on SimpleScalar Executed on SimpleScalar –Similar to a MIPS instruction set –Used 60 MHz clock Statistics Statistics –Total run time of only 1.09 seconds –Requires less than ½ megabyte of RAM –Code size much smaller than standard synthesis tools

23 Experiments Benchmark Information Benchmark Information –Powerstone (Brev, g3fax1&2) –NetBench (url) –Logic minimization kernel (logmin) Statistics Statistics –55% of total time spent in loops that are moved to hardware –Ideal speedup of 2.8 –These loops were only 2.4% of the size of the original application

24 Experiments Results Results –Achieved average speedup of 2.6, close to ideal 2.8 –Hardware loops were 20X faster than software loops Even with simple architecture and tools, large speedups were achieved Even with simple architecture and tools, large speedups were achieved

25 Conclusion Dynamic hardware/software partitioning has advantages over other partitioning approaches Dynamic hardware/software partitioning has advantages over other partitioning approaches Achieved average speedup of 2.6 Achieved average speedup of 2.6 –Very close to ideal speedup of 2.8 Future work Future work –More complex configurable logic fabric Sequential logic and increased inputs/outputs Sequential logic and increased inputs/outputs Support larger hardware regions, not just simple loops Support larger hardware regions, not just simple loops Improved algorithms (especially place and route) Improved algorithms (especially place and route) –Handle more complex memory access patterns


Download ppt "Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University."

Similar presentations


Ads by Google