Download presentation
Presentation is loading. Please wait.
Published byArlene Baker Modified over 8 years ago
1
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University Speaker: 陳雋中
2
Outline Introduction Introduction System Architecture System Architecture Tool Overview Tool Overview Experiments Experiments Conclusion Conclusion
3
Sw ______ Introduction (1/3) Drawbacks of current dynamic optimizations Drawbacks of current dynamic optimizations –Currently limited to software optimizations Alternatively, we could perform hw/sw partitioning Alternatively, we could perform hw/sw partitioning –Achieve large speedups (2x to 10x common) –However, presently dynamic optimization not possible Sw ______ Hw ______ Profiler Critical Regions Processor ASIC/FPGA
4
Introduction (2/3) Ideally, we would perform hardware/software partitioning dynamically Ideally, we would perform hardware/software partitioning dynamically –Most partitioning approaches have complex tool flows –Achieves better results than software optimizations >2x speedup, energy savings >2x speedup, energy savings Appropriate architecture required Appropriate architecture required –Requires a processor and configurable logic
5
Introduction (3/3) Binary-level hw/sw partitioning Binary-level hw/sw partitioning –Binary is profiled and hardware candidates are determined –Binary is updated to use hardware Many advantages over source-level partitioning Many advantages over source-level partitioning –Supports any language or software compiler No change in tools No change in tools –Better performance estimation at binary level Enables dynamic hw/sw partitioning Enables dynamic hw/sw partitioning Binary Netlist Processor FPGA Updated Binary Profiling Hw Exploration Decompilation Behavioral Synthesis Binary Updater
6
Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW add
7
Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW beq
8
Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW add Dynamic Partitioning Module add
9
Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW beq Dynamic Partitioning Module beq
10
Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW Dynamic Partitioning Module Frequent Loops SW
11
Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW Dynamic Partitioning Module Frequent Loops HW Frequent Loops
12
Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor Dynamic Partitioning Module Dynamic Hw/Sw Partitioning SW _________ SW Frequent Loops Configurable Logic Frequent Loops
13
Dynamic Partitioning Module(1/2) Dynamic partitioning module executes partitioning tools on chip Dynamic partitioning module executes partitioning tools on chip –Profiler, partitioning compiler, synthesis, place&route Profiler Partitioning Compiler Synthesis SW Binary HW SW Source Place&Route Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor
14
Dynamic Partitioning Module(2/2) Dynamically detects frequent loops and then reimplements the loops in hardware running on the configurable logic Dynamically detects frequent loops and then reimplements the loops in hardware running on the configurable logic Architectural components Architectural components –Profiler –Additional processor and memory But SOCs may have dozens anyways But SOCs may have dozens anyways Alternatively, we could share main processor Alternatively, we could share main processor Memory Profiler Partitioning Co-Processor
15
Configurable Logic Greatly simplified in order to create lean place & route tools Greatly simplified in order to create lean place & route tools DMA used to access memory DMA used to access memory Two registers Two registers –R0_Input stores data from memory –R1_InOut stores temporary data & data to write back to memory Fabric Fabric –Supports combinational logic –Implies loops must have body implemented in single cycle (temporary restriction) DMA R0_Input Configurable Logic Fabric R1_InOut
16
Tool Overview Binary Loop Profiling Small, Frequent Loops Decompilation Place & Route HW RT and Logic Synthesis Binary Modification Updated Binary DMA Configuration Bitfile Creation Tech. Mapping Tool flow slightly different from standard partitioning flow Tool flow slightly different from standard partitioning flow –Decompilation –Binary modification
17
Loop Profiling Non-intrusive profiler Non-intrusive profiler –Monitors instruction bus Very little overhead Very little overhead –Small cache (~16 entries) and 2,300 logic gates Less than 1% power overhead Less than 1% power overhead
18
Decompilation Decompilation recovers high-level information Decompilation recovers high-level information
19
DMA Configuration Maps memory accesses to our DMA architecture Maps memory accesses to our DMA architecture –Reads/writes –Increment/decrement address updates –Single/block request modes Optimizes DFG for DMA Optimizes DFG for DMA –Removes address calculations –Removes loop counters/exit conditions
20
Bitfile Creation Combines place&routed hardware description with DMA configuration into bitfile Combines place&routed hardware description with DMA configuration into bitfile –Used to initialize the configurable logic HW Netlist Bitfile Creation DMA Configuration Bitfile DMA R0_Input Configurable Logic Fabric R1_InOut
21
Binary Modification Updates the application binary in order to utilize the new hardware Updates the application binary in order to utilize the new hardware –Loop replaced with jump to hw initialization code loop: Load r2, 0(r1) Add r1, r1, 1 Add r3, r3, r2 Blt r1, 8, loop after_loop: ….. hw_init: 1.Initialize HW registers 2.Enable HW 3.Shutdown processor Woken up by HW interrupt 4.Store any results 5.Jump to after_loop loop: Jump hw_init.. after_loop: …..
22
Tool Statistics Executed on SimpleScalar Executed on SimpleScalar –Similar to a MIPS instruction set –Used 60 MHz clock Statistics Statistics –Total run time of only 1.09 seconds –Requires less than ½ megabyte of RAM –Code size much smaller than standard synthesis tools
23
Experiments Benchmark Information Benchmark Information –Powerstone (Brev, g3fax1&2) –NetBench (url) –Logic minimization kernel (logmin) Statistics Statistics –55% of total time spent in loops that are moved to hardware –Ideal speedup of 2.8 –These loops were only 2.4% of the size of the original application
24
Experiments Results Results –Achieved average speedup of 2.6, close to ideal 2.8 –Hardware loops were 20X faster than software loops Even with simple architecture and tools, large speedups were achieved Even with simple architecture and tools, large speedups were achieved
25
Conclusion Dynamic hardware/software partitioning has advantages over other partitioning approaches Dynamic hardware/software partitioning has advantages over other partitioning approaches Achieved average speedup of 2.6 Achieved average speedup of 2.6 –Very close to ideal speedup of 2.8 Future work Future work –More complex configurable logic fabric Sequential logic and increased inputs/outputs Sequential logic and increased inputs/outputs Support larger hardware regions, not just simple loops Support larger hardware regions, not just simple loops Improved algorithms (especially place and route) Improved algorithms (especially place and route) –Handle more complex memory access patterns
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.