Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Similar presentations


Presentation on theme: "Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University."— Presentation transcript:

1 Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine

2 Introduction Dynamic optimizations an increasing trend Dynamic optimizations an increasing trend –Examples Dynamo Dynamo –Dynamic software optimizations Transmeta Crusoe Transmeta Crusoe –Dynamic code morphing Just In Time Compilation Just In Time Compilation –Interpreted languages Advantages Advantages –Transparent optimizations No designer effort No designer effort No tool restrictions No tool restrictions –Adapts to actual usage

3 Sw ______ Introduction Drawbacks of current dynamic optimizations Drawbacks of current dynamic optimizations –Currently limited to software optimizations Limited speedup (1.1x to 1.3x common) Limited speedup (1.1x to 1.3x common) Alternatively, we could perform hw/sw partitioning Alternatively, we could perform hw/sw partitioning –Achieve large speedups (2x to 10x common) –However, presently dynamic optimization not possible Sw ______ Hw ______ Profiler Critical Regions Processor ASIC/FPGA

4 Introduction Ideally, we would perform hardware/software partitioning dynamically Ideally, we would perform hardware/software partitioning dynamically –Transparent partitioning Supports all sw languages/tools Supports all sw languages/tools Most partitioning approaches have complex tool flows Most partitioning approaches have complex tool flows –Achieves better results than software optimizations >2x speedup, energy savings >2x speedup, energy savings –Adapts to actual usage Appropriate architecture required Appropriate architecture required –Requires a processor and configurable logic

5 Introduction Microprocessor/FPGA single-chip platforms make partitioning more attractive Microprocessor/FPGA single-chip platforms make partitioning more attractive –More efficient communication, smaller size Higher performance, low power Higher performance, low power Examples Examples –Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Atmel FPSLIC Makes dynamic hw/sw partitioning more feasible Makes dynamic hw/sw partitioning more feasible –However, partitioning must be performed at binary level FPGA Processor FPGA 1990s 2003

6 Introduction Binary-level hw/sw partitioning Binary-level hw/sw partitioning –Binary is profiled and hardware candidates are determined –Regions to be partitioned are decompiled into CDFG –CDFG is synthesized to hardware –Binary is updated to use hardware Many advantages over source-level partitioning Many advantages over source-level partitioning –Supports any language or software compiler No change in tools No change in tools –Better software size and performance estimation at binary level Enables dynamic hw/sw partitioning Enables dynamic hw/sw partitioning Binary Netlist Processor FPGA Updated Binary Profiling Hw Exploration Decompilation Behavioral Synthesis Binary Updater

7 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW add

8 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW beq

9 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW add Dynamic Partitioning Module add

10 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW beq Dynamic Partitioning Module beq

11 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW Dynamic Partitioning Module Frequent Loops SW

12 Dynamic Hw/Sw Partitioning Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor SW _________ SW Dynamic Partitioning Module Frequent Loops HW Frequent Loops

13 Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor Memory Micro- processor Dynamic Partitioning Module Dynamic Hw/Sw Partitioning SW _________ SW Frequent Loops Configurable Logic Frequent Loops

14 Dynamic Partitioning Module Dynamic partitioning module executes partitioning tools on chip Dynamic partitioning module executes partitioning tools on chip –Profiler, partitioning compiler, synthesis, place&route Profiler Partitioning Compiler Synthesis SW Binary HW SW Source Place&Route Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor

15 Dynamic Partitioning Module Synthesis and place & route tools all moved on-chip Synthesis and place & route tools all moved on-chip –These tools typically execute on powerful workstations –Most people will cringe at idea of moving these tools on-chip However, dynamic partitioning deals with small regions of code However, dynamic partitioning deals with small regions of code –Typically, small innermost loops Therefore, we can develop lean tools that work specifically for these small loops Therefore, we can develop lean tools that work specifically for these small loops –Lean tools make on-chip execution possible Area overhead becoming less critical due to Moore’s Law Area overhead becoming less critical due to Moore’s Law

16 System Architecture Microprocessors Microprocessors –MIPS (may be many) On-chip memory On-chip memory Configurable logic Configurable logic Dynamic partitioning module Dynamic partitioning module Memory Dynamic Partitioning Module Configurable Logic Micro- processor Micro- processor Micro- processor Micro- processor

17 Dynamic Partitioning Module Dynamically detects frequent loops and then reimplements the loops in hardware running on the configurable logic Dynamically detects frequent loops and then reimplements the loops in hardware running on the configurable logic Architectural components Architectural components –Profiler –Additional processor and memory But SOCs may have dozens anyways But SOCs may have dozens anyways Alternatively, we could share main processor Alternatively, we could share main processor Memory Profiler Partitioning Co-Processor

18 Configurable Logic Greatly simplified in order to create lean place & route tools Greatly simplified in order to create lean place & route tools DMA used to access memory DMA used to access memory Two registers Two registers –R0_Input stores data from memory –R1_InOut stores temporary data & data to write back to memory Fabric Fabric –Supports combinational logic –Implies loops must have body implemented in single cycle (temporary restriction) DMA R0_Input Configurable Logic Fabric R1_InOut

19 Configurable Logic Fabric Fabric Fabric –3-input 2-output LUTS surrounded by switch matrices Switch Matrix Switch Matrix –Connect wire to same channel on different side LUT LUT –3-input (8 word) 2-output SRAM Configurable Logic Fabric LUT T LUT UT... SM M SM M SM M... 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 Inputs SRAM (8x2) Outputs Configurable Logic Fabric Switch Matrix LUT

20 Tool Overview Binary Loop Profiling Small, Frequent Loops Decompilation Place & Route HW RT and Logic Synthesis Binary Modification Updated Binary DMA Configuration Bitfile Creation Tech. Mapping Tool flow slightly different from standard partitioning flow Tool flow slightly different from standard partitioning flow –Decompilation –Binary modification

21 Loop Profiling Non-intrusive profiler Non-intrusive profiler –Monitors instruction bus Very little overhead Very little overhead –Small cache (~16 entries) and 2,300 logic gates Less than 1% power overhead Less than 1% power overhead Micro- processor Frequent Loop Cache Frequent Loop Cache Controller ++ rd/wr addr data To L1 Memory rd/wr addr sbb data saturation

22 Decompilation Decompilation recovers high-level information Decompilation recovers high-level information Creates optimized CDFG Creates optimized CDFG –All instruction-set inefficiencies are removed Binary partitioning has been shown to achieve similar results to source-level partitioning for many applications Binary partitioning has been shown to achieve similar results to source-level partitioning for many applications –[Greg Stitt, Frank Vahid, ICCAD 2002]

23 DMA Configuration Maps memory accesses to our DMA architecture Maps memory accesses to our DMA architecture –Reads/writes –Increment/decrement address updates –Single/block request modes Optimizes DFG for DMA Optimizes DFG for DMA –Removes address calculations –Removes loop counters/exit conditions 1 r1 + Read r1 + r2 Memory Read Increment Address Block Request r3 DMA Read + r2 r3

24 Register Transfer Synthesis Maps DFG operations to hw library components Maps DFG operations to hw library components –Adders, Comparators, Multiplexors, Shifters Creates Boolean expression for each output bit in dataflow graph by replacing hw components with corresponding expressions Creates Boolean expression for each output bit in dataflow graph by replacing hw components with corresponding expressions r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0] r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= ……. ……. r1 r2 + r4 r38 < r5 32-bit adder 32-bit comparator

25 Logic Synthesis Optimizes Boolean equations from RT synthesis Optimizes Boolean equations from RT synthesis –Large opportunity for logic minimization due to use of immediate values in the binary Simple on-chip 2-level logic minimization method Simple on-chip 2-level logic minimization method –Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed) r2[0] = r1[0] xor 0 xor 0 r2[1] = r1[1] xor 0 xor carry[0] r2[2] = r1[2] xor 1 xor carry[1] r2[3] = r1[3] xor 0 xor carry[2] … r1 4 + r2 r2[0] = r1[0] r2[1] = r1[1] xor carry[0] r2[2] = r1[2]’ xor carry[1] r2[3] = r1[3] xor carry[2] …

26 Technology Mapping Maps logic operations to 3-input, 2-output LUTs Maps logic operations to 3-input, 2-output LUTs 1.Traverse logic network and combine nodes to determine single output LUTs 2.Combine nodes to form two output LUTs 3-input, 2-output LUTs

27 Placement Nodes along critical path are placed in single horizontal row Nodes along critical path are placed in single horizontal row Build dependencies between remaining nodes and placed nodes Build dependencies between remaining nodes and placed nodes –Use dependencies to place remaining nodes Either above or below placed nodes Either above or below placed nodes LUT

28 Routing Greedy algorithm Greedy algorithm 1.At each switch matrix, choose direction to route 2.Continue to route until reaching switch matrix that is already in use 3.Backtrack to previous switch matrix, and try another direction Place and route most complex task; Place and route most complex task; currently working on improvements

29 Bitfile Creation Combines place&routed hardware description with DMA configuration into bitfile Combines place&routed hardware description with DMA configuration into bitfile –Used to initialize the configurable logic HW Netlist Bitfile Creation DMA Configuration Bitfile DMA R0_Input Configurable Logic Fabric R1_InOut

30 Binary Modification Updates the application binary in order to utilize the new hardware Updates the application binary in order to utilize the new hardware –Loop replaced with jump to hw initialization code –Wisconsin Architectural Research Tool Set (WARTS) EEL (Executable Editing Library) EEL (Executable Editing Library) –We assume memory is RAM or programmable ROM loop: Load r2, 0(r1) Add r1, r1, 1 Add r3, r3, r2 Blt r1, 8, loop after_loop: ….. hw_init: 1.Initialize HW registers 2.Enable HW 3.Shutdown processor Woken up by HW interrupt 4.Store any results 5.Jump to after_loop loop: Jump hw_init.. after_loop: …..

31 Tool Statistics Executed on SimpleScalar Executed on SimpleScalar –Similar to a MIPS instruction set –Used 60 MHz clock (like Triscend A7 device) Statistics Statistics –Total run time of only 1.09 seconds –Requires less than ½ megabyte of RAM –Code size much smaller than standard synthesis tools

32 Experiments Benchmark Information Benchmark Information –Powerstone (Brev, g3fax1&2) –NetBench (url) –Logic minimization kernel (logmin) Statistics Statistics –55% of total time spent in loops that are moved to hardware –Ideal speedup of 2.8 –These loops were only 2.4% of the size of the original application

33 Experiments Results Results –Achieved average speedup of 2.6, close to ideal 2.8 –Hardware loops were 20X faster than software loops Even with simple architecture and tools, large speedups were achieved Even with simple architecture and tools, large speedups were achieved

34 Conclusion Dynamic hardware/software partitioning has advantages over other partitioning approaches Dynamic hardware/software partitioning has advantages over other partitioning approaches –Completely transparent –Designers get performance/energy benefits of hw/sw partitioning by simply writing software –Quality likely not as good as desktop CAD for some applications, so most suitable when transparency is critical (very often!) Achieved average speedup of 2.6 Achieved average speedup of 2.6 –Very close to ideal speedup of 2.8 Future work Future work –More complex configurable logic fabric Designed in close conjunction with on-chip CAD tools Designed in close conjunction with on-chip CAD tools Sequential logic and increased inputs/outputs Sequential logic and increased inputs/outputs Support larger hardware regions, not just simple loops Support larger hardware regions, not just simple loops Improved algorithms (especially place and route) Improved algorithms (especially place and route) –Handle more complex memory access patterns


Download ppt "Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University."

Similar presentations


Ads by Google