Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

Similar presentations


Presentation on theme: "A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering."— Presentation transcript:

1 A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship

2 Lysecky, R., Vahid, F.2 Introduction Dynamic Software Optimization Dynamic optimizations are increasingly common Dynamo - Dynamic software optimizations Transmeta Crusoe, Efficeon - Dynamic code morphing Just In Time (JIT) Compilation - Interpreted languages Advantages Transparent optimizations No designer effort No tool restrictions Adapts to actual usage Drawbacks Currently limited to software optimizations Limited speedup (1.1x to 1.3x common)

3 Lysecky, R., Vahid, F.3 Introduction Hardware/Software Partitioning SW ______ SW ______ SW ______ HW ______ SW ______ SW ______ Processor ASIC/FPGA Critical Regions Profiler Benefits Speedups of 2X to 10X typical Speedups of 800X possible Far more potential than dynamic SW optimizations (1.2x) Energy reductions of 25% to 95% typical Processor

4 Lysecky, R., Vahid, F.4 Introduction Traditional Hardware/Software Partitioning Binary SW Profiling CAD Tools Netlist SW Binary Processor ASIC/FPGA Requires specialized CAD tools Non-standard partitioning compilers

5 Lysecky, R., Vahid, F.5 Introduction Binary Hardware/Software Partitioning Binary Partitioning [Stitt/Vahid ICCAD’02] [Banerjee DATE’03] Partition application starting from SW binary Can be desktop based Advantages Use any standard compiler Supports any language Supports multiple sources from multiple languages Supports assembly/object code Supports legacy code Disadvantage Loses some high-level information, so may be some loss of quality Binary SW Profiling Standard Compiler Binary Profiling CAD Tools Netlist Modified Binary Processor ASIC/FPGA

6 Lysecky, R., Vahid, F.6 Introduction Dynamic Hardware/Software Partitioning Dynamic HW/SW Partitioning Embed HW/SW partitioning CAD tools on- chip Feasible in era of billion-transistor chips Advantages Does not require any special compilers Completely transparent Bring benefits of HW/SW partitioning to all SW designers Complements other approaches Desktop CAD best from purely technical perspective Dynamic opens additional market segments (i.e., all software developers) that otherwise might not use desktop CAD Binary SW Profiling Standard Compiler Binary CAD FPGAProc.

7 Lysecky, R., Vahid, F.7 MIPS/ ARM I$ D$ Configurable Logic Profiler Dynamic Part. Module (DPM) Profile application to determine critical regions Partition critical regions to hardware Program configurable logic & update software binary Partitioned application executes faster with lower energy consumption Initially execute application in software only 1 2 3 4 5 Introduction Warp Processors

8 Lysecky, R., Vahid, F.8 ARM I$ D$ Config. Logic Profiler DPM Warp Processors Requirements & Tools Warp Processor Architecture and Tools Basic configurable logic architecture Efficient profiling architecture On-chip CAD tools for HW/SW partitioning Decompilation Synthesis Technology Mapping Placement and Routing Binary Decompilation Binary HW RT & Logic Synthesis Technology Mapping Placement & Routing Develop new Warp Configurable Logic Architecture (WCLA) WCLA

9 Lysecky, R., Vahid, F.9 Warp Configurable Logic Architecture Requirements Robustness Capable of supporting large set of applications Simplicity Existing FPGAs are too complex for warp processors Design goals of FPGAs much different Design configurable fabric by analyzing architectural features as to their impacts on on-chip CAD tools Fast execution Very low data memory Produce reasonable hardware circuits Efficient interface to memory

10 Lysecky, R., Vahid, F.10 Warp Configurable Logic Architecture Data address generators (DADG) and Loop control hardware (LCH) Found in most digital signal processors Provide fast loop execution Supports memory accesses with regular access pattern Synthesis of FSM not required for many critical loops Configurable logic fabric input provide alternative control of loop execution DADG & LCH Configurable Logic Fabric Reg0 32-bit MAC Reg1 Reg2 ARM I$ D$ WCLA Profiler DPM

11 Lysecky, R., Vahid, F.11 Warp Configurable Logic Architecture Integrated 32-bit multiplier-accumulator (MAC) Multiplications are frequently found within critical loops Frequently in the form of a multiply-accumulate operation Fast, single-cycle multipliers are large and require many interconnections DADG & LCH Configurable Logic Fabric Reg0 32-bit MAC Reg1 Reg2 ARM I$ D$ WCLA Profiler DPM

12 Lysecky, R., Vahid, F.12 SM CLB SM CLB Warp Configurable Logic Architecture Configurable Logic Fabric DADG LCH Configurable Logic Fabric 32-bit MAC SM CLB SM CLB Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) Each CLB is directly connected to a SM Switch matrix connections Four short wires connect adjacent SMs Four long wires connect every other SM together

13 Lysecky, R., Vahid, F.13 Warp Configurable Logic Architecture Combinational Logic Block Design LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB Several studies have analyzed the impact of LUT and CLB size of overall design area and delay LUTs with 5 to 6 inputs result in best performance LUTs with less than 3 inputs have much worse performance [Chow, et al. 1999, Singh, et al. 1992] CLB cluster size of 3 to 20 LUTs are feasible [Marquardt, Betz, Rose 2000]

14 Lysecky, R., Vahid, F.14 Warp Configurable Logic Architecture Combinational Logic Block Design Incorporate two 3-input 2-output LUTs Corresponds to four 3-input LUTs Allows for good quality circuit while reducing on-chip CAD tools complexity Provide routing resources between adjacent CLBs to support carry chains FPGAsWCLA Flexibility: Large CLBs, various internal routing resources Simplicity: Limited internal routing, reduce technology mapping complexity LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB

15 Lysecky, R., Vahid, F.15 Warp Configurable Logic Architecture Switch Matrix 0 0L 1 1L 2L 2 3L 3 0 1 2 3 0L 1L 2L 3L 0 1 2 3 0L1L2L3L 0123 0L1L2L 3L Switch Matrix SM connected using eight channels per side Four short channels Four long channels Routes connect wires from different side using the same channel Each short channel is associated with single long channel Wires are routed using a single pair of channels through configurable logic fabric FPGAsWCLA Flexibility: Large routing resources, requires complex routing algorithms Simplicity: Allow for design of less complex routing algorithm

16 Lysecky, R., Vahid, F.16 Results Benchmarks Considered 12 embedded benchmarks from NetBench, MediaBench, EEMBC, and Powerstone Average of 53% of total software execution time was spent executing single critical loop (more speedup possible if more loops considered) On average, critical loops comprised only 1% of total program size

17 Lysecky, R., Vahid, F.17 Results Experimental Setup Warp Processor 75 MHz ARM7 processor Configurable logic fabric with fixed frequency of 60 MHz Used dynamic partitioning CAD tools to map critical region to hardware Executed on an ARM7 processor Active for roughly 10 seconds to perform partitioning Traditional HW/SW Partitioning 75 MHz ARM7 processor Xilinx Virtex-E FPGA (executing at maximum possible speed) Manually partitioned software using VHDL VHDL synthesized using Xilinx ISE 4.1 on desktop ARM7 I$ D$ WCLA Profiler DPM ARM7 I$ D$ Xilinx Virtex-E FPGA

18 Lysecky, R., Vahid, F.18 Results Performance Speedup Average speedup of 2.1 vs. 2.2 for Virtex-E 4.1

19 Lysecky, R., Vahid, F.19 Results Energy Reduction Average energy reduction of 33% v.s 36% for Xilinx Virtex-E 74%

20 Lysecky, R., Vahid, F.20 Context: UCR’s Research on Configurable SoCs Self Tuning, Self Configuring Mass Produced ICs MIPS/ ARM I$ D$ WCLA Profiler Dynamic Part. Module (DPM) Cache Tuner Efficient on-chip profiling [Gordon-Ross, Vahid] Configurable cache [Zhang, Vahid, Najjar] Self-tuning cache [Zhang, Vahid, Lysecky] Binary decompilation, loop unrolling, alias analysis [Stitt, Vahid] Lean on-chip CAD tools [Lysecky, Vahid, Tan]

21 Lysecky, R., Vahid, F.21 Conclusions & Future Work Warp Configurable Logic Fabric Supports wide range of embedded systems applications Design specifically to allow development of lean on-chip CAD tools Provide excellent results Average speedups of 2.1 Average energy reduction of 33% Much better than dynamic software optimizations One loop only – more speedup possible More recent examples since DATE publication – 10x speedups Working towards examples with 100x speedups Future Work Partitioning multiple software loops to hardware Synthesizing Finite State Machines (FSMs) Improved synthesis, technology mapping, and place and route


Download ppt "A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering."

Similar presentations


Ads by Google