Download presentation
Presentation is loading. Please wait.
1
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering b Department of Electrical Engineering University of California, Riverside * Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship
2
2 Introduction Just-in-Time Compilation has Become Commonplace Just-in-Time Compilation Modern Pentium processors Dynamically translate instructions onto underlying RISC architecture Transmeta Crusoe & Efficeon Dynamic code morphing Translate x86 instructions to underlying VLIW processor Interpreted languages Distribute SW as processor independent bytecode/source SW typically executed on a virtual machine JIT compile bytecode to processor’s native instructions Java, Python, etc. SW ______ SW ______ Profiling Standard Compiler Binary SW Binary Processor3 Processor JIT Recompile
3
3 Introduction Just-in-Time Compilation also Performs Optimization Dynamic optimizations are increasingly common Dynamically recompile binary during execution Dynamo [Bala, et al., 2000] - Dynamic software optimizations Identify frequently executed code segments (hotpaths) Recompile with higher optimization BOA [Gschwind, et al., 2000] - Dynamic optimizer for Power PC Advantages Transparent optimizations No designer effort No tool restrictions Adapts to actual usage Speedups of up 20%-30% -- 1.3X JIT compilation operates on software binaries
4
4 Introduction But Today’s Binaries are More than just Software SW ______ SW ______ Profiling Standard Compiler Binary SW Binary Profiling Compiler/ Synthesis Binary Processor1 FPGAProc. SW ______ SW ______ SW ______ HW ______ Processor Processor2 Processor3 FPGA Proc. FPGA Proc.
5
5 Introduction Just-in-Time FPGA Compilation? JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for microprocessor Portability, transparency, standard tools Embedded JIT compilation tools optimized for each FPGA Binary VHDL/Verilog Profiling Standard CAD Tools Binary Std. HW Binary JIT FPGA Comp. FPGA ++ JIT FPGA Comp. FPGA +**+ MEM
6
6 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 Binary SW Binary
7
7 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 HW ______ Processor FPGA 1 Processor FPGA 2 Processor FPGA 3 Processor FPGA 4 Binary SW Binary Binary HW Netlist3 Binary SW Binary Binary HW Netlist2 Binary SW Binary Binary HW Netlist1 Binary SW Binary Binary HW Netlist4 HW1 ______ HW2 ______ HW3 ______ HW4 ______
8
8 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 HW ______ Processor FPGA 1 Processor FPGA 2 Processor FPGA 3 Processor FPGA 4 Binary SW Binary Binary HW Binary JIT FPGA Comp.
9
9 µPµP I$ D$ Warp Config. Logic Architecture Profiler Dynamic Part. Module (DPM) Partitioned application executes faster with lower energy consumption 5 Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Profile application to determine critical regions2 Profiler Initially execute application in software only1 µPµP I$ D$ Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Program configurable logic & update software binary 4 Warp Config. Logic Architecture Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid DAC’03; Stitt/Vahid, ICCAD’02
10
10 ARM I$ D$ WCLA Profiler DPM Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02
11
11 Introduction All that CAD on-chip? CAD people may first think Just-in-Time FPGA compilation is “absurd” CAD tools are extremely complex Require long execution times on power desktop workstations Require very large memory resources Usually require GBytes of hard drive space Costs of complete CAD tools package can exceed $1 million All that CAD on-chip? 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2-30 mins Route 10 MB
12
12 Simultaneous FPGA/CAD Design Careful simultaneous design of configurable logic fabric and CAD tools Analyze architectural features as to their impacts on on-chip Just-in-Time CAD tools Fast execution time Very low data memory Produce reasonable (good) hardware circuits
13
13 SM CLB SM CLB Simultaneous FPGA/CAD Design Configurable Logic Fabric SM CLB SM CLB Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) Each CLB is directly connected to a SM Switch matrix connections Four short wires connect adjacent SMs Four long wires connect every other SM together Lysecky/Vahid, DATE’04
14
14 Simultaneous FPGA/CAD Design Combinational Logic Block Design Incorporate two 3-input 2-output LUTs Corresponds to four 3-input LUTs Allows for good quality circuit while reducing on-chip CAD tools complexity Provide routing resources between adjacent CLBs to support carry chains LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB Lysecky/Vahid, DATE’04
15
15 Simultaneous FPGA/CAD Design Switch Matrix 0 0L 1 1L 2L 2 3L 3 0 1 2 3 0L 1L 2L 3L 0 1 2 3 0L1L2L3L 0123 0L1L2L 3L Switch Matrix SM connected using eight channels per side Four short channels Four long channels Routes wires from different side using the same channel Each short channel is associated with single long channel Wires are routed using a single pair of channels through configurable logic fabric Lysecky/Vahid, DATE’04
16
16 FPGA Routing Find a path within FPGA to connect source and sinks of each net within our hardware circuit Typically use a form of maze routing [Lee, 1961] Routes each net using Dijkstra’s shortest path algorithm
17
17 1 1 1 1 1 1 1 1 1 FPGA Routing Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest path Allows overuse (congestion) of routing resources If congestion exists (illegal routing) Update cost of congested resources based on the amount of overuse Rip-up all routes and reroute all nets 2 congestion 2
18
18 FPGA Routing VPR – Versatile Place and Route [Betz, et al., 1997] Uses modified Pathfinder algorithm Increase performance over original Pathfinder algorithm Routability-driven routing Goal: Use fewest tracks possible Timing-driven routing Goal: Optimize circuit speed Routing Resource Graph Resource Graph Route Rip-up Done! congestion? illegal? no yes
19
19 JIT FPGA Routing Riverside On-Chip Router (ROCR) Represent routing nets between CLBs as routing between SMs Resource Graph Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel Requires much less memory than VPR as resource graph is much smaller SM 0/4
20
20 JIT FPGA Routing Riverside On-Chip Router (ROCR) - Global Routing Based on VPR’s routability-driven router Utilizes similar cost model consisting of base, historical congestion, and current congestion costs Routes nets between SMs using greedy, depth-first routing algorithm Faster than traditional VPR’s breadth-first routing method Requires addition of adjustment cost to direct ROCR to re-route illegal nets using different initial routing path Ignores illegal routing within SMs If congestion exists, rip-up and re-route only the illegal routes Reduces computation time during successive routing iterations
21
21 JIT FPGA Routing Riverside On-Chip Router (ROCR) - Detailed Routing Assign specific channels to each route Construct routing conflict graph Routes conflict if assigning same channel results in an illegal routing within any SM Use Brelaz’s greedy vertex coloring algorithm [Brelaz, 1979] If illegal routes exist, rip-up illegal routes and repeat global routing 0 0L 1 1L 2L 2 3L 3 0 1 2 3 0L 1L 2L 3L 0 1 2 3 0L1L2L3L 0123 0L1L2L 3L R1 R2 R4 R3 R1R2
22
22 Experiments Memory Usage VPR requires over 50MB of memory with an average of over 20 MB ROCR requires at most 3.6 MB 13X less than VPR on average
23
23 Experiments Algorithm Performance ROCR is on average 10X faster than VPR (TD) Up to 21X faster for ex5p
24
24 Experiments Critical Path Results But 10% shorter critical path than VPR (RD) 32% longer critical path than VPR (TD)
25
25 Experiments Wire Segments 10% more wire segments than VPR (TD/RD)
26
26 Conclusions Developed Riverside On-Chip Router (ROCR) Fast, lean on-chip router for JIT FPGA compilation Order of magnitude less memory required On average 10X faster than VPR’s faster routing algorithm Produces acceptable circuit quality Uses only 10% more routing resources Critical path 10% shorter than VPR’s routability-driven router JIT FPGA Compilation Enables development of a standard HW binary Brings portability of SW design to HW designers Presently requires custom FPGA fabric Future work - Overhead of mapping simple fabric onto commercial fabric?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.