Dynamic FPGA Routing for Just-in-Time Compilation

Dynamic FPGA Routing for Just-in-Time Compilation
Roman Lyseckya, Frank Vahida*, Sheldon X.-D. Tanb aDepartment of Computer Science and Engineering bDepartment of Electrical Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship

Introduction Just-in-Time Compilation has Become Commonplace
Modern Pentium processors Dynamically translate instructions onto underlying RISC architecture Transmeta Crusoe & Efficeon Dynamic code morphing Translate x86 instructions to underlying VLIW processor Interpreted languages Distribute SW as processor independent bytecode/source SW typically executed on a virtual machine JIT compile bytecode to processor’s native instructions Java, Python, etc. SW ______ Profiling Standard Compiler Binary SW Binary Processor3 Processor JIT Recompile

Introduction Just-in-Time Compilation also Performs Optimization
Dynamic optimizations are increasingly common Dynamically recompile binary during execution Dynamo [Bala, et al., 2000] - Dynamic software optimizations Identify frequently executed code segments (hotpaths) Recompile with higher optimization BOA [Gschwind, et al., 2000] - Dynamic optimizer for Power PC Advantages Transparent optimizations No designer effort No tool restrictions Adapts to actual usage Speedups of up 20%-30% X JIT compilation operates on software binaries

Introduction But Today’s Binaries are More than just Software
SW ______ SW ______ HW Profiling Standard Compiler Profiling Compiler/ Synthesis Binary SW Binary Binary Processor1 FPGA Proc. Processor Processor2 FPGA Proc. Processor3

Introduction One Use of JIT FPGA Compilation
Processor ARM7 Processor ARM9 Binary SW Binary Feature Upgrade Processor ARM10 CableTV Company SW ______ Processor ARM11

Processor ARM7 Processor FPGA 1 Binary SW Binary HW Netlist1 Processor ARM9 Binary SW Binary HW Netlist2 Processor FPGA 2 Feature Upgrade Processor ARM10 Binary SW Binary HW Netlist3 Processor FPGA 3 HW1 ______ HW ______ CableTV Company SW ______ HW2 ______ HW3 ______ HW4 ______ Processor ARM11 Binary SW Binary HW Netlist4 Processor FPGA 4

Processor ARM7 Processor FPGA 1 JIT FPGA Comp. Processor ARM9 Processor FPGA 2 JIT FPGA Comp. Binary SW Binary HW Binary Feature Upgrade Processor ARM10 Processor FPGA 3 HW ______ CableTV Company SW ______ JIT FPGA Comp. Processor ARM11 Processor FPGA 4 JIT FPGA Comp.

Dynamic Part. Module (DPM)
Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Profile application to determine critical regions 2 Profiler Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Initially execute application in software only 1 µP I$ D$ Profiler µP I$ Partitioned application executes faster with lower energy consumption 5 D$ Warp Config. Logic Architecture Program configurable logic & update software binary 4 Warp Config. Logic Architecture Dynamic Part. Module (DPM) Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid DAC’03; Stitt/Vahid, ICCAD’02

Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning)
Binary Partitioning Tech. Mapping/Packing Placement Logic Synthesis Routing Binary Updater Decompilation Profiler RT Synthesis ARM I$ D$ Binary Std. HW Binary WCLA DPM JIT FPGA Compilation JIT FPGA Compilation Binary Updated Binary Binary HW Bitstream Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

Introduction All that CAD on-chip?
CAD people may first think Just-in-Time FPGA compilation is “absurd” CAD tools are extremely complex Require long execution times on power desktop workstations Require very large memory resources Usually require GBytes of hard drive space Costs of complete CAD tools package can exceed $1 million All that CAD on-chip? 1 min Tech. Map 1 min Log. Syn. 1-2 mins Place 2-30 mins Route 10 MB 10 MB 50 MB 60 MB

Simultaneous FPGA/CAD Design
Careful simultaneous design of configurable logic fabric and CAD tools Analyze architectural features as to their impacts on on-chip Just-in-Time CAD tools Fast execution time Very low data memory Produce reasonable (good) hardware circuits

Simultaneous FPGA/CAD Design Configurable Logic Fabric
Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) Each CLB is directly connected to a SM Switch matrix connections Four short wires connect adjacent SMs Four long wires connect every other SM together SM CLB SM SM SM CLB CLB SM SM SM Lysecky/Vahid, DATE’04

Simultaneous FPGA/CAD Design Combinational Logic Block Design
Incorporate two 3-input 2-output LUTs Corresponds to four 3-input LUTs Allows for good quality circuit while reducing on-chip CAD tools complexity Provide routing resources between adjacent CLBs to support carry chains LUT a b c d e f o1 o2 o3 o4 Adj. CLB Lysecky/Vahid, DATE’04

Simultaneous FPGA/CAD Design Switch Matrix
SM connected using eight channels per side Four short channels Four long channels Routes wires from different side using the same channel Each short channel is associated with single long channel Wires are routed using a single pair of channels through configurable logic fabric 0L 1 1L 2L 2 3L 3 Lysecky/Vahid, DATE’04

FPGA Routing FPGA Routing
Find a path within FPGA to connect source and sinks of each net within our hardware circuit Typically use a form of maze routing [Lee, 1961] Routes each net using Dijkstra’s shortest path algorithm

FPGA Routing Pathfinder [Ebeling, et al., 1995]
Introduced negotiated congestion During each routing iteration, route nets using shortest path Allows overuse (congestion) of routing resources If congestion exists (illegal routing) Update cost of congested resources based on the amount of overuse Rip-up all routes and reroute all nets 2 1 1 congestion 2

Routing Resource Graph
FPGA Routing VPR – Versatile Place and Route [Betz, et al., 1997] Uses modified Pathfinder algorithm Increase performance over original Pathfinder algorithm Routability-driven routing Goal: Use fewest tracks possible Timing-driven routing Goal: Optimize circuit speed Route Rip-up Done! congestion? illegal? no yes Resource Graph Routing Resource Graph

JIT FPGA Routing Riverside On-Chip Router (ROCR)
Represent routing nets between CLBs as routing between SMs Resource Graph Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel Requires much less memory than VPR as resource graph is much smaller SM 0/4

JIT FPGA Routing Riverside On-Chip Router (ROCR) - Global Routing
Based on VPR’s routability-driven router Utilizes similar cost model consisting of base, historical congestion, and current congestion costs Routes nets between SMs using greedy, depth-first routing algorithm Faster than traditional VPR’s breadth-first routing method Requires addition of adjustment cost to direct ROCR to re-route illegal nets using different initial routing path Ignores illegal routing within SMs If congestion exists, rip-up and re-route only the illegal routes Reduces computation time during successive routing iterations

JIT FPGA Routing Riverside On-Chip Router (ROCR) - Detailed Routing
Assign specific channels to each route Construct routing conflict graph Routes conflict if assigning same channel results in an illegal routing within any SM Use Brelaz’s greedy vertex coloring algorithm [Brelaz, 1979] If illegal routes exist, rip-up illegal routes and repeat global routing R1 R3 0L 1 1L 2L 2 3L 3 R1 R2 R2 R3 R4

Experiments Memory Usage
VPR requires over 50MB of memory with an average of over 20 MB ROCR requires at most 3.6 MB 13X less than VPR on average

Experiments Algorithm Performance
ROCR is on average 10X faster than VPR (TD) Up to 21X faster for ex5p

Experiments Critical Path Results
32% longer critical path than VPR (TD) But 10% shorter critical path than VPR (RD)

Experiments Wire Segments
10% more wire segments than VPR (TD/RD)

Conclusions Developed Riverside On-Chip Router (ROCR)
Fast, lean on-chip router for JIT FPGA compilation Order of magnitude less memory required On average 10X faster than VPR’s faster routing algorithm Produces acceptable circuit quality Uses only 10% more routing resources Critical path 10% shorter than VPR’s routability-driven router JIT FPGA Compilation Enables development of a standard HW binary Brings portability of SW design to HW designers Presently requires custom FPGA fabric Future work - Overhead of mapping simple fabric onto commercial fabric?

Dynamic FPGA Routing for Just-in-Time Compilation

Similar presentations

Presentation on theme: "Dynamic FPGA Routing for Just-in-Time Compilation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dynamic FPGA Routing for Just-in-Time Compilation

Similar presentations

Presentation on theme: "Dynamic FPGA Routing for Just-in-Time Compilation"— Presentation transcript:

Similar presentations

About project

Feedback