Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 22: December 2, 2005 Routing 2 (Pathfinder)

Digital Design Copyright © 2006 Frank Vahid 1 FPGA Internals: Lookup Tables (LUTs) Basic idea: Memory can implement combinational logic –e.g., 2-address.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.

Reconfigurable Computing (EN2911X, Fall07)

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Lecture 5: FPGA Routing September 17, 2013 ECE 636 Reconfigurable Computing Lecture 5 FPGA Routing.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

ECE 506 Reconfigurable Computing Lecture 7 FPGA Placement.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Introduction to Routing. The Routing Problem Apply after placement Input: –Netlist –Timing budget for, typically, critical nets –Locations of blocks and.

MGR: Multi-Level Global Router Yue Xu and Chris Chu Department of Electrical and Computer Engineering Iowa State University ICCAD

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Global Routing.

Solving Hard Instances of FPGA Routing with a Congestion-Optimal Restrained-Norm Path Search Space Keith So School of Computer Science and Engineering.

New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,

CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.

Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

FPGA CAD 10-MAR-2003.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

FPGA Routing Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

Placement and Routing Algorithms. 2 FPGA Placement & Routing.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Improving java performance using Dynamic Method Migration on FPGAs

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Ann Gordon-Ross and Frank Vahid*

Chin Hau Hoo, Akash Kumar

Dynamic FPGA Routing for Just-in-Time Compilation

A Self-Tuning Configurable Cache

Register-Transfer (RT) Synthesis

Dynamic Hardware/Software Partitioning: A First Approach

Warp Processor: A Dynamically Reconfigurable Coprocessor

Portable SystemC-on-a-Chip

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering b Department of Electrical Engineering University of California, Riverside * Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship

2 Introduction Just-in-Time Compilation has Become Commonplace Just-in-Time Compilation Modern Pentium processors Dynamically translate instructions onto underlying RISC architecture Transmeta Crusoe & Efficeon Dynamic code morphing Translate x86 instructions to underlying VLIW processor Interpreted languages Distribute SW as processor independent bytecode/source SW typically executed on a virtual machine JIT compile bytecode to processor’s native instructions Java, Python, etc. SW ______ SW ______ Profiling Standard Compiler Binary SW Binary Processor3 Processor JIT Recompile

3 Introduction Just-in-Time Compilation also Performs Optimization Dynamic optimizations are increasingly common Dynamically recompile binary during execution Dynamo [Bala, et al., 2000] - Dynamic software optimizations Identify frequently executed code segments (hotpaths) Recompile with higher optimization BOA [Gschwind, et al., 2000] - Dynamic optimizer for Power PC Advantages Transparent optimizations No designer effort No tool restrictions Adapts to actual usage Speedups of up 20%-30% X JIT compilation operates on software binaries

4 Introduction But Today’s Binaries are More than just Software SW ______ SW ______ Profiling Standard Compiler Binary SW Binary Profiling Compiler/ Synthesis Binary Processor1 FPGAProc. SW ______ SW ______ SW ______ HW ______ Processor Processor2 Processor3 FPGA Proc. FPGA Proc.

5 Introduction Just-in-Time FPGA Compilation? JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for microprocessor Portability, transparency, standard tools Embedded JIT compilation tools optimized for each FPGA Binary VHDL/Verilog Profiling Standard CAD Tools Binary Std. HW Binary JIT FPGA Comp. FPGA ++ JIT FPGA Comp. FPGA +**+ MEM

6 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 Binary SW Binary

7 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 HW ______ Processor FPGA 1 Processor FPGA 2 Processor FPGA 3 Processor FPGA 4 Binary SW Binary Binary HW Netlist3 Binary SW Binary Binary HW Netlist2 Binary SW Binary Binary HW Netlist1 Binary SW Binary Binary HW Netlist4 HW1 ______ HW2 ______ HW3 ______ HW4 ______

8 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 HW ______ Processor FPGA 1 Processor FPGA 2 Processor FPGA 3 Processor FPGA 4 Binary SW Binary Binary HW Binary JIT FPGA Comp.

9 µPµP I$ D$ Warp Config. Logic Architecture Profiler Dynamic Part. Module (DPM) Partitioned application executes faster with lower energy consumption 5 Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Profile application to determine critical regions2 Profiler Initially execute application in software only1 µPµP I$ D$ Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Program configurable logic & update software binary 4 Warp Config. Logic Architecture Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid DAC’03; Stitt/Vahid, ICCAD’02

10 ARM I$ D$ WCLA Profiler DPM Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

11 Introduction All that CAD on-chip? CAD people may first think Just-in-Time FPGA compilation is “absurd” CAD tools are extremely complex Require long execution times on power desktop workstations Require very large memory resources Usually require GBytes of hard drive space Costs of complete CAD tools package can exceed $1 million All that CAD on-chip? 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2-30 mins Route 10 MB

12 Simultaneous FPGA/CAD Design Careful simultaneous design of configurable logic fabric and CAD tools Analyze architectural features as to their impacts on on-chip Just-in-Time CAD tools Fast execution time Very low data memory Produce reasonable (good) hardware circuits

13 SM CLB SM CLB Simultaneous FPGA/CAD Design Configurable Logic Fabric SM CLB SM CLB Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) Each CLB is directly connected to a SM Switch matrix connections Four short wires connect adjacent SMs Four long wires connect every other SM together Lysecky/Vahid, DATE’04

14 Simultaneous FPGA/CAD Design Combinational Logic Block Design Incorporate two 3-input 2-output LUTs Corresponds to four 3-input LUTs Allows for good quality circuit while reducing on-chip CAD tools complexity Provide routing resources between adjacent CLBs to support carry chains LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB Lysecky/Vahid, DATE’04

15 Simultaneous FPGA/CAD Design Switch Matrix 0 0L 1 1L 2L 2 3L L 1L 2L 3L L1L2L3L L1L2L 3L Switch Matrix SM connected using eight channels per side Four short channels Four long channels Routes wires from different side using the same channel Each short channel is associated with single long channel Wires are routed using a single pair of channels through configurable logic fabric Lysecky/Vahid, DATE’04

16 FPGA Routing Find a path within FPGA to connect source and sinks of each net within our hardware circuit Typically use a form of maze routing [Lee, 1961] Routes each net using Dijkstra’s shortest path algorithm

FPGA Routing Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest path Allows overuse (congestion) of routing resources If congestion exists (illegal routing) Update cost of congested resources based on the amount of overuse Rip-up all routes and reroute all nets 2 congestion 2

18 FPGA Routing VPR – Versatile Place and Route [Betz, et al., 1997] Uses modified Pathfinder algorithm Increase performance over original Pathfinder algorithm Routability-driven routing Goal: Use fewest tracks possible Timing-driven routing Goal: Optimize circuit speed Routing Resource Graph Resource Graph Route Rip-up Done! congestion? illegal? no yes

19 JIT FPGA Routing Riverside On-Chip Router (ROCR) Represent routing nets between CLBs as routing between SMs Resource Graph Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel Requires much less memory than VPR as resource graph is much smaller SM 0/4

20 JIT FPGA Routing Riverside On-Chip Router (ROCR) - Global Routing Based on VPR’s routability-driven router Utilizes similar cost model consisting of base, historical congestion, and current congestion costs Routes nets between SMs using greedy, depth-first routing algorithm Faster than traditional VPR’s breadth-first routing method Requires addition of adjustment cost to direct ROCR to re-route illegal nets using different initial routing path Ignores illegal routing within SMs If congestion exists, rip-up and re-route only the illegal routes Reduces computation time during successive routing iterations

21 JIT FPGA Routing Riverside On-Chip Router (ROCR) - Detailed Routing Assign specific channels to each route Construct routing conflict graph Routes conflict if assigning same channel results in an illegal routing within any SM Use Brelaz’s greedy vertex coloring algorithm [Brelaz, 1979] If illegal routes exist, rip-up illegal routes and repeat global routing 0 0L 1 1L 2L 2 3L L 1L 2L 3L L1L2L3L L1L2L 3L R1 R2 R4 R3 R1R2

22 Experiments Memory Usage VPR requires over 50MB of memory with an average of over 20 MB ROCR requires at most 3.6 MB 13X less than VPR on average

23 Experiments Algorithm Performance ROCR is on average 10X faster than VPR (TD) Up to 21X faster for ex5p

24 Experiments Critical Path Results But 10% shorter critical path than VPR (RD) 32% longer critical path than VPR (TD)

25 Experiments Wire Segments 10% more wire segments than VPR (TD/RD)

26 Conclusions Developed Riverside On-Chip Router (ROCR) Fast, lean on-chip router for JIT FPGA compilation Order of magnitude less memory required On average 10X faster than VPR’s faster routing algorithm Produces acceptable circuit quality Uses only 10% more routing resources Critical path 10% shorter than VPR’s routability-driven router JIT FPGA Compilation Enables development of a standard HW binary Brings portability of SW design to HW designers Presently requires custom FPGA fabric Future work - Overhead of mapping simple fabric onto commercial fabric?