Dynamic FPGA Routing for Just-in-Time Compilation

Slides:

Advertisements

Similar presentations

CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 22: December 2, 2005 Routing 2 (Pathfinder)

Advertisements

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Reconfigurable Computing (EN2911X, Fall07)

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Lecture 5: FPGA Routing September 17, 2013 ECE 636 Reconfigurable Computing Lecture 5 FPGA Routing.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

ECE 506 Reconfigurable Computing Lecture 7 FPGA Placement.

Introduction to Routing. The Routing Problem Apply after placement Input: –Netlist –Timing budget for, typically, critical nets –Locations of blocks and.

MGR: Multi-Level Global Router Yue Xu and Chris Chu Department of Electrical and Computer Engineering Iowa State University ICCAD

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Global Routing.

Solving Hard Instances of FPGA Routing with a Congestion-Optimal Restrained-Norm Path Search Space Keith So School of Computer Science and Engineering.

New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,

CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.

Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.

Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

FPGA CAD 10-MAR-2003.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

FPGA Routing Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Placement and Routing Algorithms. 2 FPGA Placement & Routing.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

VLSI Physical Design Automation

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.

Introduction to Reconfigurable Computing

Improving java performance using Dynamic Method Migration on FPGAs

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Ann Gordon-Ross and Frank Vahid*

Chin Hau Hoo, Akash Kumar

Virtual Machines (Introduction to Virtual Machines)

Topics Logic synthesis. Placement and routing..

HIGH LEVEL SYNTHESIS.

Routing Algorithms.

Department of Electrical Engineering Joint work with Jiong Luo

Register-Transfer (RT) Synthesis

ECE 697F Reconfigurable Computing Lecture 5 FPGA Routing

Dynamic Hardware/Software Partitioning: A First Approach

Warp Processor: A Dynamically Reconfigurable Coprocessor

Introduction to Virtual Machines

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lyseckya, Frank Vahida*, Sheldon X.-D. Tanb aDepartment of Computer Science and Engineering bDepartment of Electrical Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship

Introduction Just-in-Time Compilation has Become Commonplace Modern Pentium processors Dynamically translate instructions onto underlying RISC architecture Transmeta Crusoe & Efficeon Dynamic code morphing Translate x86 instructions to underlying VLIW processor Interpreted languages Distribute SW as processor independent bytecode/source SW typically executed on a virtual machine JIT compile bytecode to processor’s native instructions Java, Python, etc. SW ______ Profiling Standard Compiler Binary SW Binary Processor3 Processor JIT Recompile

Introduction Just-in-Time Compilation also Performs Optimization Dynamic optimizations are increasingly common Dynamically recompile binary during execution Dynamo [Bala, et al., 2000] - Dynamic software optimizations Identify frequently executed code segments (hotpaths) Recompile with higher optimization BOA [Gschwind, et al., 2000] - Dynamic optimizer for Power PC Advantages Transparent optimizations No designer effort No tool restrictions Adapts to actual usage Speedups of up 20%-30% -- 1.3X JIT compilation operates on software binaries

Introduction But Today’s Binaries are More than just Software SW ______ SW ______ HW Profiling Standard Compiler Profiling Compiler/ Synthesis Binary SW Binary Binary Processor1 FPGA Proc. Processor Processor2 FPGA Proc. Processor3

Introduction One Use of JIT FPGA Compilation Processor ARM7 Processor ARM9 Binary SW Binary Feature Upgrade Processor ARM10 CableTV Company SW ______ Processor ARM11

Introduction One Use of JIT FPGA Compilation Processor ARM7 Processor FPGA 1 Binary SW Binary HW Netlist1 Processor ARM9 Binary SW Binary HW Netlist2 Processor FPGA 2 Feature Upgrade Processor ARM10 Binary SW Binary HW Netlist3 Processor FPGA 3 HW1 ______ HW ______ CableTV Company SW ______ HW2 ______ HW3 ______ HW4 ______ Processor ARM11 Binary SW Binary HW Netlist4 Processor FPGA 4

Introduction One Use of JIT FPGA Compilation Processor ARM7 Processor FPGA 1 JIT FPGA Comp. Processor ARM9 Processor FPGA 2 JIT FPGA Comp. Binary SW Binary HW Binary Feature Upgrade Processor ARM10 Processor FPGA 3 HW ______ CableTV Company SW ______ JIT FPGA Comp. Processor ARM11 Processor FPGA 4 JIT FPGA Comp.

Dynamic Part. Module (DPM) Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Profile application to determine critical regions 2 Profiler Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Initially execute application in software only 1 µP I$ D$ Profiler µP I$ Partitioned application executes faster with lower energy consumption 5 D$ Warp Config. Logic Architecture Program configurable logic & update software binary 4 Warp Config. Logic Architecture Dynamic Part. Module (DPM) Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid DAC’03; Stitt/Vahid, ICCAD’02

Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Binary Partitioning Tech. Mapping/Packing Placement Logic Synthesis Routing Binary Updater Decompilation Profiler RT Synthesis ARM I$ D$ Binary Std. HW Binary WCLA DPM JIT FPGA Compilation JIT FPGA Compilation Binary Updated Binary Binary HW Bitstream Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

Introduction All that CAD on-chip? CAD people may first think Just-in-Time FPGA compilation is “absurd” CAD tools are extremely complex Require long execution times on power desktop workstations Require very large memory resources Usually require GBytes of hard drive space Costs of complete CAD tools package can exceed $1 million All that CAD on-chip? 1 min Tech. Map 1 min Log. Syn. 1-2 mins Place 2-30 mins Route 10 MB 10 MB 50 MB 60 MB

Simultaneous FPGA/CAD Design Careful simultaneous design of configurable logic fabric and CAD tools Analyze architectural features as to their impacts on on-chip Just-in-Time CAD tools Fast execution time Very low data memory Produce reasonable (good) hardware circuits

Simultaneous FPGA/CAD Design Configurable Logic Fabric Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) Each CLB is directly connected to a SM Switch matrix connections Four short wires connect adjacent SMs Four long wires connect every other SM together SM CLB SM SM SM CLB CLB SM SM SM Lysecky/Vahid, DATE’04

Simultaneous FPGA/CAD Design Combinational Logic Block Design Incorporate two 3-input 2-output LUTs Corresponds to four 3-input LUTs Allows for good quality circuit while reducing on-chip CAD tools complexity Provide routing resources between adjacent CLBs to support carry chains LUT a b c d e f o1 o2 o3 o4 Adj. CLB Lysecky/Vahid, DATE’04

Simultaneous FPGA/CAD Design Switch Matrix SM connected using eight channels per side Four short channels Four long channels Routes wires from different side using the same channel Each short channel is associated with single long channel Wires are routed using a single pair of channels through configurable logic fabric 0L 1 1L 2L 2 3L 3 Lysecky/Vahid, DATE’04

FPGA Routing FPGA Routing Find a path within FPGA to connect source and sinks of each net within our hardware circuit Typically use a form of maze routing [Lee, 1961] Routes each net using Dijkstra’s shortest path algorithm

FPGA Routing Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest path Allows overuse (congestion) of routing resources If congestion exists (illegal routing) Update cost of congested resources based on the amount of overuse Rip-up all routes and reroute all nets 2 1 1 congestion 2

Routing Resource Graph FPGA Routing VPR – Versatile Place and Route [Betz, et al., 1997] Uses modified Pathfinder algorithm Increase performance over original Pathfinder algorithm Routability-driven routing Goal: Use fewest tracks possible Timing-driven routing Goal: Optimize circuit speed Route Rip-up Done! congestion? illegal? no yes Resource Graph Routing Resource Graph

JIT FPGA Routing Riverside On-Chip Router (ROCR) Represent routing nets between CLBs as routing between SMs Resource Graph Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel Requires much less memory than VPR as resource graph is much smaller SM 0/4

JIT FPGA Routing Riverside On-Chip Router (ROCR) - Global Routing Based on VPR’s routability-driven router Utilizes similar cost model consisting of base, historical congestion, and current congestion costs Routes nets between SMs using greedy, depth-first routing algorithm Faster than traditional VPR’s breadth-first routing method Requires addition of adjustment cost to direct ROCR to re-route illegal nets using different initial routing path Ignores illegal routing within SMs If congestion exists, rip-up and re-route only the illegal routes Reduces computation time during successive routing iterations

JIT FPGA Routing Riverside On-Chip Router (ROCR) - Detailed Routing Assign specific channels to each route Construct routing conflict graph Routes conflict if assigning same channel results in an illegal routing within any SM Use Brelaz’s greedy vertex coloring algorithm [Brelaz, 1979] If illegal routes exist, rip-up illegal routes and repeat global routing R1 R3 0L 1 1L 2L 2 3L 3 R1 R2 R2 R3 R4

Experiments Memory Usage VPR requires over 50MB of memory with an average of over 20 MB ROCR requires at most 3.6 MB 13X less than VPR on average

Experiments Algorithm Performance ROCR is on average 10X faster than VPR (TD) Up to 21X faster for ex5p

Experiments Critical Path Results 32% longer critical path than VPR (TD) But 10% shorter critical path than VPR (RD)

Experiments Wire Segments 10% more wire segments than VPR (TD/RD)

Conclusions Developed Riverside On-Chip Router (ROCR) Fast, lean on-chip router for JIT FPGA compilation Order of magnitude less memory required On average 10X faster than VPR’s faster routing algorithm Produces acceptable circuit quality Uses only 10% more routing resources Critical path 10% shorter than VPR’s routability-driven router JIT FPGA Compilation Enables development of a standard HW binary Brings portability of SW design to HW designers Presently requires custom FPGA fabric Future work - Overhead of mapping simple fabric onto commercial fabric?