A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Slides:



Advertisements
Similar presentations
Digital Design Copyright © 2006 Frank Vahid 1 FPGA Internals: Lookup Tables (LUTs) Basic idea: Memory can implement combinational logic –e.g., 2-address.
Advertisements

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.
Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,
Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,
Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.
Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.
Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.
Reconfigurable Computing (EN2911X, Fall07)
Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,
Lecture 5: FPGA Routing September 17, 2013 ECE 636 Reconfigurable Computing Lecture 5 FPGA Routing.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Titan: Large and Complex Benchmarks in Academic CAD
FPGA Switch Block Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.
Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.
1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.
Julien Lamoureux and Steven J.E Wilton ICCAD
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.
Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.
Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation
Parallel Routing for FPGAs based on the operator formulation
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
FPGA CAD 10-MAR-2003.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
FPGA Routing Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest.
Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.
Introduction to Reconfigurable Computing
Incremental Placement Algorithm for Field Programmable Gate Arrays
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
Ann Gordon-Ross and Frank Vahid*
Chin Hau Hoo, Akash Kumar
Dynamic FPGA Routing for Just-in-Time Compilation
A Self-Tuning Configurable Cache
EEE2243 Digital System Design Chapter 9: Advanced Topic: Physical Implementation by Muhazam Mustapha extracted from Frank Vahid’s slides, May 2012.
Register-Transfer (RT) Synthesis
ESE534: Computer Organization
Dynamic Hardware/Software Partitioning: A First Approach
Warp Processor: A Dynamically Reconfigurable Coprocessor
Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering b Department of Electrical Engineering University of California, Riverside * Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx

2/22 Introduction Standard binary - Separating Function and Architecture SW ______ SW ______ Profiling Standard Compiler Binary x86 Binary Software binaries of the past Binary reflected specific language of underlying architecture – limited portability Current “standard binary” Concept: separate function from detailed architecture Develop new architectures for existing applications Trend towards dynamic translation and optimization

3/22 Introduction But Today’s Binaries are More than just Software SW ______ SW ______ Profiling Standard Compiler Binary SW Binary Profiling Compiler/ Synthesis Binary Processor1 FPGAProc. SW ______ SW ______ SW ______ HW ______ Processor Processor2 Processor3 FPGA Proc. FPGA Proc.

4/22 Introduction Just-in-Time FPGA Compilation? JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for microprocessor Portability, transparency, standard tools Embedded JIT compilation tools optimized for each FPGA Binary VHDL/Verilog Profiling Standard CAD Tools Binary Std. HW Binary JIT FPGA Comp. FPGA ++ JIT FPGA Comp. FPGA +**+ MEM

5/22 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 Binary SW Binary

6/22 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 HW ______ Processor FPGA 1 Processor FPGA 2 Processor FPGA 3 Processor FPGA 4 Binary SW Binary Binary HW Netlist3 Binary SW Binary Binary HW Netlist2 Binary SW Binary Binary HW Netlist1 Binary SW Binary Binary HW Netlist4 HW1 ______ HW2 ______ HW3 ______ HW4 ______

7/22 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 HW ______ Processor FPGA 1 Processor FPGA 2 Processor FPGA 3 Processor FPGA 4 Binary SW Binary Binary HW Binary JIT FPGA Comp.

8/22 µPµP I$ D$ FPGA Profiler Dynamic Part. Module (DPM) Partitioned application executes faster with lower energy consumption 5 Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Profile application to determine critical regions2 Profiler Initially execute application in software only1 µPµP I$ D$ Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Program configurable logic & update software binary 4 FPGA Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

9/22 µPµP I$ D$ FPGA Profiler DPM (CAD) Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

10/22 Introduction Existing FPGAs Not Suitable for JIT FPGA Compilation Existing FPGAs require extremely complex CAD tools Designed to handle large arbitrary circuits, ASIC prototyping, etc. Require long execution times and very large memory usage Not suitable for dynamic on-chip execution 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2-30 mins Route 10 MB

11/22 JIT FPGA Comp. FPGA ++ JIT FPGA Compilation CAD-Oriented FPGA Solution: Develop a custom CAD-oriented FPGA Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD Enables development of fast, lean JIT FPGA compilation tools 1s < 1s.5 MB 1 MB < 1s 1 MB 10s 3.6 MB Tech. Mapping/Packing Placement Logic Synthesis Routing Lysecky/Vahid, DATE’04

12/22 Simple Configurable Logic Fabric CAD-Oriented FPGA SM CLB SM CLB SM CLB SM CLB Simple Configurable Logic Fabric (CLF) Hundreds of existing commercial and research FPGA fabrics Most designed to balance circuit density and speed Analyzed FPGA’s features to determine their impact of CAD Designed our CLF in conjunction with JIT FPGA compilation tools Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) CLB is directly connected to a SM Along with SM design, allows for design of lean JIT routing Lysecky/Vahid, DATE’04

13/22 Simple Configurable Logic Fabric Combinational Logic Block Combinational Logic Block Incorporate two 3-input 2-output LUTs Equivalent to four 3-input LUTs with fixed internal routing Allows for good quality circuit while reducing JIT technology mapping complexity Provide routing resources between adjacent CLBs to support carry chains Reduces number of nets we need to route FPGAsSCLF Flexibility/Density: Large CLBs, various internal routing resources Simplicity: Limited internal routing, reduce on-chip CAD complexity LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB Lysecky/Vahid, DATE’04

14/22 Simple Configurable Logic Fabric Switch Matrix 0 0L 1 1L 2L 2 3L L 1L 2L 3L L1L2L3L L1L2L 3L Switch Matrix All nets are routed using only a single pair of channels throughout the configurable logic fabric Each short channel is associated with single long channel Designed for fast, lean JIT FPGA routing FPGAsSCLF Flexibility/Speed: Large routing resources, various routing options Simplicity: Allow for design of fast, lean routing algorithm Lysecky/Vahid, DATE’04

15/22 JIT FPGA Compilation Routing FPGA Routing Find a path within FPGA to connect source and sinks of each net within our hardware circuit Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest path Allows overuse (congestion) of resources If congestion exists (illegal routing) Update cost of congested resources Rip-up all routes and reroute all nets VPR [Betz, et al., 1997] Provides various improvements over Pathfinder Routability-driven: Use fewest tracks possible Timing-driven: Optimize circuit speed Many techniques are used in commercial FPGA CAD tools congestion 2

16/22 SM CLB SM CLB SM CLB Routing Resource Graph 0/4 SM Resource Graph ROCR - Riverside On-Chip Router Resource Graph Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel Requires much less memory as resource graph is smaller JIT FPGA Compilation ROCR – Riverside On-chip Router Route Rip-up Done! illegal? no yes Lysecky/Vahid/Tan, DAC’04; Lysecky/Vahid, DATE’04

17/22 Scalability of On-chip Routing Experimental Setup SM CLB SM CLB SM CLB SM CLB Experimental Setup 100x100 configurable logic fabric array Routing channel width of 34 Large enough to support all HW circuits 123 MCNC benchmark circuits Circuit complexity ranges from few LUTs to tens of thousands of LUTs Performed technology mapping, packing, and placement using FlowMap, T-VPack, and VPR’s bounding box placement Routed each HW benchmark circuit using: VPR’s timing-driven router VPR’s fast timing-driven router (-fast option) Riverside On-Chip Router (ROCR)

18/22 Scalability of On-chip Routing Memory Usage VPR requires over 100MB of on average ROCR requires at most 8.3 MB VPR requires 18X more than ROCR on average

19/22 Scalability of On-chip Routing Algorithm Performance ROCR is over 40X times faster than VPR for small HW circuits ROCR is 2X-3X times faster than VPR for large HW circuits

20/22 Scalability of On-chip Routing Critical Path 19% longer critical path than VPR 2.6% shorter than VPR (Fast) 30%/27% longer critical path than VPR/VPR (Fast)

21/22 Scalability of On-chip Routing Wire Segments ROCR requires 2%/8% fewer wire segments than VPR/VPR (Fast) for larger HW circuits

22/22 Conclusions and Future Work Conclusions Demonstrated ROCR scales well as circuit size increases On average 2X faster than VPR’s fast timing-driven router Requiring 18X less memory than VPR Produces good circuit quality Critical path 27% longer than VPR (Fast) on average 2.6% shorter critical path for largest HW circuit Requires on average 5% fewer wire segments Future Work Currently project: Major microprocessor vendor is fabricating our custom FPGA Improvements to Riverside On-Chip Router (ROCR) Improve ROCR’s performance for large HW circuits Incorporating timing information to achieve Analyze the scalability of ROCR as circuit size approaches FPGA capacity JIT FPGA Compilation Development of standard HW binary Support more complex FPGA architectures JIT FPGA compilation