Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.

Slides:

Advertisements

Similar presentations

FPGA (Field Programmable Gate Array)

Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

ECE 506 Reconfigurable Computing ece. arizona

Spartan-3 FPGA HDL Coding Techniques

Architecture-Specific Packing for Virtex-5 FPGAs

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

Evolution of implementation technologies

Programmable logic and FPGA

OpenCL High-Level Synthesis for Mainstream FPGA Acceleration

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Automating Shift-Register-LUT Based Run-Time Reconfiguration Karel Heyse, Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt

1/30 Course-Grained Reconfigurable Architectures Patrick Cooke and Elizabeth Graham.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Automated Design of Custom Architecture Tulika Mitra

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

J. Christiansen, CERN - EP/MIC

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Topics Architecture of FPGA: Logic elements. Interconnect. Pins.

Reconfigurable Computing Ender YILMAZ, Hasan Tahsin OĞUZ.

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

Authors: James Coole, Greg Stitt University of Florida Dept. of Electrical & Computer Engineering and NSF CHREC Gainesville, FL, USA

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Presenter: Darshika G. Perera Assistant Professor

Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:

Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.

Reconfigurable Architectures

Floating-Point FPGA (FPFPGA)

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

James Coole PhD student, University of Florida Aaron Landy Greg Stitt

Give qualifications of instructors: DAP

Application-Specific Customization of Soft Processor Microarchitecture

Automated Extra Pipeline Analysis of Applications mapped to Xilinx UltraScale+ FPGAs

Instructor: Dr. Phillip Jones

Electronics for Physicists

FPGAs in AWS and First Use Cases, Kees Vissers

Andy Ye, Jonathan Rose, David Lewis

We will be studying the architecture of XC3000.

The Xilinx Virtex Series FPGA

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

ChipScope Pro Software

Architecture Synthesis

Dynamic FPGA Routing for Just-in-Time Compilation

The Xilinx Virtex Series FPGA

Register-Transfer (RT) Synthesis

Electronics for Physicists

ChipScope Pro Software

A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P

Application-Specific Customization of Soft Processor Microarchitecture

Optimizing RTL for EFLX Tony Kozaczuk, Shuying Fan December 21, 2016

Programmable logic and FPGA

Reconfigurable Computing (EN2911X, Fall07)

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of Florida Dr. Greg Stitt Assistant Professor of ECE, University of Florida CODES+ISSS ‘10

Introduction Problem: Lengthy, increasing FPGA place & route (PAR) times are a design bottleneck Previous work: Fabrics specialized for fast PAR [Lysecky04] [Beck05] [Vahid08] PAR Time

Introduction Ideally we want the advantages of fast PAR with the flexibility and availability of COTS FPGAs Approach: virtualize specialized architecture on COTS FPGA

Approach Definition Motivations Challenge: virtualization overhead Intermediate fabric (IF): a PAR-specialized reconfigurable architecture implemented on top of COTS FPGAs Serves as a virtualization layer between netlist/circuit and FPGA Motivations Orders of magnitude PAR speedups are possible for coarse-grain architectures Reduction in problem size compared to FPGA PAR (e.g. multipliers not mapped to LUTs) Portability of IF configuration between any FPGAs implementing the same IF Enables portable 3rd party PAR tools Enables small embedded PAR tools for run- time construction of datapaths e.g. dynamic binary translation [Stitt07] [Beck05] on COTS devices Challenge: virtualization overhead Fast PAR Portability

Example Circuit with floating-point operations

Previous Work Dynamic FPGA routing and JIT compilation [Lysecky04][05] 3x PAR speedup Requires specialized device architecture Coarse grain reconfigurable device architectures [Becker01] [Ebeling96] […] Faster PAR because of reduced problem size compared to FPGAs Domain specific, not as flexible as fine-grain FPGAs Wires on Demand [Athanas07] Fast PAR by routing between pre-PARed modules Could be complementary, with IFs being used for PAR of modules Quku [Shukla06] Coarse-grained array of ALUs implemented on FPGA Essentially one instance of an IF IFs also address PAR execution time and portability

* primary source of overhead IF Architecture Implemented in multiple planes – groups of resources with similar responsibilities and a purpose-specialized interconnect Stream plane: includes interfaces to off-chip memories and support for buffering Control plane: resources for implementing control, such as state machines Data plane: resources for computation and data steering * Overhead: logic utilization and device area required to support fabric configuration Slice/LUT overhead primarily due to interconnect of data plane Flip-flops due to configuration bits and interconnect pipelining * primary source of overhead

Data Plane ••• Explored architectures with 2D island topology (FPGA-like) Computational units (CUs): implement mathematical or logical operations found in netlists (e.g. multiplication, addition) Operations included depends on applications targeted by specific fabric Tracks – multi-bit wires used to carry signals over short distances Connection boxes – bring routed signals in and out of CUs by connecting to tracks Switch boxes – route signals around fabric by bridging tracks Currently use planar topology Resources virtualized by implementation as RTL Configuration set by shifting stream of bits into a chain of configuration flip flops

Implementation of Interconnect Bidirectional tracks implemented as signals for all potential sources selected down to a single sink by MUX PAR determines actual source and configures the MUX MUXs are biggest contribution to area overhead of IFs Interconnect is pipelined to maximize clock rate of deeply pipelined netlists Configurable-length shift registers on CU inputs used to realign routes Prevents combinational loops in IF RTL

Optimizations Because the FPGA can implement multiple different IFs, individual IFs can be specialized to particular application domains Optimization strategy minimizes overhead by removing or reducing impact of interconnect resources Global properties: Track density – number of tracks per channel Connection box flexibility – how adjacent CUs connect to each connection box Specialization techniques: Wide channels – only increase capacity for individual channels Long tracks – tracks that hop over switch boxes in a channel Jump tracks – long tracks that leave their channel to connect different parts of a fabric

Tool Flow Intermediate fabrics are created using device (FPGA) tool flow IFs stored by system as fabric specification with bitstream to configure the FPGA Multiple IFs may be stored in a library to enable the system to handle many applications During execution, IF tools load bitstream for compatible IF onto FPGA IF technology-maps netlist nodes to CUs, and control and stream plane elements Should be ~1:1 IF tools PAR netlist on IF Placement based on VPR [Betz97] simulated annealing (SA) placement Routing based on Pathfinder [McMurchie95] negotiated congestion routing PAR produces IF bitstream to configure the circuit on the hosted IF

Experimental Setup Explored tradeoffs of area overhead and ability to route netlists (routability) Developed tool to automate creating RTL for intermediate fabrics Island-style data planes with user-definable CU logic Parameters for CU distribution, interconnect density, and optimizations Track density, track length, etc. IFs synthesized using Synplicity Synplify Pro 2009.03 and Xilinx ISE 10.1 Developed random acyclic netlist generator to assess routability for common circuit structures Used to test routing a large number of random netlists on the fabric Routability: fraction of population that routes successfully on the fabric Higher precision metric and not biased by selection of netlists Decreases with size of fabric, so can’t compare between fabric sizes Execution times compared against ISE 10.1 for Xilinx V4LX200s on Quad-Core 2.67GHz Core i7 Xeon workstation

Results: Case Studies 1) Evaluated PAR speedup for a number of example netlists 2) Evaluated area/routability tradeoffs by creating IFs optimized for each netlist Baseline IFs: high routability, general-purpose interconnect Minimum size required to place netlist 4 tracks per channel No long tracks or other optimizations Specialized IFs: minimized overhead by removing/customizing interconnect Minimized tracks per channel, while still routing netlist Randomly explored combinations of long tracks and wide channels CUs included in IF were matched to requirements of netlist For fixed-point netlists, CUs were combination adders/multipliers mapped to Xilinx DSP48s For single-precision netlists, CUs were a mixture of Xilinx FP Cores distributed evenly Tracks set to CU bit width (16 or 32)

Case Studies: PAR Speedup IF PAR FPGA PAR PAR Speedup Area Overhead Clock Overhead Matrix multiply 0.6s 6min 06s 602x 13% -11% FIR 4min 36s 454x 29% 31% N-body 0.5s 3min 42s 491x 10% Accumulate 0.1s 0min 30s 323x 5% 25% Normalize 0.2s 6min 44s 1726x 14% 18% Bilinear 0.3s 8min 48s 1784x 27% Floyd-Steinberg 5min 37s 2407x ••• avg. floating point 5min 09s 1112x 19% Thresholding 1.4s 0min 33s 24x 42% Sobel 2min 28s 500x 6% 24% Gaussian Blur 3.3s 3min 19s 60x Max Filter 1min 16s 444x 4% 23% Mean Filter 7x7 8.9s 5min 03s 34x 26% 22% avg. 16b fixed point 1.3s 1min 49s 275x 9% PAR speedup avg. of 275x for fixed-point, 1112x for floating-point netlists ~1s PAR Speedup increases with complexity of CUs FPGA PAR times don’t include memory interfaces (FPGA circuit IO  pins) Underestimates PAR speedup for many systems (e.g. +10-20 min on GiDEL ProcStar-III)

Case Studies: Overhead PAR Speedup Area Overhead Clock Overhead Routability (Specialized) Matrix multiply 602x 13% -11% 100% FIR 454x 29% 31% 99% N-body 491x 10% Accumulate 323x 5% 25% Normalize 1726x 14% 18% 60% Bilinear 1784x 27% 97% Floyd-Steinberg 2407x ••• avg. floating point 1112x 19% 94% Thresholding 24x 42% Sobel 500x 6% 24% Gaussian Blur 60x 58% Max Filter 444x 4% 23% 98% Mean Filter 7x7 34x 26% 22% 59% avg. 16b fixed point 275x 9% 90% Specialized fabrics required avg. 9-14% more area than circuit on FPGA Overhead for unspecialized: 16-23% (48% savings) Routability: 91% for specialized, 100% for unspecialized (9% reduction) Fabrics reduced netlist clock 19% (to ~190MHz) compared to circuit on FPGA FPGA circuit implementation pipelined same as IF circuits

Newer Case Studies Experiments on Novo-G for larger circuits Place and Route Times Performance Area Utilization IF Quartus 9.1 Speedup Clk FPGA Clk Overhead Speedup IF Speedup FPGA Perf. Overhead LUT REG DSP Conv 3x3 0.9s 14min 48s 943 150 MHz 17% 3.2 3.5 9% 13% 14% 1% Conv 4x4 1.5s 15min 06s 613 148 MHz 16% 5.9 6.3 6% 2% Conv 5x5 2.1s 15min 33s 447 146 MHz 15% 8.0 8.5 4% Conv 6x6 3.0s 15min 41s 312 151 MHz 18% 11.1 11.9 7% 5% Conv 7x7 4.0s 16min 19s 243 139 MHz 11% 14.7 15.5 8% Conv 8x8 5.3s 16min 08s 184 18.8 20.0 Sobel 4.2s 14min 56s 214 154 MHz 19% 0.53 0.58 SAD 8x8 16min 51s 190 143 MHz 18.6 19.5 0% Conv 5x5 (float) 1.7s 25min 28s 919 23% 5.2 5.5 21% 29% Sobel (float) 18min 58s 759 144 MHz 0.32 0.36 3% SAD 5x5 (float) 0.6s 30min 43s 2880 140 MHz 5.3 25% 38% Average 2.7s 18min 14s 700 8.3 8.8 124 MHz 57% 58% IF (float) 114 MHz 59% 61% Experiments on Novo-G for larger circuits 700x average PAR speedup 7% performance overhead Area requirements = ~60% of device

Results: General Purpose Fabrics 3) Evaluate interconnect structures for general-purpose use Compared routability for general-purpose interconnect No application-specific interconnect optimizations Comparisons for max-sized netlists (100% of CUs) and random sized netlists CUs were 16 bit combination adders/multipliers Connection box connectivity: ~20% decrease in area overhead by using low connectivity For low track densities, however, high connectivity significantly improves routability

General Purpose Fabrics For the pipelined datapath circuits we tested, greater than 3 tracks/channel provides only small gains in routability – 2-3 tracks/channel provides reasonable tradeoffs Overhead is 37% for a 96 CU fabric with 2 tracks/channel Routability: 97%, 79% for max-size netlists Provides access to all DSP48s on V4LX200 225 CU fabric (16b add/mult) fit on V4LX200 129 CUs in LUTs, 96 in DSPs

Summary and Future Work Introduced Intermediate Fabrics: virtual coarse-grain reconfigurable architectures implemented on top of FPGAs Demonstrated average 554x PAR speedup across 12 case studies in of pipelined datapath circuits, with feasible area and clock overhead Enables small, portable PAR tools by abstracting complexity of underlying device Main limitation is area overhead introduced by virtual routing resources Demonstrated for a reasonably large fabric of 96 DSP units, the virtualization overhead required ~1/3 of a Virtex 4 LX200, with high routability (97%) Future work involves implementing interconnect directly using device’s routing resources, with potential to significantly reduce overhead Presented techniques to reduce overhead by specializing the fabric interconnect to particular domains Demonstrated average reduction in overhead of 48%, with 91% routability Future work involves methodologies for developing libraries of domain-specialized IFs, and algorithms for efficiently searching libraries of IFs