FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Slides:

Advertisements

Similar presentations

Simulation of Fracturable LUTs

Advertisements

Architecture-Specific Packing for Virtex-5 FPGAs

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

FPGA Intra-cluster Routing Crossbar Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Cadence Design Systems, Inc. Why Interconnect Prediction Doesn’t Work.

ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.

Reap What You Sow: Spare Cells for Post-Silicon Metal Fix Kai-hui Chang, Igor L. Markov and Valeria Bertacco ISPD’08, Pages

FPGA Technology Mapping Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin TomGuy Lemieux University of British Columbia Department of Electrical and Computer.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.

Balancing Interconnect and Computation in a Reconfigurable Array Dr. André DeHon BRASS Project University of California at Berkeley Why you don’t really.

Exploration of Pipelined FPGA Interconnect Structures Scott Hauck Akshay Sharma, Carl Ebeling University of Washington Katherine Compton University of.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.

Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.

CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.

The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays Steven J.

Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich VLSI CAD Lab Computer Science Department University of California,

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

Power Reduction for FPGA using Multiple Vdd/Vth

Titan: Large and Complex Benchmarks in Academic CAD

FPGA Switch Block Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.

Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.

Implementation of Finite Field Inversion

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.

FPGA Global Routing Architecture Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.

Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

Topics Architecture of FPGA: Logic elements. Interconnect. Pins.

Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.

Parallel Routing for FPGAs based on the operator formulation

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich.

An Improved “Soft” eFPGA Design and Implementation Strategy

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

© PSU Variation Aware Placement in FPGAs Suresh Srinivasan and Vijaykrishnan Narayanan Pennsylvania State University, University Park.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.

Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.

ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

1 Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture Henry Wong Vaughn Betz, Jonathan Rose.

Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.

Oleg Petelin and Vaughn Betz FPL 2016

Floating-Point FPGA (FPFPGA)

Andy Ye, Jonathan Rose, David Lewis

A Novel FPGA Logic Block for Improved Arithmetic Performance

Topics Circuit design for FPGAs: Logic elements. Interconnect.

An Active Glitch Elimination Technique for FPGAs

FPGA Glitch Power Analysis and Reduction

ESE534: Computer Organization

Give qualifications of instructors: DAP

A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P

Presentation transcript:

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223

How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz, Jonathan Rose IEEE Design & Test of Computers 15(1): (1998)

Three Questions How many inputs should the FPGA routing provide to a cluster of LUTs? (I) – Routing flexibility vs. area As the number of LUTs in a logic cluster changes, how should the FPGA’s routing architecture change? (F c ) How many LUTs should be included in a cluster? (N)

Experimental Methodology 20 MCNC Benchmarks – Well-established – A bit old, even by 1998 standards – Sadly, still in use 4-LUT Architecture F s = 3 – Vary other parameters to see what works best

Area Model Count the number of min-width transistors required to implement a benchmark circuit in an FPGA architecture Normalized Area (Num min-width transistors used) / (Num BLEs used)

How many cluster inputs do we need? We hit near 100% utilization when I = 50-60% of the total number of BLE inputs We can pack BLEs together to share common inputs Re-use locally generated outputs Works because the packing algorithm was effective! Input sharing and output re-use within a logic cluster

Visual Depiction I = ~0.6KN is pretty good Use the feedbacks! Fanout

The Packer was Effective! It packed BLEs together to share common inputs It re-use locally generated outputs via the feedbacks

Cluster inputs vs. Cluster size Approx. (2N + 2) N = 1 BLE uses 3.5/4 inputs (on average) N = 16 BLEs uses 19.7 / 64 inputs, on average

Commercial FPGAs Altera Flex 8000 FPGA uses a cluster of size N=8 with I=24 – Results suggest to reduce I to 18 (save area) Xilinx 5200 FPGA uses a cluster of size N=4 with I=16 – Results suggest to reduce I to 10 (save area)

Routing Flexiblity vs. Cluster Size Set F c = W/N – Each routing track is driven by one LUT output pin in the cluster

Area Efficiency vs. Cluster Size I is set to achieve 98% logic utilization N=2 BLEs introduces intra-cluster routing Reduce routing between logic blocks Area efficiency rapidly degrades beyond this point

Conclusions I = 2N + 2 for N < 16 – Slow, linear growth Reduce F c – Works because LUT inputs are equivalent Cluster area efficiency is within 10% for 1 < N < 8 Large clusters reduce the size of the placement problem and increase FPGA speed

The Effect of LUT and Cluster Size on Deep-Submicron FPGA Performance and Density Elias Ahmed, Jonathan Rose IEEE Transactions on VLSI Systems 12(3): (2004)

Contributions Vary LUT size (K) from 2 to 7 Vary cluster size (N) from 1 to 10 LUTs – Experimentally determine the number of cluster inputs (I) as a function of K and N – Clustering small LUTs (K=2,3) produces good area results, but bad performance (~2x worse) – LUTs of size (K=4,5,6), clusters of size (N=3…10) yield the best area-delay product

CAD Flow

Inputs Req.’d for 98% Area Utilization I = ½K(N+1)

Total Area LUT sizes of K = 4,5 are the most area efficient for all cluster sizes Reduction in total area as cluster size increases from 1-3 for all LUT sizes As clusters are made larger (N > 4) there is little impact on total FPGA area Intra-cluster routing area is 25-35% of the total area

Total Intra-cluster Routing Area The increase in cluster size far outweighs the rate of decrease in the number of clusters: hence the upward trend

#Clusters and Area/Cluster vs. K 25-35% N = 1 BLE per Cluster

LUT area vs. Intra-cluster Mux Area Intra-cluster routing area is % of logic cluster area LUT area dominates

Intra-cluster Routing Area as a Function of LUT Size Total intra-cluster routing area decreases near-linearly from K = 3 to 7

Total Intra-cluster Routing Area The product of these two curves gives the total inter-cluster routing area. Routing area decreases linearly with LUT size Increasing LUT sizes decreases the number of clusters used faster than the rate of increase in routing area per cluster Depends on good CAD tools

Critical Path Delay vs. LUT Size Increasing both N and K has a positive effect Benefits saturate as N and K get large As N and K increase LUT delay and the delay through a single cluster increases The number of LUTs and clusters in series on the critical path decreases Reduced global routing delay

Intra-cluster Delay vs. LUT Size Intra-cluster delay decreases as K increases Reduction in number of BLE levels on critical path Intra-cluster delay increases as N increases Larger intra-cluster cluster muxes are slower The delay through these muxes is still much faster than global routing delay

BLE Delay vs. K BLE delay increases linearly as K increases (intuitive) Number of BLEs on the critical path decreases quadratically as K increases Fewer, but larger, BLEs

Global Routing Delay vs. K As K increases Fewer LUTs on the critical path Fewer global routing links As N increases More opportunities to use faster intra-cluster routing

Critical Path Delay (K = 4) K remains constants – No reduction in number of BLEs on critical path N increases – BLE and intra-cluster routing delay increase – More logic implemented internally within clusters – Can use faster intra-cluster routing instead of global routing

Critical Path Delay vs. LUT Size (Recap) Increasing N beyond 3 has minimal effects Limited effectiveness of clustering Architectural weakness? Semi-effective CAD tools?

Number of Logic Clusters on Critical Path The number of logic levels decrease with increasing N and K For a given K, most of the reduction is from N = 1 to 3 The majority of the critical path delay was reduced in this range Increasing N is less effective when K is large

BLE Fanout vs. LUT Size Smaller LUTs have better response to increasing N because each LUT has a relatively small fanout Adding an extra BLE to the cluster guaranteed some reduction in the number of logic levels Larger LUTs have larger average fanout Harder to ensure that increasing N will result in fewer cluster levels on the critical path

Area-Delay Product Large Delays Many BLEs on critical path Slightly larger area requirement Large area cost for K=7 outweighs marginal delay improvement

Caveats Quality of CAD tools Mix of benchmark circuits Limited exploration of routing parameter design space – Parameters were derived from N = K = 4

Best Overall Results and Summary To achieve 98% LUT utilization, set I = ½K(N+1) Small LUT sizes are not area efficient and have poor performance characteristics Future challenges – Reduce number of BLEs on critical path without resorting to larger LUTs – Reduce intra-cluster routing delays