1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

Slides:

Advertisements

Similar presentations

Spartan-3 FPGA HDL Coding Techniques

Advertisements

Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

EECE579: Digital Design Flows

Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin TomGuy Lemieux University of British Columbia Department of Electrical and Computer.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.

Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.

VHDL Synthesis in FPGA By Zhonghai Shi February 24, 1998 School of EECS, Ohio University.

Evolution of implementation technologies

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.

CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.

The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays Steven J.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Introduction to FPGA’s FPGA (Field Programmable Gate Array) –ASIC chips provide the highest performance, but can only perform the function they were designed.

StaticRoute: A novel router for the dynamic partial reconfiguration of FPGAs Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt 2/9/2013.

Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2

156 / MAPLD 2005 Rollins 1 Reducing Energy in FPGA Multipliers Through Glitch Reduction Nathan Rollins and Michael J. Wirthlin Department of Electrical.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Placement by Simulated Annealing. Simulated Annealing  Simulates annealing process for placement  Initial placement −Random positions  Perturb by block.

ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.

An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.

Power Reduction for FPGA using Multiple Vdd/Vth

Titan: Large and Complex Benchmarks in Academic CAD

1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.

Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.

CBSSS 2002: DeHon Costs André DeHon Wednesday, June 19, 2002.

Julien Lamoureux and Steven J.E Wilton ICCAD

Tools - Implementation Options - Chapter15 slide 1 FPGA Tools Course Implementation Options.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Programmable Logic Devices

A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.

Reconfigurable Computing - Type conversions and the standard libraries John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots.

Field Programmable Gate Arrays (FPGAs) An Enabling Technology.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.

FPGA Global Routing Architecture Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.

EE 466/586 VLSI Design Partha Pande School of EECS Washington State University

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.

Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

An Improved “Soft” eFPGA Design and Implementation Strategy

FPGA CAD 10-MAR-2003.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

© PSU Variation Aware Placement in FPGAs Suresh Srinivasan and Vijaykrishnan Narayanan Pennsylvania State University, University Park.

FPGA Routing Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Placement study at ESA Filomena Decuzzi David Merodio Codinachs

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Application-Specific Customization of Soft Processor Microarchitecture

Andy Ye, Jonathan Rose, David Lewis

Topics Circuit design for FPGAs: Logic elements. Interconnect.

HIGH LEVEL SYNTHESIS.

Give qualifications of instructors: DAP

A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P

Measuring the Gap between FPGAs and ASICs

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto

2 Motivation: Datapath Regularity Larger FPGAs –Larger applications on FPGAs –More datapath logic in larger applications –Datapath logic is highly regular In custom ASIC regularity is routinely utilized to increase logic density Can regularity also be utilized to improve the logic density of FPGAs?

3 Previous Work Datapath-FPGA (DP-FPGA) study [cher96] –Yes, datapath regularity can be utilized to reduce FPGA area by as much as 50% –Based on a partially specified FPGA architecture Major simplifying assumptions –All transistors are minimum width –Datapaths are completely regular –No inefficiency from the CAD tools

4 This Work – An In-depth Study on Datapath Regularity Designed a new datapath-oriented FPGA architecture –With detailed architectural specifications –With correctly sized transistors Utilized realistic datapath benchmarks –From the Pico-java processor from SUN Created a complete set of CAD tools to support the new architecture –Taking CAD inefficiency into account

5 Multi-bit FPGA (MB-FPGA) Architected to utilize datapath regularity to generate area savings Architectural features –Capture regularity using special logic blocks called super-clusters –Increase logic density through configuration memory sharing routing resources

6 MB-FPGA – Overview LLLL S Conf. Mem. Shar. Routing Tracks Conventional Routing Tracks Routing Channels S Switch Block L Logic Block

7 MB-FPGA – Logic Block Cluster 4Cluster 3Cluster 2Cluster 1 MUX LUT DFF M A Basic Logic Element (BLE) BLE Local Routing Network BLE Cluster = Bit-Slice LRN

8 Capturing Datapath Regularity BLE Bit-Slice 1Bit-Slice 2Bit-Slice 3

9 MB-FPGA – Routing Architecture Switch Block Logic Block Cluster M M Conf. Mem. Shar. Routing M MMMM Conventional Routing

10 Utilizing Datapath Regularity to Save Area LLLL M Conf. Mem. Shar. Tracks

11 Area Estimation Using Correct Transistor Sizing Based on the fully specified MB-FPGA architecture Detailed Assumptions –SRAM transistors are min. width –Tri-state buffers are 5x min. width –75% FPGA area is routing area Simplified Assumptions –Datapaths are completely regular (all conf. mem. shar. tracks) –No inefficiency from the CAD tools

12 Area Estimation Using Correct Transistor Sizing Datapath regularity can only be used to reduce the MB-FPGA area by 25% Down from the 50% area savings prediction of the DP-FPGA study [cher96]

13 Benchmark Regularity Fifteen benchmark circuits –From the Pico-java processor –Implemented on the MB-FPGA Measurements after synthesis –Logic regularity –Net regularity

14 Logic Regularity Classify LUTs and DFFs into two types –Irregular type LUTs and DFFs that do not belong to any 4-bit wide datapath components –Regular type LUTs or DFFs that belong to a 4-bit wide datapath component More regular type of LUTs and DFFs –More regular nets –Greater area savings

15 A Datapath Component S1S2S3S4 A Datapath Component – A Group of 4 identical LUTs or DFFs Identical LUTs or DFFs

16 #LUT + #DFF#LUT + #DFF in Datapath Components %LUT & DFF in Datapath Components dcu_dpath % ex_dpath % icu_dpath % imdr_dpath % pipe_dpath % smu_dpath % ucode_dat % ucode_reg % code_seq_dp % exponent_dp % Incmod % mantissa_dp % multmod_dp % prils_dp % rsadd_dp % Total % Logic Regularity

17 Net Regularity Classify two-terminal connections in each circuit into three types –Regular 4-bit wide buses –Regular 4-bit wide control group –Irregular Two-terminal connections do not belong to either a bus or a control group

18 Definition – Net Regularity A 4-bit wide bus S1S2S3S4 S1S2S3S4 S1S2S3S4 A 4-bit wide control group Note: Only 4-bit wide buses can be used to increase the area efficiency of MB-FPGA through conf. mem. shar. routing tracks

19 Net Regularity Total Two-Terminal Connections % of Two-Terminal Connections in 4- Bit Wide Buses % of Two-Terminal Connections in Fan-Out 4 Groups dcu_dpath223249%43% ex_dpath654752%39% icu_dpath804747%36% imdr_dpath310050%36% pipe_dpath104948%42% smu_dpath116748%25% ucode_dat314352%41% ucode_reg19472%21% code_seq_dp79958%18% exponent_dp136232%23% incmod201342%33% mantissa_dp253347%36% multmod_dp338039%25% prils_dp86441%32% rsadd_dp72252%27% Total %35%

20 Area Estimation Based on Correct Net Regularity Assumptions –SRAM transistors are min. width –Tri-state buffers are 5x min. width –75% FPGA area is routing area –50% of routing tracks are conf. mem. shar. –No inefficiency from the CAD tools Result –Datapath regularity can be utilized to reduce FPGA area by 12% (again down from 25%)

21 Datapath-oriented CAD Flow – Overview Enhanced Module Compaction Synthesis Coarse-grain Node Graph Packing Coarse-grain Resource Routing Multi-bit FPGA Placement

22 Can Regularity Be Utilized to Improve Logic Density? To achieve best area –What should be the best number of clusters per logic block? –What should be the best number of conf. mem. shar. routing tracks per routing channel? What is the performance this datapath-oriented FPGA?

23 Experiments Fifteen benchmark circuits –From the Pico-java processor –Implemented on the MB-FPGA Experiments –Granularity (the number of clusters per logic block) vs. Area –% conf. mem. shar. tracks vs. area –% conf. mem. shar. tracks vs. performance

24 Granularity Vs. Area Explored a 2-D architectural space –First vary granularity –For each granularity: vary % of conf. mem. shar. routing tracks per routing channel For each architecture, find the average area required to implement the benchmark circuits Plot best area for each granularity

25 Granularity Vs. Area

26 % C.M.S. Tracks Vs. Area Assume four clusters per logic block for the MB-FPGA For each circuit –Set a fixed number of conf. mem. shar. tracks –Search for minimum number of additional conv. tracks Classify into eight percentile ranges Use the minimum area obtainable for each circuit to calculate average area

27 % C.M.S. Tracks Vs. Area Also implement the same benchmarks on a comparable conventional FPGA MB-FPGA area is normalized against the conventional FPGA area

28 % C.M.S. Tracks Vs. Area Normalized Avg. Area 10% % Conf. Mem. Shar. Tracks

29 Area (40% - 50% Tracks Are C.M.S.) Conventional FPGA Area (10e5) Datapath-oriented FPGA Area (10e5) Datapath-oriented FPGA Area (Normalized) icu_dpath ex_dpath multmod_dp imdr_dpath ucode_dat mantissa_dp dcu_dpath incmod exponent_dp smu_dpath pipe_dpath prils_dp code_seq_dp rsadd_dp ucode_reg Avg. Area

30 Performance (Crit. Path Delay) Assume carry network delay equal to local routing network delay –Over-estimated carry delay –Results are pessimistic Normalized average crit. path delay over 15 benchmark circuits with respect to conventional FPGA

31 % C.M.S. Tracks Vs. Crit. Path Normalized Avg. Delay

32 Crit. Path Delay (40%- 50% Tracks Are CMS) Conv. FPGA Crit. Path Delay (ns) D.P. FPGA Crit. Path Delay (ns) D.P. FPGA Crit. Path Delay (Normalized) code_seq_dp dcu_dpath ex_dpath exponent_dp icu_dpath imdr_dpath Incmod mantissa_dp multmod_dp pipe_dpath prils_dp rsadd_dp smu_dpath ucode_dat ucode_reg Avg. Area

33 Conclusions Investigated the question –Can regularity be effectively utilized to improve logic density? Presented –A datapath-oriented FPGA architecture Fully specified to the level of transistor sizing –An analysis on datapath regularity –A brief description of the CAD flow for the architecture

34 Conclusions Detailed architectural specification and CAD implementation is very important Best MB-FPGA architecture –Granularity = 4 –40%-50% of tracks are C.M.S. Architectural Results –10% smaller in area than conv. FPGA –Much less than the 50% area savings prediction [cher96] –Has a 10% performance penalty

35 Discussions Under what circumstances will MB- FPGA be more area efficient? –Applications with more buses than our benchmarks –Wider datapath applications –Larger than 1x min. width transistors in SRAM cells –Smaller than 5x min. width transistors in tri-state buffers –SRAM reduction is more important than area reduction

36 Future Work Architecture –Sharing configuration memory in logic –Improve performance CAD tools –Proper modeling of carry network delay –Improve performance –Power modeling

37 Detailed Datapath-oriented CAD Implementation Issues Andy Gean Ye University of Toronto

38 Datapath-oriented CAD Flow – Overview Enhanced Module Compaction Synthesis Coarse-grain Node Graph Packing Coarse-grain Resource Routing Multi-bit FPGA Placement

39 Input to CAD Flow Netlists of datapath components in Verilog or VHDL From a pre-defined library –Arithmetic operators –Logic operators –Multiplexers Datapath regularity of the input is preserved throughout the CAD flow

40 An Example Input Datapath Circuit c out sel c in mux a0a0 b0b0 + c0c0 d0d0 s0s0 a1a1 b1b1 + c1c1 d1d1 s1s1 a2a2 b2b2 + c2c2 d2d2 s2s2 a3a3 b3b3 + c3c3 d3d3 s3s3

41 Synthesis Synopsys FPGA compiler has 38% area inflation when instructed to preserve datapath regularity Two major causes of area inflation –Duplicated logic across bit-slices –Bit-slices are too small Augmented FPGA compiler with new algorithms –Reduced the area inflation to 3%

42 Packing Based on the T-VPACK [betz99] algorithm Like T-VPACK – timing driven New feature – ability to preserve datapath regularity

43 After Synthesis and Packing BLE a0a0 b0b0 c0c0 sela2a2 b2b2 c2c2 a3a3 b3b3 c3c3 d0d0 d1d1 d2d2 d3d3 s0s0 s1s1 s2s2 s3s3 c in c out a1a1 b1b1 c1c1 sel bus

44 Placement and Routing Based on the VPR tools [betz99] –Placer: simulated annealing [kirk83] –Router: congestion negotiation- based pathfinder [ebel95] New feature of the placer –Ability to move individual clusters if they do not contain datapath –Move entire logic block if they contain datapath to preserve datapath regularity

45 Router Contains a new set of expansion cost functions –Designed to ease the task of comparing the cost of using conv. tracks against the cost of using conf. mem. shar. tracks –Composed of delay and congestion metrics (similar to the conventional expansion cost)

46 Overall Routing Flow Route Buses Route Non-bus Signals Update Cost Functions

47 Routing Buses Route entire buses through conf. mem. shar. routing tracks Route the first bit through conv. routing tracks – test for delay and congestion Compare expansion costs Select the option with the lowest expansion cost

48 Routing Non-bus Signals Consider the options of routing the signal through conv. as well as conf. mem. shar. tracks Compare the expansion cost Select the option with the lowest expansion cost