A Novel FPGA Logic Block for Improved Arithmetic Performance

Slides:

Advertisements

Similar presentations

ECE 506 Reconfigurable Computing ece. arizona

Advertisements

Zhongkai Chen.  Appears in: VLSI Design, Automation and Test, VLSI-DAT International Symposium on Date:25-27 April 2007  Zi-Yi Zhao, Chien-Hung.

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Multioperand Addition Lecture 6. Required Reading Chapter 8, Multioperand Addition Note errata at:

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

Basic Adders and Counters Implementation of Adders in FPGAs ECE 645: Lecture 3.

Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.

Titan: Large and Complex Benchmarks in Academic CAD

Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department.

Digital Integrated Circuits Chpt. 5Lec /29/2006 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (

Multi-operand Addition

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Lecture 4 Multiplier using FPGA 2007/09/28 Prof. C.M. Kyung.

Reconfigurable Computing - Type conversions and the standard libraries John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots.

Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

ECE 645 – Computer Arithmetic Lecture 6: Multi-Operand Addition ECE 645—Computer Arithmetic 3/5/08.

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Routing Wire Optimization through Generic Synthesis on FPGA Carry Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.

Application of Addition Algorithms Joe Cavallaro.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.

Reconfigurable Computing - Performance Issues John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Global Delay Optimization using Structural Choices Alan Mishchenko Robert Brayton UC Berkeley Stephen Jang Xilinx Inc.

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Combinational Circuits

EKT 221 : Digital 2 Serial Transfers & Microoperations

Reconfigurable Architectures

Floating-Point FPGA (FPFPGA)

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

CSE241A VLSI Digital Circuits Winter 2003 Recitation 2

Presentation on FPGA Technology of

EKT 221 : Digital 2 Serial Transfers & Microoperations

Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.

Summary Half-Adder Basic rules of binary addition are performed by a half adder, which has two binary inputs (A and B) and two binary outputs (Carry out.

Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees

Andy Ye, Jonathan Rose, David Lewis

Unit5 Combinational circuit and instrumentation system.

CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing

CSE Winter 2001 – Arithmetic Unit - 1

Arithmetic Circuits (Part I) Randy H

Multiplier-less Multiplication by Constants

Programmable Logic- How do they do that?

Basic Adders and Counters Implementation of Adders

FPGA Glitch Power Analysis and Reduction

Part III The Arithmetic/Logic Unit

Approximate Quaternary Addition with the Fast Carry Chains of FPGAs

Multioperand Addition

UNIVERSITY OF MASSACHUSETTS Dept

Reconfigurable Architectures

ESE534: Computer Organization

Combinational Circuits

ECE 352 Digital System Fundamentals

ECE 352 Digital System Fundamentals

UNIVERSITY OF MASSACHUSETTS Dept

Comparison of Various Multipliers for Performance Issues

UNIVERSITY OF MASSACHUSETTS Dept

Lecture 3 Combinational units. Adders

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

A Novel FPGA Logic Block for Improved Arithmetic Performance Hadi Parandeh-Afshar Philip Brisk Paolo Ienne 16th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2008

FPGA vs. ASIC √ Performance gap between FPGAs and ASICs Performance Area Utilization Power Consumption Flexibility Time-to-Market ASIC FPGA √ Performance gap between FPGAs and ASICs [Kuon and Rose, FPGA 2006 and TCAD 2007] Arithmetic circuits exacerbate the disparities Focus on compressor trees 1/16

Compressor Trees A circuit that sums k > 2 integer values Carry-save representation [Wallace 1966, Dadda 1967] Parallel multipliers Many video/signal processing circuits FIR Filters H.264/AVC video coding 3G wireless base station channel cards Flowgraph transformations to expose compressor trees [Verma and Ienne, ICCAD 2004] Generally applicable to arithmetic circuits Merge disparate add, mul operations to form compressor trees 2/16

Circuit Transformation step 3 >> & delta 7 4 SEL = + 1 2 vpdiff step 3 >> = delta 1 & 2 SEL 4 vpdiff ∑ + Compressor Tree ADPCM [Verma and Ienne, ICCAD 2004] 3/16

Compressor Tree Synthesis ASIC Synthesis Ripple-carry addition Carry-save representation Ternary addition Full/Half Adder Trees m:n counters FPGA Synthesis Stratix II/III carry chain LUTs (shared arithmetic mode) Ternary addition Map poorly onto LUTs Poor flexibility in mapping [Wallace 1966] [Dadda 1967] [Stelling et al. TComp 1998] FA HA m n Count number of input bits set to 1 Generalized Full/Half Adders Output is a value in the range [0, m] [Verma and Ienne, DATE 2007] Drawbacks Routing delays Can’t use carry-chains LUTs LUTs Carry-chain 4/16

The Altera Stratix II/III ALM: Shared Arithmetic Mode rank = r sumr 3-LUT To ALM output carryr 3-LUT rank = r+1 sumr+1 3-LUT To ALM output carryr+1 3-LUT 5/16

Generalized Parallel Counters (GPCs) Extension to m:n counters Input bits can have different ranks i.e., (kn-1, …, k1, k0; S) 20 21 (2, 3; 3) 20 21 (0, 4; 3) 20 21 2n-1 … Output Range: [0, 7]  S = 3 4:3 Counter Number of input bits: M = kn-1 + … + k1 + k0 Number of output bits: S 6/16

Compressor Tree Synthesis on FPGAs via GPC Mapping Software synthesis heuristic/ILP [Parandeh-Afshar et al. ASPDAC 2008, DATE 2008] Faster than ternary adder trees or DSP blocks Stratix II/III and Xilinx Virtex-5 FPGAs M = 6 inputs S = 3, 4 outputs GPCs were mapped onto 6-LUTs Unable to exploit the carry chain, except for final add Contribution: A new carry chain that we can use! 7/16

The 6:2 Compressor: an Alternative to the 6:3 Counter and 6-input GPC All inputs have rank 0 6:3 6:3 Counter Output rank 1 2 Input ranks may vary 6:3 6-input GPC Output rank 1 2 All inputs have rank 0 Output rank 1 6:2 Compressor 6:2 rank 0 cin,0 cin,1 rank 2 rank 1 cout,1 cout,0 8/16

Why are 6:2 compressors more effective than 6:3 counters? Steady state: 3 bits per column Steady state: 2 bits per column 11/16

6:2 Compressors Form a Carry Chain Each 6:2 compressor is a logic cell Carry chains between adjacent cells bypass local routing This is not an over-glorified ripple-carry structure 9/16

6:2 Compressors: Microarchitecture FA HA rank-0 inputs Sum outputs cin,1 cout,1 cout,0 cin,0 No combinational path from carry-in to carry-out bits This is not ripple-carry 10/16

Similarities Between Shared Arithmetic Mode and the 6:2 Compressor FA HA rank-0 inputs Sum outputs cin,1 cout,1 cout,0 cin,0 6:2 Compressor FA ALM inputs To ALM outputs (LUTs) ALM (Shared Arithmetic Mode) 12/16

Proposed Logic Cell: 2 Designs 13/16

Experimental Methodology Platform: VPR Modeled island-style FPGA Altera-like ALMs and LABs 4 ALMs per LAB to reduce complexity 4 Mapping Algorithms 3-ADD : Ternary adder trees GPC : GPC mapping [Parandeh-Afshar et al. ASPDAC 2008] 6:2 : Mapping using 6:2 compressors only 6:2 + GPC : The best of both worlds 14/16

Experimental Results 3-ADD has the smallest area in all cases GPC has the largest area in all cases No uniform trends GPC does not use carry chains; the others do! 6:2 + GPC is the best in all cases 15/16

Conclusion Compressor trees are an important class of arithmetic circuits Previous work: GPC mapping outperforms 3-ADD Cannot use carry-chain Contribution: New carry chain Configures the Altera Stratix II/III ALM as a 6:2 compressor 1 HA, 2 FA, 2 muxes, plus wires Best results combine GPC mapping with 6:2 compressors Average speedup : 1.41x over 3-ADD Average increase in ALM usage: 1.19x over 3-ADD 16/16