AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.

Slides:

Advertisements

Similar presentations

Zhongkai Chen. Gonzalez-Navarro, S. ; Tsen, C. ; Schulte, M. ; Univ. of Malaga, Malaga This paper appears in: Signals, Systems and Computers, ACSSC.

Advertisements

Function Evaluation Using Tables and Small Multipliers CS252A, Spring 2005 Jason Fong.

An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.

Spartan-3 FPGA HDL Coding Techniques

Lecture 11 Oct 12 Circuits for floating-point operations addition multiplication division (only sketchy)

Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.

Distributed Arithmetic

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)

Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Synchronous Digital Design Methodology and Guidelines

Faculty of Computer Science © 2006 CMPUT 229 Floating Point Representation Operating with Real Numbers.

Introduction to Reconfigurable Computing CS61c sp06 Lecture (5/5/06) Hayden So.

A Systolic FFT Architecture for Real Time FPGA Systems.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

L1 CTT May 14, 2002 New Scheme for Tracking Marvin Johnson.

Distributed Arithmetic: Implementations and Applications

Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich VLSI CAD Lab Computer Science Department University of California,

ENGIN112 L26: Shift Registers November 3, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 26 Shift Registers.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

GPGPU platforms GP - General Purpose computation using GPU

Study of AES Encryption/Decription Optimizations Nathan Windels.

ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.

A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.

Power Reduction for FPGA using Multiple Vdd/Vth

AICCSA’06 Sharja 1 A CAD Tool for Scalable Floating Point Adder Design and Generation Using C++/VHDL By Asim J. Al-Khalili.

Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Spezielle.

Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.

Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Color.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Efficient FPGA Implementation of QR

ECE232: Hardware Organization and Design

Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),

Reconfigurable Computing - Multipliers: Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Abdullah Aldahami ( ) March 12, Introduction 2. Background 3. Proposed Multiplier Design a.System Overview b.Fixed Point Multiplier.

Implementation of Finite Field Inversion

1 A Combined Decimal and Binary Floating-point Multiplier Charles Tsen, Sonia González-Navarro, Michael Schulte, Brian Hickmann, Katherine Compton 2009.

A Fast Hardware Approach for Approximate, Efficient Logarithm and Anti-logarithm Computation Suganth Paul Nikhil Jayakumar Sunil P. Khatri Department of.

VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.

1 KU College of Engineering Elec 204: Digital Systems Design Lecture 11 Binary Adder/Subtractor.

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Abdullah Said Alkalbani University of Buraimi

Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.

Company LOGO Final presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.

Joseph Schneider February 23,  Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction  Faster, more precise.

LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA Project Guide: Smt. Latha Dept of E & C JSSATE, Bangalore. From: N GURURAJ M-Tech,

FPGA Implementation of RC6 including key schedule Hunar Qadir Fouad Ramia.

Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.

CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Speedup Speedup is defined as Speedup = Time taken for a given computation by a non-pipelined functional unit Time taken for the same computation by a.

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

Producing FPGA Firmware- 1 U. Wisconsin, February 19, 2009 Calorimeter Algorithm Firmware Calorimeter Trigger Upgrade Firmware Michael Schulte, Katherine.

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Improved Resource Sharing for FPGA DSP Blocks

An FPGA Implementation of a Brushless DC Motor Speed Controller

Floating Point Operations

FPGAs in AWS and First Use Cases, Kees Vissers

Outline Introduction Floating Point Arithmetic Adder Multiplier.

How to represent real numbers

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Binary Adder/Subtractor

UNIVERSITY OF MASSACHUSETTS Dept

Optimizing RTL for EFLX Tony Kozaczuk, Shuying Fan December 21, 2016

Presentation transcript:

AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor

OUTLINE Introduction and Overview Baseline Implementation FPGA-based Optimizations  Multiplier  Constant Tables  Multiplexers Results Conclusion

Introduction It is difficult to represent 0.1 in BFP. (closest single precision is ) FPGA’s are a potential solution to add hardware-based DFP engines to existing compute clusters without replacing the systems. Allows them to accelerate DFP calculations without replacing their computing infrastructure. This was the first presentation of a BID-based DFP adder for FPGA’s The basic idea in this paper was to take an adder implemented in HDL for standard cells and improve it for the Xilinx Virtex 5.

Intro: 3 Rounding Scenarios Important to note because it changes the number of clock cycles required. Case 1: The A exponent does not equal B exponent and the intermediate significand is no larger than our chosen rounder size. Case 2: Aexp = Bexp Case 3: The intermediate significand is too large for the rounder.

Baseline Implementation Synthesized using the original HDL to a Xilinx Virtex 5. Rounder block is largest component. 12 DSP48E blocks for the multiplier used for alignment and rounding. Several 64bit 2:1 muxes inefficiently use LUT resources. There are several constant tables that could be optimized. Rounder

Rounder Block Three tables inside the rounder block to be optimized. The 4 multiplexers referred to on last page. CoreGen multipliers are slower and use more DSP48E blocks than the improved multipliers. This is because they use the DSP blocks instead of LUT’s to add partial products. Another option is to adjust the size of the multiplier (ie increase the size so the case3 becomes case1)

Decimal Digit Counter Synthesis Results DesignLUTsFFsBRAMsPeriod Baseline ns Merged BRAM ns LUT Based ns We can merge two of the LUT’s that were originally two BRAM’s into one. The other option is to implement the whole thing using LUT’s. The Merged BRAM was chosen the time savings here does not effect the overall timing of the adder, so space is more important. The other tables were implemented as LUT’s because it was not an efficient use of resources to implement in the BRAM.

Multiplexers 64-bit 2-to-1 MUXLUT’sDSP48E’sDelay (ns) LUT-Based Combined LUT DSP-Based DSP-and-LUT LUT’s use the default LUT-based implementation without combination. If LUT’s are combined, routing congestion decreases the frequency of the result.

Control Signals The baseline implementation had mostly active-low control signals and asynchronous reset. The optimized design uses active high control signals and a single synchronous reset. This change also reduces the resources used.

Overall Results The larger multiplier has a slight frequency penalty compared to the smaller multipliers, but moves more input combinations from case3 to case1. Therefore, best multiplier size depends on the characteristics of the applications that use it. If multiple BID adders are implemented on a single FPGA, the DSP48E blocks are the limiting resource; a Virtex 5 can fit at most five of the BID adders with a pipelined small multiplier, but up to sixteen of the BID adders that use the multi-cycle multiplier.

This is because the multi-cycle multipliers use far fewer DSP48E blocks than the pipelined multipliers, and are thus a good choice for many parallel DFP units. This only degrades BID adder frequency by approximately 2-3 MHz, but reduces the number of input combinations that would incur the worst case latency.