AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor
OUTLINE Introduction and Overview Baseline Implementation FPGA-based Optimizations Multiplier Constant Tables Multiplexers Results Conclusion
Introduction It is difficult to represent 0.1 in BFP. (closest single precision is ) FPGA’s are a potential solution to add hardware-based DFP engines to existing compute clusters without replacing the systems. Allows them to accelerate DFP calculations without replacing their computing infrastructure. This was the first presentation of a BID-based DFP adder for FPGA’s The basic idea in this paper was to take an adder implemented in HDL for standard cells and improve it for the Xilinx Virtex 5.
Intro: 3 Rounding Scenarios Important to note because it changes the number of clock cycles required. Case 1: The A exponent does not equal B exponent and the intermediate significand is no larger than our chosen rounder size. Case 2: Aexp = Bexp Case 3: The intermediate significand is too large for the rounder.
Baseline Implementation Synthesized using the original HDL to a Xilinx Virtex 5. Rounder block is largest component. 12 DSP48E blocks for the multiplier used for alignment and rounding. Several 64bit 2:1 muxes inefficiently use LUT resources. There are several constant tables that could be optimized. Rounder
Rounder Block Three tables inside the rounder block to be optimized. The 4 multiplexers referred to on last page. CoreGen multipliers are slower and use more DSP48E blocks than the improved multipliers. This is because they use the DSP blocks instead of LUT’s to add partial products. Another option is to adjust the size of the multiplier (ie increase the size so the case3 becomes case1)
Decimal Digit Counter Synthesis Results DesignLUTsFFsBRAMsPeriod Baseline ns Merged BRAM ns LUT Based ns We can merge two of the LUT’s that were originally two BRAM’s into one. The other option is to implement the whole thing using LUT’s. The Merged BRAM was chosen the time savings here does not effect the overall timing of the adder, so space is more important. The other tables were implemented as LUT’s because it was not an efficient use of resources to implement in the BRAM.
Multiplexers 64-bit 2-to-1 MUXLUT’sDSP48E’sDelay (ns) LUT-Based Combined LUT DSP-Based DSP-and-LUT LUT’s use the default LUT-based implementation without combination. If LUT’s are combined, routing congestion decreases the frequency of the result.
Control Signals The baseline implementation had mostly active-low control signals and asynchronous reset. The optimized design uses active high control signals and a single synchronous reset. This change also reduces the resources used.
Overall Results The larger multiplier has a slight frequency penalty compared to the smaller multipliers, but moves more input combinations from case3 to case1. Therefore, best multiplier size depends on the characteristics of the applications that use it. If multiple BID adders are implemented on a single FPGA, the DSP48E blocks are the limiting resource; a Virtex 5 can fit at most five of the BID adders with a pipelined small multiplier, but up to sixteen of the BID adders that use the multi-cycle multiplier.
This is because the multi-cycle multipliers use far fewer DSP48E blocks than the pipelined multipliers, and are thus a good choice for many parallel DFP units. This only degrades BID adder frequency by approximately 2-3 MHz, but reduces the number of input combinations that would incur the worst case latency.