Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Novel FPGA Logic Block for Improved Arithmetic Performance

Similar presentations


Presentation on theme: "A Novel FPGA Logic Block for Improved Arithmetic Performance"— Presentation transcript:

1 A Novel FPGA Logic Block for Improved Arithmetic Performance
Hadi Parandeh-Afshar Philip Brisk Paolo Ienne 16th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2008

2 FPGA vs. ASIC √ Performance gap between FPGAs and ASICs Performance
Area Utilization Power Consumption Flexibility Time-to-Market ASIC FPGA Performance gap between FPGAs and ASICs [Kuon and Rose, FPGA 2006 and TCAD 2007] Arithmetic circuits exacerbate the disparities Focus on compressor trees 1/16

3 Compressor Trees A circuit that sums k > 2 integer values
Carry-save representation [Wallace 1966, Dadda 1967] Parallel multipliers Many video/signal processing circuits FIR Filters H.264/AVC video coding 3G wireless base station channel cards Flowgraph transformations to expose compressor trees [Verma and Ienne, ICCAD 2004] Generally applicable to arithmetic circuits Merge disparate add, mul operations to form compressor trees 2/16

4 Circuit Transformation
step 3 >> & delta 7 4 SEL = + 1 2 vpdiff step 3 >> = delta 1 & 2 SEL 4 vpdiff + Compressor Tree ADPCM [Verma and Ienne, ICCAD 2004] 3/16

5 Compressor Tree Synthesis
ASIC Synthesis Ripple-carry addition Carry-save representation Ternary addition Full/Half Adder Trees m:n counters FPGA Synthesis Stratix II/III carry chain LUTs (shared arithmetic mode) Ternary addition Map poorly onto LUTs Poor flexibility in mapping [Wallace 1966] [Dadda 1967] [Stelling et al TComp 1998] FA HA m n Count number of input bits set to 1 Generalized Full/Half Adders Output is a value in the range [0, m] [Verma and Ienne, DATE 2007] Drawbacks Routing delays Can’t use carry-chains LUTs LUTs Carry-chain 4/16

6 The Altera Stratix II/III ALM: Shared Arithmetic Mode
rank = r sumr 3-LUT To ALM output carryr 3-LUT rank = r+1 sumr+1 3-LUT To ALM output carryr+1 3-LUT 5/16

7 Generalized Parallel Counters (GPCs)
Extension to m:n counters Input bits can have different ranks i.e., (kn-1, …, k1, k0; S) 20 21 (2, 3; 3) 20 21 (0, 4; 3) 20 21 2n-1 Output Range: [0, 7]  S = 3 4:3 Counter Number of input bits: M = kn-1 + … + k1 + k0 Number of output bits: S 6/16

8 Compressor Tree Synthesis on FPGAs via GPC Mapping
Software synthesis heuristic/ILP [Parandeh-Afshar et al. ASPDAC 2008, DATE 2008] Faster than ternary adder trees or DSP blocks Stratix II/III and Xilinx Virtex-5 FPGAs M = 6 inputs S = 3, 4 outputs GPCs were mapped onto 6-LUTs Unable to exploit the carry chain, except for final add Contribution: A new carry chain that we can use! 7/16

9 The 6:2 Compressor: an Alternative to the 6:3 Counter and 6-input GPC
All inputs have rank 0 6:3 6:3 Counter Output rank 1 2 Input ranks may vary 6:3 6-input GPC Output rank 1 2 All inputs have rank 0 Output rank 1 6:2 Compressor 6:2 rank 0 cin,0 cin,1 rank 2 rank 1 cout,1 cout,0 8/16

10 Why are 6:2 compressors more effective than 6:3 counters?
Steady state: 3 bits per column Steady state: 2 bits per column 11/16

11 6:2 Compressors Form a Carry Chain
Each 6:2 compressor is a logic cell Carry chains between adjacent cells bypass local routing This is not an over-glorified ripple-carry structure 9/16

12 6:2 Compressors: Microarchitecture
FA HA rank-0 inputs Sum outputs cin,1 cout,1 cout,0 cin,0 No combinational path from carry-in to carry-out bits This is not ripple-carry 10/16

13 Similarities Between Shared Arithmetic Mode and the 6:2 Compressor
FA HA rank-0 inputs Sum outputs cin,1 cout,1 cout,0 cin,0 6:2 Compressor FA ALM inputs To ALM outputs (LUTs) ALM (Shared Arithmetic Mode) 12/16

14 Proposed Logic Cell: 2 Designs
13/16

15 Experimental Methodology
Platform: VPR Modeled island-style FPGA Altera-like ALMs and LABs 4 ALMs per LAB to reduce complexity 4 Mapping Algorithms 3-ADD : Ternary adder trees GPC : GPC mapping [Parandeh-Afshar et al. ASPDAC 2008] 6:2 : Mapping using 6:2 compressors only 6:2 + GPC : The best of both worlds 14/16

16 Experimental Results 3-ADD has the smallest area in all cases
GPC has the largest area in all cases No uniform trends GPC does not use carry chains; the others do! 6:2 + GPC is the best in all cases 15/16

17 Conclusion Compressor trees are an important class of arithmetic circuits Previous work: GPC mapping outperforms 3-ADD Cannot use carry-chain Contribution: New carry chain Configures the Altera Stratix II/III ALM as a 6:2 compressor 1 HA, 2 FA, 2 muxes, plus wires Best results combine GPC mapping with 6:2 compressors Average speedup : 1.41x over 3-ADD Average increase in ALM usage: 1.19x over 3-ADD 16/16


Download ppt "A Novel FPGA Logic Block for Improved Arithmetic Performance"

Similar presentations


Ads by Google