Download presentation
Published byDorthy Kelly Modified over 9 years ago
1
A Flexible DSP Block to Enhance FGPA Arithmetic Performance
Hadi Parandeh-Afshar Alessandro Cevrero Panagiotis Athanasopoulous Philip Brisk Yusuf Leblebici Paolo Ienne LAP EPFL LSM, LAP EPFL UCR LSM EPFL Epfl and iis logo Ecole Politechique Federale De lausanne (EPFL) University of California Riverside (UCR)
2
Motivation and contribution
New DSP block for high performance FPGAs Increased flexibility PPG Bypassable PPG What are you doing? Why doing that, and why is important Programmable Compressor Tree Enchance FPGA arithmetic performance
3
Motivation and contribution
Data flow transformation automatically expose compressor tree 19 E1 E2 M1 M2 48 4 S1 S2 out sign xor neg 1 not and Fused multiply-addition operations cannot use current DSP blocks in a single-cycle Arithmetic transformations DSP blocks cannot accelerate multi-operand addition (a) (b) Dat flow transformation [Verma et al , TCAD 08]
4
Outline Related work DSP Block Architecture Experimental methodology
Limitations DSP Block Architecture Experimental methodology Results Conclusions Not really sure it is useful ????
5
FPGA commentary IP cores [Xilinx, Altera]
Logic cells with dedicated addition circuitry and fast carry chains Compressor tree synthesis on 6-LUT FPGAs [Parandeh-Afshar et. al, ASPDAC 08, DATE 08, FPL 09] IP cores [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 08] DSP Blocks [Altera Stratix III-IV] Σ 9
6
FPGA commentary IP cores [Xilinx, Altera]
Logic cells with dedicated addition circuitry and fast carry chains Compressor tree synthesis on 6 LUTs FPGAs [Parandeh-Afshar et al, DATE 08, ASPDAC 08, FPL 09] IP cores [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 08] DSP Blocks [Altera Stratix III-IV] Σ 9
7
Field Programmable Compressor Tree (FPCT)
User-configurable multi operand adder Compressor tree + bypassable CPA 15 16 CSlice 6 128 = 816 input bits 48 = 86 output bits Carry-in 15 Carry-out Dedicated to FPCT and how fpct today map a multiplier Previous wok has established the ability of FPCT to accellerarate multi-input addtion operation. 1.6x speed up was observed [Cevrero et al, FPGA 08, TRETS 09]
8
FPCT limitations PPG soft logic
9x9-bit signed multiplier [Baugh Wooley] Soft-Logic 9x9-bit PPG (81 LUTs) 82 wires 1 FPCT 18 bit output Put low counter utilization
9
FPCT limitations PPG soft logic Low input utilization for multipliers
9x9-bit signed multiplier [Baugh Wooley] 64% input utilization Soft-Logic 9x9-bit PPG (81 LUTs) 2 3 C0 C1 C2 C3 C4 C5 C6 82 wires 1 FPCT 18 bit output Put low counter utilization
10
DSP block architecture
11 DSP block architecture FPCT (8 CSlices) 128 48 Put the constroibution
11
DSP block architecture
11 DSP block architecture ½-FPCT (4 CSlices) A B PPG PPG* 5 61 21 15 3 90 18 128 61 6 ½-FPCT (4 CSlices) Put the constroibution Two 9x9 signed PPGs One modified to support larger multiplier Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers
12
DSP block architecture
11 DSP block architecture ½-FPCT (4 CSlices) A B PPG PPG* 5 61 21 15 3 90 18 128 C4 C3 C2 C1 5 2 3 Fixed Logic (A) Logic (B) 61 6 ½-FPCT (4 CSlices) Put the constroibution Two 9x9 signed PPGs One modified to support larger multiplier Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers
13
DSP block architecture
11 DSP block architecture ½-FPCT (4 CSlices) A B PPG PPG* 5 61 21 15 3 90 18 128 Only 8% larger that traditional FPCT in 90nm CMOS (ARTISAN cell library with TSMC process) 61 6 ½-FPCT (4 CSlices) Put the constroibution Two 9x9 signed PPGs One modified to support larger multiplier Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers
14
Experimental methodology
Input Pins Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F* Same area and I/0 as our DSP IP IP To asses the DPS blcok performances we used the VEB IP Output Pins
15
Experimental methodology
Input Pins Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F* Same area and I/0 as our DSP Replace our DSP block with F* Map benchmark on Stratix II Extract F* delay Estimated proposed DSP block delay ASIC design flow (90nm CMOS) F* F* To asses the DPS blcok performances we used the VEB F* Output Pins
16
Experimental methodology
Input Pins Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F* Same area and I/0 as our DSP Replace our DSP block with F* Map benchmark on Stratix II Extract F* delay Estimated proposed DSP block delay ASIC design flow (90nm CMOS) For each proposed DSP block in the circuit Subtract delay of F* Add proposed DSP block delay New-DPS New-DPS To asses the DPS blcok performances we used the VEB New-DPS Output Pins
17
Results Critical Path Delay Ternary
GPC [Parandeh-Afshar et al, ASPDAC 08] Stratix II DSP Block FPCT w/ Soft PPG Proposed DSP Block ns
18
Normalized Area (to Stratix II DSP block area)
Results Normalized Area (to Stratix II DSP block area) Stratix II DSP Block FPCT w/ Soft PPG Proposed DSP Block
19
Conclusion New DSP block proposed
Accelerate multiplication and multi-operand addition More flexibility Competitive with Stratix II DSP block Intends to replace compressor tree in existing DSP block Only 8% area overhead respect to original FPCT
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.