Presentation is loading. Please wait.

Presentation is loading. Please wait.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

Similar presentations


Presentation on theme: "Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient."— Presentation transcript:

1 Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient Synthesis of Compressor Trees on FPGAs

2 January 22, 20082 Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

3 January 22, 20083 Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

4 January 22, 20084 FPGA vs. ASIC Performance Area Utilization Power Consumption Flexibility Time-to-Market ASICFPGA √ √ √ √ √

5 January 22, 20085 FPGA Arithmetic Features Poor Performance for Arithmetic Operations Compared to ASIC IP Cores High Routing Costs Limited Flexibility; 18-bit Adder/Multiplier Full Adder Implemented in CLB Structure Fast Carry-Chain (Xilinx and Altera) Reduces Routing Delay Cannot Use Compressor Trees to Add k>2 Values Wallace/Dadda/3-Greedy

6 January 22, 20086 Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

7 January 22, 20087 Motivation: Compressor Trees Partial product reduction in parallel multiplication Wallace and Dadda in the 1960s Multi-input addition occurs in many multimedia and signal processing H.264/AVC Variable Block Size Motion Estimation FIR Filters 3G Wireless Base Station Channel Cards Flow graph transformations expose opportunities to use compresor trees in high-level synthesis [Verma and Ienne, ICCAD 04]

8 January 22, 20088 Flow Graph Transformation step3 >> & delta7 & 4 SEL = 0 + + step1 >> & 2 = 0 SEL + step2 >> & 1 = 0 vpdiff step3 >> = delta1 & 0 step2 >> SEL 0 = delta2 & 0 step1 >> SEL 0 = delta4 & 0 step0 >> SEL 0 vpdiff ∑ + Compressor Tree ADPCM

9 January 22, 20089 Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

10 January 22, 200810 Counters m n m:n counter n = log 2 (m+1) Count #of Input Bits Set to 1 Output # as a Binary Value Counters You Know 2:2 – Half Adder 3:2 – Full Adder (Carry-Save Adder) The correct building block for computing sums of k>2 numbers Counters do not map well onto LUTs or carry chains

11 January 22, 200811 Generalized Parallel Counters (GPCs) Sum bits having different ranks m:n counter: all bits have rank 0, i.e.: 2 0 = 1 Representation: (K n-2, K n-1, …, K 0 ; S) K i – number of input bits of rank i S – number of output bits (0, 4; 3) – typical 4:3 counter (2, 3; 3) – maximum value: 2*2 1 + 4*2 0 = 12 Range [0, 12] requires S = 4 output bits Examples using dot notation (3, 3; 4) GPC (5, 5; 4) GPC

12 January 22, 200812 GPC Implementation For ASICs Basic gates, e.g. AND, XOR Built from m:n counters, e.g., just like a compressor tree FPGA Implementation K-input GPC maps nicely onto K-LUTs One logic level required K = 6 for Xilinx Virtex-5 and Altera Stratix II and III Three 6-LUTs for 6-input, 3-output GPC Four 6-LUTs for 6-input, 4-output GPC

13 January 22, 200813 Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

14 January 22, 200814 Definitions Primitive GPCs: Satisfies given I/O Constraints 12-primitive GPCs for 6 inputs, 3 outputs Including (1, 3; 3), (2, 3; 3) Covering GPCs Functionality cannot be implemented by other GPCs, given the I/O constraints e.g., (2, 3; 3) GPC can implement a (1, 3; 3) GPC  Set a rank-1 input bit to 0

15 January 22, 200815 Definitions Unreasonable GPCs: Single bit in rank-0 column (3, 1; 3) GPC  rank-0 output bit = rank-0 input bit No reduction in bits (1, 2; 3) GPC 3 input bits: Output value in range [0, 4] 3 output bits

16 January 22, 200816 Definitions Compression Ratio (CR): # Input Bits / # Output Bits (3, 3; 4) GPC CR = 6/4 = 1.5 (2, 3; 3) GPC CR = 5/3 = 1.67 Using GPCs with large CR tends to reduce the number of bits to sum at the next logic level # logic levels = # LUTs on critical path in an FPGA

17 January 22, 200817 Input: Columns of bits to sum Example: 3-tap FIR filter Each FIR filter is different, depending on constants used 0 rank

18 January 22, 200818 Mapping Heuristic map_algorithm(Integer : M, Integer : N, Array of Integers : columns ) { step1: find_covering_GPCs( ); step2: find_primitive_GPCs( ); step3: order_primitive_GPCs( ); Repeat { step4: Repeat { col_indx = find_highest_column( ); find_next_GPC (col_indx); remove_covered_dots( ); } until all dots are covered or no reasonable GPC is found step5: connect_GPCs_IOs( ); step6: generate_next_stage_dots( ); } until three rows of dots remains; } step7: generate_final_cpa( columns ) Virtex-5 and Stratix II & III support ternary addition Attack the tallest column first (greedy approach)

19 January 22, 200819 4 3 1 Example 2 Map to ternary adder

20 January 22, 200820 Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

21 January 22, 200821 Experimental Methodology Altera Stratix-II 90nm CMOS Technology Implementations of multi-input addition ADD – Ternary adder tree State of the art for FPGAs 3GD – 3-greedy algorithm (3:2 and 2:2 counters) [Stelling et al., TCOMP 98] 2 and 3-input counters do not map well onto 6-LUTs! GPCs – Heuristic described here

22 January 22, 200822 Experimental Results (Delay) 27% on average GPC is faster than ADD

23 January 22, 200823 Experimental results (Area) 5% increase in ALMs usage for GPC compared to ADD

24 January 22, 200824 Are DSP/MAC Blocks Useful? No! On average, delay using DSP/MAC blocks was more than 2x worse than 3GD

25 January 22, 200825 Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

26 January 22, 200826 Conclusion Conventional wisdom has held that adder trees outperform compressor trees on FPGAs Ternary adder trees were a major selling point of the Altera Stratix II architecture This led to their inclusion in Xilinx Virtex-5 devices Conventional wisdom is wrong! GPCs map nicely onto LUTs Compressor trees on FPGAs, are faster than adder trees when built from GPCs


Download ppt "Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient."

Similar presentations


Ads by Google