Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient Synthesis of Compressor Trees on FPGAs

January 22, 20082 Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

January 22, 20084 FPGA vs. ASIC Performance Area Utilization Power Consumption Flexibility Time-to-Market ASICFPGA √ √ √ √ √

January 22, 20085 FPGA Arithmetic Features Poor Performance for Arithmetic Operations Compared to ASIC IP Cores High Routing Costs Limited Flexibility; 18-bit Adder/Multiplier Full Adder Implemented in CLB Structure Fast Carry-Chain (Xilinx and Altera) Reduces Routing Delay Cannot Use Compressor Trees to Add k>2 Values Wallace/Dadda/3-Greedy

January 22, 20087 Motivation: Compressor Trees Partial product reduction in parallel multiplication Wallace and Dadda in the 1960s Multi-input addition occurs in many multimedia and signal processing H.264/AVC Variable Block Size Motion Estimation FIR Filters 3G Wireless Base Station Channel Cards Flow graph transformations expose opportunities to use compresor trees in high-level synthesis [Verma and Ienne, ICCAD 04]

January 22, 20088 Flow Graph Transformation step3 >> & delta7 & 4 SEL = 0 + + step1 >> & 2 = 0 SEL + step2 >> & 1 = 0 vpdiff step3 >> = delta1 & 0 step2 >> SEL 0 = delta2 & 0 step1 >> SEL 0 = delta4 & 0 step0 >> SEL 0 vpdiff ∑ + Compressor Tree ADPCM

January 22, 200810 Counters m n m:n counter n = log 2 (m+1) Count #of Input Bits Set to 1 Output # as a Binary Value Counters You Know 2:2 – Half Adder 3:2 – Full Adder (Carry-Save Adder) The correct building block for computing sums of k>2 numbers Counters do not map well onto LUTs or carry chains

January 22, 200811 Generalized Parallel Counters (GPCs) Sum bits having different ranks m:n counter: all bits have rank 0, i.e.: 2 0 = 1 Representation: (K n-2, K n-1, …, K 0 ; S) K i – number of input bits of rank i S – number of output bits (0, 4; 3) – typical 4:3 counter (2, 3; 3) – maximum value: 2*2 1 + 4*2 0 = 12 Range [0, 12] requires S = 4 output bits Examples using dot notation (3, 3; 4) GPC (5, 5; 4) GPC

January 22, 200812 GPC Implementation For ASICs Basic gates, e.g. AND, XOR Built from m:n counters, e.g., just like a compressor tree FPGA Implementation K-input GPC maps nicely onto K-LUTs One logic level required K = 6 for Xilinx Virtex-5 and Altera Stratix II and III Three 6-LUTs for 6-input, 3-output GPC Four 6-LUTs for 6-input, 4-output GPC

January 22, 200814 Definitions Primitive GPCs: Satisfies given I/O Constraints 12-primitive GPCs for 6 inputs, 3 outputs Including (1, 3; 3), (2, 3; 3) Covering GPCs Functionality cannot be implemented by other GPCs, given the I/O constraints e.g., (2, 3; 3) GPC can implement a (1, 3; 3) GPC  Set a rank-1 input bit to 0

January 22, 200815 Definitions Unreasonable GPCs: Single bit in rank-0 column (3, 1; 3) GPC  rank-0 output bit = rank-0 input bit No reduction in bits (1, 2; 3) GPC 3 input bits: Output value in range [0, 4] 3 output bits

January 22, 200816 Definitions Compression Ratio (CR): # Input Bits / # Output Bits (3, 3; 4) GPC CR = 6/4 = 1.5 (2, 3; 3) GPC CR = 5/3 = 1.67 Using GPCs with large CR tends to reduce the number of bits to sum at the next logic level # logic levels = # LUTs on critical path in an FPGA

January 22, 200817 Input: Columns of bits to sum Example: 3-tap FIR filter Each FIR filter is different, depending on constants used 0 rank

January 22, 200818 Mapping Heuristic map_algorithm(Integer : M, Integer : N, Array of Integers : columns ) { step1: find_covering_GPCs( ); step2: find_primitive_GPCs( ); step3: order_primitive_GPCs( ); Repeat { step4: Repeat { col_indx = find_highest_column( ); find_next_GPC (col_indx); remove_covered_dots( ); } until all dots are covered or no reasonable GPC is found step5: connect_GPCs_IOs( ); step6: generate_next_stage_dots( ); } until three rows of dots remains; } step7: generate_final_cpa( columns ) Virtex-5 and Stratix II & III support ternary addition Attack the tallest column first (greedy approach)

January 22, 200819 4 3 1 Example 2 Map to ternary adder

January 22, 200821 Experimental Methodology Altera Stratix-II 90nm CMOS Technology Implementations of multi-input addition ADD – Ternary adder tree State of the art for FPGAs 3GD – 3-greedy algorithm (3:2 and 2:2 counters) [Stelling et al., TCOMP 98] 2 and 3-input counters do not map well onto 6-LUTs! GPCs – Heuristic described here

January 22, 200822 Experimental Results (Delay) 27% on average GPC is faster than ADD

January 22, 200823 Experimental results (Area) 5% increase in ALMs usage for GPC compared to ADD

January 22, 200824 Are DSP/MAC Blocks Useful? No! On average, delay using DSP/MAC blocks was more than 2x worse than 3GD

January 22, 200826 Conclusion Conventional wisdom has held that adder trees outperform compressor trees on FPGAs Ternary adder trees were a major selling point of the Altera Stratix II architecture This led to their inclusion in Xilinx Virtex-5 devices Conventional wisdom is wrong! GPCs map nicely onto LUTs Compressor trees on FPGAs, are faster than adder trees when built from GPCs

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

Similar presentations

Presentation on theme: "Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

Similar presentations

Presentation on theme: "Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient."— Presentation transcript:

Similar presentations

About project

Feedback