Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient Synthesis of Compressor Trees on FPGAs
January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion
January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion
January 22, FPGA vs. ASIC Performance Area Utilization Power Consumption Flexibility Time-to-Market ASICFPGA √ √ √ √ √
January 22, FPGA Arithmetic Features Poor Performance for Arithmetic Operations Compared to ASIC IP Cores High Routing Costs Limited Flexibility; 18-bit Adder/Multiplier Full Adder Implemented in CLB Structure Fast Carry-Chain (Xilinx and Altera) Reduces Routing Delay Cannot Use Compressor Trees to Add k>2 Values Wallace/Dadda/3-Greedy
January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion
January 22, Motivation: Compressor Trees Partial product reduction in parallel multiplication Wallace and Dadda in the 1960s Multi-input addition occurs in many multimedia and signal processing H.264/AVC Variable Block Size Motion Estimation FIR Filters 3G Wireless Base Station Channel Cards Flow graph transformations expose opportunities to use compresor trees in high-level synthesis [Verma and Ienne, ICCAD 04]
January 22, Flow Graph Transformation step3 >> & delta7 & 4 SEL = step1 >> & 2 = 0 SEL + step2 >> & 1 = 0 vpdiff step3 >> = delta1 & 0 step2 >> SEL 0 = delta2 & 0 step1 >> SEL 0 = delta4 & 0 step0 >> SEL 0 vpdiff ∑ + Compressor Tree ADPCM
January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion
January 22, Counters m n m:n counter n = log 2 (m+1) Count #of Input Bits Set to 1 Output # as a Binary Value Counters You Know 2:2 – Half Adder 3:2 – Full Adder (Carry-Save Adder) The correct building block for computing sums of k>2 numbers Counters do not map well onto LUTs or carry chains
January 22, Generalized Parallel Counters (GPCs) Sum bits having different ranks m:n counter: all bits have rank 0, i.e.: 2 0 = 1 Representation: (K n-2, K n-1, …, K 0 ; S) K i – number of input bits of rank i S – number of output bits (0, 4; 3) – typical 4:3 counter (2, 3; 3) – maximum value: 2* *2 0 = 12 Range [0, 12] requires S = 4 output bits Examples using dot notation (3, 3; 4) GPC (5, 5; 4) GPC
January 22, GPC Implementation For ASICs Basic gates, e.g. AND, XOR Built from m:n counters, e.g., just like a compressor tree FPGA Implementation K-input GPC maps nicely onto K-LUTs One logic level required K = 6 for Xilinx Virtex-5 and Altera Stratix II and III Three 6-LUTs for 6-input, 3-output GPC Four 6-LUTs for 6-input, 4-output GPC
January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion
January 22, Definitions Primitive GPCs: Satisfies given I/O Constraints 12-primitive GPCs for 6 inputs, 3 outputs Including (1, 3; 3), (2, 3; 3) Covering GPCs Functionality cannot be implemented by other GPCs, given the I/O constraints e.g., (2, 3; 3) GPC can implement a (1, 3; 3) GPC Set a rank-1 input bit to 0
January 22, Definitions Unreasonable GPCs: Single bit in rank-0 column (3, 1; 3) GPC rank-0 output bit = rank-0 input bit No reduction in bits (1, 2; 3) GPC 3 input bits: Output value in range [0, 4] 3 output bits
January 22, Definitions Compression Ratio (CR): # Input Bits / # Output Bits (3, 3; 4) GPC CR = 6/4 = 1.5 (2, 3; 3) GPC CR = 5/3 = 1.67 Using GPCs with large CR tends to reduce the number of bits to sum at the next logic level # logic levels = # LUTs on critical path in an FPGA
January 22, Input: Columns of bits to sum Example: 3-tap FIR filter Each FIR filter is different, depending on constants used 0 rank
January 22, Mapping Heuristic map_algorithm(Integer : M, Integer : N, Array of Integers : columns ) { step1: find_covering_GPCs( ); step2: find_primitive_GPCs( ); step3: order_primitive_GPCs( ); Repeat { step4: Repeat { col_indx = find_highest_column( ); find_next_GPC (col_indx); remove_covered_dots( ); } until all dots are covered or no reasonable GPC is found step5: connect_GPCs_IOs( ); step6: generate_next_stage_dots( ); } until three rows of dots remains; } step7: generate_final_cpa( columns ) Virtex-5 and Stratix II & III support ternary addition Attack the tallest column first (greedy approach)
January 22, Example 2 Map to ternary adder
January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion
January 22, Experimental Methodology Altera Stratix-II 90nm CMOS Technology Implementations of multi-input addition ADD – Ternary adder tree State of the art for FPGAs 3GD – 3-greedy algorithm (3:2 and 2:2 counters) [Stelling et al., TCOMP 98] 2 and 3-input counters do not map well onto 6-LUTs! GPCs – Heuristic described here
January 22, Experimental Results (Delay) 27% on average GPC is faster than ADD
January 22, Experimental results (Area) 5% increase in ALMs usage for GPC compared to ADD
January 22, Are DSP/MAC Blocks Useful? No! On average, delay using DSP/MAC blocks was more than 2x worse than 3GD
January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion
January 22, Conclusion Conventional wisdom has held that adder trees outperform compressor trees on FPGAs Ternary adder trees were a major selling point of the Altera Stratix II architecture This led to their inclusion in Xilinx Virtex-5 devices Conventional wisdom is wrong! GPCs map nicely onto LUTs Compressor trees on FPGAs, are faster than adder trees when built from GPCs