Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

Slides:

Advertisements

Similar presentations

ECE 506 Reconfigurable Computing ece. arizona

Advertisements

Architecture-Specific Packing for Virtex-5 FPGAs

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.

Multioperand Addition Lecture 6. Required Reading Chapter 8, Multioperand Addition Note errata at:

Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

EECS Components and Design Techniques for Digital Systems Lec 18 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.

SCOTT MILLER, AMBROSE CHU, MIHAI SIMA, MICHAEL MCGUIRE ReCoEng Lab DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING UNIVERSITY OF.

Digital Design – Optimizations and Tradeoffs

Basic Adders and Counters Implementation of Adders in FPGAs ECE 645: Lecture 3.

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2

Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Logic Circuits I.

Titan: Large and Complex Benchmarks in Academic CAD

Ch.9 CPLD/FPGA Design TAIST ICTES Program VLSI Design Methodology Hiroaki Kunieda Tokyo Institute of Technology.

Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department.

Digital Integrated Circuits Chpt. 5Lec /29/2006 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (

1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.

SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

Multi-operand Addition

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

1 Using 2-opr adder Carry-save adder Wallace Tree Dadda Tree Parallel Counters Multi-Operand Addition.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Reconfigurable Computing - Type conversions and the standard libraries John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

4. Computer Maths and Logic 4.2 Boolean Logic Logic Circuits.

An EDA-Friendly Protection Scheme against Side-Channel Attacks Ali Galip Bayrak 1 Nikola Velickovic 1, Francesco Regazzoni 2, David Novo 1, Philip Brisk.

ECE 645 – Computer Arithmetic Lecture 6: Multi-Operand Addition ECE 645—Computer Arithmetic 3/5/08.

A Decomposition Algorithm to Structure Arithmetic Circuits Ajay K. Verma, Philip Brisk, Paolo Ienne Ecole Polytechnique Fédérale de Lausanne (EPFL) International.

Wallace Tree Previous Example is 7 Input Wallace Tree

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

Full Tree Multipliers All k PPs Produced Simultaneously Input to k-input Multioperand Tree Multiples of a (Binary, High-Radix or Recoded) Formed at Top.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow Ajay K. Verma 1, Philip Brisk 2, Paolo Ienne 1 International Conference.

Routing Wire Optimization through Generic Synthesis on FPGA Carry Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.

Multioperand Addition

Application of Addition Algorithms Joe Cavallaro.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

EE141 Arithmetic Circuits 1 Chapter 14 Arithmetic Circuits Rev /12/2003 Rev /05/2003.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.

Floating-Point FPGA (FPFPGA)

Altera Stratix II FPGA Architecture

Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.

Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees

Multiplier-less Multiplication by Constants

A Novel FPGA Logic Block for Improved Arithmetic Performance

Basic Adders and Counters Implementation of Adders

FPGA Glitch Power Analysis and Reduction

Approximate Quaternary Addition with the Fast Carry Chains of FPGAs

Multioperand Addition

Presentation transcript:

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient Synthesis of Compressor Trees on FPGAs

January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

January 22, FPGA vs. ASIC Performance Area Utilization Power Consumption Flexibility Time-to-Market ASICFPGA √ √ √ √ √

January 22, FPGA Arithmetic Features Poor Performance for Arithmetic Operations Compared to ASIC IP Cores High Routing Costs Limited Flexibility; 18-bit Adder/Multiplier Full Adder Implemented in CLB Structure Fast Carry-Chain (Xilinx and Altera) Reduces Routing Delay Cannot Use Compressor Trees to Add k>2 Values Wallace/Dadda/3-Greedy

January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

January 22, Motivation: Compressor Trees Partial product reduction in parallel multiplication Wallace and Dadda in the 1960s Multi-input addition occurs in many multimedia and signal processing H.264/AVC Variable Block Size Motion Estimation FIR Filters 3G Wireless Base Station Channel Cards Flow graph transformations expose opportunities to use compresor trees in high-level synthesis [Verma and Ienne, ICCAD 04]

January 22, Flow Graph Transformation step3 >> & delta7 & 4 SEL = step1 >> & 2 = 0 SEL + step2 >> & 1 = 0 vpdiff step3 >> = delta1 & 0 step2 >> SEL 0 = delta2 & 0 step1 >> SEL 0 = delta4 & 0 step0 >> SEL 0 vpdiff ∑ + Compressor Tree ADPCM

January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

January 22, Counters m n m:n counter n = log 2 (m+1) Count #of Input Bits Set to 1 Output # as a Binary Value Counters You Know 2:2 – Half Adder 3:2 – Full Adder (Carry-Save Adder) The correct building block for computing sums of k>2 numbers Counters do not map well onto LUTs or carry chains

January 22, Generalized Parallel Counters (GPCs) Sum bits having different ranks m:n counter: all bits have rank 0, i.e.: 2 0 = 1 Representation: (K n-2, K n-1, …, K 0 ; S) K i – number of input bits of rank i S – number of output bits (0, 4; 3) – typical 4:3 counter (2, 3; 3) – maximum value: 2* *2 0 = 12 Range [0, 12] requires S = 4 output bits Examples using dot notation (3, 3; 4) GPC (5, 5; 4) GPC

January 22, GPC Implementation For ASICs Basic gates, e.g. AND, XOR Built from m:n counters, e.g., just like a compressor tree FPGA Implementation K-input GPC maps nicely onto K-LUTs One logic level required K = 6 for Xilinx Virtex-5 and Altera Stratix II and III Three 6-LUTs for 6-input, 3-output GPC Four 6-LUTs for 6-input, 4-output GPC

January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

January 22, Definitions Primitive GPCs: Satisfies given I/O Constraints 12-primitive GPCs for 6 inputs, 3 outputs Including (1, 3; 3), (2, 3; 3) Covering GPCs Functionality cannot be implemented by other GPCs, given the I/O constraints e.g., (2, 3; 3) GPC can implement a (1, 3; 3) GPC  Set a rank-1 input bit to 0

January 22, Definitions Unreasonable GPCs: Single bit in rank-0 column (3, 1; 3) GPC  rank-0 output bit = rank-0 input bit No reduction in bits (1, 2; 3) GPC 3 input bits: Output value in range [0, 4] 3 output bits

January 22, Definitions Compression Ratio (CR): # Input Bits / # Output Bits (3, 3; 4) GPC CR = 6/4 = 1.5 (2, 3; 3) GPC CR = 5/3 = 1.67 Using GPCs with large CR tends to reduce the number of bits to sum at the next logic level # logic levels = # LUTs on critical path in an FPGA

January 22, Input: Columns of bits to sum Example: 3-tap FIR filter Each FIR filter is different, depending on constants used 0 rank

January 22, Mapping Heuristic map_algorithm(Integer : M, Integer : N, Array of Integers : columns ) { step1: find_covering_GPCs( ); step2: find_primitive_GPCs( ); step3: order_primitive_GPCs( ); Repeat { step4: Repeat { col_indx = find_highest_column( ); find_next_GPC (col_indx); remove_covered_dots( ); } until all dots are covered or no reasonable GPC is found step5: connect_GPCs_IOs( ); step6: generate_next_stage_dots( ); } until three rows of dots remains; } step7: generate_final_cpa( columns ) Virtex-5 and Stratix II & III support ternary addition Attack the tallest column first (greedy approach)

January 22, Example 2 Map to ternary adder

January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

January 22, Experimental Methodology Altera Stratix-II 90nm CMOS Technology Implementations of multi-input addition ADD – Ternary adder tree State of the art for FPGAs 3GD – 3-greedy algorithm (3:2 and 2:2 counters) [Stelling et al., TCOMP 98] 2 and 3-input counters do not map well onto 6-LUTs! GPCs – Heuristic described here

January 22, Experimental Results (Delay) 27% on average GPC is faster than ADD

January 22, Experimental results (Area) 5% increase in ALMs usage for GPC compared to ADD

January 22, Are DSP/MAC Blocks Useful? No! On average, delay using DSP/MAC blocks was more than 2x worse than 3GD

January 22, Outline State of the Art: FPGAs Motivation Generalized Parallel Counters Mapping Heuristic Experimental Results Conclusion

January 22, Conclusion Conventional wisdom has held that adder trees outperform compressor trees on FPGAs Ternary adder trees were a major selling point of the Altera Stratix II architecture This led to their inclusion in Xilinx Virtex-5 devices Conventional wisdom is wrong! GPCs map nicely onto LUTs Compressor trees on FPGAs, are faster than adder trees when built from GPCs