Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

Slides:



Advertisements
Similar presentations
Hao wang and Jyh-Charn (Steve) Liu
Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
ECE 506 Reconfigurable Computing ece. arizona
Architecture-Specific Packing for Virtex-5 FPGAs
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Altera FLEX 10K technology in Real Time Application.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR.
Selective Flexibility: Breaking the Rigidity of Datapath Merging Mirjana Stojilović, Institute Mihailo Pupin, University of Belgrade David Novo, École.
Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.
Graduate Computer Architecture I Lecture 16: FPGA Design.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
EECE579: Digital Design Flows
UC Berkeley BRASS Group Post Placement C-Slow Retiming for Xilinx Virtex FPGAs Nicholas Weaver Yury Markovskiy Yatish Patel John Wawrzynek UC Berkeley.
Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Programmable logic and FPGA
Lecture 3 1 ECE 412: Microcomputer Laboratory Lecture 3: Introduction to FPGAs.
Basic Adders and Counters Implementation of Adders in FPGAs ECE 645: Lecture 3.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
Power Reduction for FPGA using Multiple Vdd/Vth
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
Open Discussion of Design Flow Today’s task: Design an ASIC that will drive a TV cell phone Exercise objective: Importance of codesign.
Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department.
Electronics in High Energy Physics Introduction to Electronics in HEP Field Programmable Gate Arrays Part 1 based on the lecture of S.Haas.
Digital Integrated Circuits Chpt. 5Lec /29/2006 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
Automated Design of Custom Architecture Tulika Mitra
1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
A Flexible DSP Block to Enhance FGPA Arithmetic Performance
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
J. Christiansen, CERN - EP/MIC
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Programmable Logic Devices
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
A Decomposition Algorithm to Structure Arithmetic Circuits Ajay K. Verma, Philip Brisk, Paolo Ienne Ecole Polytechnique Fédérale de Lausanne (EPFL) International.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
Routing Wire Optimization through Generic Synthesis on FPGA Carry Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.
Click to edit Master title style Literature Review Measuring the Gap Between FPGAs and ASICs Ian Kuon, Jonathan Rose University of Toronto IEEE TCAD/ICAS.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.
Floating-Point FPGA (FPFPGA)
Altera Stratix II FPGA Architecture
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Instructor: Dr. Phillip Jones
Electronics for Physicists
FPGAs in AWS and First Use Cases, Kees Vissers
Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees
CprE / ComS 583 Reconfigurable Computing
CprE / ComS 583 Reconfigurable Computing
A Novel FPGA Logic Block for Improved Arithmetic Performance
Basic Adders and Counters Implementation of Adders
Presentation transcript:

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis Athanasopoulos 1,2 Hadi Parandeh-Afshar 2 Paolo Ienne 2 Yusuf Leblebici 1 Ajay K. Verma 2 Philip Brisk 2 Frank K. Gurkaynak th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2008

Motivation and Contribution Goal: Improve FPGA performance for arithmetic circuits. Field Programmable Counter Array (FPCA): [Brisk et al., DAC 2007] Programmable IP core to accelerate compressor trees Hybrid FPGA/FPCA device Contributions: Completely new FPCA architecture Reduced routing delay More flexibility and better mapping Simplified integration process 1/11

FPGA Commentary Logic cells with dedicated addition circuitry and fast carry chains Support for ternary addition [Altera Stratix II/III, Xilinx Virtex-5] Parallel accumulation uses adder trees ASIC designers use compressor trees! Compressor tree synthesis on FPGAs via GPC mapping [Parandeh-Afshar et al., ASPDAC 2008, DATE 2008] Faster than ternary adder trees IP Cores DSP48, BlockRAM, etc. [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 2008] Mismatches in bitwidth limit gains [Kuon and Rose, FPGA 2006, TCAD 2007] 2/11

Methodology and Solution 1. Transform circuit to merge disparate addition and multiplication operations to expose compressor trees [Verma and Ienne, ICCAD 2004] 2. Synthesize compressor tree onto FPCA [Brisk et al., DAC 2007] 3. Map everything else onto traditional FPGA Standard approach 4. Integrate FPGA+FPCA onto same die Ongoing research at EPFL FPCA : programmable compressor tree ∑ + 3/11

Previous Work Initial FPCA architecture [Brisk et al., DAC 2007] Routing network delay Performance bottleneck Poor area utilization Many resources unused Large counters implement the functionality of smaller counters “Pitch matching” problem FPCA routing channels must align with FPGA routing channels Leads to unnecessarily large counters 4/11

Recurring Patterns in Compressor Tree Synthesis New FPCA architecture: Counter Slice (CSlice) Compress one column at a time Propagate carry bits to neighboring CSlices Eliminates FPGA-style routing network No routing delay between counters Pitch matching problem disappears 5/11

FPCA v2.0 Area Utilization CSlice Architecture Configurable GPC 6/11

FPCA V2.0 Mapping Heuristic FPCA synthesis heuristic: Map columns of input bits onto FPCA Minimize the height of the compressor tree Avoid vertical configurations, when possible FPCA … Horizontal Vertical Multi-FPCA Configurations Routing Delay 7/11

CSlice Synthesis CSlice V2.0 rank-3 with 16 input bits per CSlice 90nm Artisan standard cell library CsliceRank-1Rank-2Rank-3 Area [µm 2 ] Delay [ns] CPA delay [ns] FPCA Synthesis: Rank-3 CSlices used in experiments 8 CSlices per FPCA Similar to dimensions of a DSP block in current FPGAs Simplifies integration process DFFs store configuration bitstream Semi-custom design Standard cells are predominant 8/11

FPCA Delay Extraction Methodology: Each FPCA instance is replaced with F* instance (same I/0) Extract Delay Between F* instances Combined these Delay with Combinational Delay extracted for the FPCA Input Pins Output Pins SUM Define a pre-placed soft IP core : F* Same dimensions and I/O as FPCA Map onto Stratix II FPGA Extract critical path delay Replace all sum operations with F* Map compressor tree onto FPCA Configuration DFF values set to constant values ; not optimized Measure critical path delay For each compressor tree in the circuit Subtract delay of F* Add FPCA delay Methodology: F* FPCA 9/11

Experimental Results Experimental Results Comparison GPC Mapping [Parandeh-Afshar et al., ASP-DAC 2008] FPCA mapping (6 FPCAs per device) 2.40x 1.60x 10/11

Conclusion Conclusion Future Work New FPCA architecture Hardwired connections between counters Counters of multiple sizes organized into CSlices Carry chains between CSlices Avg./Max. speedups of 1.60x/2.40x compared to GPC mapping Add pipeline registers to FPCA Increase latency, increase clock frequency, throughput Demonstrator chip taped out in October 2007 Returned from the foundry in January 2008; PCBs ready next week Measure power consumption, clock frequency, I/O interface, etc. 11/11

Demonstrator Chip