Architecture-Specific Packing for Virtex-5 FPGAs

Slides:

Advertisements

Similar presentations

Basic HDL Coding Techniques

Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

ECE 506 Reconfigurable Computing ece. arizona

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

BIST for Logic and Memory Resources in Virtex-4 FPGAs Sachin Dhingra, Daniel Milton, and Charles Stroud Electrical and Computer Engineering Auburn University.

Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

Evolution of implementation technologies

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Programmable logic and FPGA

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

GPGPU platforms GP - General Purpose computation using GPU

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

The Xilinx Spartan 3 FPGA EGRE 631 2/2/09. Basic types of FPGA’s One time programmable Reprogrammable (non-volatile) –Retains program when powered down.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.

Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.

System Arch 2008 (Fire Tom Wada) /10/9 Field Programmable Gate Array.

Implementation of Finite Field Inversion

J. Christiansen, CERN - EP/MIC

FPGA Implementations for Volterra DFEs

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Area: VLSI Signal Processing.

ECE 448 Lecture 6 FPGA devices

Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.

VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.

Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

® Virtex-E Extended Memory Technical Overview and Applications.

Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Sequential Logic Design

Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:

Reconfigurable Architectures

Floating-Point FPGA (FPFPGA)

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Complex Programmable Logic Device (CPLD) Architecture and Its Applications

Give qualifications of instructors: DAP

Head-to-Head Xilinx Virtex-II Pro Altera Stratix 1.5v 130nm copper

Instructor: Dr. Phillip Jones

Spartan FPGAs مرتضي صاحب الزماني.

FPGAs in AWS and First Use Cases, Kees Vissers

DESIGN AND IMPLEMENTATION OF DIGITAL FILTER

Field Programmable Gate Array

Field Programmable Gate Array

Field Programmable Gate Array

The Xilinx Virtex Series FPGA

Programmable Logic- How do they do that?

The Xilinx Virtex Series FPGA

Optimizing RTL for EFLX Tony Kozaczuk, Shuying Fan December 21, 2016

FPGA’s 9/22/08.

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

Overview Virtex-5 6-LUT Packing Virtex-5 DSP and Block RAM Packing Results Summary

Simplified FPGA Logic Element 4-LUT A4 A3 A2 A1 O4 FF

Simplified FPGA Logic Block FF 4-LUT General Interconnect General Interconnect

Virtex-5 Logic Block CLB General Interconnect General Interconnect SLICE FF 6-LUT General Interconnect General Interconnect SLICE FF 6-LUT

Dual-Output 6-LUT 6-LUT A6 A5 A4 A3 A2 A1 O6 O5

Dual-Output 6-LUT Usage

Dual-Output Packing 6-LUT 6-LUT Number of 6-LUTs used: 2 VCC a b Y Logic A6 A5 A4 A3 A2 A1 O6 5-LUT O5 x y X Logic x y b a Y Logic X Number of 6-LUTs used: 1! Number of 6-LUTs used: 2

Virtex-5 LUT/FF Pair CY F7 F7 A O6 XOR AMUX 6-LUT O6 O5 CIN AX FF AQ

Dual-Output Packing Tradeoff AX 6-LUT F7 O5 O6 FF 6-LUT

Dual-Output Packing in Placer Goal: To reduce area without performance hit Can be done pre-placement Will be sub-optimal without delay estimates Use delay estimates available during placement to make good decisions on when to merge two LUTs Approach: Allow second 5-LUT to be used, when performance impact is small Incorporate LUT packing in placer’s cost function

Placer Cost Function Previous cost function: Cost = a * W + b * T W: wirelength cost T: timing performance cost Extend cost function with two new terms One based on 6-LUT utilization (L) One based on SLICE utilization (S) Cost = a * W + b * T + c * L + d * S

6-LUT Utilization Term L is computed based on all the used 6-LUT slots Where

SLICE Utilization Term S is computed based on all the available SLICEs Let: Ni = Number of used 5-LUTs in SLICE i (at most 8)  m S = Si i=0

Performance Recovery Helpful to prohibit pack in certain cases for performance reasons Other used elements in a SLICE may block the “good” path from the O5 output to external interconnect.

Performance Recovery: XOR AX LUT6 CY F7 O5 O6 CIN FF AQ AMUX A LUT6 FF

Performance Recovery: F7 XOR AX LUT6 CY F7 O5 O6 CIN FF AQ AMUX A LUT6 F7 FF

6-LUT Reduction 5.5% 6-LUT Reduction

SLICE Reduction 10.23% SLICE Reduction

Performance Results 3.3% Performance Degradation

Overview Virtex-5 6-LUT Packing Virtex-5 DSP and Block RAM Packing Summary

New Type of Packing Problem Traditionally, packing is considered to be a problem of just LUTs and flops However, Virtex-5 contains large IP blocks that present their own packing problem

Virtex-5 Block RAMs A 36 Kbit block RAM tile can store: 36Kb RAM A 36 Kbit block RAM tile can store: a) single 36 Kb RAM b) two independent 18 Kb RAMs Block RAM has configurable “aspect ratio” 18 Kb RAM can be configured as: 16K x 1, 8K x 2, 2K x 9, or 1K x 18 Tools decide which independent 18 Kb block RAMs to locate in which tile 18 Kb RAM 18 Kb RAM

Virtex-5 DSP48E Block A multiply-accumulate operation, pervasive in DSP circuits, can be realized in a single DSP48E. Multiple DSP48Es can be chained together to form more complex functions through the PCIN and PCOUT ports PCOUT 48-bit 25x18 B (18-bit) X ALU A (25-bit) Optional pipeline register/ routing logic Routing logic Optional pipeline register/ routing logic P C (48-bit) = Pattern detect PCIN

Block RAM and DSP Floorplan Block RAM and DSP48E tiles are organized in columns Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile Virtex-5 DSP tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E

Block RAM/DSP Packing Problem: Placer algorithms are heuristic and sometimes do not find an optimal block RAM packing Goal: Leverage preferred block RAM packing patterns to achieve high performance Target area: DSP designs DSP designs make heavy use of block RAMs and DSP blocks

DSP Block RAM Designs Most common DSP application is the Finite Impulse Response Filter or FIR filter FIR filters have multiple instances of a “tap” which involve DSP and block RAMs

FIR Filter A Finite Impulse Response or FIR filter is a digital filter that takes a weighted average of the signals in a delay line An N-tap filter can be expressed as: y[n] = c0*x[n] + c1*x[n-1]+…+cn*[n-N+1] Where: y[n] is the output of the filter at time n x[n] is the data input “signal” at time n Ci is the coefficient Each coefficient/data product in sum is referred to as a “tap” DSP units used for the multiply and accumulate Block RAMs used to store the data and coefficients

FIR Designs – Use Case 1 2-tap FIR filter involving small block RAMs RAMD1 RAMC1 Data RAM 18 Kb block RAM RAMD0 RAMC0 Coefficient RAM DSP0 Tap 0 DSP1 Tap 1 PCOUT PCIN A B data input data output 36 Kb block RAM Tile

Packing for Use Case 1 Packing both 18k Block RAMs into a Block RAM tile permits a natural alignment between the DSP and Block RAMs Operates as two independent 18 Kb block RAMs Block RAM tile DSP48E Virtex-5 DSP tile DSP48E Block RAM tile DSP48E DSP48E High Performance! Block RAM tile DSP48E DSP48E Block RAM tile DSP48E DSP48E

FIR Designs – Use Case 2 2-tap FIR filter involving larger block RAMs DSP0 DSP1 PCOUT PCIN RAMD0 RAMD1 A B 18 Kb block RAM 36 Kb block RAM RAMC0 RAMC1 Data RAM Coefficient RAM Tap 1 Tap 0

Packing for Use Case 2 Two Block RAM columns feed one DSP column Again provides a natural alignment between the DSP and Block RAMs Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Virtex-5 DSP tile

Block RAM Chains Use Case: 18k Block RAM’s data input and output pins connected together (e.g. FIFO) Algorithm: Look for such chains and pack them together into single block RAM tile Special Case: 18k block RAMs separated by registers in RAM0 dia doa addra RAM1 dib dob addrb out 18 Kb block RAM

Block RAM/DSP Packing Results Circuit Perf RAM Packing (MHz) Perf. Baseline (MHz) Percent Improvement Circuit 1 500 400 25% Circuit 2 450 365 23% Circuit 3 470 6% Circuit 4 425 435 -2% Circuit 5 215 200 8% Geomean 359 11%

Summary Described two architecture specific packing approaches for a 65nm commercial FPGA: Xilinx Virtex-5 Dual-output LUT packing in placement: Achieves 10.2% SLICE reduction and 5.5% LUT reduction Packing for DSPs and block RAMs: Achieves 11% performance improvement

Questions