A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Slides:

Advertisements

Similar presentations

ECE 506 Reconfigurable Computing ece. arizona

Advertisements

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Lecture 7 FPGA technology. 2 Implementation Platform Comparison.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.

Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Introduction to Reconfigurable Computing CS61c sp06 Lecture (5/5/06) Hayden So.

Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Programmable logic and FPGA

Distributed Arithmetic: Implementations and Applications

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Basic Adders and Counters Implementation of Adders in FPGAs ECE 645: Lecture 3.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

GPGPU platforms GP - General Purpose computation using GPU

General FPGA Architecture Field Programmable Gate Array.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

156 / MAPLD 2005 Rollins 1 Reducing Energy in FPGA Multipliers Through Glitch Reduction Nathan Rollins and Michael J. Wirthlin Department of Electrical.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Paper Review I Coarse Grained Reconfigurable Arrays Presented By: Matthew Mayhew I.D.# ENG*6530 Tues, June, 10,

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

Power Reduction for FPGA using Multiple Vdd/Vth

Abdullah Aldahami ( ) Feb26, Introduction 2. Feedback Switch Logic 3. Arithmetic Logic Unit Architecture a.Ripple-Carry Adder b.Kogge-Stone.

Paper Review: XiSystem - A Reconfigurable Processor and System

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

J. Christiansen, CERN - EP/MIC

AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.

FPGA Implementations for Volterra DFEs

EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.

Reconfigurable Computing - Type conversions and the standard libraries John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Reconfigurable Computing Ender YILMAZ, Hasan Tahsin OĞUZ.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

Delivered by.. Love Jain p08ec907. Design Styles  Full-custom  Cell-based  Gate array  Programmable logic Field programmable gate array (FPGA)

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

EE141 Arithmetic Circuits 1 Chapter 14 Arithmetic Circuits Rev /12/2003 Rev /05/2003.

EE141 Arithmetic Circuits 1 Chapter 14 Arithmetic Circuits Rev /12/2003.

System on a Programmable Chip (System on a Reprogrammable Chip)

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Topics Coarse-grained FPGAs. Reconfigurable systems.

Improved Resource Sharing for FPGA DSP Blocks

Floating-Point FPGA (FPFPGA)

Altera Stratix II FPGA Architecture

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Design for Embedded Image Processing on FPGAs

Presentation on FPGA Technology of

Application-Specific Customization of Soft Processor Microarchitecture

Instructor: Dr. Phillip Jones

Lecture 41: Introduction to Reconfigurable Computing

Basic Adders and Counters Implementation of Adders

Application-Specific Customization of Soft Processor Microarchitecture

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

A Survey of Logic Block Architectures For Digital Signal Processing Applications

Presentation Outline Considerations in Logic Block Design  Computation Requirements  Why Inefficiencies? Representative Logic Block Architectures  Proposed  Commercial Conclusions: What is suitable Where?

Why DSP???  The Context Representative of computationally intensive class of applications  datapath oriented and arithmetic oriented Increasingly large use of FPGAs for DSP  multimedia signal processing, communications, and much more To study the “issues” in reconfigurable fabric design for compute intensive applications  What is involved in making a fabric to accelerate multimedia reconfigurable computing possible?

Elements of a Reconfigurable Architecture Logic Block/Processing Element  Differing Grains Fine>>Coarse>>ALUs Routing Dynamic Reconfiguration

So what’s wrong with the typical FPGA? Meant to be general purpose  lower risks Toooo Flexible!  Result: Efficiency Gap Higher Implementation Cost, Larger Delay, Larger Power Consumption than ASICs Performance vs. Flexibility Tradeoff  Postponing Mapping and Silicon Re-use

Solution? See how FPGAs are Used? FPGAs are being used for “classes” of applications  Encryption, DSP, Multimedia etc. Here lies the Key  Design FPGAs for a class of applications Application Domain Characterization  Application Domain Tuning

Domain Specialization COMPUTATION  defines  ARCHITECTURE Target Application Characteristics known beforehand? Yes 1. Characterize the application domain 2. Determine a balance b/w flexibilty vs efficiency 3. Tune the architecture according

Categorizing the “Computation” Control  Random Logic Implementation Datapath  Processing of Multi-bit Data Conflicting Requirements???

Datapath Element Requirements Operates on Word Slices or Bit Slices Produces multi-bit outputs Requires many smaller elements to produce each bit output  i.e. multiple small LUTs

Control Logic Requirements Produces a single output from many single bit inputs Benefits from large grain LUT as logic levels gets reduced

Logic Block Design: Considerations “How much” of “what kinds” of computations to support? Tradeoff: Generality vs Specialization

How much of What?  Applications benchmarking

So what do we have to support? Datapath functionality, in particular arithmetic, is dominant in DSP. The datapath functions have different bit-widths. DSP designs heavily use multiplexers of various size. Thus, an efficient mapping of multiplexers should be supported. DSP functions do contain random logic. The amount of random logic varies per design. Some DSP designs use wide boolean functions.

DSP Building Blocks Some techniques widely used to achieve area- speed efficient DSP implementations Bit Serial Computations  Routing Efficient  Bit Level Pipelining Increases throughput even more Digit Serial Computation  Combining “Area efficiency” of bit-serial and with “Time efficiency” of Bit-parallel

Classes of DSP-optimized FPGA Architectures 1. Architectures with Dedicated DSP Logic  Homogeneous  Hetrogeneous  Globally Homogeneous, Locally Heterogenous 2. Architectures of Coarser Granularity 3. With DSP Specific Improvements (e.g. Carry Chains, Input Sharing, CBS)

Some Representative Architectures

Bit-Serial FPGA with SR LUT Bit-serial paradigm suites the existing FPGA so why not optimize the FPGA for it! Logic block to support efficient implementation of bit-serial data path and bit-level pipelining LUTs can be used for combinational logic as well as for Shift Registers

A Bit-Serial Adder A Bit-Serial Adder which processes two bits at a time Interface Block Diagram

A Bit-Serial Multiplier Cell

The Proposed Bit Serial Logic Block Architecture 4x4-input LUTs and 6 flip-flops. The two multiplexers in front of the LUTs are targeted mainly for carry-save operations which are frequently used in bit- serial computations. There are 18 signal inputs and 6 signal outputs, plus a clock input. Feed-back inputs c2, c3, c4, c5 can be connected to either GND or VDD or to one of the 4 outputs d0, d1, d2, d3. Therefore, each LUT can implement any 4-input functions controlled by inputs a0, a1, a2, a3 or b0, b1, b2, b3. Programmable switches connected to inputs a4 and b4 control the functionality of the four multiplexers at the output of LUTs. As a result, 2 LUTs can implement any 5-input functions. The final outputs d0, d1, d2, d3 can either be the direct outputs from the multiplexers or the outputs from flip-flops. All bit-serial operators use the outputs from flip-flops; therefore the attached programmable switches are actually unnecessary. They are only present in order to implement any other logic functions other than bit-serial datapath circuits. Two flip-flops are added (inputs c0 and c1) to implement shift registers which are frequently used in bit-serial operations.

The Modified LUT Implementing a Shift Register

Performance Results

Digit-Serial Logic Block Architecture Digit–Serial Architectures process one digit (N=4 bits) at a time They offer area efficiency similar to bit- serial architectures and time-efficiency close to bit-parallel architectures N=4 bits can serve as an optimal granularity for processing larger digit sizes (N=8,16 etc)

Digit-Serial Building Blocks A Digit-Serial Adder A Digit-Serial Unsigned Multiplier

Digit-Serial Building Blocks A Pipelined Digit-Serial Unsigned Multiplier For Y=8 bits

Digit-Serial Signed Multiplier Blocks Middle Stages ModuleFirst Stage ModuleLast Stage Module

Signed Digit-Serial Multiplier A Digit-Serial Signed Booth’s Pipelined Multiplier with Y=8

Proposed Digit-Serial Logic Block

Detailed Structure of Digit-Serial Logic Block

The Basic Logic Module (LM) The Structure of the LM Table of Functions Implemented

Examples of Implementations N=4 Unsigned Multiplier N=4 Signed Multiplier Two N=2 Multipliers Bit-Level Pipelined

Area Comparison with Xilinx 4000 Series

Mixed-Grain Logic Block Architecture Exploits the adder inverting property Efficiently implements both datapath and random logic in the same logic block design

Adder Inverting Property Full Adder and Equations Showing The Inverting Property An optimal structure derived from the property

LUT Bits Utilization in Datapath and Logic Modes

Structure of a Single Slice

Complete Logic Block

Modified ALU Like Functionality

Comparison Results

Comparison Results (Cont…)

Comparison Results (cont…)

Coarser ALU Like Architectures

CHESS Architecture

CHESS ALU Based Logic Block

Structure of a Switch Box

Comparison Results

Computation Field Programmable Architecture A Heterogeneous architecture with cluster of datapath logic blocks Separate LUT Based Logic Blocks for supporting random logic mapping Basic Logic Block called a Partial Adder Subtraction Multiplier (PASM) Module

PASM Logic Block of CFPA

Cluster of PASM Logic Blocks

Comparison Results

Some Industry Architectures Designs

Altera APEX II Logic Element

Altera MAX II Logic Element

LE Configuration in Arithmetic Mode

LE in Random Logic Implementation

Altera Stratix Logic Element

Altera Stratix II Architecture

Stratix II Adaptive Logic Module

Stratix II ALM in Arithmetic Mode

Various Configurations in an ALM of Stratix II

Multiplier Resources in Stratix II

Structure of a DSP Block in Stratix II

XILINX Virtex II Pro Architecture

Basic Logic Element of Virtex II Pro

Dedicated Multipliers in Virtex II Pro

Processor- Programmable Logic Coupled Architecture

PiCoGA Architecture Coupled with a VLIW processor

PiCoGA Logic Block

Conclusions Traditional general purpose FPGA inefficient for data path mapping Logic blocks with DSP specific enhancements seem a promising solution Coarse Grained Logic can achieve better application mapping for data path but sacrifice flexibility Dedicated Blocks (Multipliers) increase performance but also increases cost significantly

Conclusions PDSPs with embedded FPGA can achieve a good balance between performance and power consumption So…Which approach is the best?  No single best exists

Suitability of Approaches Highly computationally intensive applications with large amounts of parallelism can use platform FPGAs where often large resources are required and power consumption is not an issue. Here cost/function will be lowest

Suitability of Approaches Field Programmable Logic based coprocessors can benefit from coarse grained blocks where most control functions are implemented by the PDSP itself

Suitability of Approaches Higher flexibility and lower cost can be achieved with logic blocks with DSP specific enhancements but flexibility to implement control logic in an efficient manner.