Simulation of Fracturable LUTs

Slides:



Advertisements
Similar presentations
ECE 506 Reconfigurable Computing ece. arizona
Advertisements

Spartan-3 FPGA HDL Coding Techniques
Architecture-Specific Packing for Virtex-5 FPGAs
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
A Digital Circuit Toolbox
FPGA Intra-cluster Routing Crossbar Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
FPGA Technology Mapping Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.
Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Kazi Spring 2008CSCI 6601 CSCI-660 Introduction to VLSI Design Khurram Kazi.
Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Evolution of implementation technologies
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.
HARP: Hard-Wired Routing Pattern FPGAs Cristinel Ababei , Satish Sivaswamy ,Gang Wang , Kia Bazargan , Ryan Kastner , Eli Bozorgzadeh   ECE Dept.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
Power Reduction for FPGA using Multiple Vdd/Vth
Titan: Large and Complex Benchmarks in Academic CAD
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.
Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Field Programmable Gate Arrays (FPGAs) An Enabling Technology.
Basic Sequential Components CT101 – Computing Systems Organization.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.
Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Placement study at ESA Filomena Decuzzi David Merodio Codinachs
Floating-Point FPGA (FPFPGA)
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Presentation on FPGA Technology of
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Andy Ye, Jonathan Rose, David Lewis
The Xilinx Virtex Series FPGA
XC4000E Series Xilinx XC4000 Series Architecture 8/98
Basic Adders and Counters Implementation of Adders
The Xilinx Virtex Series FPGA
ESE534: Computer Organization
A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P
Measuring the Gap between FPGAs and ASICs
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Simulation of Fracturable LUTs Tim Pifer

Presentation Overview Altera ALM from Stratix II Stratix V architecture Current VPR method for Fracturable LUTS Wiremap for technology mapping AApack for packing

Altera Adaptive Logic Module Traditional 4LUTs provide the best area-delay product Larger LUTs Shorter critical path Absorb more logic Larger LUT mask More input Muxing reduce critical path depth by 20% , improving area Improving FPGA Performance and Area Using an Adaptive Logic Module Mike Hutton, Jay Schleicher, David Lewis, Bruce Pedersen, Richard Yuan, Sinan Kaptanoglu, Gregg Baeckler, Boris Ratchev, Ketan Padalia, Mark Bourgeault, Andy Lee, Henry Kim and Rahul Saini

Motivation from other architectures BLE5 - 15% fewer LUTs , 25% shorter unit delay BLE6 - 22% fewer LUTs , 36% shorter unit delay BLE7 - 28% fewer LUTs , 46% shorter unit delay K=6 25% 6LUT Example design K=4 100 LUTs K=6 78 LUTs : 23 6LUT,32 5LUT,17 4LUT,9 3LUT,13 2LUT Stratix, VirtexII : 30:1 input mux relative contribution of routing area and interconnect delay increase with each generation of fabrication

Simple Example: 6LUT from 4BLEs Larger Area 19 input, 4 registers For 6-LUT, 4LUTs have identical inputs, separate input muxes 3 /4 registers, outputs wasted

Improved Example: 6,2 Fracturable LE 8 Inputs, 2 outputs, 2 registers 1 6LUT 2 5LUTs with input Sharing 2 independent 4 LUTS comparable in area with two BLE4 Functionally closer to two BLE5 logic elements.

Final Version Composed of 3LUTs Added d2 muxed output c1 or GND ,c2 or VCC muxed remove mux from d1 swap muxes controlled by R and T two 6LUTs share 4 inputs, identical LUT-mask 4:1 muxes, common data, different select lines Up to 12% pairs of 6-LUTs R=0 T=1 S=1 implements 2 muxed 5Luts with 7 inputs F1 = fn(a1,a2,b1,b2,d1) F2 = fn(a1,a2,b2,c2,d1) Out = mux(F1,F2,c1) roughly area-neutral with BLE4 and 36% decrease in logic depth

How do we set RSTU for a 6LUT? RST = 110 U set to zero allows for second output with D2

8:1 mux implementation 8:1 mux in 2 ALMs (4 ALUTs) using 7 input functions second ALM computes output F1=fn(s0,s1,d3,y0,y1) F2=fn(s0,s1,d7,y0,y1) mux controlled by s2 5 BLE4 vs. 2 ALMs, saves one BLE4

Stratix V ALM can become 2 4LUTS eight inputs for both ALUTs backward-compatible with 4LUT architectures Logic Array Blocks and Adaptive Logic Modules in Stratix V Devices

Stratix V Comment on adders and carry chains

Normal Modes

Normal LUT mode single 6LUT mode, other inputs used for registers

Extended LUT mode 7 input function 2-to-1 multiplexer with two 5LUTS sharing 4 inputs. If Else statements

Why 6LUTS: DES Example DES : 8 sboxes or substitution tables sbox has 6 inputs, 4 outputs Each output: 1 6LUT 6 4LUTs. 35-45% less area

How would we alter technology mapping to best support FLUTs?

Technology Mapping 1 4LUT and 2 6LUTs requiring 3 ALMs Could use 4 5LUTs requiring 2 ALMs and the same logic depth

Balancing Technology Mapping Must maintain optimal critical path depth, more packable LUT distribution avoid 6-LUTs when not helping delay 8:1 muxes identified separately and mapped to 7 input functions 7% of ALMs are 7-input functions

Results: Performance 80 designs tested 130nm process Minimum chip size used Spice models for delay

Results: Area

Stratix vs. Stratix II

Conclusions Benefits of 6LUTS without underutilization Larger LUT Costs: LUT-mask size input and output muxing FFs 6-LUT is fracturable into 5 LUTs, area comparable to 2 BLE4s 7-input functions and 6 input pairs Technology mapping support is needed for best results 6,2 Alm vs 4BLE: 15% better performance 12% smaller area average

How do we need to alter VPR to support FLUTs?

AAPack and wiremap ABC with Wiremap technology mapping to primitives AApack- capable of packing complex logic blocks based on logic primitives

Wiremap reduces 6LUTs percentage Does not increase: logic depth total LUT count WireMap: FPGA Technology Mapping for Improved Routability Stephen Jang, Billy Chan, Kevin Chung, Alan Mishchenko

AAPack Overview Current tools can’t support the complexity of logic blocks New logic block description language: Depict complex interconnects Hierarchy Modes of operation Can pack complex blocks Area driven Area is compared to the theoretical minimum Verilog input for large benchmarks Architecture Description and Packing for Logic Blocks with Hierarchy, Modes and Complex Interconnect Jason Luu, Jason Anderson, and Jonathan Rose

Example: Virtex-6 Logic Block Tools don’t support Stratix IV or Virtex 6 Virtex 6: complex soft logic blocks hard memories multipliers

What AAPack does Can describe: area-driven packing inputs: complex logic blocks with arbitrary internal routing structures Variable memory configurations: 4Kx8, or 8Kx4, or 16Kx2 area-driven packing inputs: user design architectural description

Complex Block Description Language Expressive: The language should be capable of describing a wide range of complex blocks. Simple: The language constructs should match closely with an FPGA architect’s existing knowledge and intuition. Concise: The language should permit complex blocks to be described as concisely as possible.

Physical blocks Specified in XML Hierarchy Other blocks and existing primitives Inputs and outputs and clocks with pin numbers

Primitives Common primitives are handled in the language LUTs inputs can be reordered, a memory address cannot

Intra-Block Interconnect Complete: crossbar switch –internal programmable signal direct: direct connection- wire connection, no programmability mux: multiplexed connection single-bit/bus - programmable signal

Modes of Operation Mutually exclusive functionality Represent FPGA structures being used in different ways

Packing Algorithm Input: technology mapped Netlist, XML architecture Output: Packed complex blocks Greedy algorithm similar to other packing methods while until all blocks are packed Seed block s selected and packed New complex block B for s Pack additional blocks into B Choose a compatible block c Pack c into B if valid Add B to Packed list

Selecting Netlist and Complex Blocks Choose the block with the most nets attached Candidates are selected based on affinity in equation 1 Affinity = shared nets and connections divided by the number of pins the new block would add. Connections is a measure of how likely the new block will need external connections Alpha is set to .9

Legality: Location attempting to pack: chooses a location verifies routing traversing the complex block as a tree ordered smallest to largest right to left traversed right to left to ensure smallest resource consumption attempts to pack the other nodes in the sub-tree: find a flip flop for a LUT 30 packs on the sub-tree

Legality: Routablility Initially, check if packing would exceeded external pin count Then, generate routing graph for complex block Assume any output can connect to any input of a complex block (switchbox architecture) Apply pathfinder

Memory Primitives are technology mapped with a single bit width 256 X 8 memory mapped as 8 256X 1 bit memories primitives mapped to same component if bus signals identical

Limitations No support for timing in this implementation primitive can map to only one complex logic block flip flop can only be used in a LUT complex block, they cannot also be present in Multiplier complex blocks

What are some faults of this packing method?

Experiments Verilog benchmarks soft processors image processors

Fracturable LUTS CLBs: BLEs : FlUTs: fully connected BLEs FI X N – no pin sharing 8 BLEs BLEs : 1FLUT 2 flip flops 2 outputs FlUTs: 2 Modes 6Lut Dual 5LUTs Variable number of inputs Dual mode input sharing depends on number of inputs

FLUT Evaluation Compare achieved area with the lower bound Lower bound: number of complex blocks needed to contain the primitives without routing considerations Efficiency : ratio of the achieved number of logic blocks and this value

Efficiency Results Number of inputs FI varied 5 – 10 Geometric average across 5 benchmarks 5 indicates all inputs are shared, 10 indicates no inputs are shared 6 or 7 achieves tolerable efficiency

Logic blocks and Channel width with number of inputs # blocks decreases to 7 Channel width from # inputs first increases from more routing to each block then decreases after 7: full efficiency so easier routing

Memory Varied # bits and max width best utilization: smallest size, maximum width

CLB consumption with memory size Smaller memories: more logic due to muxes Best results: multiple memory sizes

Conclusions New language can describe complex architectures using: Hierarchy Modes Arbitrary interconnects Packing algorithm for this architecture Verified on large benchmarks Needs timing driven packing

How can we get additional improvement from technology mapping?

Academic FLUTs soft logic 4 architectures : K = 6 M = 5,6,7,8 M5: dual-output 6-LUT of a Xilinx Virtex 5 M8: Stratix II ALM Exploring FPGA Technology Mapping for Fracturable LUT Minimization David Dickin, Lesley Shannon

BLE BLE: 1 FLUT 2 Registers 8 inputs 4 outputs

LUT Balancing Experiments WireMap - no LUT balancing WireMap - with LUT balancing increase the cost of LUT5, LUT6 from 1.0 to 2.5 in 0.1 increments Smaller LUT weighs unchanged

Varying 6LUT weight

Varying 5LUT and 6LUT weight

Different Architectures

Clock Frequency

FLUT Reduction by Architecture