Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Slides:



Advertisements
Similar presentations
Lecture 11-1 FPGA We have finished combinational circuits, and learned registers. Now are ready to see the inside of an FPGA.
Advertisements

Multiplexer. A multiplexer (MUX) is a device which selects one of many inputs to a single output. The selection is done by using an input address. Hence,
FPGA Intra-cluster Routing Crossbar Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
EGRE 427 Advanced Digital Design Figures from Application-Specific Integrated Circuits, Michael John Sebastian Smith, Addison Wesley, 1997 Chapter 5 Programmable.
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
©2004 Brooks/Cole FIGURES FOR CHAPTER 9 MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES Click the mouse to move to the next page. Use the ESC key.
FPGA structure and programming - Eli Kaminsky 1 FPGA structure and programming.
Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Programmable logic and FPGA
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 33: Array Subsystems (PLAs/FPGAs) Prof. Sherief Reda Division of Engineering,
Multiplexers, Decoders, and Programmable Logic Devices
Configuration. Mirjana Stojanovic Process of loading bitstream of a design into the configuration memory. Bitstream is the transmission.
Physical Implementation 1)Manufactured Integrated Circuit (IC) Technologies 2)Programmable IC Technology 3)Other Technologies Manufactured IC Technologies.
Chapter 4 Register Transfer and Microoperations
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
Lecture #3 Page 1 ECE 4110– Sequential Logic Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.No Class Monday, Labor Day Holiday 2.HW#2 assigned.
Power Reduction for FPGA using Multiple Vdd/Vth
Unit 9 Multiplexers, Decoders, and Programmable Logic Devices
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
Unit 9 Multiplexers, Decoders, and Programmable Logic Devices Ku-Yaw Chang Assistant Professor, Department of Computer Science.
Lecture #3 Page 1 ECE 4110– Sequential Logic Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.No Class Monday, Labor Day Holiday 2.HW#2 assigned.
CPLD (Complex Programmable Logic Device)
J. Christiansen, CERN - EP/MIC
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
Programmable Logic Devices
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
Lecture #3 Page 1 ECE 4110–5110 Digital System Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.HW#2 assigned Due.
BR 1/991 Issues in FPGA Technologies Complexity of Logic Element –How many inputs/outputs for the logic element? –Does the basic logic element contain.
EE3A1 Computer Hardware and Digital Design
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.
1 Leakage Power Analysis of a 90nm FPGA Authors: Tim Tuan (Xilinx), Bocheng Lai (UCLA) Presenter: Sang-Kyo Han (ECE, University of Maryland) Published.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
An Improved “Soft” eFPGA Design and Implementation Strategy
CEC 220 Digital Circuit Design Programmable Logic Devices
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
Click to edit Master title style Progress Update Energy-Performance Characterization of CMOS/MTJ Hybrid Circuits Fengbo Ren 05/28/2010.
Reconfigurable Architectures Greg Stitt ECE Department University of Florida.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
This chapter in the book includes: Objectives Study Guide
Issues in FPGA Technologies
Altera Stratix II FPGA Architecture
This chapter in the book includes: Objectives Study Guide
From Silicon to Microelectronics Yahya Lakys EE & CE 200 Fall 2014
Dr. Clincy Professor of CS
This chapter in the book includes: Objectives Study Guide
Summary Half-Adder Basic rules of binary addition are performed by a half adder, which has two binary inputs (A and B) and two binary outputs (Carry out.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Dr. Clincy Professor of CS
Multiple Drain Transistor-Based FPGA Architectures
Digital Building Blocks
The Xilinx Virtex Series FPGA
Dr. Clincy Professor of CS
Implementation Technology
The Xilinx Virtex Series FPGA
ECE 352 Digital System Fundamentals
FIGURE 5-1 MOS Transistor, Symbols, and Switch Models
Test Data Compression for Scan-Based Testing
Programmable logic and FPGA
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer Abdelhadi Guy Lemieux

Modern FPGAs are HUGE! Altera Stratix IV “4SGX530” device (45nm) –200k 6-LUTs in one die FPGAs keep getting bigger –Moore’s Law 2x every months 28nm devices announced (Stratix V, Virtex 7) –More than Moore Xilinx 1,200k 6-LUTs using stacked silicon 2

Configuration Bits Growing bitstream storage & load time –Altera Stratix IV (4SGX, largest device) 4 seconds to boot up using serial EEPROM 35 seconds to configure using USB MByte bitstream Multiple configurations per FPGA –Even more bitstream storage 3

Bitstream Compression Obvious solution: bitstream compression –Many methods in research Architecture-aware: switch blocks (eg, avoid 5:1 muxes), logic blocks (eg, ULMs, coarse-grained), readback, frame reordering, wildcard techniques Architecture-agnostic: general compression (eg: huffman, LZ, arithmetic), “don’t care” –Not yet applied in practice Can we do better? 4

Main Idea Bitstream Reduction –Throw away redundant bits Do not store them in configuration EEPROM Do not transmit them –Regenerate missing bits (in the FPGA) Done on-the-fly, during bitstream load Does not reduce # SRAM bits inside FPGA This is not bitstream compression –Can still compress after reduction 5

Main Idea Where to find redundant bits? –LUT configuration bits (this paper) –Interconnect (future work) Eg, 4-input LUT: –16 config bits (2^k) –65,536 configurations (2^(2^k)) –222 unique (npn-equivalent) 4-input functions –Theoretically, 8 config bits is sufficient 6

Equivalence Classes What is npn-equivalent ? First n = input negation (inversion), p = input permutations, Second n = output negation Eg, 4-input LUT –4! = 24 input permutations for the same function –2^k = 16 input inversions –2^1 = 2 output inversions 7

Example: LUT input ordering 8

9

10

Example: LUT input ordering 11

Example: LUT input ordering 12

Example: LUT input ordering 13

Example: LUT input ordering 14

Example: LUT input ordering 15

Example 2-LUT 16

Example 2-LUT 17 Bit e 0 -- removed from bitstream -- regenerated at load-time -- storage bit still exists in LUT

Example 2-LUT 18 Bit e 0 -- removed from bitstream -- regenerated at load-time -- storage bit still exists in LUT Regeneration circuit -- shared by all LUTs -- regenerates missing bits

Example 2-LUT 19 Bit e 0 -- removed from bitstream -- regenerated at load-time -- storage bit still exists in LUT Regeneration circuit -- shared by all LUTs -- regenerates missing bits Problem: changing order of inputs  reorders LUT contents  changes value of bit e 0

Configuration Chaining 20 Problem: changing order of inputs  reorders LUT contents  changes value of bit e 0 Solution:

Configuration Chaining Copy value of bit e 0 to be saved Problem: changing order of inputs  reorders LUT contents  changes value of bit e 0 Solution:

Configuration Chaining Copy value of bit e 0 to be saved Problem: changing order of inputs  reorders LUT contents  changes value of bit e 0 Solution: 2. Encode value of bit e 0 in the next LUT

General Approach Produce a LUT chain Input ordering of next LUT generates removed bits for current LUT Approach 1. Enumerate # of input permutations of LUT k=2  2! = 2 k=3  3! = 6 k=4  4! = Define number of bits (p k ) to be replaced k=2  log 2 (2)=1 k=3  log 2 (6)=2 k=4  log 2 (24)=4 3. For each LUT a. Copy P = p k bits from the current LUT b. Re-order next LUT inputs according to value of P c. Adjust next LUT configuration bits to match new input ordering d. Remove p k configuration bits from current LUT 23

LUT Savings k# LUT bits# bits Saved = Log2(k!) % saved 24125% % % % % % 24 How complex is bit regeneration?

General Structure 25

Regeneration Circuits (k=3,4,5) 26 3 x 6b comparators 2b (half) adder Saves 2 bits (25%)

Regeneration Circuits (k=3,4,5) 27 3 x 6b comparators 2b (half) adder Saves 2 bits (25%) 6 x 6b comparators 3b (full) adder 2b (half) adder Saves 4 bits (25%)

Regeneration Circuits (k=3,4,5) 28 3 x 6b comparators 2b (half) adder Saves 2 bits (25%) 6 x 6b comparators 3b (full) adder 2b (half) adder Saves 4 bits (25%) 10 x 6b comparators 2 x 3b (full) adder 3 x 2b (half) adder Saves 6 bits (25%)

Regeneration Circuits (k=3,4,5) 29 3 x 6b comparators 2b (half) adder Saves 2 bits (25%) 6 x 6b comparators 3b (full) adder 2b (half) adder Saves 4 bits (25%) 10 x 6b comparators 2 x 3b (full) adder 3 x 2b (half) adder Saves 6 bits (25%) Note: area cost is only once per device (shared across all LUTs in entire FPGA)

Regeneration Circuits (k=6) 30

Regeneration Circuits (k=7) 31

Regeneration Circuits (k=8) 32

Regeneration Circuits All designs are freely available for download. 33

Cost of Regeneration – Summary k# LUT bits # bits Saved = Log2(k!) # Compare (6 bit compare) # Full Adders # Half Adders # 2- input Gates Note: area cost is only once per device (shared across all LUTs in entire FPGA)

Implementation 65nm standard cell implementation –Use minimum-size drive strength (not timing critical) Regeneration circuits are very small –Shared across entire device 35 k= Area (μm 2 ) Transistors ,0581,5762,208

Future Work k# LUT bits# bits Saved = Log2(k!) % saved 24125% % % % % % 36 Can we reach 25% savings with 5+ inputs ?

Future Work k# LUT bits# bits Saved = Log2(k!) % saved 24125% % % % % % 37 Can we reach 25% savings with 5+ inputs ? Eg, combine two 4-LUTs to make one 5-LUT

Thank you! 38