Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Slides:



Advertisements
Similar presentations
Field Programmable Gate Array
Advertisements

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.
ECE 506 Reconfigurable Computing ece. arizona
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.
Altera FLEX 10K technology in Real Time Application.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.
Minimizing Clock Skew in FPGAs
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.
Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.
CMOL overview ● CMOS / nanowire / MOLecular hybrids ● Uses combination of Micro – Nano – Nano implements regular blocks (ie memory) – CMOS used for logic,
Evolution of implementation technologies
FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.
February 4, 2002 John Wawrzynek
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Power Reduction for FPGA using Multiple Vdd/Vth
Titan: Large and Complex Benchmarks in Academic CAD
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
Implementation of Finite Field Inversion
J. Christiansen, CERN - EP/MIC
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim,
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Field Programmable Gate Arrays (FPGAs) An Enabling Technology.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
BR 1/991 Issues in FPGA Technologies Complexity of Logic Element –How many inputs/outputs for the logic element? –Does the basic logic element contain.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
Pipelining and Parallelism Mark Staveley
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
An Improved “Soft” eFPGA Design and Implementation Strategy
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.
Defect-tolerant FPGA Switch Block and Connection Block with Fine-grain Redundancy for Yield Enhancement Anthony J. YuGuy G.F. Lemieux August 25, 2005.
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee, Guy Lemieux & Shahriar Mirabbasi University of British Columbia, Canada Electrical & Computer.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.
Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.
Floating-Point FPGA (FPFPGA)
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.
Give qualifications of instructors: DAP
Pipelining and Retiming 1
The University of British Columbia
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Presentation transcript:

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs Alex Brant Advisor: Guy Lemieux University of British Columbia

Outline Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary

Motivation - 1 FPGA Overlays What are the benefits? FPGA designs that can be further programmed by the user What are the benefits? Ease of use (simpler languages, tools, etc.) Optimized for particular problem domains Open access to architecture & CAD User-configured logic added to fixed FPGA bitstream Dynamic reconfiguration on any device Portability between vendors and devices

Motivation - 2 Fine Grain Overlay – ZUMA FPGA-like architecture Compatible with VTR CAD tools “Virtual” FPGA for portability of designs Open source for research and applications Implements fine grain part of MALIBU architecture Generic implementation has high area overhead Overcome by utilizing low level FPGA resources, implementing more efficient structures

Motivation - 3 Coarse Grain Overlay – CARBON Array of time-multiplexed ALUs Fast compile High density Efficient mapping of word oriented circuits Implements coarse grain part of MALIBU Time-multiplexing limits overall performance Performance gained using overclocking with error tolerance (CARBON-Razor)

Contributions Area efficient implementation of fine grain routing and logic with LUTRAMs Area efficient 2-stage local routing network and configuration controller Extension of Razor error tolerance from pipelined processors to 2D processing arrays Design of an overclockable coarse grain FPGA overlay with in-circuit error correction

Publications ZUMA: An Open FPGA Overlay Architecture, Alexander Brant and Guy G.F. Lemieux (FCCM 2012) Pipeline Frequency Boosting: Hiding Dual-Ported Block RAM Latency using Intentional Clock Skew, Alexander Brant, Ameer Abdelhadi, Aaron Severance, Guy G.F. Lemieux (FPT 2012) CARBON-Razor: An Error-Tolerant Coarse Grain FPGA (in preparation)

Outline Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary

FPGA Architecture Implements any logic function

MALIBU Architecture Hybrid coarse/fine grain FPGA Time-multiplexed ALU (CG) combined with FPGA cluster CG passes data to neighbors through memories

MALIBU Hybrid FPGA CGs are run on fast system clock (e.g. > 1GHz) System clock / Schedule length = User clock rate Advantages: Greater density from time-multiplexing Ability to trade-off between area and speed Compiles up to 300x faster than normal FPGA Better performance for word-oriented circuits

Razor Timing Error Tolerance Works with feed-forward pipeline circuits Detects timing errors by capturing data a second time with a delayed clock Tolerates errors by stalling pipeline one cycle

Razor Timing Error Example Data captured in main FF

Razor Timing Error Example Data captured in main FF Fraction of cycle later, data captured by shadow latch

Razor Timing Error Example Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared

Razor Timing Error Example Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared If different, shadow data loaded to main FF, pipeline is stalled

Razor Timing Error Example Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared If different, shadow data loaded to main FF, pipeline is stalled If not, pipelining proceeds normally

Outline Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary

ZUMA Overlay Island style FPGA architecture, implemented on an FPGA Initially implemented in generic Verilog High area overhead, 125+ host LUTs for each ZUMA LUT (eLUT) Area efficiency improvements: Implementation of routing and logic with FPGA LUTRAMs Design of efficient 2-stage local interconnect

ZUMA Layout One tile of ZUMA Architecture

Details - LUTRAM Reprogrammable LUTRAM in Xilinx and Altera Devices

Details – LUTRAM Multiplexer LUTRAM can implement larger MUXs than a normal LUT, need no extra configuration memory 6-LUT, configured as a 6-to-1 MUX in RAM mode 6-LUT, configured as a 4-to-1 MUX 6-LUT

Details – Local Routing Crossbar Two-Stage (I+N) x (k*N) crossbar used in ZUMA Logic Cluster

Results Both Xilinx and Altera versions implemented Our generic version is 125-150 LUTs per eLUT Area overhead as low as 40 Host LUTs per eLUT with improvements Compared to previous work (vFPGA) on 4-LUT host, overhead reduced 3x with same parameters

Outline Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary

CARBON Overlay FPGA implementation of MALIBU CG Modifications to support FPGA block RAMs Critical Path is Memory to ALU to Memory

CARBON-Razor Razor is applied to the CARBON overlay How to do it: Error tolerance on memory to memory critical path How to do it: Shadow registers  apply to CARBON memories CARBON schedule  1-3 extra timeslots for error recovery Stall propagation  extend from 1D pipeline (Razor) to 2D array (CARBON)

CARBON-Razor Memory Shadow register paired with RAM Stratix memory mode allows read-back of previously written data

2D Error Propagation Can’t propagate errors to entire chip fast enough We can propagate it one tile per cycle Error propagation logic can then combine multiple errors into one stall region

2D Error Propagation Example Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors

2D Error Propagation Example Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors 1 1 1 1

2D Error Propagation Example Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors 2 2 1 2 2 1 1 2 1 2

2D Error Propagation Example Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors 3 2 3 3 2 1 2 2 1 1 3 2 1 2

2D Error Propagation Example Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors 4 3 2 3 3 2 1 2 2 1 1 3 2 1 2

2D Error Propagation Example Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors 4 3 2 3 3 2 1 2 2 1 1 3 2 1 2

Stall Propagation Logic When an error is detected at a CG: Instruction schedule stalls Memories in CG load from shadow register Any writes from neighbor captured in shadow register Next cycle: Schedule resumes Neighbor’s write performed from shadow register 4 neighbors stall, unless they stalled last cycle Stall region continues in expanding diamond shaped wave

Carbon Schedule Extension We add 1-3 cycles of slack to schedule Allows margin of safety Speedup determined by difference in FMAX and schedule length If no hard deadline is needed (eg. when used as compute accelerator), average extension of schedule can be used to find speedup FMAX-Razor * SLBase FMAX-Base * SLRazor Speedup =

Results Performance compared between CARBON and CARBON-Razor for 4 benchmarks Maximum performance found by pushing clock speed and shadow register delay Average increases to 14% with no hard deadline Benchmark SL Extra Cycles Speedup Random Ops 24 2 11% Wang 28 1 6% Mean(256) 67 20% PR 29 3% Average 13%

Contributions Area efficient implementation of FPGA routing and logic with LUTRAMs Area efficient 2-stage local routing network and configuration controller Extension of Razor error tolerance from pipelined processors to 2D processing arrays Design of an overclockable coarse grain FPGA overlay with in-circuit error correction

Summary Fine Grain Overlay – ZUMA Coarse Grain Overlay – CARBON FPGA-like architecture, compatible with VTR CAD tools High area overhead implementing fine grain structures Overcome by utilizing FPGA resources, implementing alternate structures Area reduced to 40 host LUTs per eLUT, 3x improvement Coarse Grain Overlay – CARBON Fast compile, efficient mapping of word oriented circuits Time-multiplexing decreases overall performance Performance gained using overclocking with error tolerance Speedup of 13% on average compared to baseline design

Thank you

ZUMA Config Controller

LUTRAM Crossbar

CARBON Razor Timing Shadow register latches correct data if delay is sufficient

CARBON-Razor Stall Logic

CARBON-Razor Test