Download presentation
Published byMagnus Wheeler Modified over 9 years ago
1
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
Alex Brant Advisor: Guy Lemieux University of British Columbia
2
Outline Motivation Contributions Prior Work ZUMA FPGA Overlay
CARBON-Razor Overlay Summary
3
Motivation - 1 FPGA Overlays What are the benefits?
FPGA designs that can be further programmed by the user What are the benefits? Ease of use (simpler languages, tools, etc.) Optimized for particular problem domains Open access to architecture & CAD User-configured logic added to fixed FPGA bitstream Dynamic reconfiguration on any device Portability between vendors and devices
4
Motivation - 2 Fine Grain Overlay – ZUMA FPGA-like architecture
Compatible with VTR CAD tools “Virtual” FPGA for portability of designs Open source for research and applications Implements fine grain part of MALIBU architecture Generic implementation has high area overhead Overcome by utilizing low level FPGA resources, implementing more efficient structures
5
Motivation - 3 Coarse Grain Overlay – CARBON
Array of time-multiplexed ALUs Fast compile High density Efficient mapping of word oriented circuits Implements coarse grain part of MALIBU Time-multiplexing limits overall performance Performance gained using overclocking with error tolerance (CARBON-Razor)
6
Contributions Area efficient implementation of fine grain routing and logic with LUTRAMs Area efficient 2-stage local routing network and configuration controller Extension of Razor error tolerance from pipelined processors to 2D processing arrays Design of an overclockable coarse grain FPGA overlay with in-circuit error correction
7
Publications ZUMA: An Open FPGA Overlay Architecture, Alexander Brant and Guy G.F. Lemieux (FCCM 2012) Pipeline Frequency Boosting: Hiding Dual-Ported Block RAM Latency using Intentional Clock Skew, Alexander Brant, Ameer Abdelhadi, Aaron Severance, Guy G.F. Lemieux (FPT 2012) CARBON-Razor: An Error-Tolerant Coarse Grain FPGA (in preparation)
8
Outline Motivation Contributions Prior Work ZUMA FPGA Overlay
CARBON-Razor Overlay Summary
9
FPGA Architecture Implements any logic function
10
MALIBU Architecture Hybrid coarse/fine grain FPGA
Time-multiplexed ALU (CG) combined with FPGA cluster CG passes data to neighbors through memories
11
MALIBU Hybrid FPGA CGs are run on fast system clock (e.g. > 1GHz)
System clock / Schedule length = User clock rate Advantages: Greater density from time-multiplexing Ability to trade-off between area and speed Compiles up to 300x faster than normal FPGA Better performance for word-oriented circuits
12
Razor Timing Error Tolerance
Works with feed-forward pipeline circuits Detects timing errors by capturing data a second time with a delayed clock Tolerates errors by stalling pipeline one cycle
13
Razor Timing Error Example
Data captured in main FF
14
Razor Timing Error Example
Data captured in main FF Fraction of cycle later, data captured by shadow latch
15
Razor Timing Error Example
Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared
16
Razor Timing Error Example
Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared If different, shadow data loaded to main FF, pipeline is stalled
17
Razor Timing Error Example
Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared If different, shadow data loaded to main FF, pipeline is stalled If not, pipelining proceeds normally
18
Outline Motivation Contributions Prior Work ZUMA FPGA Overlay
CARBON-Razor Overlay Summary
19
ZUMA Overlay Island style FPGA architecture, implemented on an FPGA
Initially implemented in generic Verilog High area overhead, 125+ host LUTs for each ZUMA LUT (eLUT) Area efficiency improvements: Implementation of routing and logic with FPGA LUTRAMs Design of efficient 2-stage local interconnect
20
ZUMA Layout One tile of ZUMA Architecture
21
Details - LUTRAM Reprogrammable LUTRAM in Xilinx and Altera Devices
22
Details – LUTRAM Multiplexer
LUTRAM can implement larger MUXs than a normal LUT, need no extra configuration memory 6-LUT, configured as a 6-to-1 MUX in RAM mode 6-LUT, configured as a 4-to-1 MUX 6-LUT
23
Details – Local Routing Crossbar
Two-Stage (I+N) x (k*N) crossbar used in ZUMA Logic Cluster
24
Results Both Xilinx and Altera versions implemented
Our generic version is LUTs per eLUT Area overhead as low as 40 Host LUTs per eLUT with improvements Compared to previous work (vFPGA) on 4-LUT host, overhead reduced 3x with same parameters
25
Outline Motivation Contributions Prior Work ZUMA FPGA Overlay
CARBON-Razor Overlay Summary
26
CARBON Overlay FPGA implementation of MALIBU CG
Modifications to support FPGA block RAMs Critical Path is Memory to ALU to Memory
27
CARBON-Razor Razor is applied to the CARBON overlay How to do it:
Error tolerance on memory to memory critical path How to do it: Shadow registers apply to CARBON memories CARBON schedule 1-3 extra timeslots for error recovery Stall propagation extend from 1D pipeline (Razor) to 2D array (CARBON)
28
CARBON-Razor Memory Shadow register paired with RAM
Stratix memory mode allows read-back of previously written data
29
2D Error Propagation Can’t propagate errors to entire chip fast enough
We can propagate it one tile per cycle Error propagation logic can then combine multiple errors into one stall region
30
2D Error Propagation Example
Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors
31
2D Error Propagation Example
Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors 1 1 1 1
32
2D Error Propagation Example
Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors 2 2 1 2 2 1 1 2 1 2
33
2D Error Propagation Example
Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors 3 2 3 3 2 1 2 2 1 1 3 2 1 2
34
2D Error Propagation Example
Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors 4 3 2 3 3 2 1 2 2 1 1 3 2 1 2
35
2D Error Propagation Example
Error at tile at cycle 0 Each cycle, stall propagates to nearest neighbors 4 3 2 3 3 2 1 2 2 1 1 3 2 1 2
36
Stall Propagation Logic
When an error is detected at a CG: Instruction schedule stalls Memories in CG load from shadow register Any writes from neighbor captured in shadow register Next cycle: Schedule resumes Neighbor’s write performed from shadow register 4 neighbors stall, unless they stalled last cycle Stall region continues in expanding diamond shaped wave
37
Carbon Schedule Extension
We add 1-3 cycles of slack to schedule Allows margin of safety Speedup determined by difference in FMAX and schedule length If no hard deadline is needed (eg. when used as compute accelerator), average extension of schedule can be used to find speedup FMAX-Razor * SLBase FMAX-Base * SLRazor Speedup =
38
Results Performance compared between CARBON and CARBON-Razor for 4 benchmarks Maximum performance found by pushing clock speed and shadow register delay Average increases to 14% with no hard deadline Benchmark SL Extra Cycles Speedup Random Ops 24 2 11% Wang 28 1 6% Mean(256) 67 20% PR 29 3% Average 13%
39
Contributions Area efficient implementation of FPGA routing and logic with LUTRAMs Area efficient 2-stage local routing network and configuration controller Extension of Razor error tolerance from pipelined processors to 2D processing arrays Design of an overclockable coarse grain FPGA overlay with in-circuit error correction
40
Summary Fine Grain Overlay – ZUMA Coarse Grain Overlay – CARBON
FPGA-like architecture, compatible with VTR CAD tools High area overhead implementing fine grain structures Overcome by utilizing FPGA resources, implementing alternate structures Area reduced to 40 host LUTs per eLUT, 3x improvement Coarse Grain Overlay – CARBON Fast compile, efficient mapping of word oriented circuits Time-multiplexing decreases overall performance Performance gained using overclocking with error tolerance Speedup of 13% on average compared to baseline design
41
Thank you
42
ZUMA Config Controller
43
LUTRAM Crossbar
44
CARBON Razor Timing Shadow register latches correct data if delay is sufficient
45
CARBON-Razor Stall Logic
46
CARBON-Razor Test
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.