Download presentation
Presentation is loading. Please wait.
1
Simulation of Fracturable LUTs
Tim Pifer
2
Presentation Overview
Altera ALM from Stratix II Stratix V architecture Current VPR method for Fracturable LUTS Wiremap for technology mapping AApack for packing
3
Altera Adaptive Logic Module
Traditional 4LUTs provide the best area-delay product Larger LUTs Shorter critical path Absorb more logic Larger LUT mask More input Muxing reduce critical path depth by 20% , improving area Improving FPGA Performance and Area Using an Adaptive Logic Module Mike Hutton, Jay Schleicher, David Lewis, Bruce Pedersen, Richard Yuan, Sinan Kaptanoglu, Gregg Baeckler, Boris Ratchev, Ketan Padalia, Mark Bourgeault, Andy Lee, Henry Kim and Rahul Saini
4
Motivation from other architectures
BLE5 - 15% fewer LUTs , 25% shorter unit delay BLE6 - 22% fewer LUTs , 36% shorter unit delay BLE7 - 28% fewer LUTs , 46% shorter unit delay K=6 25% 6LUT Example design K=4 100 LUTs K=6 78 LUTs : 23 6LUT,32 5LUT,17 4LUT,9 3LUT,13 2LUT Stratix, VirtexII : 30:1 input mux relative contribution of routing area and interconnect delay increase with each generation of fabrication
5
Simple Example: 6LUT from 4BLEs
Larger Area 19 input, 4 registers For 6-LUT, 4LUTs have identical inputs, separate input muxes 3 /4 registers, outputs wasted
6
Improved Example: 6,2 Fracturable LE
8 Inputs, 2 outputs, 2 registers 1 6LUT 2 5LUTs with input Sharing 2 independent 4 LUTS comparable in area with two BLE4 Functionally closer to two BLE5 logic elements.
7
Final Version Composed of 3LUTs Added d2 muxed output
c1 or GND ,c2 or VCC muxed remove mux from d1 swap muxes controlled by R and T two 6LUTs share 4 inputs, identical LUT-mask 4:1 muxes, common data, different select lines Up to 12% pairs of 6-LUTs R=0 T=1 S=1 implements 2 muxed 5Luts with 7 inputs F1 = fn(a1,a2,b1,b2,d1) F2 = fn(a1,a2,b2,c2,d1) Out = mux(F1,F2,c1) roughly area-neutral with BLE4 and 36% decrease in logic depth
8
How do we set RSTU for a 6LUT?
RST = 110 U set to zero allows for second output with D2
9
8:1 mux implementation 8:1 mux in 2 ALMs (4 ALUTs) using 7 input functions second ALM computes output F1=fn(s0,s1,d3,y0,y1) F2=fn(s0,s1,d7,y0,y1) mux controlled by s2 5 BLE4 vs. 2 ALMs, saves one BLE4
10
Stratix V ALM can become 2 4LUTS eight inputs for both ALUTs
backward-compatible with 4LUT architectures Logic Array Blocks and Adaptive Logic Modules in Stratix V Devices
11
Stratix V Comment on adders and carry chains
12
Normal Modes
13
Normal LUT mode single 6LUT mode, other inputs used for registers
14
Extended LUT mode 7 input function
2-to-1 multiplexer with two 5LUTS sharing 4 inputs. If Else statements
15
Why 6LUTS: DES Example DES : 8 sboxes or substitution tables
sbox has 6 inputs, 4 outputs Each output: 1 6LUT 6 4LUTs. 35-45% less area
16
How would we alter technology mapping to best support FLUTs?
17
Technology Mapping 1 4LUT and 2 6LUTs requiring 3 ALMs
Could use 4 5LUTs requiring 2 ALMs and the same logic depth
18
Balancing Technology Mapping
Must maintain optimal critical path depth, more packable LUT distribution avoid 6-LUTs when not helping delay 8:1 muxes identified separately and mapped to 7 input functions 7% of ALMs are 7-input functions
19
Results: Performance 80 designs tested 130nm process
Minimum chip size used Spice models for delay
20
Results: Area
21
Stratix vs. Stratix II
22
Conclusions Benefits of 6LUTS without underutilization
Larger LUT Costs: LUT-mask size input and output muxing FFs 6-LUT is fracturable into 5 LUTs, area comparable to 2 BLE4s 7-input functions and 6 input pairs Technology mapping support is needed for best results 6,2 Alm vs 4BLE: 15% better performance 12% smaller area average
23
How do we need to alter VPR to support FLUTs?
24
AAPack and wiremap ABC with Wiremap technology mapping to primitives
AApack- capable of packing complex logic blocks based on logic primitives
25
Wiremap reduces 6LUTs percentage Does not increase: logic depth
total LUT count WireMap: FPGA Technology Mapping for Improved Routability Stephen Jang, Billy Chan, Kevin Chung, Alan Mishchenko
26
AAPack Overview Current tools can’t support the complexity of logic blocks New logic block description language: Depict complex interconnects Hierarchy Modes of operation Can pack complex blocks Area driven Area is compared to the theoretical minimum Verilog input for large benchmarks Architecture Description and Packing for Logic Blocks with Hierarchy, Modes and Complex Interconnect Jason Luu, Jason Anderson, and Jonathan Rose
27
Example: Virtex-6 Logic Block
Tools don’t support Stratix IV or Virtex 6 Virtex 6: complex soft logic blocks hard memories multipliers
28
What AAPack does Can describe: area-driven packing inputs:
complex logic blocks with arbitrary internal routing structures Variable memory configurations: 4Kx8, or 8Kx4, or 16Kx2 area-driven packing inputs: user design architectural description
29
Complex Block Description Language
Expressive: The language should be capable of describing a wide range of complex blocks. Simple: The language constructs should match closely with an FPGA architect’s existing knowledge and intuition. Concise: The language should permit complex blocks to be described as concisely as possible.
30
Physical blocks Specified in XML Hierarchy
Other blocks and existing primitives Inputs and outputs and clocks with pin numbers
31
Primitives Common primitives are handled in the language
LUTs inputs can be reordered, a memory address cannot
32
Intra-Block Interconnect
Complete: crossbar switch –internal programmable signal direct: direct connection- wire connection, no programmability mux: multiplexed connection single-bit/bus - programmable signal
33
Modes of Operation Mutually exclusive functionality
Represent FPGA structures being used in different ways
34
Packing Algorithm Input: technology mapped Netlist, XML architecture
Output: Packed complex blocks Greedy algorithm similar to other packing methods while until all blocks are packed Seed block s selected and packed New complex block B for s Pack additional blocks into B Choose a compatible block c Pack c into B if valid Add B to Packed list
35
Selecting Netlist and Complex Blocks
Choose the block with the most nets attached Candidates are selected based on affinity in equation 1 Affinity = shared nets and connections divided by the number of pins the new block would add. Connections is a measure of how likely the new block will need external connections Alpha is set to .9
36
Legality: Location attempting to pack:
chooses a location verifies routing traversing the complex block as a tree ordered smallest to largest right to left traversed right to left to ensure smallest resource consumption attempts to pack the other nodes in the sub-tree: find a flip flop for a LUT 30 packs on the sub-tree
37
Legality: Routablility
Initially, check if packing would exceeded external pin count Then, generate routing graph for complex block Assume any output can connect to any input of a complex block (switchbox architecture) Apply pathfinder
38
Memory Primitives are technology mapped with a single bit width
256 X 8 memory mapped as 8 256X 1 bit memories primitives mapped to same component if bus signals identical
39
Limitations No support for timing in this implementation
primitive can map to only one complex logic block flip flop can only be used in a LUT complex block, they cannot also be present in Multiplier complex blocks
40
What are some faults of this packing method?
41
Experiments Verilog benchmarks soft processors image processors
42
Fracturable LUTS CLBs: BLEs : FlUTs: fully connected BLEs
FI X N – no pin sharing 8 BLEs BLEs : 1FLUT 2 flip flops 2 outputs FlUTs: 2 Modes 6Lut Dual 5LUTs Variable number of inputs Dual mode input sharing depends on number of inputs
43
FLUT Evaluation Compare achieved area with the lower bound
Lower bound: number of complex blocks needed to contain the primitives without routing considerations Efficiency : ratio of the achieved number of logic blocks and this value
44
Efficiency Results Number of inputs FI varied 5 – 10
Geometric average across 5 benchmarks 5 indicates all inputs are shared, 10 indicates no inputs are shared 6 or 7 achieves tolerable efficiency
45
Logic blocks and Channel width with number of inputs
# blocks decreases to 7 Channel width from # inputs first increases from more routing to each block then decreases after 7: full efficiency so easier routing
46
Memory Varied # bits and max width
best utilization: smallest size, maximum width
47
CLB consumption with memory size
Smaller memories: more logic due to muxes Best results: multiple memory sizes
48
Conclusions New language can describe complex architectures using:
Hierarchy Modes Arbitrary interconnects Packing algorithm for this architecture Verified on large benchmarks Needs timing driven packing
49
How can we get additional improvement from technology mapping?
50
Academic FLUTs soft logic
4 architectures : K = 6 M = 5,6,7,8 M5: dual-output 6-LUT of a Xilinx Virtex 5 M8: Stratix II ALM Exploring FPGA Technology Mapping for Fracturable LUT Minimization David Dickin, Lesley Shannon
51
BLE BLE: 1 FLUT 2 Registers 8 inputs 4 outputs
52
LUT Balancing Experiments
WireMap - no LUT balancing WireMap - with LUT balancing increase the cost of LUT5, LUT6 from 1.0 to 2.5 in 0.1 increments Smaller LUT weighs unchanged
53
Varying 6LUT weight
54
Varying 5LUT and 6LUT weight
55
Different Architectures
56
Clock Frequency
57
FLUT Reduction by Architecture
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.