Download presentation
Presentation is loading. Please wait.
Published byKerry Montgomery Modified over 9 years ago
1
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida
2
Agenda Introduction Previous work on reducing Partial Reconfiguration overhead Develop a multi-step physical area management strategy Experimental case studies: 4 simple circuits: serial, parallel, block multiplier resource,high fanout carry chain LUT multiplier 2 large circuits: SECDED, MD5 step function (new) Conclusion and future work
3
Motivation For partial reconfiguration operations, especially those running on a SOC configuration, storage space and reconfiguration speed are crucial. Find way to optimize them...
4
Previous Work Architecture-level [Ganesan2000] Pipelining overlap execution of one temporal partition with reconfiguration of another Logical level [Raghuraman2005] Minimization of frames relating the number of frames that need to be downloaded into FPGAs to the number of minterms of a specially constructed logic function Hardware level [Compton2002][Hauck1999] Requires dedicated HW component Compression or defragmentation performed externally at runtime Physical-level resource management? Not specifically addressed for partial reconfiguration modules in literature But can be done … and … also cascaded with above techniques for multiplicative benefit
5
Module Based Flow Including Fixed modules and PR modules Using Bus Macro Suitable for potential full automation Primary PR technique to study Basic concept and capability for PR proposed by Xilinx
6
Column-Level Configuration Format For Virtex II/Pro series Including IOB, IOI, CLB, GCLK, BlockRAM, and BlockRAM Interconnect Labeling with 32-bit addresses composed of a Block Address (BA), a Major Address (MJA), a Minor Address (MNA), and a byte number Only contents of CLB column are studied in this research
7
Bitstream Compression CLB columns program the configurable logic blocks, routing, and most interconnect resources. Each CLB column contains 2 columns of slices. 22 frames are utilized within the bitstream for each column of slices, describing the logic and routing information respectively. Each frame occupies 424 bytes. In bitstream of PR modules, a compression technique is already used by Xilinx to represent the unused CLB frame. For unused CLB frames, only 10 bytes are used instead of 424 to describe empty contents.
8
Optimization Strategy Primary goal: minimize the number of columns of slices utilized, including routing resources Secondary goal: do not incur increase in propagation delay after area optimization Physical area management procedure: 1.Region Allocation: define size and boundary 2.Pin Assignment: top/bottom edge preferred 3.Column Alignment: fill column of slices at a time 4.Choke-Point Elimination: resolve high fanout cases 5.Repeat
9
Case Study The hardware platform is Xilinx Virtex II Pro VP7 device. Module-based partial reconfiguration flow is adopted to generate the partial reconfiguration bitstream. The Xilinx ISE 6.3 is used to support the module based flow. The area constraints are entered directly into User Constrain File (.ucf ) before map and routing. Four representative small cases and two larger size cases studies are tested for the strategy. Similar or identical external pin arrangement. Hypothesis: larger circuits can achieve more savings
10
4-LUT Design Optimization (parallel) Simple 4-LUT elements Parallel logic path with direct input from external signals LUTs feed outputs straight though to flip flops Best strategy is by locating them in a single column close to the external pins
11
Shifter Optimization (serial) All logic elements are cascaded in contiguous string of CLBs Attributes of this serial circuit functionality will have best arrangement from input to output in single column serially
12
Block Multiplier Optimization Block multiplier resource involved Requires balancing the routing between two paths. One path is between the block multiplier and the LUTs. The other path is from the LUTs to the external pins. Decreased savings compared to use of LUT- only circuits
13
LUT Multiplier Optimization High fan-out occurs because of the carry chains Single column style is not optimal Arrange related LUTs around each other using adjacent columns
14
Larger Case Studies SECDED as full PR module and MD5 step functions were developed as PR module vs. SHA family During the optimization process, not every slice has been specifically placed because of the large number of resources involved Only the slices on the critical path are constrained. These are comparatively larger modules, increased bitstream savings of 33% and 30% are achieved
15
Benchmark Results Module Name # of LUT. # of FF # of block Multiplier # of Slices Original File Size (bytes) Original Max. Delay (ns) Optimized File Size (bytes) Optimized Max. Delay (ns) Area Saving 4 LUTs41601264K1.37155K1.34714% Shifter12401387K1.37763K1.36728% Block Multiplier 82511788K1.34666K1.34625% LUT Multiplier 22 0 96K1.36768K1.34629% SECDED934107489K1.35560K1.35533% MD52921280168120K1.38084K1.32230%
16
Area Optimization Results A physical resource area management strategy is proposed to minimize the reconfiguration overhead. Experiments show that up to one-third of size reduction can be achieved for partial reconfiguration modules. The maximum propagation delay has also been decreased slightly in most cases (not increased). On the other hand, the larger the module is, the more complicated and time consuming the process becomes. Providing autonomous area optimization capabilities is future work for integrating into our Multi-Layer Runtime Reconfiguration Architecture FGPA framework
17
Multi-Layer Runtime Reconfiguration Arch. (MRRA) Resource name Number of Available Number of Used Utilization IOBs3968521% Slices4928180536% BRAM244454% TBUFs246435214% PPC40511100% BUFGMUXs4125%
18
Current Work: Direct Bitstream Management Change one-bit full adder to a one-bit full subtracter Both have three one-bit inputs and two one-bit outputs. Both used 2 LUTs with identical logic interconnections between LUTs and I/O signals. Only difference between them is actually only one truth table stored inside one LUT, changing from 0xE8 to 0x8E Combined MD5 / SHA-1 Step Function – Area Utilization Original Algorithm Module Based Frame Based SHA-1 192(slices)65(slices) 32(slices) MD5 881(slices)168(slices) 32(slices) Combined 1068(slices)324(slices) 32(slices)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.