Partial Region and Bitstream Cost Models for Hardware Multitasking on Partially Reconfigurable FPGAs + Also Affiliated with NSF Center for High- Performance.

Slides:

Advertisements

Similar presentations

PARTIAL RECONFIGURATION USING FPGAs: ARCHITECTURE

Advertisements

FPGA (Field Programmable Gate Array)

Architecture-Specific Packing for Virtex-5 FPGAs

Computer Architecture (EEL4713, Fall 2013) Partial Reconfiguration Not just a half baked job of reconfiguring Rohit Kumar Research Student University of.

Run-Time FPGA Partial Reconfiguration for Image Processing Applications Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida Dr. Ann Gordon-Ross.

A self-reconfiguring platform Brandon Blodget,Philip James- Roxby, Eric Keller, Scott McMillan, Prasanna Sundararajan.

Lecture 7 FPGA technology. 2 Implementation Platform Comparison.

HTR: On-Chip Hardware Task Relocation for Partially Reconfigurable FPGAs + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

QUIZ What does ICAP stand for ? What is its main use ? Why is Partition Pin preferred over Bus Macro? 1.

Committee Members: Annie S. Wu, Jooheung Lee, and Ronald F. DeMara Committee Members: Annie S. Wu, Jooheung Lee, and Ronald F. DeMara Optimizing Dynamic.

Fast FPGA Resource Estimation Paul Schumacher & Pradip Jha Xilinx, Inc.

BIST for Logic and Memory Resources in Virtex-4 FPGAs Sachin Dhingra, Daniel Milton, and Charles Stroud Electrical and Computer Engineering Auburn University.

Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

1 Performed by: Lin Ilia Khinich Fanny Instructor: Fiksman Eugene המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.

The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.

Configurable System-on-Chip: Xilinx EDK

Evolution of implementation technologies

Introduction to FPGA’s FPGA (Field Programmable Gate Array) –ASIC chips provide the highest performance, but can only perform the function they were designed.

Dynamic Power Consumption In Large FPGAs WILLIAM GARCIA, ANDREW MORTELLARO.

Bitstream Relocation with Local Clock Domains for Partially Reconfigurable FPGAs Adam Flynn, Ann Gordon-Ross, Alan D. George NSF Center for High-Performance.

Dr. Konstantinos Tatas ACOE201 – Computer Architecture I – Laboratory Exercises Background and Introduction.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Benefits of Partial Reconfiguration Reducing the size of the FPGA device required to implement a given function, with consequent reductions in cost and.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

J. Christiansen, CERN - EP/MIC

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Design Framework for Partial Run-Time FPGA Reconfiguration Chris Conger, Ann Gordon-Ross, and Alan D. George Presented by: Abelardo Jara-Berrocal HCS Research.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

Reconfigurable Embedded Processor Peripherals Xilinx Aerospace and Defense Applications Brendan Bridgford Brandon Blodget.

FPGA Partial Reconfiguration Presented by: Abelardo Jara-Berrocal HCS Research Laboratory College of Engineering University of Florida April 10 th, 2009.

1 Leakage Power Analysis of a 90nm FPGA Authors: Tim Tuan (Xilinx), Bocheng Lai (UCLA) Presenter: Sang-Kyo Han (ECE, University of Maryland) Published.

M. ALSAFRJALANI D. DZENITIS Runtime PR for Software Radio 2/26/2010 UFL ECE Dept 1 PARTIAL RECONFIGURATION (PR)

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

VAPRES A Virtual Architecture for Partially Reconfigurable Embedded Systems Presented by Joseph Antoon Abelardo Jara-Berrocal, Ann Gordon-Ross NSF Center.

Minimum Effort Design Space Subsetting for Configurable Caches + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.

Reconfigurable Architectures Greg Stitt ECE Department University of Florida.

A Brief Introduction to FPGAs

Runtime Temporal Partitioning Assembly to Reduce FPGA Reconfiguration Time Abelardo Jara-Berrocal, Ann Gordon-Ross HCS Research Laboratory College of Engineering.

An Automated Hardware/Software Co-Design

A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg.

James Coole PhD student, University of Florida Aaron Landy Greg Stitt

Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch

FPGA: Real needs and limits

Anne Pratoomtong ECE734, Spring2002

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Aurelio Morales-Villanueva and Ann Gordon-Ross+

Tosiron Adegbija and Ann Gordon-Ross+

Shaon Yousuf Ph.D. Student NSF CHREC Center, University of Florida

University of Florida, Gainesville, Florida, USA

Dynamic Partial Reconfiguration of FPGA

Presentation transcript:

Partial Region and Bitstream Cost Models for Hardware Multitasking on Partially Reconfigurable FPGAs + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Aurelio Morales-Villanueva and Ann Gordon-Ross + Department of Electrical and Computer Engineering University of Florida, Gainesville, Florida, USA This work was supported by National Science Foundation (NSF) grants EEC and IIP , and Programa de Ciencia y Tecnología (FINCyT) under contract FINCyT-BDE

2 of 20 Field-programmable gate arrays (FPGAs) –Programmable devices with large amount of resources Resources connected with a complex, configurable routing network –Logic resources: CLBs (LUTs, flip-flops) –Special resources: BRAMs, DSPs, hardcore μP Reconfiguration on FPGAs –Benefits system designers and functionality Run-time hardware adaptation via resource time multiplexing Reduced area/power requirements Two types of reconfiguration: full and partial reconfiguration Introduction

3 of 20 Full Reconfiguration Used for initializing the entire FPGA –Entire FPGA configured with full bitstream and fixed hardware task set –Reconfiguration halts all tasks (i.e., the entire system) –Lengthy switching time if task set changes Execution and state of all tasks is lost during full reconfiguration! Configuration Port Full bitstream 2 Full bitstream 1 HW task C1 HW task B1 HW task A1 HW task C2 HW task B2 HW task A2

4 of 20 Current FPGAs support PR –Enables efficient hardware multitasking –FPGA area and power reduction, faster configuration, etc. Effectively leveraging PR on FPGAs –Challenging for system designers –Early design decisions affect overall PR system performance –Inappropriate decisions severely degrade PR system performance Potentially worse than non-PR system PR divides the FPGA fabric into two regions –Static region: fixed functionality, never reconfigured after initial configuration at startup –Reconfigurable region: multiple PR regions (PRRs) PRRs execute PR modules (PRMs) (hardware tasks) Module D Module C Module B Module A Embedded processor ICAP Mem Controller Reconfig. region Static region Partial Reconfiguration (PR)

5 of 20 Increased flexibility Increased task throughput/performance Reduced FPGA area requirements Reduced power consumption Dynamic, on-the-fly PR of individual PRRs –No execution interruption of static region or other PRRs! Uses partial bitstreams –Smaller than full bitstream  faster reconfiguration time –*May* require bitstream for each PRM-to-PRR mapping Partial vs. Full Reconfiguration Function Power On Time Static region operation Re configuration Overhead Configuration Overhead

6 of 20 –Fine-grained to coarse-grained partitioning Simple operations to entire application as a single PRM –Designers can only evaluate a subset of these designs Need analytical or simulated cost models Evaluate design decisions’ impact on PRR size/organization and partial bitstream sizes Cost models avoid lengthy PR design flow PR partitioning design space is exponentially large Critical design decisions done in early system design Static region Resource utilization vs. PRR size/organization OR PRR 1 PRR 1 PRR 2 PRR 1 PRM3 PRM4 PRM2 PRM1 PRR size/organization? PRR size? How big? PRM-to-PRR mapping? Design partitioning? System Designer Challenges

7 of 20 Prior works in PR cost models –Only provided partial methods for evaluating design tradeoffs Manual PRR floorplanning process in the PR design flow –Avoid oversized PRRs –Avoid ill-suited PRR organizations –Goal: high resource utilization per PRR –Benefits: Smaller partial bitstreams Faster reconfiguration times Efficent area utilization in the FPGA GOAL: High-level cost models for system designers –Evaluation of design decisions early in the design process The cost models must provide sufficiently accurate evaluations Reduces design space exploration time –As compared to full system implementation to attain same information change to PRR Motivations

8 of 20 Two high-level cost models for design decision evaluation –Based on synthesis report results generated by Xilinx tools cost model –PRR size/organization cost model Compares PRRs with different resources and FPGA fabric locations cost model –Partial bitstream size cost model Partial bitstream size derivation based on PRR size/organization Benefits of our cost models –Early estimation of PRR size/organization and partial bitstream size Increases the resource utilization in PRRs –Generally portable across different Xilinx FPGA families Device-specific characteristics’ values used in cost model formulas –Does not require executing the entire PR design flow –Significantly decreases the design exploration time Increasing system designer productivity Contributions

9 of 20 PRR Size/Organization Cost Model

10 of 20 Specific values in PRR size/organization cost model for Virtex-4, -5, and -6 device families ParameterVirtex-4Virtex-5Virtex-6 CLB col DSP col 4816 BRAM col 448 LUT_CLB888 FF_CLB8816 ParameterDescription DSP col DSPs in a column (per row) BRAM req BRAMs required in PRM W BRAM BRAM columns in PRR H BRAM BRAM rows in PRR BRAM col BRAMs in a column (per row) CLB avail CLBs available in PRR FF avail FFs available in PRR DSP avail DSPs available in PRR BRAM avail BRAMs available in PRR HNumber of rows in the PRR WNumber of columns in the PRR PRR size Size of PRR ParameterDescription LUT_FF req LUT FF pairs required in PRM LUT req Slice LUTs required in PRM LUT_CLBLUTs per CLB FF_CLBFFs per CLB CLB req CLBs required in PRM FF req FFs required in PRM W CLB CLB columns in PRR H CLB CLB rows in PRR CLB col CLBs in a column (per row) DSP req DSPs required in PRM W DSP DSP columns in PRR H DSP DSP rows in PRR PRR Size/Organization Cost Model Parameters Based on Xilinx synthesis report results

11 of 20 PRR size/organization depends on the specific FPGA selected PRR height (number of rows) PRR Width (number of columns) Total PRR size –CLB columns (W CLB ) –DSP columns (W DSP ) –BRAM columns (W BRAM ) Derive PRRs resources –Maximum resource util. BRAMs DSPs Flip-Flops Extract resources required for PRMs that map to same PRR Selected Device : 5vlx110tff Slice Logic Utilization: # of Slice Registers: 1592 of % # of Slice LUTs: 1527 of % # used as Logic: 1527 of % Slice Logic Distribution: # of LUT Flip Flop pairs used: 2619 # with an unused Flip Flop: 1027 of % # with an unused LUT: 1092 of % # of fully used LUT-FF pairs: 500 of % # of unique control sets: 45 IO Utilization: # of IOs: 38 # of bonded IOBs: 38 of 640 5% Specific Feature Utilization: # of Block RAM/FIFO: 4 of 148 2% # using Block RAM only: 4 # of BUFG/BUFGCTRLs: 3 of 32 9% # of DSP48Es: 4 of 64 6% Generate synthesis report for each PRM Select an FPGA for the PR system Derivation of the PRR Size/Organization H = H CLB = H DSP = H BRAM W = W CLB + W DSP + W BRAM PRR SIZE = H x W

12 of 20 Partial Bitstream Size Cost Model

13 of 20 Partial bitstream structure is similar across device families –Initial words (IW) Synchronization of bitstream with configuration port (e.g., ICAP) –Configuration words per PRR row (NCW row ) Access to CLBs, DSPs, BRAMs, and CLB flip-flops initialization –BRAM data words per PRR row (NDW BRAM ) BRAM initialization –Final words (FW) Releases the ICAP, allowing other PRRs to be configured Partial Bitstream Structure

14 of 20 Specific values in partial bitstream size cost model for Virtex-4, -5, and -6 device families ParameterVirtex-4Virtex-5Virtex-6 CF CLB 2236 CF DSP 2128 CF BRAM DF BRAM FR size IW FW FAR_FDRI 555 Bytes word 444 ParameterDescription IWNumber of initial words FWNumber of final words FAR_FDRIFAR/FDRI initialization words per row NCW row Configuration words in a PRR row NDW BRAM BRAM initialization words in a PRR row NCF CLB CLB configuration frames in a PRR row NCF DSP DSP configuration frames in a PRR row NCF BRAM BRAM configuration frames in a PRR row CF CLB Configuration frames per CLB column CF DSP Configuration frames per DSP column CF BRAM Configuration frames per BRAM col. DF BRAM Initialization frames per BRAM col. FR size Frame size in words Bytes word Number of bytes per word HNumber of rows in the PRR S bitstream Size of partial bitstream in bytes Partial Bitstream Size Cost Model Parameters

15 of 20 Partial Bitstream Size Derivation Partial bitstream size in bytes S bitstream = {IW + H x (NCW row + NDW BRAM ) + FW} x Bytes words PRR rows frame size NCW row = FAR_FDRI + (NCF CLB + NCF DSP + NCF BRAM + 1) x FR size Configuration words per PRR row CLB configuration frames per PRR row NCF CLB = W CLB x CF CLB DSP configuration frames per PRR row NCF DSP = W DSP x CF DSP BRAM configuration frames per PRR row NCF BRAM = W BRAM x CF BRAM BRAM initialization words per PRR row NDW BRAM = FAR_FDRI + (W BRAM x DF BRAM + 1) x FR size

Experimental Results

17 of 20 RU CLB = 92%, RU DSP = 84%, RU BRAM = 0% H = 1, W CLB = 5, W DSP = 2, W BRAM = 0 PRM FIR (Virtex-6) Resource Utilization RU CLB = 82% RU DSP = 80% RU BRAM = 0% H = 5, W CLB = 2, W DSP = 1, W BRAM = 0 PRM FIR (Virtex-5) Resource Utilization RU CLB = 92% RU DSP = 25% RU BRAM =75% H = 1, W CLB = 11, W DSP = 1, W BRAM = 1 PRM MIPS (Virtex-6) Synthesis report results using Xilinx ISE 12.4 tools Resource utilizations (RUs) per resource type are maximum for the selected PRR size/organization Resource Utilization (RU) RU CLB = 97% RU DSP = 50% RU BRAM =75% H = 1, W CLB = 17, W DSP = 1, W BRAM = 2 PRM MIPS (Virtex-5) Executing the entire flow vs. using our cost model Average RU CLB is 15% higher (due to tool optimizations) RU DSP and RU BRAM are the same FPGA devices -- Virtex-5 LX110T and Virtex-6 LX75T –Different sizes/architectures to evaluate different resource organizations Experimental PRMs -- MIPS, FIR, and SDRAM –PRM complexity and resource usage similar to prior works PRR Size/Organization Cost Model Evaluation

18 of 20 Virtex-5 LX110TVirtex-6 LX75T ProcessFIRMIPSSDRAMFIRMIPSSDRAM Synthesis4m 25s4m 15s3m 20s4m4m 50s4m 23s Implementation5m 35s5m 15s2m 55s4m 15s5m 50s4m 30s Place and Route execution times Includes derivation of PRR size/organization and bitstream size (cost model = 1m 30s on avg., which is 35% of synthesis time) Execution times: minutes (m) and seconds (s) PRMVirtex-5 LX110TVirtex-6 LX75T FIR83,44077,340 MIPS157,672189,140 SDRAM18,41624,204 Partial Bitstream Sizes Bitstream sizes (in bytes) based on PRR sizes/organizations per PRM –Without executing the entire PR design flow –Bitstream sizes are 9% larger on average vs. executing the entire flow

19 of 20 Introduced two high-level cost models –Early design estimation tradeoffs for PR system design space exploration –PRR size/organization cost model Smallest PRRs that maximize shared PRM resource utilization –Partial bitstream size cost model Bitstream size derivation based on PRR size/organization –Cost models generally portable across FPGA device families –Improved system designer productivity Use of cost models without executing the entire PR design flow Future work –Introduce cost models as part of the PR design flow Integration with Xilinx tools in the PRR floorplanning process Conclusions

20 of 20 Questions?