Improved Resource Sharing for FPGA DSP Blocks

Slides:



Advertisements
Similar presentations
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
Advertisements

Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.
Distributed Arithmetic
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
Parallell Processing Systems1 Chapter 4 Vector Processors.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Bryan Lahartinger. “The Apriori algorithm is a fundamental correlation-based data mining [technique]” “Software implementations of the Aprioiri algorithm.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.
Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Spezielle.
Reconfigurable Computing. Lect-02.2 Course Schedule Introduction to Reconfigurable Computing FPGA Technology, Architectures, and Applications FPGA Design.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
1 WORLD CLASS – through people, technology and dedication High level modem development for Radio Link INF3430/4431 H2013.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Efficient FPGA Implementation of QR
CLEMSON U N I V E R S I T Y AVR32 Micro Controller Unit Atmel has created the first processor architected specifically for 21st century applications that.
A Flexible DSP Block to Enhance FGPA Arithmetic Performance
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
A High-Speed Hardware Implementation of the LILI-II Keystream Generator Paris Kitsos...in cooperation with Nicolas Sklavos and Odysseas Koufopavlou Digital.
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA Project Guide: Smt. Latha Dept of E & C JSSATE, Bangalore. From: N GURURAJ M-Tech,
Exploiting Parallelism
EKT303/4 Superscalar vs Super-pipelined.
Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
1. Convert the RISCEE 1 Architecture into a pipeline Architecture (like Figure 6.30) (showing the number data and control bits). 2. Build the control line.
On Implementing Sorting Network Machines with FPGAs
An optimization of the SAFER+ algorithm for custom hardware and TMS320C6x DSP implementation. By: Sachin Garg Vikas Sharma.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
PipeliningPipelining Computer Architecture (Fall 2006)
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Presenter: Darshika G. Perera Assistant Professor
Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:
Hiba Tariq School of Engineering
Head-to-Head Xilinx Virtex-II Pro Altera Stratix 1.5v 130nm copper
ESE532: System-on-a-Chip Architecture
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Instructor: Dr. Phillip Jones
Anne Pratoomtong ECE734, Spring2002
Instruction Level Parallelism and Superscalar Processors
Introduction to cosynthesis Rabi Mahapatra CSCE617
Lecture 41: Introduction to Reconfigurable Computing
Centar ( Global Signal Processing Expo
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Serial versus Pipelined Execution
Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.
Multiplier-less Multiplication by Constants
Programmable Logic- How do they do that?
ARM implementation the design is divided into a data path section that is described in register transfer level (RTL) notation control section that is viewed.
Scalable Memory-Less Architecture for String Matching With FPGAs
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Final Project presentation
ARM ORGANISATION.
DSP Architectures for Future Wireless Base-Stations
Programmable logic and FPGA
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Improved Resource Sharing for FPGA DSP Blocks Bajaj Ronak School of Computer Science and Engineering, Nanyang Technological University, Singapore. Suhaib A Fahmy School of Engineering, University of Warwick, UK. 1st Sep, 2016.

Xilinx DSP48E1 Primitive Three sub-blocks: Up to four pipeline stages Pre-adder Multiply ALU Up to four pipeline stages Supports dynamic programmability Functionality can be changed per clock cycle 17-bit configuration input

Xilinx DSP48E1 Primitive iDEA Soft Processor FPGA Overlays Exploit dynamic programmability to build a small, fast (400MHZ+ soft processor) [FPT2012, TRETS 2014] FPGA Overlays Exploit dynamic programmability in flexible processing elements Makes fast, area-efficient overlays [FCCM2015, HEART 2015, DATE 2016, FCCM 2016]

Resource sharing Hard blocks like DSP48E1 are typically a constrained resource, and resource sharing should be applied where possible Traditional resource sharing: Operations scheduled in non-overlapping time schedules mapped to a set of hardware blocks Input and output muxes controlled through a state machine Major disadvantages: Increased schedule length High initiation interval (II) due to multi-cycle DSP blocks Structure of DFG of design limits the best achievable II, thus throughput

Improved Resource sharing Proposed scheduling and implementation technique for II driven resource sharing Splits operations across multiple banks of DSP blocks, such that each bank meets targeted II Opens up space between fully unconstrained implementation and traditional resource sharing Results in significant resource savings compared to resource unconstrained implementations Dynamic programmability of DSP block is exploited to map different sets of operations onto the same DSP block primitive

Illustrative Example Maximum number of DSP blocks in a schedule time = 3 (due to data dependencies) Best II achievable = 16 #DSP=1 #DSP=2 #DSP=3 Sch Length 62 32 22 II 56 26 16

Illustrative Example Proposed approach uses more DSP blocks for better II II=16 II=11 II=6 II=1 Sch Length 22 32 DSP Blks 3 4 6 12 II = 6

Results Throughput gain with increase in DSP block usage Increase in DSP is from best throughput achievable using TRS1.8× throughput improvement with 1.4× increase in DSP for II of 11 For II of 6, throughput improvements up to 8× at a cost of 3× increase in DSP blocks All these design points are inaccessible using traditional approach

Thank You