Presenter: Darshika G. Perera Assistant Professor

Slides:

Advertisements

Similar presentations

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

Advertisements

Spartan-3 FPGA HDL Coding Techniques

Altera FLEX 10K technology in Real Time Application.

BIST for Logic and Memory Resources in Virtex-4 FPGAs Sachin Dhingra, Daniel Milton, and Charles Stroud Electrical and Computer Engineering Auburn University.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Configurable System-on-Chip: Xilinx EDK

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.

Technion Digital Lab Project Performance evaluation of Virtex-II-Pro embedded solution of Xilinx Students: Tsimerman Igor Firdman Leonid Firdman.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

GPGPU platforms GP - General Purpose computation using GPU

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010.

Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.

J. Christiansen, CERN - EP/MIC

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

Power-Aware RAM Processing for FPGAs December 9, 2005 Power-aware RAM Processing for FPGA Embedded Memory Blocks Russell Tessier University of Massachusetts.

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Design Framework for Partial Run-Time FPGA Reconfiguration Chris Conger, Ann Gordon-Ross, and Alan D. George Presented by: Abelardo Jara-Berrocal HCS Research.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

1 Multi-ported Memories for FPGAs via XOR Eric LaForest, Ming Liu, Emma Rapati, and Greg Steffan ECE, University of Toronto.

Design of a Novel Bridge to Interface High Speed Image Sensors In Embedded Systems Tareq Hasan Khan ID: ECE, U of S Term Project (EE 800)

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:

Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.

Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Fast VLSI Implementation of Sorting Algorithm for Standard Median Filters Hyeong-Seok Yu SungKyunKwan Univ. Dept. of ECE, Vada Lab.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

A Multi-Ported Memory Compiler Utilizing True Dual- port BRAMs Ameer Abdelhadi and Guy Lemieux Department of Electrical and Computer Engineering University.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Programmable Logic Devices

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Backprojection Project Update January 2002

Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.

Improved Resource Sharing for FPGA DSP Blocks

A New Logic Synthesis, ExorBDS

ECE354 Embedded Systems Introduction C Andras Moritz.

Introduction Introduction to VHDL Entities Signals Data & Scalar Types

Application-Specific Customization of Soft Processor Microarchitecture

Assembly Language for Intel-Based Computers, 5th Edition

Instructor: Dr. Phillip Jones

Architecture & Organization 1

Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch

FPGAs in AWS and First Use Cases, Kees Vissers

FPGA Implementation of Multicore AES 128/192/256

Improving java performance using Dynamic Method Migration on FPGAs

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

Hyperthreading Technology

Anne Pratoomtong ECE734, Spring2002

Architecture & Organization 1

Morgan Kaufmann Publishers Computer Organization and Assembly Language

Wavelet “Block-Processing” for Reduced Memory Transfers

Dynamic High-Performance Multi-Mode Architectures for AES Encryption

Portable SystemC-on-a-Chip

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

An Efficient Embedded Multi-Ported Memory Architecture for Next-Generation FPGAs Presenter: Darshika G. Perera Assistant Professor Department of Electrical & Computer Engineering University of Colorado at Colorado Springs Email: darshika.perera@uccs.edu Website: http://eas.uccs.edu/~dperera/ Authors: S. Navid Shahrouzi & Darshika G. Perera Copyright 2017 Darshika G. Perera

Copyright 2017 Darshika G. Perera Overview Introduction and motivation Existing research on multi-ported memory designs on FPGAs Our embedded multi-ported memory architecture Experimental results and analysis Conclusion and future work Copyright 2017 Darshika G. Perera

Introduction and Motivation Significance of FPGAs for embedded computing Higher level of flexibility than ASICs Higher performance than sw on microprocessors Utilizing FPGAs for real-time compute/data intensive applications Data mining, machine learning, image processing Copyright 2017 Darshika G. Perera

Introduction and Motivation To achieve high speed-performance Leverage fine/coarse grain parallelism, data parallelism, pipelining To execute computations in parallel Simultaneously read/write multiple data/results from/to memory Existing dual-port BRAMs on FPGAs Insufficient for real-time compute/data intensive applications Copyright 2017 Darshika G. Perera

Existing Works on Multi-Ported Memory Designs on FPGAs Conventional methods Replication, Banking, & Multi-pumping Multi-ported memories in conjunction with soft processor cores VLIW, Multithreaded, Application-Specific, MicroBlaze, Nios processors Most recent works LVT-based, XOR-based, I-LVT, Switched-ports Use techniques to provide arbitrary number of R/W ports Adds extra logic and routing to the design Increases design complexity & cost Sheer design complexity Hinders employment with next-gen. FPGAs Copyright 2017 Darshika G. Perera

Our Embedded Multi-Ported Memory Architecture Objective: To provide a simplified memory architecture with an arbitrary number of R/W ports To simplify the design Only the read data from BRAMs are processed Using intermediate combinatorial logic All other signals are directly forwarded to BRAMs Without incorporating any intermediate logic b/w modules Write data & R/W addresses Copyright 2017 Darshika G. Perera

Top-Level Architecture of Our mW/nR Multi-Ported Memory Copyright 2017 Darshika G. Perera

Decision Making Module (DMM) DMM finds the last written data in m number of BRAMs During read operation Internal architecture - combinatorial Depends on number of write ports For n number of read ports  n number of DMMs executed in parallel Functionalities Checks the counter values of each BRAM in a column Data with highest counter value Extracts this last written data value Forwards this data value to read data output port Copyright 2017 Darshika G. Perera

Copyright 2017 Darshika G. Perera Counter Integrate a counter value to input data Prior to storing in BRAM Our BRAMs are configured To have r-bit word size Values p & q are variables Depends on requirements of a given application Copyright 2017 Darshika G. Perera

Copyright 2017 Darshika G. Perera Counter Reaches max. count - 2p-2 clock cycles Operating Time = Overflow time (Tov) Based on number of counter bits At max. count multi-ported memory Can not perform write operations Can perform read operations Memory has to recover to restore write operations Write all the data to recovery memory Reset multi-ported memory Write all the data back to multi-ported memory from recovery memory Copyright 2017 Darshika G. Perera

Recovery vs. Operating Time/Mode mW/nR multi-ported memory with 2s depth, and p-bit counter Example: Operating time = Tov = 183 minutes For system clk freq. is 100MHz & p is 40-bit Recovery Time = Tr = 163 microseconds For 32-bit 8K 2W/2R multi-ported memory Higher ratio leads to better multi-ported memory design Copyright 2017 Darshika G. Perera

Internal Architecture: BRAM Distribution of 2W/2R Multi-Ported Memory Copyright 2017 Darshika G. Perera

Internal Architecture of DMM for 2W/2R Multi-Ported Memory Copyright 2017 Darshika G. Perera

How does it reduce the complexity? Read data outputs of BRAMs go to DMM DMM selects and forwards the most recently written data to read data output port Remaining port signals (W/R address & write data) are directly forwarded to BRAMs Without incorporating any extra logic between modules Copyright 2017 Darshika G. Perera

Experimental Results & Analysis To evaluate feasibility and efficiency Evaluated with the most recent designs LVT-based, XOR-based On Virtex-6 XC6VHX380T FPGA For fair comparison purposes To synthesize & implement our multi-ported memory Xilinx ISE 14.7 To verify results & functionalities of designs ModelSim SE & Xilinx ISim Copyright 2017 Darshika G. Perera

Experimental Results & Analysis On 3 memory configurations: For 2W/4R, 4W/8R, & 8W/16R With varying memory depths (that fits on the chip) Word-size (32-bit) of memory is constant 40-bit counter Registered all the signals to ensure accurate timing Obtained Maximum frequency (Fmax) Total occupied slices on chip No. of 36Kbit BRAMs used Ratio b/w operating time & recovery time Copyright 2017 Darshika G. Perera

For 2W/4R Multi-Ported Memory Depth Fmax (MHz) Slices BRAMs RatioTov/Tr 2 385.802469 63 8 3.14146E+11 4 379.362671 60 1.57073E+11 78536544841 16 405.515004 61 39268272421 32 19634136210 64 9817068105 128 4908534053 256 2454267026 512 1227133513 1K 355.492357 613566756.6 2K 349.65035 77 306783378.3 4K 290.10734 89 153391689.1 8K 257.400257 310 76695844.57 Copyright 2017 Darshika G. Perera

Copyright 2017 Darshika G. Perera Observations For 2W/4R, 4W/8R, & 8W/16R: Max. memory depths vary from 8K, 8K, & 2K respectively. For 8W/16R, BRAMs do not fit to achieve depth > 2K. For 2W/4R multi-ported memory Ratio (Tov/ Tr) decreases with increasing memory depth For same memory depth Ratio increases with increasing number of ports Higher ratio leads to better multi-ported memory design, due to lower recovery time Copyright 2017 Darshika G. Perera

Copyright 2017 Darshika G. Perera Observations For memory depths > 512 BRAMs usage increases with increasing number of ports and with increasing memory depth For memory depths < 512 Constant BRAM usage Expectations: maximum frequency would decrease with increasing number of ports and with increasing memory depths. True for same memory depth For same number of ports, maximum frequency results are inconsistent, with increasing memory depths. Potentially due to how CAD tools place and route designs on FPGA Copyright 2017 Darshika G. Perera

Further Hardware Optimizations As proof-of-concept work Optimize internal architecture of DMM only for 2 write ports For 2W/4R memory configuration No further hardware optimizations attempted or No optimization techniques (via Xilinx ISE tools) enabled During synthesis and implementation of our designs Copyright 2017 Darshika G. Perera

Comparison With Existing Works BRAM Usage vs. Memory Depth: for 2W/4R LVT-based design has lowest BRAM usage Our proposed design has highest BRAM usage Copyright 2017 Darshika G. Perera

Comparison With Existing Works Occupied slices vs. Memory Depth: for 2W/4R Our proposed design & XOR-based design have lower slice usage, compared to that of LVT-based design. Copyright 2017 Darshika G. Perera

Comparison With Existing Works Maximum Frequency vs. Memory Depth: for 2W/4R As memory depths increase Our design has higher maximum frequency than other memory designs Copyright 2017 Darshika G. Perera

Copyright 2017 Darshika G. Perera Results & Analysis Our proposed memory design is superior to existing memory designs: In terms of slice usage and maximum frequency Though with higher BRAM usage Due to optimized internal architecture of DMM With 2W ports. Thus, our design is more suitable for 2W/nR configurations Copyright 2017 Darshika G. Perera

Copyright 2017 Darshika G. Perera Conclusion Introduced novel and efficient multi-ported memory architecture FPGA manufacturers could employ our memory architecture To accelerate real-time compute/data intensive applications, To further enhance our architecture in their next-gen. FPGAs by: Integrating fast DMMs to BRAMs as configurable hard logics Providing fast configurable interconnect structure Integrating counters to BRAMs to reduce the routing complexity Thus, significantly reducing Logic & routing delays of next-gen. FPGAs Lower design complexity Our design would enable seamless integration to the existing FPGA-based CAD tools with minimal design cost Copyright 2017 Darshika G. Perera

Copyright 2017 Darshika G. Perera Future Work Investigate techniques to further optimize internal architecture of our memory Are and speed Optimize internal architecture of DMM for any number of write (mW) ports Investigate techniques to perform/reduce the recovery mode time Out of scope of this paper Copyright 2017 Darshika G. Perera

Copyright 2017 Darshika G. Perera Questions? Copyright 2017 Darshika G. Perera