Download presentation
Presentation is loading. Please wait.
Published byDora McDaniel Modified over 6 years ago
1
An Efficient Embedded Multi-Ported Memory Architecture for Next-Generation FPGAs
Presenter: Darshika G. Perera Assistant Professor Department of Electrical & Computer Engineering University of Colorado at Colorado Springs Website: Authors: S. Navid Shahrouzi & Darshika G. Perera Copyright 2017 Darshika G. Perera
2
Copyright 2017 Darshika G. Perera
Overview Introduction and motivation Existing research on multi-ported memory designs on FPGAs Our embedded multi-ported memory architecture Experimental results and analysis Conclusion and future work Copyright 2017 Darshika G. Perera
3
Introduction and Motivation
Significance of FPGAs for embedded computing Higher level of flexibility than ASICs Higher performance than sw on microprocessors Utilizing FPGAs for real-time compute/data intensive applications Data mining, machine learning, image processing Copyright 2017 Darshika G. Perera
4
Introduction and Motivation
To achieve high speed-performance Leverage fine/coarse grain parallelism, data parallelism, pipelining To execute computations in parallel Simultaneously read/write multiple data/results from/to memory Existing dual-port BRAMs on FPGAs Insufficient for real-time compute/data intensive applications Copyright 2017 Darshika G. Perera
5
Existing Works on Multi-Ported Memory Designs on FPGAs
Conventional methods Replication, Banking, & Multi-pumping Multi-ported memories in conjunction with soft processor cores VLIW, Multithreaded, Application-Specific, MicroBlaze, Nios processors Most recent works LVT-based, XOR-based, I-LVT, Switched-ports Use techniques to provide arbitrary number of R/W ports Adds extra logic and routing to the design Increases design complexity & cost Sheer design complexity Hinders employment with next-gen. FPGAs Copyright 2017 Darshika G. Perera
6
Our Embedded Multi-Ported Memory Architecture
Objective: To provide a simplified memory architecture with an arbitrary number of R/W ports To simplify the design Only the read data from BRAMs are processed Using intermediate combinatorial logic All other signals are directly forwarded to BRAMs Without incorporating any intermediate logic b/w modules Write data & R/W addresses Copyright 2017 Darshika G. Perera
7
Top-Level Architecture of Our mW/nR Multi-Ported Memory
Copyright 2017 Darshika G. Perera
8
Decision Making Module (DMM)
DMM finds the last written data in m number of BRAMs During read operation Internal architecture - combinatorial Depends on number of write ports For n number of read ports n number of DMMs executed in parallel Functionalities Checks the counter values of each BRAM in a column Data with highest counter value Extracts this last written data value Forwards this data value to read data output port Copyright 2017 Darshika G. Perera
9
Copyright 2017 Darshika G. Perera
Counter Integrate a counter value to input data Prior to storing in BRAM Our BRAMs are configured To have r-bit word size Values p & q are variables Depends on requirements of a given application Copyright 2017 Darshika G. Perera
10
Copyright 2017 Darshika G. Perera
Counter Reaches max. count - 2p-2 clock cycles Operating Time = Overflow time (Tov) Based on number of counter bits At max. count multi-ported memory Can not perform write operations Can perform read operations Memory has to recover to restore write operations Write all the data to recovery memory Reset multi-ported memory Write all the data back to multi-ported memory from recovery memory Copyright 2017 Darshika G. Perera
11
Recovery vs. Operating Time/Mode
mW/nR multi-ported memory with 2s depth, and p-bit counter Example: Operating time = Tov = 183 minutes For system clk freq. is 100MHz & p is 40-bit Recovery Time = Tr = 163 microseconds For 32-bit 8K 2W/2R multi-ported memory Higher ratio leads to better multi-ported memory design Copyright 2017 Darshika G. Perera
12
Internal Architecture: BRAM Distribution of 2W/2R Multi-Ported Memory
Copyright 2017 Darshika G. Perera
13
Internal Architecture of DMM for 2W/2R Multi-Ported Memory
Copyright 2017 Darshika G. Perera
14
How does it reduce the complexity?
Read data outputs of BRAMs go to DMM DMM selects and forwards the most recently written data to read data output port Remaining port signals (W/R address & write data) are directly forwarded to BRAMs Without incorporating any extra logic between modules Copyright 2017 Darshika G. Perera
15
Experimental Results & Analysis
To evaluate feasibility and efficiency Evaluated with the most recent designs LVT-based, XOR-based On Virtex-6 XC6VHX380T FPGA For fair comparison purposes To synthesize & implement our multi-ported memory Xilinx ISE 14.7 To verify results & functionalities of designs ModelSim SE & Xilinx ISim Copyright 2017 Darshika G. Perera
16
Experimental Results & Analysis
On 3 memory configurations: For 2W/4R, 4W/8R, & 8W/16R With varying memory depths (that fits on the chip) Word-size (32-bit) of memory is constant 40-bit counter Registered all the signals to ensure accurate timing Obtained Maximum frequency (Fmax) Total occupied slices on chip No. of 36Kbit BRAMs used Ratio b/w operating time & recovery time Copyright 2017 Darshika G. Perera
17
For 2W/4R Multi-Ported Memory
Depth Fmax (MHz) Slices BRAMs RatioTov/Tr 2 63 8 E+11 4 60 E+11 16 61 32 64 128 256 512 1K 2K 77 4K 89 8K 310 Copyright 2017 Darshika G. Perera
18
Copyright 2017 Darshika G. Perera
Observations For 2W/4R, 4W/8R, & 8W/16R: Max. memory depths vary from 8K, 8K, & 2K respectively. For 8W/16R, BRAMs do not fit to achieve depth > 2K. For 2W/4R multi-ported memory Ratio (Tov/ Tr) decreases with increasing memory depth For same memory depth Ratio increases with increasing number of ports Higher ratio leads to better multi-ported memory design, due to lower recovery time Copyright 2017 Darshika G. Perera
19
Copyright 2017 Darshika G. Perera
Observations For memory depths > 512 BRAMs usage increases with increasing number of ports and with increasing memory depth For memory depths < 512 Constant BRAM usage Expectations: maximum frequency would decrease with increasing number of ports and with increasing memory depths. True for same memory depth For same number of ports, maximum frequency results are inconsistent, with increasing memory depths. Potentially due to how CAD tools place and route designs on FPGA Copyright 2017 Darshika G. Perera
20
Further Hardware Optimizations
As proof-of-concept work Optimize internal architecture of DMM only for 2 write ports For 2W/4R memory configuration No further hardware optimizations attempted or No optimization techniques (via Xilinx ISE tools) enabled During synthesis and implementation of our designs Copyright 2017 Darshika G. Perera
21
Comparison With Existing Works
BRAM Usage vs. Memory Depth: for 2W/4R LVT-based design has lowest BRAM usage Our proposed design has highest BRAM usage Copyright 2017 Darshika G. Perera
22
Comparison With Existing Works
Occupied slices vs. Memory Depth: for 2W/4R Our proposed design & XOR-based design have lower slice usage, compared to that of LVT-based design. Copyright 2017 Darshika G. Perera
23
Comparison With Existing Works
Maximum Frequency vs. Memory Depth: for 2W/4R As memory depths increase Our design has higher maximum frequency than other memory designs Copyright 2017 Darshika G. Perera
24
Copyright 2017 Darshika G. Perera
Results & Analysis Our proposed memory design is superior to existing memory designs: In terms of slice usage and maximum frequency Though with higher BRAM usage Due to optimized internal architecture of DMM With 2W ports. Thus, our design is more suitable for 2W/nR configurations Copyright 2017 Darshika G. Perera
25
Copyright 2017 Darshika G. Perera
Conclusion Introduced novel and efficient multi-ported memory architecture FPGA manufacturers could employ our memory architecture To accelerate real-time compute/data intensive applications, To further enhance our architecture in their next-gen. FPGAs by: Integrating fast DMMs to BRAMs as configurable hard logics Providing fast configurable interconnect structure Integrating counters to BRAMs to reduce the routing complexity Thus, significantly reducing Logic & routing delays of next-gen. FPGAs Lower design complexity Our design would enable seamless integration to the existing FPGA-based CAD tools with minimal design cost Copyright 2017 Darshika G. Perera
26
Copyright 2017 Darshika G. Perera
Future Work Investigate techniques to further optimize internal architecture of our memory Are and speed Optimize internal architecture of DMM for any number of write (mW) ports Investigate techniques to perform/reduce the recovery mode time Out of scope of this paper Copyright 2017 Darshika G. Perera
27
Copyright 2017 Darshika G. Perera
Questions? Copyright 2017 Darshika G. Perera
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.