Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010.

Slides:



Advertisements
Similar presentations
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Advertisements

Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.
NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day10: October 25, 2000 Computing Elements 2: Cascades, ALUs,
Multiprocessing Memory Management
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
Chapter 16 Control Unit Implemntation. A Basic Computer Model.
Chapter 1 and 2 Computer System and Operating System Overview
L1 CTT May 14, 2002 New Scheme for Tracking Marvin Johnson.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Module I Overview of Computer Architecture and Organization.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 30: November 12, 2014 Memory Core: Part.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
GBT Interface Card for a Linux Computer Carson Teale 1.
Reconfigurable Computing - Verifying Circuits Performance! John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn.
Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.
Modular SRAM-based Binary Content-Addressable Memories Ameer M.S. Abdelhadi and Guy G.F. Lemieux Department of Electrical and Computer Engineering University.
IT253: Computer Organization
A Fast Hardware Approach for Approximate, Efficient Logarithm and Anti-logarithm Computation Suganth Paul Nikhil Jayakumar Sunil P. Khatri Department of.
Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 27: November 14, 2011 Memory Core.
PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.
OSes: 11. FS Impl. 1 Operating Systems v Objectives –discuss file storage and access on secondary storage (a hard disk) Certificate Program in Software.
Power-Aware RAM Processing for FPGAs December 9, 2005 Power-aware RAM Processing for FPGA Embedded Memory Blocks Russell Tessier University of Massachusetts.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
1 Multi-ported Memories for FPGAs via XOR Eric LaForest, Ming Liu, Emma Rapati, and Greg Steffan ECE, University of Toronto.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Performed by:Yulia Turovski Lior Bar Lev Instructor: Mony Orbach המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי.
Maximizing Speed and Density of Tiled FPGA Overlays via Partitioning Charles Eric LaForest J. Gregory Steffan University of Toronto ICFPT 2013.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
Implementing and Optimizing a Direct Digital Frequency Synthesizer on FPGA Jung Seob LEE Xiangning YANG.
Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.
ICC Module 3 Lesson 1 – Computer Architecture 1 / 12 © 2015 Ph. Janson Information, Computing & Communication Computer Architecture Clip 6 – Logic parallelism.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
Company LOGO Project Characterization Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Review CS File Systems - Partitions What is a hard disk partition?
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.
Implementing JPEG Encoder for FPGA ECE 734 PROJECT Deepak Agarwal.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
Buffering Techniques Greg Stitt ECE Department University of Florida.
A Multi-Ported Memory Compiler Utilizing True Dual- port BRAMs Ameer Abdelhadi and Guy Lemieux Department of Electrical and Computer Engineering University.
1 Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture Henry Wong Vaughn Betz, Jonathan Rose.
Memory Management Damian Gordon.
A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg.
Presenter: Darshika G. Perera Assistant Professor
Class Exercise 1B.
ESE532: System-on-a-Chip Architecture
Altera Stratix II FPGA Architecture
Presentation on FPGA Technology of
SLP1 design Christos Gentsos 9/4/2014.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
XC4000E Series Xilinx XC4000 Series Architecture 8/98
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010

2 Parallelism in FPGAs Larger SoCs on FPGAs → Parallel Systems Parallel systems on FPGAs will need:  Queueing  Data sharing  Communication  Synchronization Boils down to:  FIFOs  Register files We can do all these with multi-ported memories

3 Multi-Ported Memory X X X X Existing workarounds are ad-hoc, “roll-your-own”, and have limited parallelism.

4 Conventional Approaches

5 2W/2R Multi-Ported Memory Doesn't exist on FPGAs Altera used to have one (Mercury)‏

6 Stratix III Building Blocks M9K (eg: 32 x 256)‏ M144K (eg: 32 x 4098)‏ Adaptive Logic Modules Registers LUTs Adders Block RAMs Flexible, but slow Fast, but inflexible

7 2W/2R Pure-ALM Scales very poorly with memory depth

8 1W/nR Replication Multiple read ports Only one write port

9 mW/nR Banking Multiple write ports Fragmented data

10 mW/nR “Multipumping” Multiple read/write ports No fragmentation Divides clock speed Read/write ordering

11 Block RAMs: Simple Dual Port Read Write

12 Block RAMs: True Dual Port R / W

13 “Pure Multipumping” Read as banked memory (multiple reads)‏

14 “Pure Multipumping” Write as replicated memory (avoids fragmentation)‏

15 Methodology Generate design variations over space  Vary # of ports, depth, type of memories 1W/2R to 8W/16R 2 to 256 elements deep Pure-ALM, M9K, MLAB, Multipumped  Wrap in testbench for timing and correctness Target Quartus 9.0 to Stratix III  No synthesis optimizations for speed or area  Standard P&R effort (speed, avg. over 10 runs) Measure area as Total Equivalent Area  Expresses area in a single unit (ALMs)‏

16 Conventional Multi-Porting Performance

17 1W/2R Pure-ALM Area vs. Speed NiosII/f 290 MHz 500 ALMs Smaller Faster Too big and slow!

18 1W/2R Replicated vs. Pure-ALM

19 1W/2R “Pure Multipumping”

20 LVT-Based Multi-Ported Memories

21 LVT-Based Memory

22 LVT-Based Memory Begin with one block RAM

23 LVT-Based Memory Replicate for two read ports

24 LVT-Based Memory Bank for two write ports

25 LVT-Based Memory Select bank to read from

26 LVT-Based Memory Add bank lookup table

27 LVT-Based Memory

28 Live Value Table Operation

29 LVT Operation 2W/2R, 4-deep

30 W0W0 LVT Operation W0W0 W1W1 R0R0 R1R1 Live Value Table Write Addresses Read Addresses

31 W0W0 LVT Operation: Write W0W0 W1W1 R0R0 R1R Records which write port last updated a location

32 W0W0 LVT Operation: Read W0W0 W1W1 R0R0 R1R Steers read port to correct memory bank

33 LVT Implementation LVT remains practical because it is very narrow

34 LVT Operation Small Pure-ALM memorycontrolling larger block RAMs

35 Advantages of LVTs LVTs add a layer of indirection  Everything operates in parallel  Makes banked memory behave as consistent unit LVTs are narrow  Word width = log 2 (# of write ports) < 4 bits typically  Pure-ALM, but practical size and speed

36 LVT Performance

37 2W/4R Pure-ALM

38 2W/4R LVT-based vs. Pure-ALM 84% smaller 43% faster 412 MHz to 375 MHz

39 2W/4R Multipumping Must be careful about read/write ordering!

40 Multipumping Performance

41 2W/4R Multipumping

42 2W/4R Multipumping Pure Multipumping (279 MHz)‏

43 4W/8R Multipumping Worsens as # of ports increases

44 2W/4R Multipumping 28% smaller on average 193 MHz to 174 MHz 54% slower on average

45 Conclusions Pure multipumped memories are better for memories with few ports or low speed. LVT-based memories are faster and smaller than Pure-ALM memories. LVT-based memories are faster than pure multipumping, but at a cost in area.

46 Future Work Pure multipumping for LVT-based memories  Build banks with 2W/4R pure multipumping blocks  Possible further area improvement Relaxing the read/write order for multipumping  Allows multiplexing the write ports  Leaves designer to watch for WAR violations

47 Thank You