1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Class Presentation Of Advance VLSI Course Presented by : Ali Shahabi Major Refrence is : Architecture and Circuit Techniques for a Reconfigurable Memory.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
On-Chip Cache Analysis A Parameterized Cache Implementation for a System-on-Chip RISC CPU.
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Virtual memory.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
Lecture 41: Review Session #3 Reminders –Office hours during final week TA as usual (Tuesday & Thursday 12:50pm-2:50pm) Hassan: Wednesday 1pm to 4pm or.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Paper Review Building a Robust Software-based Router Using Network Processors.
Synchronization and Communication in the T3E Multiprocessor.
A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.
2013/10/21 Yun-Chung Yang An Energy-Efficient Adaptive Hybrid Cache Jason Cong, Karthik Gururaj, Hui Huang, Chunyue Liu, Glenn Reinman, Yi Zou Computer.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
Network On Chip Platform
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
W AVEFRONT S KIPPING USING BRAM S FOR C ONDITIONAL A LGORITHMS ON V ECTOR P ROCESSORS Aaron Severance Joe Edwards Guy G.F. Lemieux.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
CSCI206 - Computer Organization & Programming
Nios II Processor: Memory Organization and Access
COSC3330 Computer Architecture
Associativity in Caches Lecture 25
Basic Performance Parameters in Computer Architecture:
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch
Cache Memory Presentation I
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Lecture 23: Cache, Memory, Virtual Memory
Systems Architecture II
Lecture 22: Cache Hierarchies, Memory
Interconnect with Cache Coherency Manager
Lecture 22: Cache Hierarchies, Memory
CPE 631 Lecture 05: Cache Design
Module IV Memory Organization.
Lecture 24: Memory, VM, Multiproc
CDA 5155 Caches.
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Improving Memory System Performance for Soft Vector Processors
Lecture 22: Cache Hierarchies, Memory
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 21: Memory Hierarchy
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Cache - Optimization.
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux

2 Our Problem We use overlays for data processing Partially/fully fixed processing elements Virtual CGRAs, soft vector processors Memory: Large register files/scratchpad in overlay Low latency, local data Trivial (large DMA): burst to/from DDR Non-trivial?

Scatter/Gather Data dependent store/load vscatter adr_ptr, idx_vect, data_vect for i in 1..N adr_ptr[idx_vect[i]] <= data_vect[i] Random narrow (32-bit) accesses Waste bandwidth on DDR interfaces 3

4 If Data Fits on the FPGA… BRAMs with interconnect network General network… Not customized per application Shared: all masters all slaves Memory mapped BRAM Double-pump (2x clk) if possible Banking/LVT/etc. for further ports

5 Example BRAM system

6 But if data doesn’t fit… (oversimplified)

7 So Let ’ s Use a Cache But a throughput focused cache Low latency data held in local memories Amortize latency over multiple accesses Focus on bandwidth

Replace on-chip memory or augment memory controller? Data fits on-chip Want BRAM like speed, bandwidth Low overhead compared to shared BRAM Data doesn’t fit on-chip Use ‘leftover’ BRAMs for performance 8

9 TputCache Design Goals Fmax near BRAM Fmax Fully pipelined Support multiple outstanding misses Write coalescing Associativity

10 TputCache Architecture Replay based architecture Reinsert misses back into pipeline Separate line fill/evict logic in background Token FIFO for completing requests in order No MSHRs for tracking misses Fewer muxes (only single replay request mux) 6 stage pipeline -> 6 outstanding misses Good performance with high hit rate Common case fast

11 TputCache Architecture

12 Cache Hit

13 Cache Miss

14 Evict/Fill Logic

15 Area & Fmax Results Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV 423MHz compared to 490MHz BRAM fmax on Stratix IV Minor degredation with increasing size, associativity 13% to 35% extra BRAM usage for tags, queues

16 Benchmark Setup TputCache 128kB, 4-way, 32-byte lines MXP soft vector processor 16 lanes, 128kB scratchpad memory Scatter/Gather memory unit Indexed loads/stores per lane Doublepumping port adapters TputCache runs at 2x frequency of MXP

MXP Soft Vector Processor 17

18 Histogram Instantiate a number of Virtual Processors (VPs) mapped across lanes Each VP histograms part of the image Final pass to sum VP partial histograms

19 Hough Transform Convert an image to 2D Hough Space (angle, radius) Each vector element calculates the radius for a given angle Adds pixel value to counter

20 Motion Compensation Load block from reference image, interpolate Offset by small amount from location in current image

21 Future Work More ports needed for scalability Share evict/fill BRAM port with 2 nd request Banking (sharing same evict/fill logic) Multiported BRAM designs Write cache Allocate on write currently Track dirty state of bytes in BRAMs 9 th bit Non-blocking behavior Multiple token FIFOs (one per requestor)?

22 FAQ Coherency Envisioned as only/LLC Future work Replay loops/problems Random replacement + associativity Power expected to be not great…

23 Conclusions TputCache: alternative to shared BRAM Low overhead (13%-35% extra BRAM) Nearly as high fmax (253MHz vs 270MHz) More flexible than shared BRAM Performance degrades gradually Cache behavior instead of manual filling

24 Questions? Thank you