Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.

Similar presentations


Presentation on theme: "1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux."— Presentation transcript:

1 1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux

2 2 Our Problem We use overlays for data processing Partially/fully fixed processing elements Virtual CGRAs, soft vector processors Memory: Large register files/scratchpad in overlay Low latency, local data Trivial (large DMA): burst to/from DDR Non-trivial?

3 Scatter/Gather Data dependent store/load vscatter adr_ptr, idx_vect, data_vect for i in 1..N adr_ptr[idx_vect[i]] <= data_vect[i] Random narrow (32-bit) accesses Waste bandwidth on DDR interfaces 3

4 4 If Data Fits on the FPGA… BRAMs with interconnect network General network… Not customized per application Shared: all masters all slaves Memory mapped BRAM Double-pump (2x clk) if possible Banking/LVT/etc. for further ports

5 5 Example BRAM system

6 6 But if data doesn’t fit… (oversimplified)

7 7 So Let ’ s Use a Cache But a throughput focused cache Low latency data held in local memories Amortize latency over multiple accesses Focus on bandwidth

8 Replace on-chip memory or augment memory controller? Data fits on-chip Want BRAM like speed, bandwidth Low overhead compared to shared BRAM Data doesn’t fit on-chip Use ‘leftover’ BRAMs for performance 8

9 9 TputCache Design Goals Fmax near BRAM Fmax Fully pipelined Support multiple outstanding misses Write coalescing Associativity

10 10 TputCache Architecture Replay based architecture Reinsert misses back into pipeline Separate line fill/evict logic in background Token FIFO for completing requests in order No MSHRs for tracking misses Fewer muxes (only single replay request mux) 6 stage pipeline -> 6 outstanding misses Good performance with high hit rate Common case fast

11 11 TputCache Architecture

12 12 Cache Hit

13 13 Cache Miss

14 14 Evict/Fill Logic

15 15 Area & Fmax Results Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV 423MHz compared to 490MHz BRAM fmax on Stratix IV Minor degredation with increasing size, associativity 13% to 35% extra BRAM usage for tags, queues

16 16 Benchmark Setup TputCache 128kB, 4-way, 32-byte lines MXP soft vector processor 16 lanes, 128kB scratchpad memory Scatter/Gather memory unit Indexed loads/stores per lane Doublepumping port adapters TputCache runs at 2x frequency of MXP

17 MXP Soft Vector Processor 17

18 18 Histogram Instantiate a number of Virtual Processors (VPs) mapped across lanes Each VP histograms part of the image Final pass to sum VP partial histograms

19 19 Hough Transform Convert an image to 2D Hough Space (angle, radius) Each vector element calculates the radius for a given angle Adds pixel value to counter

20 20 Motion Compensation Load block from reference image, interpolate Offset by small amount from location in current image

21 21 Future Work More ports needed for scalability Share evict/fill BRAM port with 2 nd request Banking (sharing same evict/fill logic) Multiported BRAM designs Write cache Allocate on write currently Track dirty state of bytes in BRAMs 9 th bit Non-blocking behavior Multiple token FIFOs (one per requestor)?

22 22 FAQ Coherency Envisioned as only/LLC Future work Replay loops/problems Random replacement + associativity Power expected to be not great…

23 23 Conclusions TputCache: alternative to shared BRAM Low overhead (13%-35% extra BRAM) Nearly as high fmax (253MHz vs 270MHz) More flexible than shared BRAM Performance degrades gradually Cache behavior instead of manual filling

24 24 Questions? Thank you


Download ppt "1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux."

Similar presentations


Ads by Google