Download presentation
Presentation is loading. Please wait.
Published byNicholas Mills Modified over 8 years ago
1
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ
2
2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication
3
3 Bus 1 Bus 2 Bus 3 Module 4 Module 1 Module 3 Module 2 FPGAs are big! Design big systems High on-chip communication Wide links from single bit- programmable interconnect Somewhat unpredictable frequency re-pipeline 100s of bits Tough timing constraints Physical distance affects frequency Buses are unique to the application, and.. inefficient Design a bus for each set of connections
4
4 FPGAs are big! Design big systems High on-chip communication Design a bus for each set of connections Wide links from single bit- programmable interconnect Somewhat unpredictable frequency re-pipeline Module 4 Module 1 Module 3 Module 2 Buses are unique to the application, and.. inefficient
5
5 FPGAs are big! Design big systems High on-chip communication Module 4 Module 1 Module 3 Module 2 Embedded NoC Data Packet flit Data Sys-level interconnect used in most designs harden! NoC=complete interconnect Data transport Switching Buffering NoC=complete interconnect Data transport Switching Buffering Links Routers IEEE Micro 2014: Smaller and less power than bus connecting 3 modules to DDR3 More efficient than soft buses and huge bandwidth
6
Module 6 NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency Link Width# Nodes 150 bits16 Area %Freq. 1.3 %1.2 GHz 22.5 GB/s = 1.2 GHz x 150 bits = 300 MHz x 600 bits22.5 GB/s
7
Module 7 NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency Link Width# Nodes 150 bits16 Area %Freq. 1.3 %1.2 GHz 22.5 GB/s = 1.2 GHz x 150 bits = 300 MHz x 600 bits22.5 GB/s
8
How do we adapt Embedded NoCs to FPGAs? 8 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch [Easy-to-use]
9
9 NoC FabricPort = On-ramp Photo from: http://www.aaroads.com
10
10 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC Width # flits 0-150 bits 1 150-300 bits 2 300-450 bits 3 450-600 bits 4
11
11 3 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC Time-domain multiplexing: Divide width by 4 Multiply frequency by 4
12
12 3 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC Asynchronous FIFO: Cross into NoC clock No restriction on module frequency Time-domain multiplexing: Divide width by 4 Multiply frequency by 4
13
13 3 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC Asynchronous FIFO: Cross into NoC clock No restriction on module frequency Time-domain multiplexing: Divide width by 4 Multiply frequency by 4 NoC Writer: Track available buffers in NoC Router Forward flits to NoC Backpressure 2 10
14
14 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC 3 Virtual Channels: Multiple buffers Logical separation of data VC2 VC1 2121 Packet on VC1 Packet on VC2 Interleaved flits from 2 packets
15
15 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC 32121 Packet on VC2 321 21 Virtual Channels: Multiple buffers Logical separation of data Packet on VC1 Interleaved flits from 2 packets
16
16 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC 32121 Packet on VC2 321 21 Asynchronous FIFO: Reads packet from one VC at a time Supports Priorities Virtual Channels: Multiple buffers Logical separation of data Packet on VC1 321 21 Interleaved flits from 2 packets
17
17 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC 32121 Packet on VC2 321 21 Asynchronous FIFO: Reads packet from one VC at a time Supports Priorities Virtual Channels: Multiple buffers Logical separation of data Packet on VC1 TD Demultiplexing: Output 1 packet at a time easy for designers 321 21 3 2 1 2 1 ? Interleaved flits from 2 packets
18
18 Ready/Valid Any* width 0-600 bits Any* width 0-600 bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency 100-400 MHz Frequency 100-400 MHz FPGA Module Embedded NoC 2121 Packet on VC2 21 21 Packet on VC1 combine_data mode: Packets from VC1 upper bits Packets from VC2 lower bits 21 21 2 1 2 1
19
19 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch How do we adapt Embedded NoCs to FPGAs?
20
20 NoC Rules for using an NoC with FPGA design styles Photo from: http://www.dailymail.co.uk
21
FPGA Designs Mostly streaming Cannot tolerate reordering – Hardware expensive and difficult 21 Multiprocessors Memory mapped Packets arrive out-of-order – Fine for cache lines – Processors have re-order buffers 1 2 All packets with same src/dst must take same NoC path RULE
22
FPGA Designs Mostly streaming Cannot tolerate reordering – Hardware expensive and difficult 22 Multiprocessors Memory mapped Packets arrive out-of-order – Fine for cache lines – Processors have re-order buffers 1 2 All packets with same src/dst must take same NoC path RULE FULL 2 1 1 2 All packets with same src/dst must take same VC RULE
23
23 1 2 1 2 0
24
24 Each message type must take a different VC FabricPort output must support that RULE 1 2 2 0 1
25
25 Design Styles Latency- insensitive Latency- sensitive
26
Stall-wrappers 26 Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals A B C Tolerate latency on links
27
27 ABC Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals Tolerate latency on links
28
28 ABC Longer path, lower frequency Insert pipeline regs Doesn’t affect functionality Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals Tolerate latency on links AB C data Ready signals
29
29 ABC NoC for Latency-insensitive: Pipelined Timing closure: yes Little/no physical info (?) NoC for Latency-insensitive: Pipelined Timing closure: yes Little/no physical info (?) Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals Tolerate latency on links
30
30 Latency-Sensitive Not stallable No Ready/valid signals Latency must remain constant to be functional Permapaths A BC Reserved Paths Never stall Fixed latency Reserved Paths Never stall Fixed latency AB C data
31
31 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch How do we adapt Embedded NoCs to FPGAs?
32
32 Booksim - NoC simulator - Cycle-accurate - Widely-used - NoC simulator - Cycle-accurate - Widely-used Define traffic pattern www.eecg.utoronto.ca/~mohamed/rtl2booksim RTL (Verilog) - Hardware designs - FabricPort - NoC looks exactly like a hardware module - Hardware designs - FabricPort - NoC looks exactly like a hardware module Sockets SV DPI Simulate any RTL design with a flexible NoC Simulator
33
Latency-sensitive [ Permapaths ] Parallelize 10 – 40 times – Data width 130 – 520 bits Important for data center applications Compare NoC version to non-NoC version 33 DCTQNRRLE Strobe Pixel in valid Pixel out DCTQNRRLE DCTQNRRLE DCTQNRRLE Strobe0100000…0 PixelXXP1P2P3P4P5…P64 Photo from: http://www.fbpapa.com
34
34 DCT QNR RLE Original [close]
35
35 DCT QNR RLE Original [close]
36
36 QNR RLE DCT Critical path Original [far] 130 MHz =50% Original [close]
37
37 QNR RLE DCT Original [far] NoC [average] Original [close]
38
38 QNR RLE DCT Original [far] NoC [average] 5% drop in frequency from best case 80% improvement from worst case ~6% variability regardless of module placement Frequency is good, and (much) more predictable with NoC Original [close] No drop in frequency for largest design
39
39 NoC doesn’t produce routing hotspots – it reduces them NoC reduces utilization of scarce long FPGA wires Max 40% Max 100% [long wires]
40
Communications and networking are the largest FPGA customers Important design: Ethernet Switch 40 High Bandwidth Ethernet switches ASIC land – 10/25/40/100 Gb/s per port switches, e.g. 32x32 Best previous FPGA switch: – 16x16, 10 Gb/s per port, total 160 Gb/s – FPGAs have transceiver BW more than 1000 Gb/s Photo from: http://wisegeek.org
41
Embedded NoC is the switch (16X16 switch) Use RTL2Booksim to measure latency/throughput 41
42
42 Previous work 10 Gb/s NoC Switch 10 Gb/s Approx. Half the latency Approx. Half the latency Saturates at higher injection: Handles more bursty traffic Saturates at higher injection: Handles more bursty traffic
43
Previous work 10 Gb/s NoC Switch 10 Gb/s 25 Gb/s 3X smaller area 40 Gb/s 5X more switching than best: 819 Gb/s vs 160 Gb/s ~ Half the latency Reconfigurable speed, radix, protocol – augment pkt processing
44
44 How do we adapt Embedded NoCs to FPGAs? 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles Module (any freq./width) Embedded NoC (1.2 GHz, 150 bits) Designer-friendly Leverage VCs, credits Rules for NoC design on FPGAs: packet ordering and dependencies Support FPGA Design Styles: Latency-insensitive design Latency-sensitive design Parallel JPEG: Frequency predictable and good, Reduces routing hotspots and long wire utilization Ethernet Switch: 5X more switching, 3X less area, reconfigurable 3. Case studies: JPEG & Ethernet switch Compelling case for Embedded NoCs
45
45 www.eecg.utoronto.ca/~mohamed
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.