Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ
2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication
3 Bus 1 Bus 2 Bus 3 Module 4 Module 1 Module 3 Module 2 FPGAs are big! Design big systems High on-chip communication Wide links from single bit- programmable interconnect Somewhat unpredictable frequency re-pipeline 100s of bits Tough timing constraints Physical distance affects frequency Buses are unique to the application, and.. inefficient Design a bus for each set of connections
4 FPGAs are big! Design big systems High on-chip communication Design a bus for each set of connections Wide links from single bit- programmable interconnect Somewhat unpredictable frequency re-pipeline Module 4 Module 1 Module 3 Module 2 Buses are unique to the application, and.. inefficient
5 FPGAs are big! Design big systems High on-chip communication Module 4 Module 1 Module 3 Module 2 Embedded NoC Data Packet flit Data Sys-level interconnect used in most designs harden! NoC=complete interconnect Data transport Switching Buffering NoC=complete interconnect Data transport Switching Buffering Links Routers IEEE Micro 2014: Smaller and less power than bus connecting 3 modules to DDR3 More efficient than soft buses and huge bandwidth
Module 6 NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency Link Width# Nodes 150 bits16 Area %Freq. 1.3 %1.2 GHz 22.5 GB/s = 1.2 GHz x 150 bits = 300 MHz x 600 bits22.5 GB/s
Module 7 NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency Link Width# Nodes 150 bits16 Area %Freq. 1.3 %1.2 GHz 22.5 GB/s = 1.2 GHz x 150 bits = 300 MHz x 600 bits22.5 GB/s
How do we adapt Embedded NoCs to FPGAs? 8 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch [Easy-to-use]
9 NoC FabricPort = On-ramp Photo from:
10 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Width # flits bits bits bits bits 4
11 3 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Time-domain multiplexing: Divide width by 4 Multiply frequency by 4
12 3 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Asynchronous FIFO: Cross into NoC clock No restriction on module frequency Time-domain multiplexing: Divide width by 4 Multiply frequency by 4
13 3 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Asynchronous FIFO: Cross into NoC clock No restriction on module frequency Time-domain multiplexing: Divide width by 4 Multiply frequency by 4 NoC Writer: Track available buffers in NoC Router Forward flits to NoC Backpressure 2 10
14 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC 3 Virtual Channels: Multiple buffers Logical separation of data VC2 VC Packet on VC1 Packet on VC2 Interleaved flits from 2 packets
15 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Packet on VC Virtual Channels: Multiple buffers Logical separation of data Packet on VC1 Interleaved flits from 2 packets
16 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Packet on VC Asynchronous FIFO: Reads packet from one VC at a time Supports Priorities Virtual Channels: Multiple buffers Logical separation of data Packet on VC Interleaved flits from 2 packets
17 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Packet on VC Asynchronous FIFO: Reads packet from one VC at a time Supports Priorities Virtual Channels: Multiple buffers Logical separation of data Packet on VC1 TD Demultiplexing: Output 1 packet at a time easy for designers ? Interleaved flits from 2 packets
18 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC 2121 Packet on VC Packet on VC1 combine_data mode: Packets from VC1 upper bits Packets from VC2 lower bits
19 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch How do we adapt Embedded NoCs to FPGAs?
20 NoC Rules for using an NoC with FPGA design styles Photo from:
FPGA Designs Mostly streaming Cannot tolerate reordering – Hardware expensive and difficult 21 Multiprocessors Memory mapped Packets arrive out-of-order – Fine for cache lines – Processors have re-order buffers 1 2 All packets with same src/dst must take same NoC path RULE
FPGA Designs Mostly streaming Cannot tolerate reordering – Hardware expensive and difficult 22 Multiprocessors Memory mapped Packets arrive out-of-order – Fine for cache lines – Processors have re-order buffers 1 2 All packets with same src/dst must take same NoC path RULE FULL All packets with same src/dst must take same VC RULE
24 Each message type must take a different VC FabricPort output must support that RULE
25 Design Styles Latency- insensitive Latency- sensitive
Stall-wrappers 26 Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals A B C Tolerate latency on links
27 ABC Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals Tolerate latency on links
28 ABC Longer path, lower frequency Insert pipeline regs Doesn’t affect functionality Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals Tolerate latency on links AB C data Ready signals
29 ABC NoC for Latency-insensitive: Pipelined Timing closure: yes Little/no physical info (?) NoC for Latency-insensitive: Pipelined Timing closure: yes Little/no physical info (?) Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals Tolerate latency on links
30 Latency-Sensitive Not stallable No Ready/valid signals Latency must remain constant to be functional Permapaths A BC Reserved Paths Never stall Fixed latency Reserved Paths Never stall Fixed latency AB C data
31 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch How do we adapt Embedded NoCs to FPGAs?
32 Booksim - NoC simulator - Cycle-accurate - Widely-used - NoC simulator - Cycle-accurate - Widely-used Define traffic pattern RTL (Verilog) - Hardware designs - FabricPort - NoC looks exactly like a hardware module - Hardware designs - FabricPort - NoC looks exactly like a hardware module Sockets SV DPI Simulate any RTL design with a flexible NoC Simulator
Latency-sensitive [ Permapaths ] Parallelize 10 – 40 times – Data width 130 – 520 bits Important for data center applications Compare NoC version to non-NoC version 33 DCTQNRRLE Strobe Pixel in valid Pixel out DCTQNRRLE DCTQNRRLE DCTQNRRLE Strobe …0 PixelXXP1P2P3P4P5…P64 Photo from:
34 DCT QNR RLE Original [close]
35 DCT QNR RLE Original [close]
36 QNR RLE DCT Critical path Original [far] 130 MHz =50% Original [close]
37 QNR RLE DCT Original [far] NoC [average] Original [close]
38 QNR RLE DCT Original [far] NoC [average] 5% drop in frequency from best case 80% improvement from worst case ~6% variability regardless of module placement Frequency is good, and (much) more predictable with NoC Original [close] No drop in frequency for largest design
39 NoC doesn’t produce routing hotspots – it reduces them NoC reduces utilization of scarce long FPGA wires Max 40% Max 100% [long wires]
Communications and networking are the largest FPGA customers Important design: Ethernet Switch 40 High Bandwidth Ethernet switches ASIC land – 10/25/40/100 Gb/s per port switches, e.g. 32x32 Best previous FPGA switch: – 16x16, 10 Gb/s per port, total 160 Gb/s – FPGAs have transceiver BW more than 1000 Gb/s Photo from:
Embedded NoC is the switch (16X16 switch) Use RTL2Booksim to measure latency/throughput 41
42 Previous work 10 Gb/s NoC Switch 10 Gb/s Approx. Half the latency Approx. Half the latency Saturates at higher injection: Handles more bursty traffic Saturates at higher injection: Handles more bursty traffic
Previous work 10 Gb/s NoC Switch 10 Gb/s 25 Gb/s 3X smaller area 40 Gb/s 5X more switching than best: 819 Gb/s vs 160 Gb/s ~ Half the latency Reconfigurable speed, radix, protocol – augment pkt processing
44 How do we adapt Embedded NoCs to FPGAs? 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles Module (any freq./width) Embedded NoC (1.2 GHz, 150 bits) Designer-friendly Leverage VCs, credits Rules for NoC design on FPGAs: packet ordering and dependencies Support FPGA Design Styles: Latency-insensitive design Latency-sensitive design Parallel JPEG: Frequency predictable and good, Reduces routing hotspots and long wire utilization Ethernet Switch: 5X more switching, 3X less area, reconfigurable 3. Case studies: JPEG & Ethernet switch Compelling case for Embedded NoCs
45