Mohamed ABDELFATTAH Vaughn BETZ
2 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA
3 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA MotivationPrevious Work
Interconnect 4 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires
5 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires Hard Blocks: Memory Multiplier Processor Hard Blocks: Memory Multiplier Processor
6 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires Hard Interfaces DDR/PCIe.. Hard Interfaces DDR/PCIe.. Interconnect still the same Hard Blocks: Memory Multiplier Processor Hard Blocks: Memory Multiplier Processor 1600 MHz 200 MHz 800 MHz
7 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1600 MHz 200 MHz 800 MHz
8 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet
9 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet
10 DDR3 PHY and Controller 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet
BarcelonaLos Angeles Keep the “roads”, but add “freeways”. Hard Blocks Logic Cluster Source: Google Earth
12 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect NoC RoutersLinks Router forwards data packet Router moves data to local interconnect
13 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect Pre-design NoC to requirements NoC links are “re-usable” Latency-tolerant communication NoC abstraction favors modularity High bandwidth endpoints known
14 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Low-level interconnect hinders modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect Latency-tolerant communication NoC abstraction favors modularity
15 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet Implementation options: Soft Logic (LUTs,.. ) Hard Logic (unchangeable) Mixed Soft/Hard Implementation options: Soft Logic (LUTs,.. ) Hard Logic (unchangeable) Mixed Soft/Hard Soft NoC Hard NoC Build as needed out of LUTs Must build the whole thing Tailor to application Must be general enough for any aiapplication Slower, bigger Faster, smaller Investigate the hard vs. soft tradeoff for NoCs (area/delay) Configurability Efficiency
FPGA-tuned Soft NoCs: – LiPar (2005), NoCeM (2008), Connect (2012) Hard NoCs: – Francis and Moore (2008): Exploring Hard and Soft Networks-on-Chip for FPGAs Applications that leverage NoCs: – Chung et al. (2011): CoRAM: An In-Fabric Memory Architecture for FPGA-based Computing 16 Our Contributions: 1.Quantify area/performance gap of hard and soft NoCs 2.Investigate how this impacts NoC design (hard/soft) 3.Integrate hard NoC with FPGA fabric Our Contributions: 1.Quantify area/performance gap of hard and soft NoCs 2.Investigate how this impacts NoC design (hard/soft) 3.Integrate hard NoC with FPGA fabric 1. Why NoCs on FPGAs?
17 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA NoC Architecture Methodology Soft NoC design Results Area/Speed Efficiency Gap
NoC = Routers + Links Hard/Soft Efficiency State-of-the-art router architecture from Stanford: 1.Acknowledge that the NoC community have excelled at building a router: We just use it 2.To meet FPGA bandwidth requirements: High-performance router 3.A complex router includes a superset of NoC components that may be used: More complete analysis Split router into 5 Components
19 2. Hard/Soft Efficiency
20 2. Hard/Soft Efficiency Multi-Queue Buffer Port Width Buffer depth Number of VCs = Memory + CIControl Logic Input Modules
21 2. Hard/Soft Efficiency Multiplexers Logic + crowded interconnect Port Width Number of Ports Crossbar
22 2. Hard/Soft Efficiency Retiming Register Registers + little control logic Port Width Number of VCs Output Modules
23 2. Hard/Soft Efficiency Arbiters = Logic + Registers Number of Ports Number of VCs Allocators
24 2. Hard/Soft Efficiency 5 Components Input Module Crossbar VC Allocator SW Allocator Output Module Port Width Number of Ports Number of VCs Buffer Depth 4 Parameters
Post-routing FPGA (soft) area and delay Post-synthesis ASIC (hard) area and delay Both TSMC 65 nm technology (Stratix III) Verify results against previous FPGA:ASIC comparison by Kuon and Rose Hard/Soft Efficiency Per Router Component
Relatively small memories Critical component in router design 3 options for FPGA: 26 Registers LUTRAM Block RAM One per LUT 640 bits 9 Kbits 2. Hard/Soft Efficiency Area of each implementation option
27 Width = 32 Bits 2. Hard/Soft Efficiency Another logic cluster used
Relatively small memories 3 options for implementation on FPGA 28 Registers LUTRAM Block RAM One per LUT 640 bits 9 Kbits 0.77 Kbit/mm 2 23 Kbit/mm Kbit/mm 2 16% utilized BRAM more area efficient than fully used LUTRAM (Valid for Stratix III) LUTRAM could win for some points in other FPGAs Use BRAM for FPGA (soft) implementation Soft 2. Hard/Soft Efficiency
29 High port count inefficient in soft Soft 24X – 94X 60X – 170X 2. Hard/Soft Efficiency
30 High port count inefficient in soft Width scales better Soft 2. Hard/Soft Efficiency 26X – 17X 72X
31 Buffer depth is free on FPGAs when using BRAM Soft Filling up the BRAM 2. Hard/Soft Efficiency
Design recommendations based on FPGA silicon area Supported by delay measurements 32 Buffer depth is free on FPGAs when using BRAM Soft High port count inefficient in soft Width scales better Soft Use BRAM for FPGA (soft) implementation Soft 2. Hard/Soft Efficiency
33 Memory = Logic + Registers 2. Hard/Soft Efficiency Router ComponentMean Area RatioLUT:REG Input Module17-- Crossbar85-- VC Allocator488:1 Switch Allocator5620:1 Output Module390.6:1 Router30
34 2. Hard/Soft Efficiency Router ComponentMean Delay Ratio Input Module2.9 Crossbar4.4 VC Allocator3.9 Switch Allocator3.3 Output Module3.4 Router3.6
35 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA Hard NoC + FPGA Wiring Conclusion Future Work
36 Router ComponentArea RatioDelay Ratio Input Module172.9 Crossbar854.4 VC Allocator483.9 Switch Allocator563.3 Output Module393.4 Router303.6 Router ComponentArea RatioDelay Ratio Input Module172.9 Crossbar854.4 VC Allocator483.9 SW Allocator563.3 Output Module393.4 Router % Total Area Critical Path Results suggest hardening Crossbar and Allocators Mixed hard/soft implementation 40% 10% 3. Hard NoC with FPGA
37 SoftHardMixed Area4.1 mm 2 (1X)0.14 mm 2 (30X)2.3 mm 2 (1.8X) Speed150 MHz (1X)810 MHz (5X)390 MHz (2.5X) ? ? How to connect hard and soft? How efficient is mixed/hard after doing that? Soft Hard Mixed not worth hardening For a typical router.. 5 ports 32 bits wide 2 VCs 10 buffer words 3. Hard NoC with FPGA
38 3. Hard NoC with FPGA FPGA Router Same I/O mux structure as a logic block – 9X the area Conventional FPGA interconnect between routers Logic clusters Router Logic
FPGA Router Hard NoC with FPGA Same I/O mux structure as a logic block – 9X the area Conventional FPGA interconnect between routers 730 MHz
Router Hard NoC with FPGA Assumed a mesh Can form any topology FPGA
41 SoftHardHard (+ interconnect) Area4.1 mm 2 (1X)0.14 mm 2 (30X)0.18 mm 2 = 9 LABs (22X) Speed150 MHz (1X)810 MHz (5X)730 MHz (4.7X) 64-node NoC on Stratix V Router SoftHard (+ interconnect) Area ~12,500 LABs576 LABs %LABs 33 %1.6 % %FPGA 12 %0.6 % 3. Hard NoC with FPGA Hard NoC + Soft Interconnect is very compelling Provides 47 GB/s peak bisection bandwidth Very Cheap! Less than cost of 3 soft nodes
Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA Big city needs freeways to handle traffic Solve communication problems for a large/heterogeneous FPGA: Timing Closure – Interconnect Scaling – Modular Design A hard NoC is on average 30X smaller and 3.6X faster than soft Crossbars and allocators worst – Input buffer best An efficient soft NoC: Uses BRAMs – Large width, low Port Count – Deep buffers Mixed implementation does not make sense Integrated fully hard NoC with FPGA fabric (for NoC Links) 22X area improvement over soft Reaches max. FPGA frequency (4.7X faster than soft) 64-node NoC = 0.6% of total FPGA area (Stratix V)
Power analysis More hardening: – Dedicated inter-router links (hard wires) – Clock domain crossing hardware How do traffic hotspots (DDR/PCIe) influence NoC design? Latency insensitive design methodology that uses NoC CAD tool changes for a NoC-based FPGA Hard NoC with FPGA