The Case for Embedding Networks-on-Chip in FPGA Architectures

Slides:



Advertisements
Similar presentations
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA
Advertisements

Augmenting FPGAs with Embedded Networks-on-Chip
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis
A Novel 3D Layer-Multiplexed On-Chip Network
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.
Graduate Computer Architecture I Lecture 16: FPGA Design.
Hardwired networks on chip for FPGAs and their applications
Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.
Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Programmable logic and FPGA
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
Issues in System-Level Direct Networks Jason D. Bakos.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Uli Schäfer 1 FPGAs for high performance – high density applications Intro Requirements of future trigger systems Features of recent FPGA families 9U *
Using FPGAs with Embedded Processors for Complete Hardware and Software Systems Jonah Weber May 2, 2006.
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Comparison Against P2P/Buses 4 4.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Approaching Ideal NoC Latency with Pre-Configured Routes George Michelogiannakis, Dionisios Pnevmatikatos and Manolis Katevenis Institute of Computer Science.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
On-Chip Networks and Testing
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
R OUTE P ACKETS, N OT W IRES : O N -C HIP I NTERCONNECTION N ETWORKS Veronica Eyo Sharvari Joshi.
Networks-on-Chips (NoCs) Basics
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
Paper Review: XiSystem - A Reconfigurable Processor and System
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Veronica Eyo Sharvari Joshi. System on chip Overview Transition from Ad hoc System On Chip design to Platform based design Partitioning the communication.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
On-time Network On-Chip: Analysis and Architecture CS252 Project Presentation Dai Bui.
Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Mohamed Abdelfattah Vaughn Betz
Floating-Point FPGA (FPFPGA)
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
ESE532: System-on-a-Chip Architecture
Instructor: Dr. Phillip Jones
Azeddien M. Sllame, Amani Hasan Abdelkader
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Israel Cidon, Ran Ginosar and Avinoam Kolodny
On-time Network On-chip
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
Multiprocessors and Multi-computers
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

The Case for Embedding Networks-on-Chip in FPGA Architectures Vaughn Betz University of Toronto With special thanks to Mohamed Abdelfattah, Andrew Bitar and Kevin Murray

Overview Why do we need a new system-level interconnect? Why an embedded NoC? How does it work? How efficient is it? Future trends and an embedded NoC Data center FPGAs Silicon interposers Registered routing Kernels  massively parallel accelerators

Why Do We Need a System Level Interconnect? And Why a NoC?

Challenge Large Systems High bandwidth hard blocks & compute modules PCIe High bandwidth hard blocks & compute modules High on-chip communication PCIe

Today Design soft bus per set of connections PCIe PCIe Module 4 Bus 1 Tough timing constraints Physical distance affects frequency Design soft bus per set of connections 100s of bits PCIe Module 4 Wide links from single bit-programmable interconnect Bus 1 100’s of muxes Muxing/arbitration/buffers on wide datapaths  big Module 3 Bus 3 Somewhat unpredictable frequency  re-pipeline Module 1 Module 2 Buses are unique to the application  design time Wide links: FPGAs are good at doing things in parallel which compensates for their relatively low operating frequency Wide functional units = wide connections to be able to transport the high bandwidth data Only gets worse – need more data, more bandwidth, more issues Harder to design the buses Worse timing closure Timing: Physical distance affects timing: the farther you are, the more switches and wire segments you are using. As a designer I can only know that timing information after placement and routing Takes a long time Must then go back and redesign the interconnect to meet constraints Time/effort consuming! Area: Will show you some results that highlight how big buses are Multiple clock domains require area-expensive asynchronous fifos Why? Do we really need all that flexibility in the system-level interconnect? No Is there a better way?  this is the question that my research attempts to answer We think an embedded NoC is the way to go PCIe System-level buses  costly Bus 2

dedicated (hard) wires Hard Bus? Module 4 Module 1 Module 3 Module 2 dedicated (hard) wires Muxing, arbitration Bus 1 Bus 2 Bus 3 Bus 1 Bus 2 Bus 3 Module 4 Module 1 Module 3 Module 2 System-level interconnect in most designs?  PCIe Costly in area & power?  Usable by many designs? Wide links: FPGAs are good at doing things in parallel which compensates for their relatively low operating frequency Wide functional units = wide connections to be able to transport the high bandwidth data Only gets worse – need more data, more bandwidth, more issues Harder to design the buses Worse timing closure Timing: Physical distance affects timing: the farther you are, the more switches and wire segments you are using. As a designer I can only know that timing information after placement and routing Takes a long time Must then go back and redesign the interconnect to meet constraints Time/effort consuming! Area: Will show you some results that highlight how big buses are Multiple clock domains require area-expensive asynchronous fifos Why? Do we really need all that flexibility in the system-level interconnect? No Is there a better way?  this is the question that my research attempts to answer We think an embedded NoC is the way to go PCIe

System-level interconnect in most designs? Hard Bus? PCIe Module 4 Module 1 Module 3 Module 2 Module 5 System-level interconnect in most designs?  Costly in area & power?  Module 4 Module 1 Module 3 Module 2 Bus 1 Usable by many designs?  Bus 3 Not Reusable! Too design specific Wide links: FPGAs are good at doing things in parallel which compensates for their relatively low operating frequency Wide functional units = wide connections to be able to transport the high bandwidth data Only gets worse – need more data, more bandwidth, more issues Harder to design the buses Worse timing closure Timing: Physical distance affects timing: the farther you are, the more switches and wire segments you are using. As a designer I can only know that timing information after placement and routing Takes a long time Must then go back and redesign the interconnect to meet constraints Time/effort consuming! Area: Will show you some results that highlight how big buses are Multiple clock domains require area-expensive asynchronous fifos Why? Do we really need all that flexibility in the system-level interconnect? No Is there a better way?  this is the question that my research attempts to answer We think an embedded NoC is the way to go Bus 2

Needed: A More General System-Level Interconnect Move data between arbitrary end-points Area efficient High bandwidth  match on-chip & I/O bandwidths Network-on-Chip (NoC)

Embedded NoC PCIe Packet Data Data PCIe Module 4 Module 1 Module 3 NoC=complete interconnect Data transport Switching Buffering PCIe Module 4 Module 1 Module 3 Module 2 flit flit flit Packet Data Routers Links Data We think the right answer is an embedded NoC When fpga vendors realised that some functions were common enough they hardened them Hardened DSP blocks are much more efficient than soft dsps Blockram was much needed to build efficient memories Same is true for IO transceivers and memory controllers I believe we could make a similar case for interconnect: Applications on fpgas are large and modular – inter-module communication (or sys-level communication) is significant in almost every design. Take DDR memory for example: every sizable fpga application accesses off-chip memory – that immediately means wide buses with tough timing constraints and multiple clock domains. For this precise piece of an application – access to ddr – our work shows that an embedded NoC can be quite a bit more efficient. (show micro results) Someone could be wondering why harden an NoC, why not a bus: When we thought of hardening a bus, we couldn’t figure out where the endpoints should be – wherever they are placed it won’t be flexible enough whereas NoCs on the other hand are the most distributed type of on-chip interconnect. NoCs are a type of interconnect that are commonly used to connect cores in a multi-core or multiprocessor chip We do not intend, nor are we trying to reinvent the wheel – we want to use the very same NoCs used in the multiprocessor field. To transport data, we add some control information to form a packet (terminology we’ll be using) That packet is then split up into flits. Flits are transmitted to the specified destination. The main thing here is that we want to adapt this NoC to work with conventional FPGA design – we aren’t trying to change how FPGA designers do their work. We’re just trying to make it easier for them. PCIe

 Embedded NoC Moves data between arbitrary end points? PCIe Data Module 4 Module 1 Module 3 Module 2 Module 5 NoC=complete interconnect Data transport Switching Buffering PCIe Data Moves data between arbitrary end points? flit flit flit  Packet We think the right answer is an embedded NoC When fpga vendors realised that some functions were common enough they hardened them Hardened DSP blocks are much more efficient than soft dsps Blockram was much needed to build efficient memories Same is true for IO transceivers and memory controllers I believe we could make a similar case for interconnect: Applications on fpgas are large and modular – inter-module communication (or sys-level communication) is significant in almost every design. Take DDR memory for example: every sizable fpga application accesses off-chip memory – that immediately means wide buses with tough timing constraints and multiple clock domains. For this precise piece of an application – access to ddr – our work shows that an embedded NoC can be quite a bit more efficient. (show micro results) Someone could be wondering why harden an NoC, why not a bus: When we thought of hardening a bus, we couldn’t figure out where the endpoints should be – wherever they are placed it won’t be flexible enough whereas NoCs on the other hand are the most distributed type of on-chip interconnect. NoCs are a type of interconnect that are commonly used to connect cores in a multi-core or multiprocessor chip We do not intend, nor are we trying to reinvent the wheel – we want to use the very same NoCs used in the multiprocessor field. To transport data, we add some control information to form a packet (terminology we’ll be using) That packet is then split up into flits. Flits are transmitted to the specified destination. The main thing here is that we want to adapt this NoC to work with conventional FPGA design – we aren’t trying to change how FPGA designers do their work. We’re just trying to make it easier for them. Data PCIe

Embedded NoC Architecture How Do We Build It?

Routers, Links and Fabric Ports No hard boundaries  Build any size compute modules in fabric Fabric interface: flexible interface to compute modules

Router Full featured virtual channel router [D. Becker, Stanford PhD, 2012]

Must We Harden the Router? Tested: 32-bit wide ports, 2 VCs, 10 flit deep buffers 65 nm TSMC process standard cells vs. 65 nm Stratix III Y Soft Hard Area 4.1 mm2 (1X) 0.14 mm2 (30X) Speed 166 MHz (1X) 943 MHz (5.7X) Hard: 170X throughput per area!

Harden the Routers? FPGA-optimized soft router? [CONNECT, Papamichale & Hoe, FPGA 2012] and [Split/Merge, Huan & Dehon, FPT 2012] ~2-3X throughput / area improvement with reduced feature set [Hoplite, Kapre & Gray, FPL 2015] Larger improvement with very reduced features / guarantees Not enough to close 170X gap with hard Want ease of use  full featured Hard Routers

Fabric Interface 200 MHz module, 900 MHz router? Configurable time-domain mux / demux: match bandwidth Asynchronous FIFO: cross clock domains  Full NoC bandwidth, w/o clock restrictions on modules

Hard Routers/Soft Links Logic clusters FPGA Router Same I/O mux structure as a logic block – 9X the area Conventional FPGA interconnect between routers

Hard Routers/Soft Links FPGA Router 730 MHz   1 9 th of FPGA vertically (~2.5 mm) Faster, fewer wires (C12) 1 5 th of FPGA vertically (~4.5 mm) Same I/O mux structure as a logic block – 9X the area Conventional FPGA interconnect between routers

Hard Routers/Soft Links FPGA Router Assumed a mesh  Can form any topology

Hard Routers/Hard Links Logic blocks FPGA Router Muxes on router-fabric interface only – 7X logic block area Dedicated interconnect between routers  Faster/Fixed

Hard Routers/Hard Links FPGA Router 900 MHz Dedicated Interconnect (Hard Links) ~ 9 mm at 1.1 V or ~ 7 mm at 0.9V Muxes on router-fabric interface only – 7X logic block area Dedicated interconnect between routers  Faster/Fixed

Hard NoCs Router Soft Hard (+ Soft Links) Hard (+ Hard Links) Area 4.1 mm2 (1X) 0.18 mm2 = 9 LABs (22X) 0.15 mm2 =7 LABs (27X) Speed 166 MHz (1X) 730 MHz (4.4X) 943 MHz (5.7X) Power -- (9X less) (11X – 15X) Router

Hard NoCs Router Soft Hard (+ Soft Links) Hard (+ Hard Links) Area 4.1 mm2 (1X) 0.18 mm2 = 9 LABs (22X) 0.15 mm2 =7 LABs (27X) Speed 166 MHz (1X) 730 MHz (4.4X) 943 MHz (5.7X) Power -- (9X less) (11X – 15X less) Router

 2. Area Efficient? Very Cheap! Less than cost of 3 soft nodes 64-node, 32-bit wide NoC on Stratix V Very Cheap! Less than cost of 3 soft nodes Soft Hard (+ Soft Links) Hard (+ Hard Links) Area ~12,500 LABs 576 LABs 448 LABs %LABs 33 % 1.6 % 1.3% %FPGA 12 % 0.6 % 0.45%

Power Efficient? Hard and Mixed NoCs  Power Efficient Length of 1 NoC Link Compare to best case FPGA interconnect: point-to-point link 200 MHz 64 Width, 0.9 V, 1 VC Hard and Mixed NoCs  Power Efficient

3. Match I/O Bandwidths? 32-bit wide NoC @ 28 nm 1.2 GHz  4.8 GB/s per link Too low for easy I/O use!

 3. Match I/O Bandwidths? Reduce number of nodes: 64  16 Need higher-bandwidth links 150 bits wide @ 1.2 GHz 22.5 GB/s per link Can carry full I/O bandwidth on one link Want to keep cost low Much easier to justify adding to an FPGA if cheap E.g. Stratix I: 2% of die size for DSP blocks First generation: not used by most customers, but 2% cost OK Reduce number of nodes: 64  16 1.3% of core area for a large Stratix V FPGA

NoC Usage & Application Efficiency Studies How Do We Use It?

FabricPort In FPGA Module Embedded NoC Width # flits 0-150 bits 1 Frequency 100-400 MHz Frequency 1.2 GHz Any* width 0-600 bits Fixed Width 150 bits Ready/Valid Credits Width # flits 0-150 bits 1 150-300 bits 2 300-450 bits 3 450-600 bits 4

FabricPort In 3 FPGA Module Embedded NoC Time-domain multiplexing: Frequency 100-400 MHz Frequency 1.2 GHz Any* width 0-600 bits Fixed Width 150 bits Ready/Valid 3 Credits Time-domain multiplexing: Divide width by 4 Multiply frequency by 4

FabricPort In 3 FPGA Module Embedded NoC Time-domain multiplexing: Frequency 100-400 MHz Frequency 1.2 GHz Any* width 0-600 bits Fixed Width 150 bits Ready/Valid 3 Credits Time-domain multiplexing: Divide width by 4 Multiply frequency by 4 Asynchronous FIFO: Cross into NoC clock No restriction on module frequency

Input interface: flexible & easy for designers  little soft logic FabricPort In FPGA Module Embedded NoC Frequency 100-400 MHz Frequency 1.2 GHz Any* width 0-600 bits Fixed Width 150 bits Ready/Valid 3 2 1 Credits Ready=0 Time-domain multiplexing: Divide width by 4 Multiply frequency by 4 Asynchronous FIFO: Cross into NoC clock No restriction on module frequency NoC Writer: Track available buffers in NoC Router Forward flits to NoC Backpressure Input interface: flexible & easy for designers  little soft logic

Designer Use NoC has non-zero, usually variable latency Use on latency-insensitive channels Stallable modules A B C data valid ready With restrictions, usable for fixed-latency communication Pre-establish and reserve paths “Permapaths” Permapaths A A B C

How Common Are Latency-Insensitive Channels? Connections to I/O DDRx, PCIe, … Variable latency Between HLS kernels OpenCL channels / pipes Bluespec SV … Common design style between larger modules And any module can be converted to use [Carloni et al, TCAD, 2001] Widely used at system level, and use likely to increase

Packet Ordering All packets with same src/dst must take same NoC path Multiprocessors Memory mapped Packets arrive out-of-order Fine for cache lines Processors have re-order buffers FPGA Designs Mostly streaming Cannot tolerate reordering Hardware expensive and difficult All packets with same src/dst must take same NoC path RULE 1 2 1 1 2 FULL 2 All packets with same src/dst must take same VC RULE

Application Efficiency Studies How Efficient Is It?

NoC: 16-nodes, hard routers & links 1. Qsys vs. NoC qsys: build logical bus from fabric NoC: 16-nodes, hard routers & links

Only 1/8 of Hard NoC BW used, but already less area for most systems Area Comparison Only 1/8 of Hard NoC BW used, but already less area for most systems

Hard NoC saves power for even simplest systems Power Comparison Hard NoC saves power for even simplest systems

2. Ethernet Switch FPGAs with transceivers: commonly manipulating / switching packets e.g. 16x16 Ethernet switch, @ 10 Gb/s per channel NoC is the crossbar Plus buffering, distributed arbitration & back-pressure Fabric inspects packet headers, performs more buffering, … NoC Router Transceiver

Ethernet Switch Efficiency 14X more efficient! Latest FPGAs: ~2 Tb/s transceiver bandwidth  need good switches

3. Parallel JPEG (Latency Sensitive) Max 100% [long wires] Max 40% NoC makes performance more predictable NoC doesn’t produce wiring hotspots & saves long wires

Future Trends and Embedded NoCs Speculation Ahead!

1. Embedded NoCs and the Datacenter

Datacenter Accelerators Microsoft Catapult: Shell & Role to Ease Design Shell: 23% of Stratix V FPGA [Putnam et al, ISCA 2014]

Datacenter “Shell”: Bus Overhead Buses to I/Os in shell & role Divided into two parts to ease compilation (shell portion locked down)

Datacenter “Shell”: Swapping Accelerators Partial reconfig of role only  swap accelerator w/o taking down system Overengineer shell buses for most demanding accelerator Two separate compiles  lose some optimization of bus

More Swappable Accelerators Allows more virtualization But shell complexity increases Less efficient Wasteful for one big accelerator Big Accelerator Accelerator 5 Accelerator 6

Shell with an Embedded NoC Efficient for more cases (small or big accelerators) Data brought into accelerator, not just to edge with locked bus Big Accelerator Accelerator 5 Accelerator 6

2. Interposer-Based FPGAs

Xilinx: Larger Fabric with Interposers Create a larger FPGA with interposers 10,000 connections between dice (23% of normal routing) Routability good if > 20% of normal wiring cross interposer [Nasiri et al, TVLSI, to appear] Figure: Xilinx, SSI Technology White Paper, 2012

Interposer Scaling Concerns about how well microbumps will scale Will interposer routing bandwidth remain >20% of within-die bandwidth? Embedded NoC: naturally multiplies routing bandwidth (higher clock rate on NoC wires crossing interposer) Figure: Xilinx, SSI Technology White Paper, 2012

Altera: Heterogeneous Interposers Figure: Mike Hutton, Altera Stratix 10, FPL 2015 Custom wiring interface to each unique die PCIe/transceiver, high-bandwidth memory NoC: standardize interface, allow TDM-ing of wires Extends system level interconnect beyond one die

3. Registered Routing

Registered Routing Stratix 10 includes a pulse latch in each routing driver Enables deeper interconnect pipelining Obviates need for a new system-level interconnect? I don’t think so Makes it easier to run wires faster But still not: Switching, buffering, arbitration (complete interconnect) Pre-timing closed Abstraction to compose & re-configure systems Pushes more designers to latency-tolerant techniques Which helps match the main NoC programming model

4. Kernels  Massively Parallel Accelerators Crossbars for Design Composition

Map – Reduce and FPGAs [Ghasemi & Chow, MASc thesis, 2015] Write map & reduce kernel Use Spark infrastructure to distribute data & kernels across many CPUs Do same for FPGAs? Between chips  network Within a chip  soft logic Consumes lots of soft logic and limits routable design to ~30% utilization!

Can We Remove the Crossbar? Not without breaking Map-Reduce/Spark abstraction! The automatic partitioning / routing / merging of data is what makes Spark easy to program Need a crossbar to match the abstraction and make composability easy NoC: efficient, distributed crossbar Allows us to efficiently compose kernels Can use crossbar abstraction within chips (NoC) and between chips (datacenter network)

Wrap Up

Wrap Up Adding NoCs to FPGAs Enhances efficiency of system level interconnect Enables new abstractions (crossbar composability, easily-swappable accelerators) NoC abstraction can cross interposer boundaries Interesting multi-die systems My belief: Special purpose box  datacenter ASIC-like flow  composable flow Embedded NoCs help make this happen

Future Work CAD System for Embedded NoCs Automatically create lightweight soft logic to connect to fabric port (translator) According to designer’s specified intent Choose best router to connect each compute module Choose when to use NoC vs. soft links Then map more applications, using CAD