Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2 2 3 3 Comparison Against P2P/Buses 4 4.

Slides:



Advertisements
Similar presentations
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA
Advertisements

6-April 06 by Nathan Chien. PCI System Block Diagram.
Evaluation of On-Chip Interconnect Architectures for Multi-Core DSP Students : Haim Assor, Horesh Ben Shitrit 2. Shared Bus 3. Fabric 4. Network on Chip.
Augmenting FPGAs with Embedded Networks-on-Chip
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.
JAZiO ™ IncorporatedPlatform JAZiO ™ Supplemental SupplementalInformation.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
Chapter Hardwired vs Microprogrammed Control Multithreading
Programmable logic and FPGA
1 Evgeny Bolotin – ClubNet Nov 2003 Network on Chip (NoC) Evgeny Bolotin Supervisors: Israel Cidon, Ran Ginosar and Avinoam Kolodny ClubNet - November.
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
CPU Chips The logical pinout of a generic CPU. The arrows indicate input signals and output signals. The short diagonal lines indicate that multiple pins.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Uli Schäfer 1 FPGAs for high performance – high density applications Intro Requirements of future trigger systems Features of recent FPGA families 9U *
Using FPGAs with Embedded Processors for Complete Hardware and Software Systems Jonah Weber May 2, 2006.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
General Purpose FIFO on Virtex-6 FPGA ML605 board midterm presentation
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
PHY 201 (Blum) Buses Warning: some of the terminology is used inconsistently within the field.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Chapter 8 Input/Output. Busses l Group of electrical conductors suitable for carrying computer signals from one location to another l Each conductor in.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
On-FPGA Communication Architectures
HyperTransport™ Technology I/O Link Presentation by Mike Jonas.
On-Chip Networks and Testing
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Applied research laboratory David E. Taylor Users Guide: Fast IP Lookup (FIPL) in the FPX Gigabit Kits Workshop 1/2002.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
J. Christiansen, CERN - EP/MIC
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
CCNA 3 Week 4 Switching Concepts. Copyright © 2005 University of Bolton Introduction Lan design has moved away from using shared media, hubs and repeaters.
COMPARISON B/W ELECTRICAL AND OPTICAL COMMUNICATION INSIDE CHIP Irfan Ullah Department of Information and Communication Engineering Myongji university,
Veronica Eyo Sharvari Joshi. System on chip Overview Transition from Ad hoc System On Chip design to Platform based design Partitioning the communication.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
Network On Chip Platform
Latest ideas in DAQ development for LHC B. Gorini - CERN 1.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Soc 5.1 Chapter 5 Interconnect Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)
Interconnection network network interface and a case study.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
The Case for Embedding Networks-on-Chip in FPGA Architectures
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.
SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Trigger Hardware Development Modular Trigger Processing Architecture Matt Stettler, Magnus Hansen CERN Costas Foudas, Greg Iles, John Jones Imperial College.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
System on a Programmable Chip (System on a Reprogrammable Chip)
Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering
Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 
Mohamed Abdelfattah Vaughn Betz
System On Chip.
FPGAs in AWS and First Use Cases, Kees Vissers
Israel Cidon, Ran Ginosar and Avinoam Kolodny
A High Performance SoC: PkunityTM
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
NetFPGA - an open network development platform
Programmable logic and FPGA
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Mohamed ABDELFATTAH Vaughn BETZ

2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Comparison Against P2P/Buses 4 4

Interconnect 3 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires

4 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires Hard Blocks: Memory Multiplier Processor Hard Blocks: Memory Multiplier Processor

5 1. Why NoCs on FPGAs? Logic Blocks Switch Blocks Wires Hard Interfaces DDR/PCIe.. Hard Interfaces DDR/PCIe.. Interconnect still the same Hard Blocks: Memory Multiplier Processor Hard Blocks: Memory Multiplier Processor 1600 MHz 200 MHz 800 MHz

6 DDR3 PHY and Controller Problems: 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet 1600 MHz 200 MHz 800 MHz

7 DDR3 PHY and Controller Problems: 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet

BarcelonaLos Angeles Keep the “roads”, but add “freeways”. Hard Blocks Logic Cluster Source: Google Earth

9 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet Problems: 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated NoC RoutersLinks Router forwards data packet Router moves data to local interconnect

10 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet Problems: 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Abstraction favours modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect  Pre-design NoC to requirements  NoC links are “re-usable”  NoC is heavily “pipelined”  NoC abstraction favors modularity  High bandwidth endpoints known

11 DDR3 PHY and Controller 1. Why NoCs on FPGAs? PCIe Controller Gigabit Ethernet  Latency-tolerant communication  NoC abstraction favors modularity Problems: 1.Bandwidth requirements for hard logic/interfaces 2.Timing closure 3.High interconnect utilization: – Huge CAD Problem – Slow compilation – Power/area utilization 4.Wire speed not scaling: – Delay is interconnect-dominated 5.Abstraction favours modularity: – Parallel compilation – Partial reconfiguration – Multi-chip interconnect NoCs can simplify FPGA design Does the NoC abstraction come at a high area/power cost? How to integrate NoCs in FPGAs? How do embedded NoCs compare to current interconnects?

12 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Mixed NoCs Hard NoCs Comparison Against P2P/Buses 4 4

2. Embedded NoCs “Mixed” NoC “Hard” NoC Soft LinksHard Routers Hard LinksHard Routers = + + = “Soft” NoCSoft LinksSoft Routers + =

14 Soft Hard FPGA CAD Tools ASIC CAD Tools Design Compiler Area Speed Power? Power Toggle rates Gate-level simulation Mixed HSPICE

FPGA Router Embedded NoCs Logic blocks Baseline Router Programmable “soft” interconnect WidthVCsPortsBuffer /VC “Mixed” NoCSoft LinksHard Routers + =

FPGA Router Embedded NoCs 16 “Mixed” NoCSoft LinksHard Routers + =

Router 17 Assumed a mesh  Can form any topology FPGA 2. Embedded NoCs Special Feature Configurable topology

FPGA Router Embedded NoCs Logic blocksDedicated “hard” interconnectProgrammable “soft” interconnect 18 “Hard” NoCHard LinksHard Routers + =

FPGA Router Embedded NoCs 19 “Hard” NoCHard LinksHard Routers + =

FPGA Router Embedded NoCs Low-V mode 1.1 V 0.9 V Save 33% Dynamic Power Special Feature ~15% slower 20 “Hard” NoCHard LinksHard Routers + =

21 2. Embedded NoCs 21 Width adaptation Frequency adaptation Voltage adaptation Bridge NoC and FPGA fabric: Bus protocol e.g. AXI

22 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Soft vs. mixed vs.Hard 3 3 System Area/Power Comparison Against P2P/Buses 4 4

23  State-of-the-art router architecture from Stanford: 1.NoC community have excelled at building on-chip routers: We just use it 2.To meet FPGA bandwidth requirements: High-performance router 3.Complex functionality such as virtual channels: Assigning traffic priority could be useful 3. Area/Power Analysis

24 3. Area/Power Analysis Hard Router vs. Soft Router 9X smaller, 2.4X faster, 1.4X lower power 30X smaller, 6X faster, 14X lower power Hard Links vs. Soft Links

25 Area Gap Speed Gap Power Gap MixedHard (Low-V) Soft 20X – 23X smaller 5X – 6X faster 9X – 11X (15X) less Average 1X 3. Area/Power Analysis

MixedHard Soft Speed Bisection BW ~ 1.5% of FPGA 33% of FPGA 730 – 940 MHz 166 MHz ~ 50 GB/s ~ 10 GB/s 64 – NoC [65 nm] 3. Area/Power Analysis 576 LBs ~12,500 LBs Area 448 LBs 64-node NoC on Stratix III

MixedHard (Low-V) Soft Speed Bisection BW ~ 1.5% of FPGA 33% of FPGA 730 – 940 MHz 166 MHz ~ 50 GB/s ~ 10 GB/s 64 – NoC [65 nm] 3. Area/Power Analysis 576 LBs ~12,500 LBs Area 448 LBs Provides ~50GB/s peak bisection bandwidth Very Cheap! Less than cost of 3 soft nodes 64-node NoC on Stratix III

28 Soft NoCMixed NoCHard NoCHard NoC (Low-V) 17.4 W 250 GB/s total bandwidth Typical FPGA Dynamic Power 123% How much is used for system-level communication? 3. Area/Power Analysis Largest Stratix-III device

29 Soft NoCMixed NoCHard NoCHard NoC (Low-V) 17.4 W NoC 250 GB/s total bandwidth 15% Typical FPGA Dynamic Power 3. Area/Power Analysis 123%

30 NoC 17.4 W Typical FPGA Dynamic Power Soft NoCMixed NoCHard NoCHard NoC (Low-V) 250 GB/s total bandwidth 15% 123%11% 3. Area/Power Analysis

31 NoC 17.4 W Typical FPGA Dynamic Power Soft NoCMixed NoCHard NoCHard NoC (Low-V) 250 GB/s total bandwidth 15% 123%11% 7% 3. Area/Power Analysis

GB/s 17 GB/s DDR3  Module 1 PCIe  Module 2 Full theoretical BW 126 GB/s Aggregate Bandwidth 3.5% NoC Power Budget Cross whole chip! 3. Area/Power Analysis

33 Why NoCs on FPGAs? Embedded NoCs Area &Power Analysis Point-to-point links 3 3 Comparison Against P2P/Buses 4 4 Qsys Buses

34 11 Point-to-point Links Broadcast 11 n Multiple Masters 1 1 Mux + Arbiter n Multiple Masters, Multiple Slaves 1 1 Mux + Arbiter n n Interconnect = Just wires Interconnect = Wires + Logic Interconnect = NoC 1.. n Compare “wires” interconnect to NoCs 4. Comparison

35 Hard and Mixed NoCs  Area/Power Efficient Length of 1 NoC Link 1 % area overhead on Stratix 5 Runs at MHz Power on-par with simplest FPGA interconnect 200 MHz High Performance / Packet Switched 4. Comparison

36 4. Comparison Qsys bus: Build logical bus from fabric Embedded NoC: 16 Nodes, hard routers & links

37 4. Comparison Steps to close timing using Qsys close FPGA

38 4. Comparison Steps to close timing using Qsys far FPGA

39 4. Comparison Steps to close timing using Qsys far FPGA Timing closure can be simplified with an embedded NoC

40 4. Comparison

41 4. Comparison

42 4. Comparison Entire NoC smaller than bus for 3 modules!

43 4. Comparison 1/8 Hard NoC BW used  already less area for most systems

44 4. Comparison Hard NoC saves power for even the simplest systems

Big city needs freeways to handle traffic Area: 20-23X Why NoCs on FPGAs? Embedded NoCs: Mixed & Hard Area & Power Analysis Speed: 5-6XPower: 9-15X Area Budget for 64 nodes: ~1% Power Budget for 100 GB/s: 3-7% Comparison Against P2P/Buses 4 4 Raw efficiency close to simplest P2P links NoC more efficient & lower design effort

46 eecg.utoronto.ca/~mohamed/noc_designer.html

47 eecg.utoronto.ca/~mohamed/noc_designer.html

 200 MHz 128-bit module, 900 MHz 32-bit router?  Configurable time-domain mux / demux: match bandwidth  Asynchronous FIFO: cross clock domains  Full NoC bandwidth, w/o clock restrictions on modules Embedded NoCs

49 1. Why NoCs on FPGAs? Maxeler Geoscience (14x, 70x) Financial analysis (5x, 163x) Altera OpenCL Video compression (3x, 114x) Information filtering (5.5x) GPU CPU

50 1. Why NoCs on FPGAs?

51 1. Why NoCs on FPGAs?

52 1. Why NoCs on FPGAs? NoC