Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.

Slides:



Advertisements
Similar presentations
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Hard/soft efficiency gap Integrating hard NoCs with FPGA
Advertisements

Augmenting FPGAs with Embedded Networks-on-Chip
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis
A Novel 3D Layer-Multiplexed On-Chip Network
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
CSC 450/550 Part 3: The Medium Access Control Sublayer More Contents on the Engineering Side of Ethernet.
Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.
Module R R RRR R RRRRR RR R R R R Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip Zvika Guz, Isask ’ har Walter, Evgeny Bolotin, Israel.
NETWORK ON CHIP ROUTER Students : Itzik Ben - shushan Jonathan Silber Instructor : Isaschar Walter Final presentation part A Winter 2006.
Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
Predictive Load Balancing Reconfigurable Computing Group.
1 Evgeny Bolotin – ClubNet Nov 2003 Network on Chip (NoC) Evgeny Bolotin Supervisors: Israel Cidon, Ran Ginosar and Avinoam Kolodny ClubNet - November.
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
Issues in System-Level Direct Networks Jason D. Bakos.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
1 Networking Basics: A Review Carey Williamson iCORE Professor Department of Computer Science University of Calgary.
Storage area network and System area network (SAN)
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Comparison Against P2P/Buses 4 4.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
For more notes and topics visit: eITnotes.com.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Interconnect Networks
On-Chip Networks and Testing
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Hosting Virtual Networks on Commodity Hardware VINI Summer Camp.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
1 Lecture 7: Interconnection Network Part I: Basic Definitions Part II: Message Passing Multicomputers.
Networks and Protocols CE Week 5b. WAN’s, Frame Relay, DSL, Cable.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
Veronica Eyo Sharvari Joshi. System on chip Overview Transition from Ad hoc System On Chip design to Platform based design Partitioning the communication.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
Lecture 16: Router Design
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
The Case for Embedding Networks-on-Chip in FPGA Architectures
SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
Trigger Hardware Development Modular Trigger Processing Architecture Matt Stettler, Magnus Hansen CERN Costas Foudas, Greg Iles, John Jones Imperial College.
1 Switching and Forwarding Sections Connecting More Than Two Hosts Multi-access link: Ethernet, wireless –Single physical link, shared by multiple.
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Flow Control Ben Abdallah Abderazek The University of Aizu
On-time Network On-Chip: Analysis and Architecture CS252 Project Presentation Dai Bui.
Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 
Graciela Perera Department of Computer Science and Information Systems Slide 1 of 18 INTRODUCTION NETWORKING CONCEPTS AND ADMINISTRATION CSIS 3723 Graciela.
Mohamed Abdelfattah Vaughn Betz
The network-on-chip protocol
Dynamic connection system
Lecture 23: Interconnection Networks
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Israel Cidon, Ran Ginosar and Avinoam Kolodny
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
Introduction to Scalable Interconnection Network Design
On-time Network On-chip
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
Multiprocessors and Multi-computers
Presentation transcript:

Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ

2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication

3 Bus 1 Bus 2 Bus 3 Module 4 Module 1 Module 3 Module 2 FPGAs are big! Design big systems High on-chip communication Wide links from single bit- programmable interconnect Somewhat unpredictable frequency  re-pipeline 100s of bits Tough timing constraints Physical distance affects frequency Buses are unique to the application, and.. inefficient Design a bus for each set of connections

4 FPGAs are big! Design big systems High on-chip communication Design a bus for each set of connections Wide links from single bit- programmable interconnect Somewhat unpredictable frequency  re-pipeline Module 4 Module 1 Module 3 Module 2 Buses are unique to the application, and.. inefficient

5 FPGAs are big! Design big systems High on-chip communication Module 4 Module 1 Module 3 Module 2 Embedded NoC Data Packet flit Data Sys-level interconnect used in most designs  harden! NoC=complete interconnect Data transport Switching Buffering NoC=complete interconnect Data transport Switching Buffering Links Routers IEEE Micro 2014: Smaller and less power than bus connecting 3 modules to DDR3 More efficient than soft buses and huge bandwidth

Module 6 NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency Link Width# Nodes 150 bits16 Area %Freq. 1.3 %1.2 GHz 22.5 GB/s = 1.2 GHz x 150 bits = 300 MHz x 600 bits22.5 GB/s

Module 7 NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency NoC can transport 22.5 GB/s on one link Easy for designers No segmentation of data DDR3 max = 17 GB/s Run NoC at fixed, high frequency Link Width# Nodes 150 bits16 Area %Freq. 1.3 %1.2 GHz 22.5 GB/s = 1.2 GHz x 150 bits = 300 MHz x 600 bits22.5 GB/s

How do we adapt Embedded NoCs to FPGAs? 8 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch [Easy-to-use]

9 NoC FabricPort = On-ramp Photo from:

10 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Width # flits bits bits bits bits 4

11 3 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Time-domain multiplexing: Divide width by 4 Multiply frequency by 4

12 3 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Asynchronous FIFO: Cross into NoC clock No restriction on module frequency Time-domain multiplexing: Divide width by 4 Multiply frequency by 4

13 3 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Asynchronous FIFO: Cross into NoC clock No restriction on module frequency Time-domain multiplexing: Divide width by 4 Multiply frequency by 4 NoC Writer: Track available buffers in NoC Router Forward flits to NoC Backpressure 2 10

14 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC 3 Virtual Channels: Multiple buffers Logical separation of data VC2 VC Packet on VC1 Packet on VC2 Interleaved flits from 2 packets

15 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Packet on VC Virtual Channels: Multiple buffers Logical separation of data Packet on VC1 Interleaved flits from 2 packets

16 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Packet on VC Asynchronous FIFO: Reads packet from one VC at a time Supports Priorities Virtual Channels: Multiple buffers Logical separation of data Packet on VC Interleaved flits from 2 packets

17 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC Packet on VC Asynchronous FIFO: Reads packet from one VC at a time Supports Priorities Virtual Channels: Multiple buffers Logical separation of data Packet on VC1 TD Demultiplexing: Output 1 packet at a time  easy for designers ? Interleaved flits from 2 packets

18 Ready/Valid Any* width bits Any* width bits Frequency 1.2 GHz Frequency 1.2 GHz Credits Fixed Width 150 bits Fixed Width 150 bits Frequency MHz Frequency MHz FPGA Module Embedded NoC 2121 Packet on VC Packet on VC1 combine_data mode: Packets from VC1  upper bits Packets from VC2  lower bits

19 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch How do we adapt Embedded NoCs to FPGAs?

20 NoC Rules for using an NoC with FPGA design styles Photo from:

FPGA Designs Mostly streaming Cannot tolerate reordering – Hardware expensive and difficult 21 Multiprocessors Memory mapped Packets arrive out-of-order – Fine for cache lines – Processors have re-order buffers 1 2 All packets with same src/dst must take same NoC path RULE

FPGA Designs Mostly streaming Cannot tolerate reordering – Hardware expensive and difficult 22 Multiprocessors Memory mapped Packets arrive out-of-order – Fine for cache lines – Processors have re-order buffers 1 2 All packets with same src/dst must take same NoC path RULE FULL All packets with same src/dst must take same VC RULE

24 Each message type must take a different VC  FabricPort output must support that RULE

25 Design Styles Latency- insensitive Latency- sensitive

Stall-wrappers 26 Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals A B C Tolerate latency on links

27 ABC Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals Tolerate latency on links

28 ABC Longer path, lower frequency  Insert pipeline regs  Doesn’t affect functionality Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals Tolerate latency on links AB C data Ready signals

29 ABC NoC for Latency-insensitive: Pipelined Timing closure: yes Little/no physical info (?) NoC for Latency-insensitive: Pipelined Timing closure: yes Little/no physical info (?) Stall-wrappers Latency-Insensitive Modules are stallable Ready/valid signals AB C data Ready signals Tolerate latency on links

30 Latency-Sensitive Not stallable No Ready/valid signals Latency must remain constant to be functional Permapaths A BC Reserved Paths Never stall Fixed latency Reserved Paths Never stall Fixed latency AB C data

31 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles 3. Case studies: JPEG & Ethernet switch How do we adapt Embedded NoCs to FPGAs?

32 Booksim - NoC simulator - Cycle-accurate - Widely-used - NoC simulator - Cycle-accurate - Widely-used Define traffic pattern RTL (Verilog) - Hardware designs - FabricPort - NoC looks exactly like a hardware module - Hardware designs - FabricPort - NoC looks exactly like a hardware module Sockets SV DPI Simulate any RTL design with a flexible NoC Simulator

Latency-sensitive [ Permapaths ] Parallelize 10 – 40 times – Data width 130 – 520 bits Important for data center applications Compare NoC version to non-NoC version 33 DCTQNRRLE Strobe Pixel in valid Pixel out DCTQNRRLE DCTQNRRLE DCTQNRRLE Strobe …0 PixelXXP1P2P3P4P5…P64 Photo from:

34 DCT QNR RLE Original [close]

35 DCT QNR RLE Original [close]

36 QNR RLE DCT Critical path Original [far] 130 MHz =50% Original [close]

37 QNR RLE DCT Original [far] NoC [average] Original [close]

38 QNR RLE DCT Original [far] NoC [average] 5% drop in frequency from best case 80% improvement from worst case ~6% variability regardless of module placement Frequency is good, and (much) more predictable with NoC Original [close] No drop in frequency for largest design

39 NoC doesn’t produce routing hotspots – it reduces them NoC reduces utilization of scarce long FPGA wires Max 40% Max 100% [long wires]

Communications and networking are the largest FPGA customers Important design: Ethernet Switch 40 High Bandwidth Ethernet switches  ASIC land – 10/25/40/100 Gb/s per port switches, e.g. 32x32 Best previous FPGA switch: – 16x16, 10 Gb/s per port, total 160 Gb/s – FPGAs have transceiver BW more than 1000 Gb/s Photo from:

Embedded NoC is the switch (16X16 switch) Use RTL2Booksim to measure latency/throughput 41

42 Previous work 10 Gb/s NoC Switch 10 Gb/s Approx. Half the latency Approx. Half the latency Saturates at higher injection: Handles more bursty traffic Saturates at higher injection: Handles more bursty traffic

Previous work 10 Gb/s NoC Switch 10 Gb/s 25 Gb/s 3X smaller area 40 Gb/s 5X more switching than best: 819 Gb/s vs 160 Gb/s ~ Half the latency Reconfigurable speed, radix, protocol – augment pkt processing

44 How do we adapt Embedded NoCs to FPGAs? 1. FabricPort: NoC-to-FPGA interface 2. NoC with FPGA design styles Module (any freq./width)  Embedded NoC (1.2 GHz, 150 bits) Designer-friendly Leverage VCs, credits Rules for NoC design on FPGAs: packet ordering and dependencies Support FPGA Design Styles: Latency-insensitive design Latency-sensitive design Parallel JPEG: Frequency predictable and good, Reduces routing hotspots and long wire utilization Ethernet Switch: 5X more switching, 3X less area, reconfigurable 3. Case studies: JPEG & Ethernet switch Compelling case for Embedded NoCs

45