Mohamed Abdelfattah Vaughn Betz

Slides:



Advertisements
Similar presentations
Bus Specification Embedded Systems Design and Implementation Witawas Srisa-an.
Advertisements

Evaluation of On-Chip Interconnect Architectures for Multi-Core DSP Students : Haim Assor, Horesh Ben Shitrit 2. Shared Bus 3. Fabric 4. Network on Chip.
Augmenting FPGAs with Embedded Networks-on-Chip
Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis
A Novel 3D Layer-Multiplexed On-Chip Network
NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.
NETWORK ON CHIP ROUTER Students : Itzik Ben - shushan Jonathan Silber Instructor : Isaschar Walter Final presentation part A Winter 2006.
OCIN Workshop Wrapup Bill Dally. Thanks To Funding –NSF - Timothy Pinkston, Federica Darema, Mike Foster –UC Discovery Program Organization –Jane Klickman,
Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
Programmable logic and FPGA
CAD and Design Tools for On- Chip Networks Luca Benini, Mark Hummel, Olav Lysne, Li-Shiuan Peh, Li Shang, Mithuna Thottethodi,
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Comparison Against P2P/Buses 4 4.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
On-Chip Networks and Testing
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
TELE202 Lecture 5 Packet switching in WAN 1 Lecturer Dr Z. Huang Overview ¥Last Lectures »C programming »Source: ¥This Lecture »Packet switching in Wide.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 林孟諭 Dept. of Electrical Engineering National Cheng Kung.
Network On Chip Platform
The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
The Case for Embedding Networks-on-Chip in FPGA Architectures
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
Architecture and algorithm for synthesizable embedded programmable logic core Noha Kafafi, Kimberly Bozman, Steven J. E. Wilton 2003 Field programmable.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.
Lab 4 HW/SW Compression and Decompression of Captured Image
Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus
Presented by: Nick Kirchem Feb 13, 2004
Routing and Switching Fabrics
Advanced Computer Networks
Erno DAVID, Tivadar KISS Wigner Research Center for Physics (HU)
System On Chip.
ESE532: System-on-a-Chip Architecture
: An Introduction to Computer Networks
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson
Network-on-Chip & NoCSim
System Interconnect Fabric
CMSC 611: Advanced Computer Architecture
Israel Cidon, Ran Ginosar and Avinoam Kolodny
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
Verilog to Routing CAD Tool Optimization
ChipScope Pro Software
On-time Network On-chip
Switching Techniques.
CPU SCHEDULING.
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
ChipScope Pro Software
Routing and Switching Fabrics
Multiprocessors and Multi-computers
Presentation transcript:

Mohamed Abdelfattah Vaughn Betz LYNX: CAD for FPGA-based Networks-on-Chip Mohamed Abdelfattah Vaughn Betz

System-level Interconnect FPGA Example Design DDRx Controller module B B From PCIe To PCIe A C A C System-Interconnection Tool e.g. Qsys D Soft Buses PCIe Transcievers 100G Ethernet Controller D Memory Bus DDRx Controller

Embedded NoCs Embedded NoC on FPGA Implement System Communication DDRx Controller Implement System Communication Routers General-purpose system interconnect Ease Timing Closure (to IOs) FabricPorts PCIe Transcievers 100G Ethernet Controller More Efficient than Soft Buses Links Easy to Use? Direct IOLinks DDRx Controller

Embedded NoCs Embedded NoC on FPGA Implement System Communication DDRx Controller Implement System Communication Routers General-purpose system interconnect Ease Timing Closure (to IOs) FabricPorts PCIe Transcievers 100G Ethernet Controller More Efficient than Soft Buses Links Easy to Use? Direct IOLinks DDRx Controller

NoC Communication Easy to Use? Which Router? FabricPort Mode? FPGA DDRx Controller Easy to Use? Which Router? Example Design A FabricPort Mode? Packet B From PCIe To PCIe Packetize data A C Data B PCIe Transcievers Manage traffic 100G Ethernet Controller D C Memory D Data DDRx Controller

LYNX CAD Flow Design NoC Architecture NoC-based System Automatically connect design Satisfy correctness constraints: ordering Optimize performance: Throughput Latency

Outline 1. LYNX CAD Flow 2. Transaction Communication 3. Comparison How can we automate the use of NoCs? 2. Transaction Communication How to handle request – reply communication on NoC? 3. Comparison LYNX + NoC compared to Qsys + Bus

How can we automate the use of NoCs? Outline 1. LYNX CAD Flow How can we automate the use of NoCs? 2. Transaction Communication How to handle request – reply communication on NoC? 3. Comparison LYNX + NoC compared to Qsys + Bus

CAD Automatically connect application using NoC Application’s communication description

CAD Classify connection into streaming or transaction

CAD Tarjan’s clustering algorithm Cluster feedback loops to avoid stalls Intra-cluster connected directly

CAD Map modules and clusters to suitable locations on the NoC Simulated annealing Maximize throughput and minimize latency

NoC Mapping Routers = 16 FPGA Width = 150 VCs = 4 TDM = 4 Router DDRx Controller PCIe Transcievers 100G Ethernet Controller FPGA Width = 150 VCs = 4 TDM = 4 FabricPort Router 150 bits 1.2 GHz 600 bits ~ 300 Mhz

16 locations for a 600-bit module NoC Mapping Routers = 16 DDRx Controller PCIe Transcievers 100G Ethernet Controller FPGA Width = 150 VCs = 4 TDM = 4 1 2 3 4 5 6 7 8 FabricPort Router 150 bits 9 10 11 12 13 14 15 16 600 bits Module 16 locations for a 600-bit module

64 locations for a 150-bit module NoC Mapping Routers = 16 DDRx Controller PCIe Transcievers 100G Ethernet Controller FPGA Width = 150 VCs = 4 TDM = 4 1 2 5 6 9 10 13 14 3 4 7 8 11 12 15 16 17 18 21 22 25 26 29 30 FabricPort Router 150 bits 19 20 23 24 27 28 31 32 33 34 37 38 41 42 45 46 35 36 39 40 43 44 47 48 4 x 150 bits 1 49 50 53 54 57 58 61 62 2 51 52 55 56 59 60 63 64 3 4 64 locations for a 150-bit module

CAD Map modules and clusters to suitable locations on the NoC Simulated annealing Maximize throughput and minimize latency

CAD Soft logic wrappers between module and router Packetize data (simple) Manage traffic (complex)

CAD Analyze throughput and latency in NoC Estimate frequency

CAD Supports all features of commercial bus-based tools: Java Open-source Available at: eecg.utoronto.ca/~mohamed/lynx Supports all features of commercial bus-based tools: Streaming/transaction E.g.: Uneven arbitration Challenge: NoC is distributed

Outline 1. LYNX CAD Flow 2. Transaction Communication 3. Comparison How can we automate the use of NoCs? 2. Transaction Communication How to handle request – reply communication on NoC? 3. Comparison LYNX + NoC compared to Qsys + Bus

Streaming Communication Point-to-point LYNX automatically generates translators data valid packet dest 5 vc 2 ready ready

Transaction Communication

Transaction Communication Return Address Request Reply Response Unit Simple FIFO Buffers return address info. Return router Return VC

Transaction Communication Traffic Manager Decides when requests can be issued Response Unit Simple FIFO Buffers return address info. Return router Return VC 1. Traffic Build-Up in Multiple-Master Systems 2. Ordering in Multiple-Slave Systems

Multiple-Master Systems Slave E.g.: Memory Master 3

Multiple-Master Systems Slave Master 3 Buffering Switching

Multiple-Master Systems Slave Master 3 Buffering Switching

Multiple-Master Systems Traffic Build-Up Master 1 Master 2 Slave Master 3 Requests accumulate in the interconnect Uses much of the buffering Increases request-reply roundtrip latency Catastrophic for NoCs – buffering is shared

Multiple-Master Systems Traffic Build-Up A Master 1 Master 2 Slave Master 3 B Requests accumulate in the interconnect Uses much of the buffering Increases request-reply roundtrip latency Catastrophic for NoCs – buffering is shared

Multiple-Master Systems Credits Traffic Manager Master 1 Credits TM Master 2 Slave Credits TM Master 3 Credits TM Credits Traffic Manager: Stalls new requests until reply comes back Number of requests = number of credits Prevents traffic build-up in NoC

Credits Traffic Manager stall 1 2 3 Stalls new requests until reply comes back Number of requests = number of credits Prevents traffic build-up in NoC

Latency Without credits traffic manager With credits traffic manager Credits TM improves roundtrip latency (drastically) … and reduces NoC contention

Ordering in Multiple-Slave Systems Reply 1 Master Request 1 Request 2 Slave 2 Reply 2 Reply 2 arrives before reply 1 Data ordering hazard! Interconnect must guarantee correct ordering

1. Stall Traffic Manager Qsys uses “stall traffic manager” Slave 1 Stop request 2! Allow request 2 Reply 1 Stall TM Master Request 2 Request 1 Slave 2 Reply 2 Qsys uses “stall traffic manager” Stall requests to different slave until reply returns Problem: latency increases / throughput drops

2. VC Traffic Manager Leverage VCs & reorder at master Buffer replies in NoC Slave 1 Allow reply 1 Allow reply 2 Reply 1 (VC1) VC TM Master Request 1 (VC1) Request 2 (VC2) Slave 2 Reply 2 (VC2) Leverage VCs & reorder at master Increase throughput / reduce latency Use VC buffers in NoC  no added area Throughput limited by number of VCs

3. ROB Traffic Manager Reorder buffer (ROB) Traffic Manager Slave 1 Buffer replies in RAM ROB TM Master BRAM Slave 2 Reorder buffer (ROB) Traffic Manager Instantiate RAM in FPGA soft logic Reorder more replies than VC TM higher throughput … but more area

Three Traffic Managers for Ordering Performance Depending on traffic  VC or ROB TM

Three Traffic Managers for Ordering Performance Qsys Depending on traffic  VC or ROB TM Performance (much) better than Qsys

LYNX + NoC compared to Qsys + Bus Outline 1. LYNX CAD Flow How can we automate the use of NoCs? 2. Transaction Communication How to handle request – reply communication on NoC? 3. Comparison LYNX + NoC compared to Qsys + Bus

Frequency System Frequency ~1.5X higher with Embedded NoC LYNX NoC Qsys Multi-Master Qsys Crossbar Qsys Multi-Slave (150-bits)

Area 32x32 Qsys crossbar larger than largest FPGA! Qsys Multi-Slave Qsys Multi-Master LYNX NoC 32x32 crossbar NoC = ~2% of FPGA area (150-bits)

Summary 1. LYNX CAD Flow 2. Transaction Communication CAD flow steps to automatically connect Design to NoC 2. Transaction Communication Traffic-build up in Multiple-master Ordering in Multiple-slave 3. Area/Frequency Comparison ~1.5X higher system frequency Up to 78X less area

Future Work    “Mimic” Benchmarking       Standard way to compare interconnects Use graphs not complete apps Traffic generators instead of modules B A C D Ordering Uneven Arbitration Broadcast LYNX Hoplite Qsys Feed-forward Streaming External Memory Transactions LYNX 100 GB/s 10 GB/s Hoplite 75 GB/s 12 GB/s Qsys 25 GB/s 7 GB/s         

Thank You!

Three Traffic Managers for Ordering Area ROB TM twice the area VC/Stall TMs

Future Work          “Mimic” Benchmarking Application graphs to evaluate and compare different interconnects NoC in context of tomorrow’s FPGAs High-level Synthesis Virtualization Partial Reconfiguration Transaction Ordering Uneven Arbitration Broadcast LYNX GENIE Qsys    A B       Feed-forward Streaming External Memory Transactions LYNX 100 GB/s 10 GB/s GENIE 75 GB/s 12 GB/s Qsys 25 GB/s 7 GB/s C D

LYNX NoC roundtrip latency lower than Altera Qsys Bus