PRESENTED BY: PRIYANK GUPTA 04/02/2012 Generic Low Latency NoC Router Architecture for FPGA Computing Systems & A Complete Network on Chip Emulation Framework.

Slides:

Advertisements

Similar presentations

CSE 413: Computer Networks

Advertisements

Network II.5 simulator ..

Evaluation of On-Chip Interconnect Architectures for Multi-Core DSP Students : Haim Assor, Horesh Ben Shitrit 2. Shared Bus 3. Fabric 4. Network on Chip.

Prof. Natalie Enright Jerger

A Novel 3D Layer-Multiplexed On-Chip Network

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

Jaringan Komputer Lanjut Packet Switching Network.

1 Message passing architectures and routing CEG 4131 Computer Architecture III Miodrag Bolic Material for these slides is taken from the book: W. Dally,

Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.

What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.

Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

Network based System on Chip Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.

1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.

Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.

Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.

1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.

1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

Gursharan Singh Tatla Transport Layer 16-May

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

On-Chip Networks and Testing

High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.

Networks-on-Chips (NoCs) Basics

MICROPROCESSOR INPUT/OUTPUT

ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Data and Computer Communications Chapter 10 – Circuit Switching and Packet Switching (Wide Area Networks)

1 Message passing architectures and routing CEG 4131 Computer Architecture III Miodrag Bolic Material for these slides is taken from the book: W. Dally,

J. Christiansen, CERN - EP/MIC

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

Network-on-Chip Introduction Axel Jantsch / Ingo Sander

Computer Networks with Internet Technology William Stallings

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Yu Cai Ken Mai Onur Mutlu

Unit III Bandwidth Utilization: Multiplexing and Spectrum Spreading In practical life the bandwidth available of links is limited. The proper utilization.

Lecture 16: Router Design

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Virtual-Channel Flow Control William J. Dally

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Flow Control Ben Abdallah Abderazek The University of Aizu

1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.

Lecture 23: Interconnection Networks

Exploring Concentration and Channel Slicing in On-chip Network Router

Interconnection Networks: Flow Control

Azeddien M. Sllame, Amani Hasan Abdelkader

Lecture 23: Router Design

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

EE 122: Lecture 7 Ion Stoica September 18, 2001.

Lecture: Interconnection Networks

Chapter 3 Part 3 Switching and Bridging

Lecture 25: Interconnection Networks

Multiprocessors and Multi-computers

Presentation transcript:

PRESENTED BY: PRIYANK GUPTA 04/02/2012 Generic Low Latency NoC Router Architecture for FPGA Computing Systems & A Complete Network on Chip Emulation Framework 1

Introduction Moore’s law is pushing towards more complex SoCs Network on Chip one of the technologies which enable us to keep up with the law. Designing and evaluation of NoC has been a challenge  Complexity of design  Inaccurate network/component models  Simulation/CAD tools FPGA provides a platform where above concerns can be addressed and real life data can be measured. Present on-chip FPGA resources can be used as underlying interconnect fabric 2

NoC Fundamentals 3 Block diagram of a router.

NoC Fundamentals 4

Router Architecture Simplified block diagram of a router. Flits arriving over the input channels are stored in buffers associated with each input. A set of allocators assigns buffers on the next node and channel bandwidth to pending flits. When a flit has been allocated the resources it needs, it is forwarded by the crossbar switch to an output channel. 5

Crossbar Example A 4×5 crossbar switch as implemented with 5 4:1 multiplexers. Each multiplexer selects the input to be connected to the corresponding output. 6

Crossbar Symbol 7

Data Packet Example Packet format for our simple network. Time, in cycles, is shown in the vertical direction, while the 18 signals of a channel are shown in the horizontal direction. The leftmost signals contain the phit type, while the 16 remaining signals contain either a destination address or data, or are unused in the case of a null phit. 8

Data Flow Control Units of resource allocation. Messages are divided into packets for allocation of control state. Each packet includes routing information (RI) and a sequence number (SN). Packets are further divided into flits for allocation of buffer capacity and channel bandwidth. Flits include no routing or sequencing information beyond that carried in the packet, but may include a virtual-channel identifier (VCID) to record the assignment of packets to control state. 9

Data Flow Algorithms 10 Deterministic: Algorithm always chooses the same path between x and y even if multiple options exist  Simple to implement  Poor job of balancing load Oblivious: Algorithm decides on a path between x and y without any information on the network’s prior state. Deterministic is a subset. Adaptive: Algorithm adapts to the state of the network. Information includes node status, channel load information, queue length etc.

Routing Mechanics 11 Mechanism used to implement any routing algorithm Table based Routing: The set of paths for each pair of nodes is stored in the table, and the table is indexed by the source and destination node. Only that portion of the table that is needed on a particular node need be stored on that node. Algorithmic Routing: Routing relation is computed using a network specific algorithm. More efficient in terms of both area and speed.

Flow Control 12 Determines how network resources such as channel bandwidth, buffer capacity and control state are allocated to the data traversing. Bufferless Flow Control  Simple to implement  Packets are dropped or misrouted if resource is not available Buffered Flow Control  Temporarily store packets in a buffer therefore less probability of dropped packets  Additional cost of hardware for implementation

Network Topologies Torus and mesh networks: (a) a torus network (4-ary 2-cube) includes the connection from node 3 to node 0 in both dimensions, but (b) a mesh network (4-ary 2-mesh) omits this connection. 13

FPGA Emulation Framework 14 Emulation environment developed to explore, evaluate and compare a variety of NoC solutions. Current FPGA implementations are limited in flexibility and do not allow full test of NoC implementations Cycle accurate simulations using a combination of hardware and software modules. Added flexibility due to modular approach of architecture times faster than HDL simulator

NoC Emulation Architecture 15

NoC Emulation Architecture 16 Xilinx Virtex 2 Pro V20 with an embedded power PC: Processor is povides the much needed flexibility for the emulation proces. Monitor Module: Responsible for the interface between the host PC and FPGA board. Also streams out data generated from various tests. Programmable NoC Platform: Responsible for traffic generators and receptors. Also keeps the user defined interconnection set between switches and network.

Data Flow 17 Above Traffic Generator (TG) can provide an image of the congestion of the network at each moment in time of the emulation. Each time a flit is not acknowledged by its receptor (i.e. switch or TR) and has to be resent, a readable counter by the processor is incremented.

NoC Emulation Flow 18

NoC Emulation Flow 19 No re synthesizing or re mapping of system due to HW-SW structure Processor is able to initialize parameters in hardware Emulation flow is categorized as:  Stochastic Emulation Flow: Implemented at the hardware level only. Configuration is implemented at the compilation level  Trace-based Emulation Flow: Entire NoC trace is loaded via software located on RAM. Processor streams the data into the emulated NoC and collects information on latency and congestion.

Results – Run Time 20 The total delivery time with the same amount of packets for the burst-mode is higher than for the uniform traffic. This is because the probability of collisions between packets in the burst-mode is significantly higher.

Results – Congestion Rate 21 Plots indicate that the congestion rate does not increase linearly with the number of delivered packets in a burst mode.

Results – Avg. Latency 22 The average latency of packets reaches a limit of congestion, which is the limit of the NoC in terms of latency

Generic Low Latency Router - Motivation 23 New Generation of FPGAs comprise of millions of LUTs and will contain many parallel soft processor cores and glue/extra logic. Use of traditional interconnect schemes will lead to under utilization. Future designs are perceived to be at a higher level than traditional gate level. Functionality will be implemented through programmability of such cores. Increased complexity of FPGA will lead to inefficient RTL based design flow.

Proposed Solution 24 Network on Chip can provide a flexible, scalable and reliable communication solution for such large and complex solutions. NoC provides the ability to change bandwidth and add processing elements. Cost is linear in this case whereas, traditional cross bar interconnects scale exponentially. FPGA contains significant global and local routing resources which can be used to construct the interconnect fabric and implement routing algorithm.

Prior Work 25 Many routers have been designed for NoC FPGA implementation  Circuit switch router: Head flit charts out the path, body follows. It has long circuit setup latency and low bandwidth utilization but once path is setup, Q0S is guaranteed and data latency is less.  Time multiplexed router: Precomputed communication pattern. Less flexible. Works well when communication loads are 100% but performance drops significantly when load < 40%  Packet switch router: Negotiate network resources dynamically at run time. Flexible and scalable and low resource utilization but have high latency (about 8 clock cycles per hop). FPGAs primarily used for prototyping and evaluating latency, throughput, cost and power.

Generic Low Latency NoC Router - Overview 26 Reconfigurable wormhole router for packet-switched NoC designs Low routing latency, low complexity and high buffer utilization Designed to be scalable, flexible and reliable for a variety of FPGA platforms and network configurations. 1-D ring, 2-D mesh and 3-D cube network topologies were used to measure the feasibility of design and implementation on the FPGA. 2 Cycles per hop latency

Wormhole Router Block Diagram 27 Three main components: flow control, components and pipeline control Wormhole flow control Components include input and output ports, arbiter to arbitrate between multiple requests and FSM to maintain state of output port Pipeline control is instrumental in achieving low latency per hop and parallel computation

Wormhole Flow Control 28 (a) the header arrives at the node, while the virtual channel is in the idle state (I) and the desired upper (U) output channel is busy — allocated to the lower (L) input. (b) the header is buffered and the virtual channel is in the waiting state (W), while the first body flit arrives.

Wormhole Flow Control 29 (c) the header and first body flit are buffered, while the virtual channel is still in the waiting state. In this state, the input channel is blocked. The second body flit cannot be transmitted, since it cannot acquire a flit buffer. (d) the output virtual channel becomes available and allocated to this packet. The state moves to active (A) and the head is transmitted to the next node.

Wormhole Flow Control 30 (e) The body flits follow (f) The body flits follow

Wormhole Flow Control 31 (g) the tail flit is transmitted and frees the virtual channel, returning it to the idle state. (h) a time-space diagram showing this process.

Packet Format 32 Packet length is unfixed Format is defined by network topology Output Channel (OC) field stores the output channel used by packet

Input Port 33 Single entry flit buffer uses dual port memory Dimensional Ordering Routing Routing computation is decoupled from arbitration Head & tail pointers are used to evaluate whether a flit is present in the buffer

Output FSM 34 Maintains state of output port Active state indicates that an output port has been matched with a downstream input port Tail flit departure puts router in wait state Once all flits leave downstream routers, the router goes in idle mode.

Router Pipeline Organization 35 a.Clock 1: Destination address and output channel latched in. Flit is written in flit buffer. During this period, arbitration result and look ahead routing is computed. b.Clock 2: Crossbar control signal latched in and flit is read from the granted port.

Timing of Second Pipeline Stage 36 T = Tco + Tlut + Trot + Tsu Tco is clock to register (or memory) output delay Tlut is the delay of LUT cells (multiplexer logic) Trot is the delay due to programmable wire routing Tsu isthe setup time of the device

Pipeline Diagram ASIC vs FPGA Router 37

Network Topologies Used 38

Credit Based Routing 39 Upstream router keeps count of free buffers of downstream router. The credit count is decremented every time a buffer is consumed. If count is zero, all downstream buffers are full. Data is transferred only if credit count > 0

Results – Resource Utilization 40 Logic cost vs. Router radix (32 bits data-path width)

Results – Resource Utilization 41 logic cost vs. Data-path width

Results - Timing 42 (a)Maximum clock rate vs. Router radix (b) two important critical paths within a router

Results - Power 43 (a)Static and dynamic power (mw) vs. Router radix (b)Normalized per-packet power consumption vs. radix

Packet Generator and Receiver 44 Due to pin limitations, packets must be generated using on chip logic within the FPGA rather than external sources. Each node of the NoC system is attached with a packet generator and receiver.

Results – Resource Utilization per Configuration 45 Resource utilization of different network configurations

Results – Resource Utilization per Configuration 46

Summary 47 Highly scalable router which is easily used among different network topologies. Low hop by hop propagation delay using a packet switch NoC router. Analysis of router in terms of scalability, hardware cost, operation speed and power dissipation. Real world feasibility of such router architecture has been demonstrated and its usage within FPGA platform provides a very robust and cost effective solution.