Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

Slides:



Advertisements
Similar presentations
Delivery and Forwarding of
Advertisements

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
4-1 Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving side, delivers.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
Issues in System-Level Direct Networks Jason D. Bakos.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Switching, routing, and flow control in interconnection networks.
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
On-Chip Networks and Testing
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Networks-on-Chips (NoCs) Basics
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
Packet switching network Data is divided into packets. Transfer of information as payload in data packets Packets undergo random delays & possible loss.
BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.
Unit III Bandwidth Utilization: Multiplexing and Spectrum Spreading In practical life the bandwidth available of links is limited. The proper utilization.
Lecture 16: Router Design
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.
Chapter 3 Part 3 Switching and Bridging
The network-on-chip protocol
Chapter 8 Switching Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Buffer Management and Arbiter in a Switch
Lecture 23: Interconnection Networks
Packet Switching Datagram Approach Virtual Circuit Approach
Network Layer.
CS4470 Computer Networking Protocols
Azeddien M. Sllame, Amani Hasan Abdelkader
Chapter 3 Switching.
Lecture 23: Router Design
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Chapter 3 Part 3 Switching and Bridging
Deadlock Free Hardware Router with Dynamic Arbiter
Switching, routing, and flow control in interconnection networks
Using Packet Information for Efficient Communication in NoCs
The Network Layer Network Layer Design Issues:
Data Communication Networks
Congestion Control (from Chapter 05)
On-time Network On-chip
PRESENTATION COMPUTER NETWORKS
Switching Techniques.
Congestion Control (from Chapter 05)
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Lecture: Interconnection Networks
Congestion Control (from Chapter 05)
Chapter 3 Part 3 Switching and Bridging
Delivery, Forwarding, and Routing of IP Packets
CS 6290 Many-core & Interconnect
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
Lecture 25: Interconnection Networks
Switching, routing, and flow control in interconnection networks
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
Virtual LAN (VLAN).
Multiprocessors and Multi-computers
Presentation transcript:

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar

Outline Motivation Comparison to Packet Networks Why Network-on-chip (NoC) Comparison to Packet Networks Similarities Differences Design Constraints Topology and Routing/Switching techniques for NoC Mesh, fat-tree, honey-comb Greedy, Deflection, Wormhole, Virtual-Channels Start with the paper – Design of a Low-Latency Virtual-Channel Router

Why NoC Billion transistor era has arrived Several such SoC are in pipeline, Inter-connection is critical A generic inter-connection architecture ensures Reduced design time IP reuse Predictable backend (versus ad-hoc wiring) Bus based inter-connects were sufficient until now But not now Shared bus is slow (arbitrates between several requesters) More components increase loading => speed drops further Ad-hoc routing of wires results in backend complications, lower performance and higher power consumption

Why NoC (cont) Recently Dally proposed an idea “Route packets not wires” as in data networks Point to point communication Point to point links are faster Create a chip wide network (Like a regular IP WAN) A router at every node Links connecting all routers Messages encapsulated in packets, which are routed Challenges Topologies, Routing protocol Network and router design with small footprint and low latency

Some more motivations The need to put repeaters into long wires allows us to add the switching needed to implement a network at little additional cost Makes efficient use of critical global wiring resources by sharing them across different senders and receivers Simplifies overall design Design a single router and do copy-paste in both dimension

A typical NoC node Layered Design of reconfigurable micronetworks. Exploits methods and tools used for general network. Micronetworks based on the ISO/OSI model. NoC architecture consists of Physical, Data link, and Network layers.

Implemented in cores, enables end-to-end reliable transport A typical NoC Layered Design of reconfigurable micronetworks. Exploits methods and tools used for general network. Micronetworks based on the ISO/OSI model. NoC architecture consists of Physical, Data link, and Network layers. Implemented in cores, enables end-to-end reliable transport

A typical NoC Layered Design of reconfigurable micronetworks. Exploits methods and tools used for general network. Micronetworks based on the ISO/OSI model. NoC architecture consists of Physical, Data link, and Network layers. Implemented in cores, enables end-to-end reliable transport Multi hop route setup, packet addressing, etc

A typical NoC Layered Design of reconfigurable micronetworks. Exploits methods and tools used for general network. Micronetworks based on the ISO/OSI model. NoC architecture consists of Physical, Data link, and Network layers. Implemented in cores, enables end-to-end reliable transport Multi hop route setup, packet addressing, etc Contention issues, reliability issues, grouping of physical layer bits, e.g. “flits”

A NoC topology Cores Communicates With Each Other Using NoC NoC Consists of Routers (R) and Network Interfaces (NI) A NI linked to Router by Non-Pipelined Wires One or More Cores Connected to a NI

Another NoC topologies Fat tree Mesh Multi hop route setup, packet addressing, etc

Routing protocols We will only consider mesh topology Objective is to find a path from a source to a destination Greedy Algorithms (deterministic) Choose shortest path (e.g. X-Y) Adaptive routing If congestion, choose alternative path Deflection routing Is adaptive better than greedy => NOT REALLY (when only local information is used) Adaptive routing can also result in livelock

Switching techniques Circuit Switching: A control message is sent from source to destination and a path is reserved. Communication starts. The path is released when communication is complete. Store-and-forward policy (Packet Switching): each switch waits for the full packet to arrive in switch before sending to the next switch Cut-through routing or worm hole routing: switch examines the header, decides where to send the message, and then starts forwarding it immediately In worm hole routing, when head of message is blocked, message stays strung out over the network, potentially blocking other messages (Needs only buffer the piece of the packet that is sent between switches). Cut through routing lets the tail continue when head is blocked, storing the whole message into an intermediate switch. (Need buffer large enough to hold the largest packet).

Wormhole Routing – Good fit for NoC Wormhole routing is good for NoC Low latency Less buffering requirements Suffers from deadlock [example from Li and McKinley, IEEE Computer v26n2, 1993]

Adding Virtual Channels With virtual channels, deadlock can be avoided Move message and reply on different channels => Will never have loop on a single channel

Designing Virtual Channel Routers Design Constraints in NoC Minimize Latency Minimize Buffering Minimal footprint Can exploit far greater number of pins and wires May use fat data and flow control wires Objective: Design routers with minimal latency This will also result in smaller buffers This paper presents design of a low latency router Cycle time of 12 FO4 Single cycle routing/switching

Designing Virtual Channel Routers Design Constraints in NoC Minimize Latency Minimize Buffering Minimal footprint Can exploit far greater number of pins and wires May use fat data and flow control wires Objective: Design routers with minimal latency This will also result in smaller buffers This paper presents design of a low latency router Cycle time of 12 FO4 Single cycle routing/switching

A Virtual Channel Router

Designing Virtual Channel Routers Every VC of every input port has buffers to hold arriving flits Arriving flits are placed into the buffers of corresponding VC

Designing Virtual Channel Routers Every VC of every input port has buffers to hold arriving flits Routing logic assigns set of outgoing VC on which flit can go Arbitrates between competing input VC & allocates output VC Arriving flits are placed into the buffers of corresponding VC

Designing Virtual Channel Routers Every VC of every input port has buffers to hold arriving flits Routing logic assigns set of outgoing VC on which flit can go Arbitrates between competing input VC & allocates output VC Arriving flits are placed into the buffers of corresponding VC Matches successful input ports (allocated VC) to output ports Flits at input VCs getting grants are passed to output VCs

Routing Logic Three possibilities Look ahead routing Return a single VC Return set of VCs on a single port Return any VCs Look ahead routing Routing performed at the previous router Good for X-Y deterministic (non adaptive) routing A SGI routing chip first implemented it

VC Allocation Complexity of VC allocation depends on routing range Routing returns single VC Needs PxV input arbiter for every outgoing VC Routing returns multiple VC at single port Additional V:1 arbiter at every input VC to reduce potential outgoing VC to 1 Routing returns any set of VCs Needs two cascaded PxV input arbiters We consider multiple VC at single port case

VC Allocation Logic At every outgoing VC following logic is needed

Switch Allocation Individual flits at input VCs arbitrate for access to the crossbar port Arbitration can be performed in two stages First stage A VC among V possible VCs at every input port is selected V:1 arbiter at every input port Second stage Winning VC at every input port is matched to the output port P:1 arbiter at every output port This scheme doesn’t guarantee a maximal/maximum/good matching But simple to implement

Switch Allocation

Issues VC allocation and Switch allocation are serialized A flit will either take 2 clocks to get through Else clock speed will be low Solution: Speculative switch allocation

Speculative Switch Allocation Dally proposed speculative switch allocation Perform switch and VC allocation in parallel Assume that participating VC in switch allocation will get the output VC If not then wasted cycle An even better idea is to perform speculative and non-speculative switch allocation in parallel Non-speculative allocation has higher priority Note that non-speculative allocation is done for input VCs which has already been allocated an output VC Mostly one cycle delay under light load Mostly one cycle delay under heavy load Speculative will work Non-speculative will work

Further Enhancement Is it possible to have zero cycle VC/switch allocation YES, Most of the time, that’s what this paper is about!

Idea 1: Free Virtual Channel Queue Keep queue of free VC at every outgoing port Also bit mask with one set bit Thus First stage of VC allocation where an output VC is selected will be removed

Idea 1: Free Virtual Channel Queue Keep queue of free VC at every outgoing port Also bit mask with one set bit Thus First stage of VC allocation where an output VC is selected will be removed

Idea 2: Pre-computing arbitration decisions If somehow, you know the arbitration results before flits actually arrive and fight for the VC and switch I mean, every arbitration decision VC allocation Switch allocation Etc Then the router can be made to run in zero cycle The arriving flit route/switch in the same clock they arrive Also, clock speed may be pretty good Data path and control path are no more in series That’s what the idea 2 is.

Some preliminaries before going into detail Tree Arbiters Implements large arbiters using tree of small arbiters Matrix Arbiters Fair and Fast arbiter implementation

VC allocation using a Tree Arbiter

A Matrix Arbiter

Pre-computing arbitration decisions An alternative arbiter design

Pre-computing arbitration decisions An alternative arbiter design Generate grant enables one cycle prior and latch them Grants are product of latched enables and the requests

Pre-computing arbitration decisions An alternative arbiter design Grants are generated in same clock as request arrives If at least one request remains Generate grant enables one cycle prior and latch them Grants are product of latched enables and the requests

Pre-computing arbitration decisions An alternative arbiter design However, when no request remains, it is difficult to generate grant enables ??? Generate grant enables one cycle prior and latch them Grants are product of latched enables and the requests

Generating grant enables Safe Environment Only one request may arrive in a cycle Thus it is safe to assert all grant enables Thus grant can still be generated in same cycle Unsafe Environment Multiple request may arrive in same cycle Can still assert all grants But need to abort when multiple requests arrive in same cycle All first stage V:1 arbiters operate under safe environment However P:1 arbiters doesn’t

Generating grant enables Even in unsafe environments, assert all grants May need to abort when multiple requests arrive Note that after an abort, a correct arbitration is ensured in the next cycle Why will it work? Because in lightly loaded network, multiple requests for same VC/port will not arrive (few aborts) In heavily loaded network flits will remain buffered and Non-speculative arbitration (higher priority) will happen most of the time Few aborts again

I will skip the design details now Since it is confusing and complex Will jump to critical path analysis

Analysis of critical path Generates VC/switch grants from pre-computed grant enables

Analysis of critical path Generates VC/switch grants from pre-computed grant enables Crossbar traversal is aborted once invalid grants are detected

Analysis of critical path In case of an abort, the correct control signals are ensured in the next cycle Generates VC/switch grants from pre-computed grant enables Crossbar traversal is aborted once invalid grants are detected

Final design Control path critical delay is 12 FO4 Until now, the best design had 20 FO4 delays They have sampled a NoC based ASIC last week using this idea Runs at several GHz speeds Note that fast cycle time is possible by Running VC allocation and Switch allocation in parallel Must use speculation, else delay will be higher (1 more cycle)

Simulation results

If (doubts) Then Ask; Else Thank you; Goto Discussion; End if;