Networks on Chip: Router Microarchitecture & Network Topologies

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Variations of the Turing Machine
1 EE384Y: Packet Switch Architectures Part II Load-balanced Switch (Borrowed from Isaac Keslassys Defense Talk) Nick McKeown Professor of Electrical Engineering.
EE384y: Packet Switch Architectures
1 UNIT I (Contd..) High-Speed LANs. 2 Introduction Fast Ethernet and Gigabit Ethernet Fast Ethernet and Gigabit Ethernet Fibre Channel Fibre Channel High-speed.
1
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
Processes and Operating Systems
1 Hyades Command Routing Message flow and data translation.
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Multipath Routing for Video Delivery over Bandwidth-Limited Networks S.-H. Gary Chan Jiancong Chen Department of Computer Science Hong Kong University.
Break Time Remaining 10:00.
Jongsok Choi M.A.Sc Candidate, University of Toronto.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) The Black Widow High Radix Clos Network S. Scott, D.Abts, J. Kim, and W.
Mohamed Hauter CMPE 259 – Sensor Networks UCSC 1.
Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.
Turing Machines.
Database Performance Tuning and Query Optimization
PP Test Review Sections 6-1 to 6-6
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 2 The OSI Model and the TCP/IP.
Chapter 3 Logic Gates.
1 COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. On the Capacity of Wireless CSMA/CA Multihop Networks Rafael Laufer and Leonard Kleinrock Bell.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Association Rule Mining
Chapter 20 Network Layer: Internet Protocol
Computer vision: models, learning and inference
Chapter 6 File Systems 6.1 Files 6.2 Directories
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 Introduction to Network Layer Lesson 09 NETS2150/2850 School of Information Technologies.
1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Clock will move after 1 minute
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Select a time to count down from the clock above
Princess Sumaya University
A Novel 3D Layer-Multiplexed On-Chip Network
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
Efficient Microarchitecture for Network-on-Chip Routers
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Virtual-Channel Flow Control William J. Dally
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
The network-on-chip protocol
Lecture 23: Interconnection Networks
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Exploring Concentration and Channel Slicing in On-chip Network Router
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
NoC Switch: Basic Design Principles &
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
CS 6290 Many-core & Interconnect
Lecture 25: Interconnection Networks
Presentation transcript:

Networks on Chip: Router Microarchitecture & Network Topologies Daniel U. Becker, James Chen, Nan Jiang, Prof. William J. Dally Concurrent VLSI Architecture Group Stanford University

Outline Introduction Router Microarchitecture Network Topologies Q&A Open Source Router RTL Q&A 10/13/09 NoC: Router Microarchitecture & Network Topologies

Why Networks-on-Chip? (1) Clock frequency scaling and ILP exploitation have hit power wall Single-threaded performance is leveling off Transistor budgets still growing Further performance increases rely on exploiting parallelism Key problems: Scalability Energy efficiency Design complexity (source: Wikipedia) 10/13/09 NoC: Router Microarchitecture & Network Topologies

Why Networks-on-Chip? (2) Bus-based solutions don’t scale Contention, electrical characteristics, timing, … Really want point-to-point links Full connectivity is too expensive Area, power, delay, … Ad-hoc wiring is too expensive Design, verification, … Need efficient, scalable communication fabric on chip: Network! Building blocks (circuits, microarchitecture) Topologies Routing & flow control schemes 10/13/09 NoC: Router Microarchitecture & Network Topologies

NoCs vs. Long-Haul Networks Networks are a mature field Extensive body of prior research Internet, wireless, interconnection networks, … NoCs have similar building blocks & design choices So why not just leverage those results? Well, we do, but… NoCs are subject to very different constraints These constraints lead to very different design tradeoffs 10/13/09 NoC: Router Microarchitecture & Network Topologies

The On-Chip Environment (1) Wires are cheap Favors wider interfaces and more channels Buffers are expensive Provide “just enough” buffering (e.g. credit round trip) Minimize occupancy & turnaround time Efficient flow control required Power budget is limiting factor for current chip designs Minimize power required for moving things around Maximize power available for doing actual work 10/13/09 NoC: Router Microarchitecture & Network Topologies

The On-Chip Environment (2) Semiconductor technology requires planar layout Favors regular, low-dimensional topologies Strict cycle time constraints Favors simple routing algorithms, flow control mechanisms No complex arithmetic or large lookup tables Minimize amount of state required for adaptive routing No need to support online reconfigurability Topology is essentially static Possibly some dynamic reconfiguration for power management But must be able to isolate faulty cores, e.g. at boot time 10/13/09 NoC: Router Microarchitecture & Network Topologies

Traffic Characteristics (1) Message-oriented Few message types with short, uniform message sizes Highly latency-sensitive E.g. memory, cache coherence traffic in CMPs Requires shallow router pipelines, low-diameter topologies Bursty Can’t just optimize for average network load! Message loss usually not acceptable Message may contain only instance of its payload data End-to-end reliability is expensive 10/13/09 NoC: Router Microarchitecture & Network Topologies

Traffic Characteristics (2) Assumptions for this talk: Memory load/store traffic Split transaction protocol Single-packet messages Single-phit flits Transact. Type Message Type Packet Size Read Request 1 flit Reply 1+n flits Write 10/13/09 NoC: Router Microarchitecture & Network Topologies

Router Microarchitecture Topology, routing and flow control set the high-level framework for network cost and performance Channel width, connectivity, path diversity, hop count, … Router microarchitecture determines the cost of each hop Directly impacts latency, throughput and power consumption Comprises variety of different aspects: Pipeline organization Implementation of routing logic, flow control & allocators Buffer management Power management, reliability & fault tolerance 10/13/09 NoC: Router Microarchitecture & Network Topologies

Virtual Channels Share channel capacity between multiple data streams Interleave flits from different packets Provide dedicated buffer space for each virtual channel Decouple channels from buffers “The Swiss Army Knife for Interconnection Networks” Prevent deadlocks Reduce head-of-line blocking Also useful for providing QoS 10/13/09 NoC: Router Microarchitecture & Network Topologies

[Dally’87] Using VCs for Deadlock Prevention Protocol deadlock Circular dependencies between messages at network edge Solution: Partition range of VCs into different message classes Routing deadlock Circular dependencies between resources within network Partition range of VCs into different resource classes Restrict transitions between resource classes to impose partial order on resource acquisition {packet classes} = {message classes} × {resource classes} 10/13/09 NoC: Router Microarchitecture & Network Topologies

[Dally’90] Using VCs for Flow Control Coupling between channels and buffers causes head-of-line blocking Adds false dependencies between packets Limits channel utilization Increases latency Even with VCs for deadlock prevention, still applies to packets in same class Solution: Assign multiple VCs to each packet class 10/13/09 NoC: Router Microarchitecture & Network Topologies

VC Router Pipeline Per packet Per flit Route Computation (RC) Determine candidate output port(s) and VC(s) Can be precomputed at upstream router (lookahead routing) Virtual Channel Allocation (VA) Assign available output VCs to waiting packets at input VCs Switch Allocation (SA) Assign switch time slots to buffered flits Switch Traversal (ST) Send flits through crossbar switch to appropriate output Per packet Per flit 10/13/09 NoC: Router Microarchitecture & Network Topologies

Allocator Comparison Allocators represent key pieces of router control logic Affect both cost/complexity and performance Evaluate & compare several representative VC and switch allocator implementations Investigate scaling behavior with router radix, number of VCs, and other key parameters To appear in SC‘09 10/13/09 NoC: Router Microarchitecture & Network Topologies

Allocation Basics Arbitration: Allocation: Matching: Multiple requestors Single resource Request + grant vectors Allocation: Multiple equivalent resources Request + grant matrices Matching: Each grant must satisfy a request Each requester gets at most one grant Each resource is granted at most once 10/13/09 NoC: Router Microarchitecture & Network Topologies

Separable Allocators Matchings have at most one grant per row and per column Implement via to two phases of arbitration Column-wise and row-wise Perform in either order Arbiters in each stage are fully independent Fast and cheap But bad choices in first phase can prevent second stage from generating a good matching! Input-first: Output-first: 10/13/09 NoC: Router Microarchitecture & Network Topologies

[Tamir’93] Wavefront Allocators Avoid separate phases … and bad decisions in first Generate better matchings But delay scales linearly Also difficult to pipeline Principle of operation: Pick initial diagonal Grant all requests on diagonal Never conflict! For each grant, delete requests in same row, column Repeat for next diagonal 10/13/09 NoC: Router Microarchitecture & Network Topologies

Wavefront Allocator Timing Originally conceived as full- custom design Tiled design True delay scales linearly Signal wraparound creates combinational loops Effectively broken at priority diagonal But static timing analysis cannot infer that Synthesized designs must be modified to avoid loops! 10/13/09 NoC: Router Microarchitecture & Network Topologies

Loop-Free Wavefront Allocators I/O Transformation: Replication: 10/13/09 NoC: Router Microarchitecture & Network Topologies

[Hurt’99] Diagonal Propagation Allocator Unrolled matrix avoids combinational loops Sliding priority window activates sub-matrix cells But static timing analysis again sees false paths! Actual delay is ~n Reported delay is ~(2n-1) Hurts synthesized designs Another design that avoids combinational loops is […] 10/13/09 NoC: Router Microarchitecture & Network Topologies

[J. Chen] Domino Logic Wavefront Allocator Wavefront allocator tends to be on the critical path for high- radix routers Full-custom dual-rail domino implementation achieves 2-3x reduction in latency over synthesized implementation Exploits the one-hot nature of priority signal Optimize token propagation path Single domino gate per cell on critical path Maximum pull-down depth of two 10/13/09 NoC: Router Microarchitecture & Network Topologies

VC Allocation Before packets can proceed through router, need to acquire ownership of VC at downstream router VC allocator matches unassigned input VCs with output VCs that are not currently in use P×V requestors (input VCs), P×V resources (output VCs) VC is acquired by head flit, inherited by body & tail flits 10/13/09 NoC: Router Microarchitecture & Network Topologies

VC Allocator Implementations Not shown: Masking logic for busy VCs 10/13/09 NoC: Router Microarchitecture & Network Topologies

Sparse VC Allocation Any-to-any flexibility in VC allocator is unnecessary Different use cases for VCs restrict possible transitions: Message class never changes Resource classes are traversed in order VCs within a packet class are functionally equivalent Requests apply to all VCs assigned to target class Can take advantage of these properties to reduce VC allocator complexity! Property Max. Savings Delay 41% Area 90% Power 83% 10/13/09 NoC: Router Microarchitecture & Network Topologies

VC Allocator Performance Type of VC allocator has little impact on performance Each packet only gerenates a single request Each input VC only has small set of possible destination VCs Mesh 4 VCs FBFly8 VCs 10/13/09 NoC: Router Microarchitecture & Network Topologies

VC Allocator Cost Wavefront cost scales badly Synthesis fails for larger FBFly configurations But cost is reasonable for small mesh configurations 10/13/09 NoC: Router Microarchitecture & Network Topologies

Switch Allocation Traversal of the router requires access to the crossbar All VCs at a given input port share one crossbar input Switch allocator matches ready-to-go flits with crossbar time slots Allocation performed on a cycle-by-cycle basis P×V requestors (input VCs), P resources (output ports) At most one flit at each input port can be granted 10/13/09 NoC: Router Microarchitecture & Network Topologies

Switch Allocator Implementations Not shown: Masking logic for credit and VC state check 10/13/09 NoC: Router Microarchitecture & Network Topologies

Switch Allocators – 1 VC / Class Normalized to cycle time Mesh FBFly 10/13/09 NoC: Router Microarchitecture & Network Topologies

Switch Allocators – 2 VCs / Class 10/13/09 NoC: Router Microarchitecture & Network Topologies

Switch Allocators – 4 VCs / Class 10/13/09 NoC: Router Microarchitecture & Network Topologies

Switch Allocator Cost 10/13/09 Separable input-first is fastest and has least cost Separable output-first has higher delay & cost but similar matching performance Matrix arbiters cost much more than round-robin, but delay advantage is small Wavefront provides best matching, but at the cost of increased delay, area and power 10/13/09 NoC: Router Microarchitecture & Network Topologies

[Peh’01] Speculative Switch Allocation Perform switch allocation in parallel with VC allocation Speculate that the latter will be successful If so, saves a pipeline stage, otherwise try again Reduces zero-load latency, but adds complexity Prioritize non-speculative requests Avoid performance degradation due to misspeculation Usually implemented through secondary switch allocator Simpler/faster/cheaper than true multi-priority allocator But need to prioritize non-speculative grants 10/13/09 NoC: Router Microarchitecture & Network Topologies

Speculative Grant Masking Conventional approach kills speculative grants upon conflict with non-speculative ones Speculation matters primarily at low load, so can pessimistically kill using non- speculative requests instead Sacrifice some speculation opportunities for lower delay But zero-load latency remains virtually unaffected! 10/13/09 NoC: Router Microarchitecture & Network Topologies

Speculation – 1 VC / Class Normalized to cycle time Mesh FBFly 10/13/09 NoC: Router Microarchitecture & Network Topologies

Speculation – 2 VCs / Class 10/13/09 NoC: Router Microarchitecture & Network Topologies

Speculation – 4 VCs / Class 10/13/09 NoC: Router Microarchitecture & Network Topologies

Speculation Cost Pessimistic masking reduces delay hit due to speculation Slight increase in area Slight power increase in small configurations only 10/13/09 NoC: Router Microarchitecture & Network Topologies

Allocator Study Conclusions Network performance largely insensitive to VC allocator For reasonable packet sizes, VC allocator is lightly loaded With light load, all variants produce near-ideal matchings Favors use of simplest / fastest variant (sep/if) Sparse VC allocation can greatly reduce delay & cost For switch allocation, wavefront allocator produces better matchings at the cost of slightly higher delay Best choice depends on application goals & constraints Pessimistic speculation narrows delay gap to non-speculative implementation, but can slightly increase area & power 10/13/09 NoC: Router Microarchitecture & Network Topologies

Dual-Path Switch Allocation (1) Significant fraction of cycle is used for peripheral logic VC state & credit checking Input VC selection / combination of requests For buffered flits, much of this can be precomputed Leave most of cycle for actual allocation logic In some cases need to be pessimistic! For newly arriving flits, precomputation not possible But at most one such flit per input port per cycle! So can simplify allocation for these flits 10/13/09 NoC: Router Microarchitecture & Network Topologies

Dual-Path Switch Allocation (2) Idea: Provide separate, optimized logic paths for newly arriving and buffered flits New flits always try to bypass buffer (fast path) If no other VC at input port is active, send requests to fast-path output arbiter First part of cycle available for credit checking, etc. Also write to buffer in case bypassing fails For buffered flits, precompute control signals (slow path) Preselect next VC to go (round-robin between active VCs) Check for credit and eligibility one cycle ahead Almost entire cycle available for slow-path allocation Merge grants from both paths Prioritize slow-path grants to avoid starvation 10/13/09 NoC: Router Microarchitecture & Network Topologies

Network Topologies (1) Physical organization of the network Number of nodes per router Connectivity between routers Directly affects all key network parameters Channel length, router complexity, latency, throughput, … These parameters translate into performance and power 10/13/09 NoC: Router Microarchitecture & Network Topologies

Network Topologies (2) 2D mesh Concentrated mesh Node 2D mesh One router per node Connected with 4 neighbors Concentrated mesh Several nodes share a router Reduces the network diameter Fat tree / folded clos Number of channels between levels is constant Router Source: N. Jiang 10/13/09 NoC: Router Microarchitecture & Network Topologies

[Kim’07] Flattened Butterfly 4-ary 2-flatfly 2-ary 4-flatfly 10/13/09 NoC: Router Microarchitecture & Network Topologies

Flattened Butterfly Benefits High-radix routers reduce network diameter Routers are fully connected within each dimension Minimal paths require one inter-router hop per dimension Folding adds path diversity Can now traverse dimensions in any order Highly scalable with network size 10/13/09 NoC: Router Microarchitecture & Network Topologies

[N. Jiang] Topology Comparison (1) Evaluate implementation cost for various topologies Focus on cost of routers Area and power from P&R Unless indicated otherwise, use parameters on the right Missing points for kn router because it failed to meet cycle time Parameter Value Channel width 64 bits Packet injection rate 2% Request size 32 bits Reply size 128 bits Network size 64 nodes Operating frequency 200 MHz 10/13/09 NoC: Router Microarchitecture & Network Topologies

Topology Comparison (2) 10/13/09 NoC: Router Microarchitecture & Network Topologies

Topology Comparison (3) 10/13/09 NoC: Router Microarchitecture & Network Topologies

Topology Study Conclusions Meshes are most commonly used topology in NoCs, but they are not very efficient High average hop count Flattened Butterfly provides more bandwidth per watt at the cost of increased complexity Non-tiled layout, more complex routers Best choice for particular application depends on design goals and parameters 10/13/09 NoC: Router Microarchitecture & Network Topologies

Router RTL (1) Analytical models are becoming increasingly inaccurate Do not properly model wire delay Crucial in submicron processes! Derived from idealized, full-custom circuit designs But much of the timing-critical control logic is synthesized Useful for developing intuition & high-level models But detailed evaluations require a more accurate model Goal: Provide flexible template to generate customized RTL-level router implementation 10/13/09 NoC: Router Microarchitecture & Network Topologies

Router RTL (2) Developed complete VC router implementation in RTL Highly parameterized Topologies, VC & flit buffer configurations, allocators, routing logic, and various other implementation details Fully synthesizable Verilog-2001 BSD license allows for virtually unrestricted use Live source code repository & bug tracker http://nocs.stanford.edu/ 10/13/09 NoC: Router Microarchitecture & Network Topologies

Thank You! Questions? 10/13/09 NoC: Router Microarchitecture & Network Topologies