NoC Switch: Basic Design Principles &

Slides:



Advertisements
Similar presentations
A Novel 3D Layer-Multiplexed On-Chip Network
Advertisements

Dynamic Topology Optimization for Supercomputer Interconnection Networks Layer-1 (L1) switch –Dumb switch, Electronic “patch panel” –Establishes hard links.
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
Predictive Load Balancing Reconfigurable Computing Group.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
1 Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
Deadlock CEG 4131 Computer Architecture III Miodrag Bolic.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.
Unit III Bandwidth Utilization: Multiplexing and Spectrum Spreading In practical life the bandwidth available of links is limited. The proper utilization.
Lecture 16: Router Design
1 Lecture 15: NoC Innovations Today: power and performance innovations for NoCs.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Virtual-Channel Flow Control William J. Dally
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
1 Switching and Forwarding Sections Connecting More Than Two Hosts Multi-access link: Ethernet, wireless –Single physical link, shared by multiple.
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.
Chapter 3 Part 3 Switching and Bridging
Interconnection Structures
Overview Parallel Processing Pipelining
The network-on-chip protocol
Lecture 23: Interconnection Networks
URL: Chapter 8 Switching Tel: (03) Ext: URL:
Packet Switching Datagram Approach Virtual Circuit Approach
Physical constraints (1/2)
Interconnection Networks: Flow Control
Lecture 23: Router Design
Lecture 16: On-Chip Networks
SWITCHING Switched Network Circuit-Switched Network Datagram Networks
Chapter 3 Part 3 Switching and Bridging
Lecture 17: NoC Innovations
William Stallings Data and Computer Communications
Mechanics of Flow Control
Lecture: Interconnection Networks
Data Communication Networks
On-time Network On-chip
PRESENTATION COMPUTER NETWORKS
Switching Techniques.
CEG 4131 Computer Architecture III Miodrag Bolic
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Lecture: Networks Topics: TM wrap-up, networks.
Lecture: Interconnection Networks
Chapter 3 Part 3 Switching and Bridging
CS 6290 Many-core & Interconnect
Lecture 25: Interconnection Networks
Multiprocessors and Multi-computers
Presentation transcript:

NoC Switch: Basic Design Principles & Tunis, December 2015 NoC Switch: Basic Design Principles & Intra-Switch Performance Optimization Instructor: Davide Bertozzi Email: davide.bertozzi@unife.it

Acknowledgement Many slides have been taken or adapted from Prof. Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace (DUTH), Greece NoCS 2012 Tutorial: Switch design: A unified view of microarchitecture and circuits

Switch Building Blocks

Wormhole switch operation The operations can fit in the same cycle or they can be pipelined Extra registers are needed in the control path Body/tail flits inherit the decisions taken by the head flits, yet they cannot bypass RC, and SA Simply, there is nothing to do for them in those stages Operation latency is an issue! For single-cycle switches, the head flit is anyway the one which determines the critical timing path! Body/tail flits would have slack!

Look-ahead routing Routing computation is based only on packet’s destination Can be performed in switch A and used in switch B Look-ahead routing computation (LRC) Does it really need to be a separate pipeline stage?

Look-ahead routing optimization The LRC can be performed in parallel to SA LRC should be completed before the ST stage in the same switch The head flit needs to embed the output port requests for the next switch before leaving

Look-ahead routing details The head flit of each packet carries the output port requests for the next switch together with the destination address

Low-latency organizations Baseline SA precedes ST (no speculation) SA decoupled from ST Predict or Speculate arbiter’s decisions Trick: crossbar control not from switch allocator When prediction is wrong replay all the tasks (same as baseline) Do in different phases Circuit switching Arbitration and routing at setup phase Transmit phase: contention-free ST Bypass switches Reduce latency under certain criteria When bypass not enabled, same as baseline SA ST LT LRC SA LT ST LRC SA LT Setup LRC ST LT Transmit

At runtime the prediction accuracy is verified. ST in parallel with SA SA ST Prediction criteria Target: LRC It is likely that a packet coming from the east (if any) will go to the west because of xy routing in a 2D mesh Target Output West East Input packet During idle time the predictor pre-sets the I/O connection of the West output port through the crossbar multiplexer At runtime the prediction accuracy is verified. Mis- prediction Arbiter South East Arbiter West Arbiter West East East East East North North East South South West West Local Local

Mux control signals are set on-the-fly by fast speculation logic ST in parallel with SA SA ST Speculation Target: LRC Mux control signals are set on-the-fly by fast speculation logic At the end of the cycle, arbiter computation results are compared with the outcome of speculation logic At the beginning of the cycle, requests are fed to the allocator arbiter and to speculation logic East East arbiter arbiter ?? Local Local Local Mis- speculation East East Mask East East North East North East South South West West Local Local Next step: switch traversal from Local

Prediction-based ST: Hit SA ST Target: Assumption: RC is a pipeline stage (no LRC) RC Crossbar is reserved Idle state: Output port X+ is selected and reserved for Input X+ 1st cycle: Incoming flit is transferred to X+ without RC and SA Correct 1st cycle: RC is performed  The prediction was correct! Outcome: SA, ST and RC were actually performed in parallel! PREDICTOR ARBITER Buffer X+ X+ X- X- Y+ Y+ Y- Y- Crossbar

Prediction-based ST: Miss Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and SA Correct Dead flit 1st cycle: RC is performed  The prediction is wrong! (X- is correct) KILL Kill signal to X+ is asserted 2nd/3rd cycle: Dead flit is removed; Move on with SA. PREDICTOR ARBITER Buffer X+ X+ X- X- Y+ Y+ @Miss: tasks replayed as the baseline case (that is, at least RC was done, now let us move to SA) Y- Y- Crossbar

Speculative ST Assumption: RC is done (LRC) Speculation criteria: Switch Fabric Control B Wins A Wins Speculation criteria: assume contention doesn’t happen! If correct then flit transferred directly to output port without waiting SA In case of contention SA was done, move on with ST accordingly Wasted cycle in the event of contention Generation and management of event abort A A A A A ? B B B clk port 0 port 1 grant valid out data out 1 4 cycle 2 3 A p0 A B p1 ??? A p0 B A A B Only SA done

Efficient recovery from mispeculation: xor-based recovery Switch Fabric Control B Wins Assume contention never happens If correct then flit transferred directly to output port If not then bitwise=XOR all the competing flits and send the encoded result to the link At the same time arbitrate and mask (set to 0) the winning input Repeat on the next cycle In the case of contention, encoded outputs (due to contention) are resolved at the receiver Can be done at the output port of the switch too A A A A A A^B B B 1 4 cycle 2 3 clk port 0 port 1 grant valid out data out A p0 A B p1 B^A A No Contention Contention

XOR-based recovery Works upon simple XOR property. Coded Flit Buffer 1 B^C A^B^C A^B^C A B^C C B^C C B A Works upon simple XOR property. (A^B^C) ^ (B^C) = A Always able to decode by XORing two sequential values Performs similarly to speculative switches Only head-flit collisions matter Maintains previous router’s arbitration order

Bypassing intermediate nodes Virtual bypassing paths SRC 1-cycle Bypassed DST 3-cycle 3-cycle 3-cycle 3-cycle 3-cycle Switch bypassing criteria: Frequently used paths Packets continually moving along the same dimension Most techniques can bypass some pipeline stages only for specific packet transfers and traffic patterns Not generic enough

Speculation-free low-latency switches SAT Prediction and speculation drawbacks On miss-prediction (speculation) the tasks should be replayed Latency not always saved. Depends on network conditions Merged Switch Allocation and Traversal (SAT) Latency always saved – no speculation Delay of SAT smaller than SA and ST in series

How can we increase throughput? Green flow is blocked until red passes the switch. Physical channel left idle The solution is to have separate buffers for each flow

Virtual Channels Decouple output port allocation from next-hop buffer allocation Contention present on: Output links (crossbar output port) Input port of the crossbar Contention is resolved by time sharing the resources Mixing words of two packets on the same channel The words are on different virtual channels A virtual channel identifier should be in place Separate buffers at the end of the link guarantee no blocking between the packets

Virtual Channels Virtual-channel support does not mean extra links They act as extra street lanes Traffic on each lane is time shared on a common channel Provide dedicated buffer space for each virtual channel Decouple channels from buffers Interleave flits from different packets “The Swiss Army Knife for Interconnection Networks” Reduce head-of-line blocking Prevent deadlocks Provide QoS, fault-tolerance, …

Datapath of a VC-based switch Separate buffer for each VC Separate flow control signals (credits/stalls) for each VC The radix of the crossbar may stay the same (or may increase) A higher number of input ports increases propagation delay through the crossbar Input VCs may share a common input port of the crossbar Alternatively, crossbars can be replicated On each cycle at most one VC will receive a new word

Per-packet operation of a VC-based switch A switch connects input VCs to output VCs Routing computation (RC) determines the output port Can be shared among VCs of an input port May restrict the usable output VCs (e.g., based on msg type or dst ID) An input VC should allocate first an output VC Allocation is performed by a VC allocator (VA) RC and VA are done per packet on the head flits and inherited to the remaining flits of the packet Input VCs Output VCs

Per-flit operation of a VC-based switch Flits with an allocated output VC fight for an output port Output port allocated by switch allocator This entails 2 levels of arbitration At input port At output port The VCs of the same input share a common input port of the crossbar Each input has multiple requests (equal to the #input VCs) The flit leaves the switch provided credits are available downstream Credits are counted per output VC Unfortunate case: VC & port are allocated to an input VC, but no credits available Input VCs Output VCs

Switch allocation All VCs at a given input port share one crossbar input port Switch allocator matches ready-to-go flits with crossbar time slots Allocation performed on a cycle-by-cycle basis N×V requests (input VCs), N resources (output ports) At most one flit at each input port can be granted At most one flit et each output port can be sampled Other options need more crossbar ports (input-output speedup)

Switch allocation example Bipartite graph Inputs Outputs Request matrix Outputs 1 2 1 1 Inputs 1 2 2 2 One request (arc) for each input VC Example with 2 VCs per input At most 2 arcs leaving each input At most 2 requests per row in the request matrix The allocation is a Matching problem: Each grant must satisfy a request Each requester gets at most one grant Each resource is granted at most once

Separable allocation Matchings have at most one grant per row and per column Two phases of arbitration Column-wise and row-wise Perform in either order Arbiters in each stage are independent But the outcome of each one affects the quality of the overall match Fast and cheap Bad choices in first phase can prevent second stage from generating a good matching Multiple iterations required for a good match Iterative scheduling converges to a maximal schedule Unaffordable for high-speed networks Input-first: Output-first:

Input first allocation Implementation Input first allocation (row-wise)

Output first allocation Implementation Output first allocation (column-wise)

Centralized allocator Wavefront allocation Pick initial diagonal Grant all requests on each diagonal Never conflict! For each grant, delete requests in same row, column Repeat for next diagonal

Switch allocation for adaptive routing Input VCs can request more than one output ports Called the set of Admissible Output Ports (AOP) This adds an extra selection step (not arbitration) Selection mostly tries to load balance the traffic Input-first allocation For each input VC select one request of the AOP Arbitrate locally per input and select one input VC Arbitrate globally per output and select one VC from all fighting inputs Output-first allocation Send all requests of the AOP of each input VC to the outputs Arbitrate globally per output and grant one request Arbitrate locally per input and grant an input VC For this input VC select one out of the possibly multiple grants of the AOP set

VC allocation Input VCs Output VCs Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels) Before packets can proceed through router, need to claim ownership of VC buffer at next router VC acquired by head flit, is inherited by body & tail flits VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use N×V inputs (input VCs), N×V outputs (output VCs) Once assigned, VC is used for entire packet’s duration in the switch

VC allocation example Requests Grants Inputs VCs Output VCs Inputs VCs In#0 Out#0 In#0 Out#0 1 1 1 1 2 2 2 2 In#1 Out#1 In#1 Out#1 3 3 3 3 4 4 4 4 Out#2 In#2 Out#2 In#2 5 5 5 5 An input VC may require any of the VCs of a given output port In case of adaptive routing, an input VC may require VCs from different output ports No port constraints as in switch allocators Allocation can be both separable (2 arbitration steps) or centralized At most one grant per input VC At most one grant per output VC

Input – output VC mapping Any-to-any flexibility in VC allocator is unnecessary Partition set of VCs to restrict legal requests Different use cases for VCs restrict possible transitions: Message class never changes VCs within a packet class are functionally equivalent Can take advantage of these properties to reduce VC allocator complexity!

Single cycle VA or pipelined organization Header flits see longer latency than body/tail flits RC, VA decisions taken for head flits and inherited to the rest of the packet Every flit fights for SA Can we parallelize SA and VA?

The order of VC and switch allocation VA first SA follows Only packets with an allocated output VC fight for SA VA and SA can be performed concurrently: Speculate that waiting packets will successfully acquire a VC Prioritize non-speculative requests over speculative ones for SA Speculation holds only for the head flits (The body/tail flits always know their output VC) VA SA Description Win Everything OK!! Leave the switch Lose Allocated a VC Retry SA (not speculative - high priority next cycle) Does not know the output VC Allocated output port (grant lost – output idle) Retry both VA and SA

Free list of VCs per output Can assign a VC non-speculatively after SA A free list of output VCs exists at each output The flit that was granted access to this output receives the first free VC before leaving the switch If no VC available output port allocation slot is missed Flit retries for switch allocation VCs are not unnecessarily occupied for flits that don’t win SA Optimizations feasible: Flits are allowed to win SA for a target port only provided there are empty VCs at that output port