NoC Switch: Basic Design Principles &

NoC Switch: Basic Design Principles &
Tunis, December 2015 NoC Switch: Basic Design Principles & Intra-Switch Performance Optimization Instructor: Davide Bertozzi

Acknowledgement Many slides have been taken or adapted from
Prof. Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace (DUTH), Greece NoCS 2012 Tutorial: Switch design: A unified view of microarchitecture and circuits

Switch Building Blocks

Wormhole switch operation
The operations can fit in the same cycle or they can be pipelined Extra registers are needed in the control path Body/tail flits inherit the decisions taken by the head flits, yet they cannot bypass RC, and SA Simply, there is nothing to do for them in those stages Operation latency is an issue! For single-cycle switches, the head flit is anyway the one which determines the critical timing path! Body/tail flits would have slack!

Look-ahead routing Routing computation is based only on packet’s destination Can be performed in switch A and used in switch B Look-ahead routing computation (LRC) Does it really need to be a separate pipeline stage?

Look-ahead routing optimization
The LRC can be performed in parallel to SA LRC should be completed before the ST stage in the same switch The head flit needs to embed the output port requests for the next switch before leaving

Look-ahead routing details
The head flit of each packet carries the output port requests for the next switch together with the destination address

Low-latency organizations
Baseline SA precedes ST (no speculation) SA decoupled from ST Predict or Speculate arbiter’s decisions Trick: crossbar control not from switch allocator When prediction is wrong replay all the tasks (same as baseline) Do in different phases Circuit switching Arbitration and routing at setup phase Transmit phase: contention-free ST Bypass switches Reduce latency under certain criteria When bypass not enabled, same as baseline SA ST LT LRC SA LT ST LRC SA LT Setup LRC ST LT Transmit

At runtime the prediction accuracy is verified.
ST in parallel with SA SA ST Prediction criteria Target: LRC It is likely that a packet coming from the east (if any) will go to the west because of xy routing in a 2D mesh Target Output West East Input packet During idle time the predictor pre-sets the I/O connection of the West output port through the crossbar multiplexer At runtime the prediction accuracy is verified. Mis- prediction Arbiter South East Arbiter West Arbiter West East East East East North North East South South West West Local Local

Mux control signals are set on-the-fly by fast speculation logic
ST in parallel with SA SA ST Speculation Target: LRC Mux control signals are set on-the-fly by fast speculation logic At the end of the cycle, arbiter computation results are compared with the outcome of speculation logic At the beginning of the cycle, requests are fed to the allocator arbiter and to speculation logic East East arbiter arbiter ?? Local Local Local Mis- speculation East East Mask East East North East North East South South West West Local Local Next step: switch traversal from Local

Prediction-based ST: Hit
SA ST Target: Assumption: RC is a pipeline stage (no LRC) RC Crossbar is reserved Idle state: Output port X+ is selected and reserved for Input X+ 1st cycle: Incoming flit is transferred to X+ without RC and SA Correct 1st cycle: RC is performed  The prediction was correct! Outcome: SA, ST and RC were actually performed in parallel! PREDICTOR ARBITER Buffer X+ X+ X- X- Y+ Y+ Y- Y- Crossbar

Prediction-based ST: Miss
Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and SA Correct Dead flit 1st cycle: RC is performed  The prediction is wrong! (X- is correct) KILL Kill signal to X+ is asserted 2nd/3rd cycle: Dead flit is removed; Move on with SA. PREDICTOR ARBITER Buffer X+ X+ X- X- Y+ Y+ @Miss: tasks replayed as the baseline case (that is, at least RC was done, now let us move to SA) Y- Y- Crossbar

Speculative ST Assumption: RC is done (LRC) Speculation criteria:
Switch Fabric Control B Wins A Wins Speculation criteria: assume contention doesn’t happen! If correct then flit transferred directly to output port without waiting SA In case of contention SA was done, move on with ST accordingly Wasted cycle in the event of contention Generation and management of event abort A A A A A ? B B B clk port 0 port 1 grant valid out data out 1 4 cycle 2 3 A p0 A B p1 ??? A p0 B A A B Only SA done

Efficient recovery from mispeculation: xor-based recovery
Switch Fabric Control B Wins Assume contention never happens If correct then flit transferred directly to output port If not then bitwise=XOR all the competing flits and send the encoded result to the link At the same time arbitrate and mask (set to 0) the winning input Repeat on the next cycle In the case of contention, encoded outputs (due to contention) are resolved at the receiver Can be done at the output port of the switch too A A A A A A^B B B 1 4 cycle 2 3 clk port 0 port 1 grant valid out data out A p0 A B p1 B^A A No Contention Contention

XOR-based recovery Works upon simple XOR property.
Coded Flit Buffer 1 B^C A^B^C A^B^C A B^C C B^C C B A Works upon simple XOR property. (A^B^C) ^ (B^C) = A Always able to decode by XORing two sequential values Performs similarly to speculative switches Only head-flit collisions matter Maintains previous router’s arbitration order

Bypassing intermediate nodes
Virtual bypassing paths SRC 1-cycle Bypassed DST 3-cycle 3-cycle 3-cycle 3-cycle 3-cycle Switch bypassing criteria: Frequently used paths Packets continually moving along the same dimension Most techniques can bypass some pipeline stages only for specific packet transfers and traffic patterns Not generic enough

Speculation-free low-latency switches
SAT Prediction and speculation drawbacks On miss-prediction (speculation) the tasks should be replayed Latency not always saved. Depends on network conditions Merged Switch Allocation and Traversal (SAT) Latency always saved – no speculation Delay of SAT smaller than SA and ST in series

How can we increase throughput?
Green flow is blocked until red passes the switch. Physical channel left idle The solution is to have separate buffers for each flow

Virtual Channels Decouple output port allocation from next-hop buffer allocation Contention present on: Output links (crossbar output port) Input port of the crossbar Contention is resolved by time sharing the resources Mixing words of two packets on the same channel The words are on different virtual channels A virtual channel identifier should be in place Separate buffers at the end of the link guarantee no blocking between the packets

Virtual Channels Virtual-channel support does not mean extra links
They act as extra street lanes Traffic on each lane is time shared on a common channel Provide dedicated buffer space for each virtual channel Decouple channels from buffers Interleave flits from different packets “The Swiss Army Knife for Interconnection Networks” Reduce head-of-line blocking Prevent deadlocks Provide QoS, fault-tolerance, …

Datapath of a VC-based switch
Separate buffer for each VC Separate flow control signals (credits/stalls) for each VC The radix of the crossbar may stay the same (or may increase) A higher number of input ports increases propagation delay through the crossbar Input VCs may share a common input port of the crossbar Alternatively, crossbars can be replicated On each cycle at most one VC will receive a new word

Per-packet operation of a VC-based switch
A switch connects input VCs to output VCs Routing computation (RC) determines the output port Can be shared among VCs of an input port May restrict the usable output VCs (e.g., based on msg type or dst ID) An input VC should allocate first an output VC Allocation is performed by a VC allocator (VA) RC and VA are done per packet on the head flits and inherited to the remaining flits of the packet Input VCs Output VCs

Per-flit operation of a VC-based switch
Flits with an allocated output VC fight for an output port Output port allocated by switch allocator This entails 2 levels of arbitration At input port At output port The VCs of the same input share a common input port of the crossbar Each input has multiple requests (equal to the #input VCs) The flit leaves the switch provided credits are available downstream Credits are counted per output VC Unfortunate case: VC & port are allocated to an input VC, but no credits available Input VCs Output VCs

Switch allocation All VCs at a given input port share one crossbar input port Switch allocator matches ready-to-go flits with crossbar time slots Allocation performed on a cycle-by-cycle basis N×V requests (input VCs), N resources (output ports) At most one flit at each input port can be granted At most one flit et each output port can be sampled Other options need more crossbar ports (input-output speedup)

Switch allocation example
Bipartite graph Inputs Outputs Request matrix Outputs 1 2 1 1 Inputs 1 2 2 2 One request (arc) for each input VC Example with 2 VCs per input At most 2 arcs leaving each input At most 2 requests per row in the request matrix The allocation is a Matching problem: Each grant must satisfy a request Each requester gets at most one grant Each resource is granted at most once

Separable allocation Matchings have at most one grant per row and per column Two phases of arbitration Column-wise and row-wise Perform in either order Arbiters in each stage are independent But the outcome of each one affects the quality of the overall match Fast and cheap Bad choices in first phase can prevent second stage from generating a good matching Multiple iterations required for a good match Iterative scheduling converges to a maximal schedule Unaffordable for high-speed networks Input-first: Output-first:

Input first allocation
Implementation Input first allocation (row-wise)

Output first allocation
Implementation Output first allocation (column-wise)

Centralized allocator
Wavefront allocation Pick initial diagonal Grant all requests on each diagonal Never conflict! For each grant, delete requests in same row, column Repeat for next diagonal

Switch allocation for adaptive routing
Input VCs can request more than one output ports Called the set of Admissible Output Ports (AOP) This adds an extra selection step (not arbitration) Selection mostly tries to load balance the traffic Input-first allocation For each input VC select one request of the AOP Arbitrate locally per input and select one input VC Arbitrate globally per output and select one VC from all fighting inputs Output-first allocation Send all requests of the AOP of each input VC to the outputs Arbitrate globally per output and grant one request Arbitrate locally per input and grant an input VC For this input VC select one out of the possibly multiple grants of the AOP set

VC allocation Input VCs Output VCs Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels) Before packets can proceed through router, need to claim ownership of VC buffer at next router VC acquired by head flit, is inherited by body & tail flits VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use N×V inputs (input VCs), N×V outputs (output VCs) Once assigned, VC is used for entire packet’s duration in the switch

VC allocation example Requests Grants Inputs VCs Output VCs Inputs VCs
In#0 Out#0 In#0 Out#0 1 1 1 1 2 2 2 2 In#1 Out#1 In#1 Out#1 3 3 3 3 4 4 4 4 Out#2 In#2 Out#2 In#2 5 5 5 5 An input VC may require any of the VCs of a given output port In case of adaptive routing, an input VC may require VCs from different output ports No port constraints as in switch allocators Allocation can be both separable (2 arbitration steps) or centralized At most one grant per input VC At most one grant per output VC

Input – output VC mapping
Any-to-any flexibility in VC allocator is unnecessary Partition set of VCs to restrict legal requests Different use cases for VCs restrict possible transitions: Message class never changes VCs within a packet class are functionally equivalent Can take advantage of these properties to reduce VC allocator complexity!

Single cycle VA or pipelined organization
Header flits see longer latency than body/tail flits RC, VA decisions taken for head flits and inherited to the rest of the packet Every flit fights for SA Can we parallelize SA and VA?

The order of VC and switch allocation
VA first SA follows Only packets with an allocated output VC fight for SA VA and SA can be performed concurrently: Speculate that waiting packets will successfully acquire a VC Prioritize non-speculative requests over speculative ones for SA Speculation holds only for the head flits (The body/tail flits always know their output VC) VA SA Description Win Everything OK!! Leave the switch Lose Allocated a VC Retry SA (not speculative - high priority next cycle) Does not know the output VC Allocated output port (grant lost – output idle) Retry both VA and SA

Free list of VCs per output
Can assign a VC non-speculatively after SA A free list of output VCs exists at each output The flit that was granted access to this output receives the first free VC before leaving the switch If no VC available output port allocation slot is missed Flit retries for switch allocation VCs are not unnecessarily occupied for flits that don’t win SA Optimizations feasible: Flits are allowed to win SA for a target port only provided there are empty VCs at that output port

NoC Switch: Basic Design Principles &

Similar presentations

Presentation on theme: "NoC Switch: Basic Design Principles &"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NoC Switch: Basic Design Principles &

Similar presentations

Presentation on theme: "NoC Switch: Basic Design Principles &"— Presentation transcript:

Similar presentations

About project

Feedback