NoC Switch: Basic Design Principles & Tunis, December 2015 NoC Switch: Basic Design Principles & Intra-Switch Performance Optimization Instructor: Davide Bertozzi Email: davide.bertozzi@unife.it
Acknowledgement Many slides have been taken or adapted from Prof. Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace (DUTH), Greece NoCS 2012 Tutorial: Switch design: A unified view of microarchitecture and circuits
Switch Building Blocks
Wormhole switch operation The operations can fit in the same cycle or they can be pipelined Extra registers are needed in the control path Body/tail flits inherit the decisions taken by the head flits, yet they cannot bypass RC, and SA Simply, there is nothing to do for them in those stages Operation latency is an issue! For single-cycle switches, the head flit is anyway the one which determines the critical timing path! Body/tail flits would have slack!
Look-ahead routing Routing computation is based only on packet’s destination Can be performed in switch A and used in switch B Look-ahead routing computation (LRC) Does it really need to be a separate pipeline stage?
Look-ahead routing optimization The LRC can be performed in parallel to SA LRC should be completed before the ST stage in the same switch The head flit needs to embed the output port requests for the next switch before leaving
Look-ahead routing details The head flit of each packet carries the output port requests for the next switch together with the destination address
Low-latency organizations Baseline SA precedes ST (no speculation) SA decoupled from ST Predict or Speculate arbiter’s decisions Trick: crossbar control not from switch allocator When prediction is wrong replay all the tasks (same as baseline) Do in different phases Circuit switching Arbitration and routing at setup phase Transmit phase: contention-free ST Bypass switches Reduce latency under certain criteria When bypass not enabled, same as baseline SA ST LT LRC SA LT ST LRC SA LT Setup LRC ST LT Transmit
At runtime the prediction accuracy is verified. ST in parallel with SA SA ST Prediction criteria Target: LRC It is likely that a packet coming from the east (if any) will go to the west because of xy routing in a 2D mesh Target Output West East Input packet During idle time the predictor pre-sets the I/O connection of the West output port through the crossbar multiplexer At runtime the prediction accuracy is verified. Mis- prediction Arbiter South East Arbiter West Arbiter West East East East East North North East South South West West Local Local
Mux control signals are set on-the-fly by fast speculation logic ST in parallel with SA SA ST Speculation Target: LRC Mux control signals are set on-the-fly by fast speculation logic At the end of the cycle, arbiter computation results are compared with the outcome of speculation logic At the beginning of the cycle, requests are fed to the allocator arbiter and to speculation logic East East arbiter arbiter ?? Local Local Local Mis- speculation East East Mask East East North East North East South South West West Local Local Next step: switch traversal from Local
Prediction-based ST: Hit SA ST Target: Assumption: RC is a pipeline stage (no LRC) RC Crossbar is reserved Idle state: Output port X+ is selected and reserved for Input X+ 1st cycle: Incoming flit is transferred to X+ without RC and SA Correct 1st cycle: RC is performed The prediction was correct! Outcome: SA, ST and RC were actually performed in parallel! PREDICTOR ARBITER Buffer X+ X+ X- X- Y+ Y+ Y- Y- Crossbar
Prediction-based ST: Miss Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and SA Correct Dead flit 1st cycle: RC is performed The prediction is wrong! (X- is correct) KILL Kill signal to X+ is asserted 2nd/3rd cycle: Dead flit is removed; Move on with SA. PREDICTOR ARBITER Buffer X+ X+ X- X- Y+ Y+ @Miss: tasks replayed as the baseline case (that is, at least RC was done, now let us move to SA) Y- Y- Crossbar
Speculative ST Assumption: RC is done (LRC) Speculation criteria: Switch Fabric Control B Wins A Wins Speculation criteria: assume contention doesn’t happen! If correct then flit transferred directly to output port without waiting SA In case of contention SA was done, move on with ST accordingly Wasted cycle in the event of contention Generation and management of event abort A A A A A ? B B B clk port 0 port 1 grant valid out data out 1 4 cycle 2 3 A p0 A B p1 ??? A p0 B A A B Only SA done
Efficient recovery from mispeculation: xor-based recovery Switch Fabric Control B Wins Assume contention never happens If correct then flit transferred directly to output port If not then bitwise=XOR all the competing flits and send the encoded result to the link At the same time arbitrate and mask (set to 0) the winning input Repeat on the next cycle In the case of contention, encoded outputs (due to contention) are resolved at the receiver Can be done at the output port of the switch too A A A A A A^B B B 1 4 cycle 2 3 clk port 0 port 1 grant valid out data out A p0 A B p1 B^A A No Contention Contention
XOR-based recovery Works upon simple XOR property. Coded Flit Buffer 1 B^C A^B^C A^B^C A B^C C B^C C B A Works upon simple XOR property. (A^B^C) ^ (B^C) = A Always able to decode by XORing two sequential values Performs similarly to speculative switches Only head-flit collisions matter Maintains previous router’s arbitration order
Bypassing intermediate nodes Virtual bypassing paths SRC 1-cycle Bypassed DST 3-cycle 3-cycle 3-cycle 3-cycle 3-cycle Switch bypassing criteria: Frequently used paths Packets continually moving along the same dimension Most techniques can bypass some pipeline stages only for specific packet transfers and traffic patterns Not generic enough
Speculation-free low-latency switches SAT Prediction and speculation drawbacks On miss-prediction (speculation) the tasks should be replayed Latency not always saved. Depends on network conditions Merged Switch Allocation and Traversal (SAT) Latency always saved – no speculation Delay of SAT smaller than SA and ST in series
How can we increase throughput? Green flow is blocked until red passes the switch. Physical channel left idle The solution is to have separate buffers for each flow
Virtual Channels Decouple output port allocation from next-hop buffer allocation Contention present on: Output links (crossbar output port) Input port of the crossbar Contention is resolved by time sharing the resources Mixing words of two packets on the same channel The words are on different virtual channels A virtual channel identifier should be in place Separate buffers at the end of the link guarantee no blocking between the packets
Virtual Channels Virtual-channel support does not mean extra links They act as extra street lanes Traffic on each lane is time shared on a common channel Provide dedicated buffer space for each virtual channel Decouple channels from buffers Interleave flits from different packets “The Swiss Army Knife for Interconnection Networks” Reduce head-of-line blocking Prevent deadlocks Provide QoS, fault-tolerance, …
Datapath of a VC-based switch Separate buffer for each VC Separate flow control signals (credits/stalls) for each VC The radix of the crossbar may stay the same (or may increase) A higher number of input ports increases propagation delay through the crossbar Input VCs may share a common input port of the crossbar Alternatively, crossbars can be replicated On each cycle at most one VC will receive a new word
Per-packet operation of a VC-based switch A switch connects input VCs to output VCs Routing computation (RC) determines the output port Can be shared among VCs of an input port May restrict the usable output VCs (e.g., based on msg type or dst ID) An input VC should allocate first an output VC Allocation is performed by a VC allocator (VA) RC and VA are done per packet on the head flits and inherited to the remaining flits of the packet Input VCs Output VCs
Per-flit operation of a VC-based switch Flits with an allocated output VC fight for an output port Output port allocated by switch allocator This entails 2 levels of arbitration At input port At output port The VCs of the same input share a common input port of the crossbar Each input has multiple requests (equal to the #input VCs) The flit leaves the switch provided credits are available downstream Credits are counted per output VC Unfortunate case: VC & port are allocated to an input VC, but no credits available Input VCs Output VCs
Switch allocation All VCs at a given input port share one crossbar input port Switch allocator matches ready-to-go flits with crossbar time slots Allocation performed on a cycle-by-cycle basis N×V requests (input VCs), N resources (output ports) At most one flit at each input port can be granted At most one flit et each output port can be sampled Other options need more crossbar ports (input-output speedup)
Switch allocation example Bipartite graph Inputs Outputs Request matrix Outputs 1 2 1 1 Inputs 1 2 2 2 One request (arc) for each input VC Example with 2 VCs per input At most 2 arcs leaving each input At most 2 requests per row in the request matrix The allocation is a Matching problem: Each grant must satisfy a request Each requester gets at most one grant Each resource is granted at most once
Separable allocation Matchings have at most one grant per row and per column Two phases of arbitration Column-wise and row-wise Perform in either order Arbiters in each stage are independent But the outcome of each one affects the quality of the overall match Fast and cheap Bad choices in first phase can prevent second stage from generating a good matching Multiple iterations required for a good match Iterative scheduling converges to a maximal schedule Unaffordable for high-speed networks Input-first: Output-first:
Input first allocation Implementation Input first allocation (row-wise)
Output first allocation Implementation Output first allocation (column-wise)
Centralized allocator Wavefront allocation Pick initial diagonal Grant all requests on each diagonal Never conflict! For each grant, delete requests in same row, column Repeat for next diagonal
Switch allocation for adaptive routing Input VCs can request more than one output ports Called the set of Admissible Output Ports (AOP) This adds an extra selection step (not arbitration) Selection mostly tries to load balance the traffic Input-first allocation For each input VC select one request of the AOP Arbitrate locally per input and select one input VC Arbitrate globally per output and select one VC from all fighting inputs Output-first allocation Send all requests of the AOP of each input VC to the outputs Arbitrate globally per output and grant one request Arbitrate locally per input and grant an input VC For this input VC select one out of the possibly multiple grants of the AOP set
VC allocation Input VCs Output VCs Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels) Before packets can proceed through router, need to claim ownership of VC buffer at next router VC acquired by head flit, is inherited by body & tail flits VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use N×V inputs (input VCs), N×V outputs (output VCs) Once assigned, VC is used for entire packet’s duration in the switch
VC allocation example Requests Grants Inputs VCs Output VCs Inputs VCs In#0 Out#0 In#0 Out#0 1 1 1 1 2 2 2 2 In#1 Out#1 In#1 Out#1 3 3 3 3 4 4 4 4 Out#2 In#2 Out#2 In#2 5 5 5 5 An input VC may require any of the VCs of a given output port In case of adaptive routing, an input VC may require VCs from different output ports No port constraints as in switch allocators Allocation can be both separable (2 arbitration steps) or centralized At most one grant per input VC At most one grant per output VC
Input – output VC mapping Any-to-any flexibility in VC allocator is unnecessary Partition set of VCs to restrict legal requests Different use cases for VCs restrict possible transitions: Message class never changes VCs within a packet class are functionally equivalent Can take advantage of these properties to reduce VC allocator complexity!
Single cycle VA or pipelined organization Header flits see longer latency than body/tail flits RC, VA decisions taken for head flits and inherited to the rest of the packet Every flit fights for SA Can we parallelize SA and VA?
The order of VC and switch allocation VA first SA follows Only packets with an allocated output VC fight for SA VA and SA can be performed concurrently: Speculate that waiting packets will successfully acquire a VC Prioritize non-speculative requests over speculative ones for SA Speculation holds only for the head flits (The body/tail flits always know their output VC) VA SA Description Win Everything OK!! Leave the switch Lose Allocated a VC Retry SA (not speculative - high priority next cycle) Does not know the output VC Allocated output port (grant lost – output idle) Retry both VA and SA
Free list of VCs per output Can assign a VC non-speculatively after SA A free list of output VCs exists at each output The flit that was granted access to this output receives the first free VC before leaving the switch If no VC available output port allocation slot is missed Flit retries for switch allocation VCs are not unnecessarily occupied for flits that don’t win SA Optimizations feasible: Flits are allowed to win SA for a target port only provided there are empty VCs at that output port