Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University
Overview Allocators have major impact on router performance – Zero-load latency, throughput under load, cycle time On-chip environment imposes stringent constraints – Cycle time, power, no iterative / multi-cycle allocators Main Contributions: – RTL-based performance & cost evaluation of virtual channel and switch allocators for NoC routers – Sparse VC allocation scheme reduces delay, area & power – Pessimistic speculation scheme minimizes delay penalty 11/18/09Allocator Implementations for NoC Routers2
Separable Allocators 11/18/09Allocator Implementations for NoC Routers3 Implement allocation as two phases – Local arbitration at each input – Global arbitration at each output Pros: – Straightforward implementation – Delay scales logarithmically Cons: – Arbiters within each phase are independent – Bad choice in first phase can limit matching Input-first: Output-first: Outputs Inputs
[Tamir’93] Wavefront Allocator Consider inputs and outputs together – Grant requests on diagonal, kill conflicts – Repeat for other diagonals Pros: – Tends to generate better matchings – Tiled design facilitates full-custom implem. Cons: – Delay scales linearly – Orig. design has (false) combinational loops 11/18/09Allocator Implementations for NoC Routers4 Outputs Inputs
Evaluation Methodology Analytical models useful for developing intuition But becoming increasingly inaccurate – Wire delay impact, synthesized vs. full-custom logic, … Use two-pronged evaluation approach: – Delay & cost via detailed RTL-based evaluation Synthesized using Synopsys Design Compiler in topo mode Commercial 45nm low power worst case – Network-level performance via simulation Cycle-oriented interconnection network simulator 64-node networks: 2D mesh & 2D flattened butterfly Request-reply traffic, synthetic traffic patterns 11/18/09Allocator Implementations for NoC Routers5
Virtual Channel Allocation Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels) Before packets can proceed through router, need to claim ownership of VC buffer at next router VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use – P×V inputs (input VCs), P×V outputs (output VCs) – Once assigned, VC is used for entire packet’s duration 11/18/09Allocator Implementations for NoC Routers6
Sparse VC Allocation (1) VCs are used for variety of purposes: – Deadlock avoidance Break cyclic dependencies Routing deadlock (within network) Protocol deadlock (at network boundary) – Flow control Decouple buffers and channels to avoid head-of-line blocking Idea: Partition set of VCs to restrict legal requests – Significantly reduces VC allocator logic complexity – Delay/area/power savings of up to 41%/90%/83% 11/18/09Allocator Implementations for NoC Routers7
Sparse VC Allocation (2) 11/18/09Allocator Implementations for NoC Routers8 REQ REP NM MIN NM MIN IVCOVC P×8 Requests P×4 Requests P×2 Requests P×4 Requests P×2 Requests 8 VCs2×4 VCs2×2×2 VCs 64 Requests32 Requests24 Requests
VC Allocator Performance 11/18/09Allocator Implementations for NoC Routers9 [FBfly, 2×2×2 VCs]
VC Allocator Delay 11/18/09Allocator Implementations for NoC Routers10
VC Allocator Cost 11/18/09Allocator Implementations for NoC Routers11
Switch Allocation Flits require crossbar access to traverse router VCs at each input port share crossbar input Switch allocator generates crossbar schedule – Allocation performed on cycle-by-cycle basis – P×V inputs (input VCs), P outputs (output ports) – At most one VC per input can be granted in each cycle Speculative allocation reduces zero-load latency – Start switch allocation before VC allocation completes 11/18/09Allocator Implementations for NoC Routers12
Pessimistic Speculation (1) Conventional approach: – Separate allocators for spec. and non-spec. requests – Non-spec. grants mask conflicting spec. grants – Conflict detection is on critical path At low load, most requests are granted Idea: Assume all requests will be granted – Mask spec. grants with non-spec. requests – Overlap conflict detection and allocation – Sacrifice speculation accuracy for lower delay – But preserve zero-load latency improvement 11/18/09Allocator Implementations for NoC Routers13
Pessimistic Speculation (2) 11/18/09Allocator Implementations for NoC Routers14 nonspec. allocator spec. allocator conflict detection mask nonspec. requests spec. requests nonspec. grants spec. grants
Switch Allocator Performance (1) 11/18/09Allocator Implementations for NoC Routers15 [Mesh, 2×1×1 VCs]
Switch Allocator Performance (2) 11/18/09Allocator Implementations for NoC Routers16 [FBfly, 2×2×4 VCs] >20%
Switch Allocator Delay 11/18/09Allocator Implementations for NoC Routers17
Switch Allocator Cost 11/18/09Allocator Implementations for NoC Routers18
Speculation Performance (1) 11/18/09Allocator Implementations for NoC Routers19 [Mesh, 2×1×1 VCs]
Speculation Performance (2) 11/18/09Allocator Implementations for NoC Routers20 [Fbfly, 2×2×4 VCs]
Speculation Implementation 11/18/09Allocator Implementations for NoC Routers21
Conclusions Network-level performance is largely insensitive to VC allocator implemetation – Light effective load facilitates near-ideal matchings Sparse VC allocation can greatly reduce delay & cost – Partition set of VCs based on functionality – Restrict possible requests allocator must handle For switch allocation, wavefront allocator produces better matchings but increases delay & cost – Difference increases with number of ports, VCs Pessimistic speculation reduces switch allocator delay – Trade for some performance degradation near saturation 11/18/09Allocator Implementations for NoC Routers22