Elastic-Buffer Flow-Control for On-Chip Networks

Name: Elastic-Buffer Flow-Control for On-Chip Networks
Uploaded: 2017-12-11T18:58:15+00:00
Duration: PTM14S39
Channel: Millicent Payne
Description: Elastic-Buffer Flow-Control for On-Chip Networks

Elastic-Buffer Flow-Control for On-Chip Networks
George Michelogiannakis, James Balfour, William J. Dally Computer Systems Laboratory Stanford University

Introduction Elastic-buffer (EB) flow-control uses the channels as distributed FIFOs Input buffers at routers are not needed Can provide 12% more throughput per unit power Equal zero-load latency Reduces router cycle time by 18% Compared to VC routers

Outline Building elastic-buffered channels Router microarchitecture
By using what is already there Router microarchitecture Deadlock avoidance Load-sensing for adaptive routing Evaluation

The Idea Use the network channels as distributed FIFOs
Use that storage instead of input buffers at routers To remove input buffer area and power costs Pipelined channel Removing the buffers (and hence the VCs) lowers the utilization of the channels, but we make up for that by increasing the datapath width. More power efficient network: apparent by trade reduction in power for datapath width. Channel as FIFO

Building an Elastic Buffer
To build an EB in a pipelined channel with master-slave flip-flops (FFs): Use latches for storage by driving their enables independently Elastic buffer Master-slave FF Removing the buffers lowers the utilization of the channels, but we make up for that by increasing the datapath width. Trade reduction in power for datapath width.

How Elastic Buffer Channels Work
Ready/valid handshake between elastic buffers Ready: At least one free storage slot Valid: Non-empty (driving valid data) Flit advances when ready and valid are asserted at a clock edge. Cycle 5 Cycle 6 Cycle 3 Cycle 1 Cycle 2 Cycle 4

Control Logic Area Overhead
Control logic is implemented as a four-state FSM with 10 gates and 2 FFs Cost is amortized over channel width Example: control logic increases area of a 64-bit channel by 5% 10 gates with Gray encoding. 5% is by assuming 1 FF = 8 gates for area cost. 11% by assuming 1 FF = 2 gates for area cost. Minimum speed in our technology is 150ps/mm. The master latch enable signal is party generated in the second half of the previous clock cycle.

Use EB flow-control through the router Deadlock avoidance Load-sensing for adaptive routing Evaluation

Use EB Flow-Control Through the Router
VC input-buffered router Three-slot output EB to cover for arbitration done one cycle in advance. VC & SW allocators removed. Per-output arbiters instead. Input buffer replaced by input EB LA routing also applicable to EB networks. Line in input buffer is to separate among VCs. Routers have no input buffers or virtual channels. No VC or switch allocators: Per-output arbiters SA done a cycle in advance, without credits. Solution: Output port EB has three slots. Intermediate register for pipelining only. Ready/valid facilitates movement inside the router. EB router

Deadlock avoidance How to provide isolation without VCs Load-sensing for adaptive routing Evaluation

Deadlock Avoidance: Duplicate Channels
No input buffers no virtual channels Three types of possible deadlocks: Protocol deadlock Cyclic flit dependency in network Solution: Duplicate physical channels Protocol deadlock. Destination may require a reply to handle more requests. Reply may be behind a request (FIFO channels). Cyclic flit dependency in network. Must be prevented within and across traffic classes, defined by duplicate channels.

Deadlock Avoidance: No Interleaving
Interleaving deadlock New head flits require destination registers Occupied destination registers depend on tail flits Tail flits cannot bypass the new head flit Solution: Disallow packet interleaving Increasing the number of destination registers makes the problem harder, doesn’t solve it. You can do stuff like having a maximum number of head flits without tails sent by an output. However, disabling interleaving doesn’t degrade performance and solves the problem. It could help with unequal packet lengths and starvation, or allocator inefficiencies. Interleaving deadlock also appears in wormhole networks. Protocol deadlock. Destination may require a reply to handle more requests. Reply may be behind a request (FIFO channels). Cyclic flit dependency in network. Must be prevented within and across traffic classes, defined by duplicate channels.

Duplicating Channels Between Routers
Duplicate channels with neckdown Small improvement (still one switch port), large cost Duplicate channels with duplicate switch ports Excessive cost (switch quadratic cost)

Dividing Into Sub-Networks More Efficient
Divide into sub-networks Double bandwidth, double the cost However, when narrowing datapath down to normalize for throughput or power more beneficial Again, due to switch quadratic cost Beneficial only up to a certain number of sub-networks. Above that, network interfaces become too expensive and control cost dominates. Duplicating into sub-networks is a beneficial choice anyway.

Deadlock avoidance Load-sensing for adaptive routing Propose a load metric for EB networks Evaluation

Output Channel Occupancy Load Metric
Flit-buffered networks use credit count EB networks measure output channel occupancy At a certain segment of the output channel (shown in red) Occupancy decremented when flits leave that segment Incremented by a packet’s length when routing decision is made. Packets see other decisions in same cycle We played around with a few metrics. Without incrementing when decision is made, all inputs regarded the same output as unloaded. It is important: Flits waiting for an output are regarded into the metric. Therefore many inputs wanting to send to the same output see each other and the congestion therefore.

Deadlock avoidance Load-sensing for adaptive routing Evaluation Compare throughput, power, area, latency, cycle time

Evaluation Methodology
Used a modified version of booksim Area/power estimations from a 65nm library Input buffers modeled as SRAM cells Throughput/power optimal # of VCs and buffer depth Two sub-networks: request and reply Averaged over a set of 6 traffic patterns Constant packet size (512 bits) Swept channel width from 28 to 192 bits Low-swing channels: 0.3 of the full-swing repeated wire traversal power Assumed 2GHz clock, about 20FO4. We try to relate three factors: area, power, performance. We use the channel width to sweep those. @trafficPatterns = ("uniform", "randperm", "shuffle", "bitcomp", "tornado", "neighbor");

Throughput-Power Gains in 2D Mesh
Throughput gain EB network improvement: Same power: 10% increased throughput Same throughput: 12% reduced power How to read Pareto curve. Generated by sweeping datapath width. One width each sample point. Trade power improvement for increased datapath width. Generated by sweeping the datapath width and getting design points associating max throughput and power at that for each width. Same power -> performance: 10% gain. Same performance -> power: 12% gain.

Throughput-Area Gains in 2D Mesh
2% improvement for EB networks Same area -> performance: 2% gain. Same performance -> area: 2% gain. Less impressive than power because SRAM buffers are small. If you remove them there are small area gains. But when you widen the datapath, the crossbar cost increases quadratically. And it’s already the majority of the area.

Latency-Throughput in 2D Mesh
Curve for the mesh has the same behaviour. Zero-load latency equal

Power Breakdown: No Input Buffer Power
VCB power is 25%. A single mesh physical network. Explain why there is an input buffer read for EBNs.

Area Breakdown: No Input Buffer Area
This is for same-width routers. More area in the EB networks because we need more bandwidth (wires) to cover for the loss of buffers -> more area. Channel area is double compared to full-swing. For a single 4x4 mesh. DOR does not matter for area.

Router RTL Implementation
No buffers, VCs, allocators, credits VC router had look-ahead routing Buffers: FF arrays. 2 VCs, 8 slots each 45nm, LP-CMOS, worst-case Mesh 5x5 routers. DOR. 64-bit datapath Aspect VC router EB router Savings Area (μm2) 63,515 14,730 77% Clock (ns) 3.3 2.7 18% Power (mW) 2.59 0.12 95% Pareto results do not take into account the reduced cycle time shown here. Really reduces area and power. Only hit is a small decrease in peak throughput (for equal-width datapaths). Power gains due to FF buffers and allocators. Allocators not modelled in the Pareto (booksim) area and power models. No channels are modeled here. 45nm low-power, worst case. 5x5 mesh DOR routers. Synopsys DC and Cadence encounter. 24% flit injection in each input. VC router: 2VCs, 8 buffer slots. Look-ahead for VC used, not for EB. FF were used for VCBs. Pins were placed to assume a realistic topology layout. I/O timing and load/driving constrains were assumed. Critical stages for both was the first stage. 77% decrease in area, 18% decrease in cycle time, 95% decrease in power.

Conclusions EB flow-control uses channels as distributed FIFOs
Removes input buffers from routers Uses duplicate physical channels instead of VCs Increases throughput per unit power up to 12% for low-swing Depends on what fraction of the overall cost input buffers constitute Reduces router cycle time by 18% Flow-control choice depends on design parameters and priorities 8% for full-swing. Design priorities: Area, power, latency, cycle time, etc.

Thanks for your attention
Questions?

Elastic-Buffer Flow-Control for On-Chip Networks

Similar presentations

Presentation on theme: "Elastic-Buffer Flow-Control for On-Chip Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Elastic-Buffer Flow-Control for On-Chip Networks

Similar presentations

Presentation on theme: "Elastic-Buffer Flow-Control for On-Chip Networks"— Presentation transcript:

Similar presentations

About project

Feedback