Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Novel 3D Layer-Multiplexed On-Chip Network

Similar presentations


Presentation on theme: "A Novel 3D Layer-Multiplexed On-Chip Network"— Presentation transcript:

1 A Novel 3D Layer-Multiplexed On-Chip Network
Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California, San Diego

2 Networks-on-Chip Chip-multiprocessors (CMPs) increasingly popular
2D-mesh networks often used as on-chip fabric 12.64mm I/O Area single tile 1.5mm 2.0mm 21.72mm Chip multiprocessors are becoming increasingly popular. With increasing core counts the interconnect fabric connecting the processing elements within a chip has become an important component that has a significant impact on both performance and power consumed by the chip. On-chip networks are networks connecting different processing elements within a single chip. They provide a modular and scalable communication fabric for multi-core and many-core architectures 2D mesh networks are often used as the on-chip communication fabric Tilera Tile64 I/O Area Intel 80-core

3 3D Integrated Circuits Reduced chip footprint Reduced wire delays
Through Silicon Via Device layer 2 ≥ 2 active device layers Short inter-layer distances Device layer 1 Reduced chip footprint Reduced wire delays High inter-layer bandwidth Heterogeneous system integration Another technology that is emerging really fast is that of 3D integrated circuits. Uses the concept of vertical integration Instead of having a single device layer in a 2D plane we can now have multiple device layers stacked on top of each other. This has several benefits which includes reduced chip footprint. Reduced wire delays because reducing footprint which reduces the length of horizontal global interconnects Vertical interconnect connecting components on different layers have very low delays because inter-layer distances are very small. WE can have high inter-layer bandwidth because wires running through the layers can be densely packed in 2 dimensions. 3D technology also open up opportunities for heterogeneous system integration.

4 Natural Progression: 3D Mesh for 3D CMPs
Since 3D technology is so promising the natural way for the architecture community to take advantage of 3D integration is by extending 2D CMPs to 3 dimensions. The simplest way to extend 2D mesh topologies to a 3D layout is by adding two extra ports for vertical communication to each router and rearranging the tiles in the form of a 3D mesh. 3D Mesh 2D Mesh What routing algorithms to use for 3D mesh networks?

5 Outline Oblivious routing on a 3D mesh
Layer-multiplexed 3D architecture Evaluation

6 Oblivious Routing Objectives
Maximize throughput Distribute traffic evenly on network links Maximize worst-case throughput as traffic is application dependent Minimize hop count Minimize routing delay between source and destination Reduce power Next, we take another look at the Routing algorithm objectives. The main task of the routing algorithm, maximizing throughput by evenly distributing the traffic over all network links. Since we are concerned with routing algorithms for general purpose processors, the application that will be running on the processor is unknown and as a result the traffic which is application dependent is also an unknown. So we try to maximize worst-case throughput or the throughput under the most adversarial traffic. A second objective which was a part of the constraints in my original problem statement is to minimize the number of intermediate router hops between the source and the destination. This has a two-fold benefit. It reduces delay and also reduces power as each intermediate router hop consumes power.

7 Routing Algorithms for 3D Mesh Networks
Valiant Routing Optimal worst-case throughput Poor latency 2 VAL Dimension Ordered Routing Minimal latency Poor worst-case throughput O1TURN Routing Minimal latency Poor worst-case throughput Ideal routing algorithm Minimal latency Maximum worst-case throughput (normalized to minimal) Average hop count 1 IDEAL In this graph, we have the two router objectives along the X and Y axes and we try to plot current routing algorithms on this 2D plane. First we see what an ideal routing algorithm should look like. DOR O1TURN 0.25 0.5 Worst-case throughput (fraction of network capacity)

8 Randomized Partially-Minimal Routing (RPM)
X Random intermediate layer Destination To ensure that the projected traffic in each 2D plane is admissible requires a very simple step – load- balance equally across all 2D layers. Source Phase-2Z Intermediate layer to the destination Phase-1Z Source to the intermediate layer XY or YX routing on the intermediate layer

9 Main Idea Load-balance uniformly across the vertical layers
2 phases of vertical routing Min XY/YX used on each layer

10 Routing Algorithms for 3D Mesh Networks
2 VAL Randomized Partially Minimal Routing Near-optimal worst-case throughput Low latency (normalized to minimal) Average hop count 1.1 RPM 1 IDEAL In this graph, we have the two router objectives along the X and Y axes and we try to plot current routing algorithms on this 2D plane. First we see what an ideal routing algorithm should look like. DOR O1TURN 0.25 0.5 Worst-case throughput (fraction of network capacity)

11 RPM has Near-optimal Worst-case Throughput
RPM is optimal for even radix, within 1/k2 of optimal for odd radix.

12 Performance of RPM: Average-case Throughput
The message we would like to convey here is that RPM is the best known oblivious routing algorithm for 3D mesh networks

13 Outline Oblivious routing on a 3D mesh
Layer-multiplexed (LM) 3D architecture Evaluation Now we will shift our focus from routing algorithm to an optimized 3D network architecture to effectively implement RPM routing.

14 Unique Features of 3D ICs
Inter-layer distances are very small (~50 μm) Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm) Vertical interconnects implemented using Through-Silicon-Vias (TSVs) have very low delay 50μm TSV 1500μm

15 Unique Features of 3D ICs
Inter-layer distances are very small (~50 μm) Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm) Vertical wires using Through-Silicon-Vias (TSVs) have very low delay Vertical bandwidth abundant as TSVs can be densely packed in 2D with small via pitch (~4 μm) 4 μm

16 Unique Features of 3D ICs
Inter-layer distances are very small (~50 μm) Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm) Vertical wires using Through-Silicon-Vias (TSVs) have very low delay Vertical wiring abundant as TSVs can be packed in 2D with small via pitch (~4 μm) Number of device layers likely to remain small (4-5 layers) due to thermal and manufacturing issues

17 * RPM on a 3D Mesh Phase-2Z Intermediate layer to the destination
X Random intermediate layer Destination Now lets take a look at how RPM routes a packet on a conventional 3D mesh. Lets say that the source and destination nodes are on the bottom layer and the intermediate layer chosen at random is the top layer. Source Phase-2Z Intermediate layer to the destination Phase-1Z Source to the intermediate layer * XY or YX routing on the intermediate layer

18 Proposed Layer-Multiplexed Architecture
Phase-2Z Intermediate layer to the destination X Y Z Phase-1Z Source to the intermediate layer Random intermediate layer P1 P2 P1 P3 The layer multiplexed architecture we propose replaces the hop-by-hop vertical communication with demultiplexing and multiplexing structures. The first step in RPM routing is to demultiplex packets from each processor to a randomly chosen intermediate layer. Since the inter-layer distances are very short we use a vertical demultiplexing switch that spans all 4 layers. This switch connects processors on the 4 layers to the injection queues of routers on all 4 layers. So with this switch in place, a processor at any layer can now inject a packet to a router on any other layer in just a single hop. Once a packet is injected to a router on one of the layers, routing on each layer is the same as routing on a 2D mesh. So we only need 5 port routers instead of large 7 port routers for routing on a 2D plane. Finally, when a packet reaches the (X,Y) coordinates of the destination, it is directly ejected to a packet ejection multiplexer at the destination processor. These multiplexer multiplex packets arriving from different layers at each processor. RPM on LM architecture called RPM-LM. P2 RPM routing adapted to the LM architecture : RPM-LM P4 P3 Destination * P4 XY or YX routing on the intermediate layer Source

19 Power and Area Savings 5x5 crossbar in LM vs. 7x7 crossbar in 3D mesh
Conventional 3D Mesh P1 P2 P3 P4 Packet injection demultiplexer P1 P2 P3 P4 Packet ejection multiplexer . Now lets take a look at the advantages of the LM architecture over a conventional 3D mesh. First LM architecture uses only 5-port routers compared to 7-port routers used in a 3D mesh. Since power and area increase quadratically with the number of ports, 7-port routers are almost twice as expensive as 5-port routers in terms of power and area. In the LM architecture we decouple vertical routing from routing on each horizontal plane. By doing this we are able to reorganize the 7-port routers in a 3D mesh into 5-port routers for routing on each 2D plane integrated with demultiplexing and multiplexing structures for routing between layers. We also restrict vertical routing to packet injection and packet ejection stages. Since RPM requires 2 phases of vertical routing, these two phases are easily mapped to the packet injection and ejection stages. Layer-Multiplexed Architecture Decouple vertical routing from horizontal routing Restrict vertical routing to packet injection and packet ejection

20 Single Hop Vertical Communication
Single hop vertical routing more power efficient than one-layer-per-hop routing Leverages short inter-layer distances in 3D ICs Better utilizes available vertical bandwidth The next advantage of LM over a 3D mesh is that it allows single hop vertical communication instead of one-layer per hop routing. As it turns out, when inter-layer distances are very short, single hop routing using the demultiplexing and multiplexing stages is more power efficient that on-layer-per hop routing where power is dissipated at every intermediate router. Single hop routing is made possible because of the opportunities available in 3D Ics.

21 Packet Injection Demultiplexer
Route Selection/Load Balancing VC Allocation Credits in from the injection port of routers on layers 1-4 Flit Counters Switch Arbitration To the injection port of the Layer 1 router P1 . The packet injection demultiplexer is basically a vertical switch and its architecture is very similar to a 2D router. The route selection stage selects an intermediate layer to route a packet. Since the goal is to load balance across the vertical layers, we need a set of counters to keep track of the amount of traffic sent from every processor to each layer. The routing decision is based on the counter values. In this case there is also some added flexibility for the route selection stage as “technically” any output port can be used, provided we load balance over time. So route selection can can also try to reduce contention at output ports. The thing that is different here is that the route selection stage needs to balance traffic uniformly across the 4 layers. It does the balancing with the help of flit counters which keeps track of the number of flits a processor sends to each layer and it ensures that over time traffic is distributed uniformly across the layers. P2 P3 To the injection port of the Layer 4 router P4

22 Packet Ejection Multiplexer
Credits out for L1-P1, L2-P1, L3-P1 and L4-P1 Arbiter VCID L1-P1 Router on Layer 1 Packets from layer2 L2-P1 P1 Packets from layer3 L3-P1 Packets from layer4 L4-P1 . P2 The job of the packet ejection multiplexers is to multiplex packets arriving from different layers to the destination processor. In a 4 layer architecture, each multiplexer will have 4 queues to receive packets from routers on each of the 4 layers. When a packet is ejected from a layer router, its destination field is used to direct it to the right ejection multiplexer. Each multiplexer can then independently choose which flit to forward to the destination processor every cycle. P3 Credits out for L1-P4, L2-P4, L3-P4 and L4-P4 Arbiter L1-P4 Packets from layer2 L2-P4 P4 Packets from layer3 L3-P4 Packets from layer4 L4-P4

23 Outline Oblivious routing on a 3D mesh
Layer-multiplexed 3D architecture Evaluation Power and Area Performance After describing the details of the LM architecture lets move on to the evaluation of the architecture. First I will compare the power and area of the LM architecture with a 3D mesh

24 Power and Area Evaluation
Used Orion 2.0 models for router power and area estimation. 65nm process at 1V and 1GHz Buffers 4VCs/port, 5flits/VC for routers 5 flits/port for packet injection demultiplexer 5 flits/port for each packet ejection multiplexer

25 Power Comparison 3D mesh LM One 7-port router per tile
One packet injection demultiplexer for every 4 tiles One packet ejection multiplexer per tile

26 Power Evaluation 27% power reduction

27 Area Evaluation 26.5% power reduction

28 Outline Oblivious routing on a 3D mesh
Layer-multiplexed 3D architecture Evaluation Power and Area Performance

29 RPM on a 3D mesh vs. RPM-LM Worst-case throughput
RPM-LM achieves same (near-optimal) worst-case throughput as RPM Average-case throughput The reason RPM-LM outperforms RPM on the symmetric topology is because the LM architecture offers higher vertical bandwidth than a 3D mesh.

30 Flit-Level Simulation
Ideal throughput evaluation assumes Ideal single-cycle router Infinite buffers No contention in switches, no flow control Flit-level simulation PopNet network simulator 5 stage router pipeline Credit-based flow control 8 virtual channels, each 5 flits deep Multi-flit packets injected into the network (5 flits/packet) Until now the throughput results presented were for an ideal scenario where we assumed ideal single cycle routers, infinite buffers and no contention is switches. To get a more realistic insight into the performance of RPM and RPM-LM we need to account for the non idealities present in practical implementations. For this purpose we used a cycle accurate flit-level simulator.

31 Flit-Level Simulation (cont’d)
Network configurations simulated 4 x 4 x 4 mesh 8 x 8 x 4 mesh Four different traffic traces used Uniform traffic Transpose traffic: (x,y,z) → (y,z,x) Complement traffic: (x,y,z) → (k-x-1, k-y-1, k-z-1) Worst Case traffic pattern for DOR (DOR-WC): (x,y,z) → (k-z-1, k-y-1, k-x-1)

32 Uniform Traffic 8x8x4 Mesh
Asymmetric 8x8x4 mesh

33 Transpose Traffic 8x8x4 Mesh

34 Worst-case Traffic for DOR 8x8x4 Mesh

35 Summary of Contributions
Proposed a 3D Layer-multiplexed architecture which is an optimization of a 3D mesh Exploits the optimality of RPM together with the high vertical bandwidth enabled in 3D technology LM architecture consumes 27% less power, occupies 26% less area than a 3D mesh RPM-LM has comparable (marginally better) performance to RPM on a 3D mesh

36 Thank you!!


Download ppt "A Novel 3D Layer-Multiplexed On-Chip Network"

Similar presentations


Ads by Google