Download presentation
Presentation is loading. Please wait.
1
Physical constraints (1/2)
One of the physical constraints facing the implementation of interconnection network is the available wiring area. The minimum number of wires that must be cut when the network is divided into two equal sets of nodes, is referred as bisection width. Even the failure of a single link can destroy the deadlock freedom properties. A second constraint is the number of I/Os available per router, that is referred as node size. 分散処理論2 (No9)
2
Physical constraints (2/2)
Network Bisection width Node size k-ary n-cude 2Wkn-1 2Wn Binary n-cube nW/2 nW n-D mesh Wkn-1 Omega net. NW 2tW channel of width W bits, t×t switches in Omega with N nodes 分散処理論2 (No9)
3
Bisection width 分散処理論2 (No9)
4
Hardware cost and speed model
Chien proposed cost and speed model for wormhole routers to compare their complexity and performance[1993]. He showed gate counts of each router component and its delay based on a canonical router model. Intrarouter delay can be parametrized by the number of ports on a crossbar switch and routing freedom with technology dependent constants. 分散処理論2 (No9)
5
Canonical Router model
LC LC Output channels Input channels switch LC LC Injection channel Ejection channel LC LC Routing and arbitration LC: Link Controller 分散処理論2 (No9)
6
Buffer partitioning LC LC switch LC LC LC LC Output channels Input
VC LC LC Output channels Input channels switch LC VC LC Injection channel Ejection channel LC LC Routing and arbitration LC: Link Controller, VC: virtual channel controller 分散処理論2 (No9)
7
Alternative partitioning
switch mux mux VC LC LC Output channels Input channels mux mux LC VC LC Injection channel Ejection channel LC LC Routing and arbitration LC: Link Controller, VC: virtual channel controller 分散処理論2 (No9)
8
Circular buffer d c b a e h g f head tail tail head 分散処理論2 (No9)
9
A parametrization Module parameter Gate count Delay Crossbar P(#ports)
O(P2) c0+c1×log P Flow control - O(1) c2 Address decoder c3 Routing decision F(freedom) O(F2) c4+c5 ×log F Header selection O(log F) c6+c7 ×log F VC V(#VCs) O(V) c8+c9 ×log V 分散処理論2 (No9)
10
Pipelined routing RC VA SA ST SA ST SA ST SA ST
Cycle Head flit Body flit 1 Body flit 2 Tail flit RC VA SA ST SA ST SA ST SA ST RC: Routing computation, VA: Virtual channel allocation, SA: Switch allocation, ST: Switch traversal 分散処理論2 (No9)
11
Pipeline stalls Pipeline stalls occur if a given pipeline stage cannot be completed in the current cycles. Stalls may occur at any stage. No available output ports (virtual channels) Input buffer is empty, etc. Latency of an interconnection network is directly related to pipeline depth. 分散処理論2 (No9)
12
Virtual-channel allocation stall
Cycle Head flit 1 tail flit 2 Body flit 1 Tail flit 1 RC VA SA ST SA ST SA ST SA ST Header flit 1 is not able to allocate virtual channel until cycle 5, and reallocation is taken. 分散処理論2 (No9)
13
Physical channel (1/2) Half-duplex channels have the disadvantage that both sides of the link must arbitrate for the use of the link. Unidirectional channels doubles the average distance traveled by a message in tori. In unidirectional full-duplex channels, channel widths are reduced, i.e. approximately halves as bandwidth is statically allocated in each direction. 分散処理論2 (No9)
14
Physical channel (2/2) Lower dimensionality, and communication traffic below network saturation would tend to favor the use of half-duplex channels. Cost considerations would encourage the use of low cost packaging which would also favor half-duplex channels and the use of command/data encodings to reduce the overall pin count. For large systems with a higher number of dimensions (wire delays would dominate), the use of higher clock speeds, communication intensive applications and the use of pipelined links would favor the use of full-duplex channels. 分散処理論2 (No9)
15
A bidirectional half-duplex channel
data R1 R1 control Fair arbitration requires status information (ownership) to be transmitted across the channel to indicate the availability of data to be transmitted. 分散処理論2 (No9)
16
A unidirectional channel
data control R1 R1 Arbitration overhead is avoided, and therefore links can generally be run faster. 分散処理論2 (No9)
17
A bidirectional full-duplex channel
data control R1 R1 data control When the data are being transmitted in only one direction across the channel, 50% of the pin bandwidth is unused. 分散処理論2 (No9)
18
A demand-driven mutual exclusion ring
ready VC ARB VC ARB ack VC ARB Link control Link control VC ARB A signal representing the privilege to drive the physical channel circulates around the mutual ring. 分散処理論2 (No9)
19
Available buffer space
Assume that propagation delay is specified as P ns per unit length and the length of the wire in the channel is L units. When the receiver requests the sender to stop transmitting, there may be LPB phits in transit on a channel running at B Gphits/s. An additional LPB phits may be placed in the channel to propagate the flow control signal to the sender. If the flow control operation latency is F ns, the sender place another FB phits on the channel. Thus, available buffer space = 2LPB + FB (phits) 分散処理論2 (No9)
20
Simultaneous bidirectional signaling
It allows simultaneous signaling between two routers across a single signal line (full-duplex bidirectional communication). It transmits a logic 1(0) as a positive (negative) current. The received signal is the superposition of the two signals transmitted from both sides of the channel. Each transmitter generates a reference signal which is subtracted from the superimposed signal to generate the received signal. 分散処理論2 (No9)
21
Separate crossbar X+ X+ crossbar From node To Yin X- X- Y+ Y+ crossbar
AD FC X+ crossbar From node AD FC To Yin X- X- AD FC Y+ AD FC Y+ crossbar Yin AD FC To node Y- Y- AD FC 分散処理論2 (No9)
22
Planar-adaptive router
vc L3 crossbar vc L3 mux L3 L1 L2 vc vc L4 L4 L1 vc crossbar mux L4 L2 vc 分散処理論2 (No9)
23
Cray T3D router Y+ Z+ X+ NI xbar xbar NI xbar X- Y- Z- 分散処理論2 (No9)
24
Message processing datapath on T3D
Data translation buffer processor addressing Messaging support Local memory Msg queue control Addressing and routing tag lookup Network interface Input buf. Output buf. 分散処理論2 (No9)
25
Intel Cavallino router
VC0 VC1 VC2 VC3 VC0 VC1 VC2 VC3 X+ X- VC0 VC1 VC2 VC3 VC0 VC1 VC2 VC3 crossbar Y+ Y- VC0 VC1 VC2 VC3 VC0 VC1 VC2 VC3 Z+ Z- 分散処理論2 (No9)
26
SGI SPIDER chip I/F I/F I/F Link I/F crossbar VCs Msg cntl I/F I/F I/F
It is the first router to compute the route one step ahead. 分散処理論2 (No9)
27
Routing control for PCS
Input VC Routing header Channel mappings decode History store Decision unit Inc/Dec banks Modified header Output VC 分散処理論2 (No9)
28
Pipelined circuit switching (PCS)
The implementation of routing protocols based on PCS is more complex than wormhole-switched routers to support backtracking. Backtracking must use history information. When a routing header is backtracking, its history mask must be retrieved from the history store. 分散処理論2 (No9)
29
Buffered wormhole (IBM SP family)
FC FC serializer FIFO deserializer FIFO 8 8 Central queue 64 64 routing Bypass Crossbar 8 8 Packets are buffered in the central queue when they fail wining access to the bypass crossbar (output port). 分散処理論2 (No9)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.