MESSAGE ROUTING SCHEMES IN A HYPERCUBE MACHINE S. Raghupathy, M. R. Leuze, and S. R. Schach Presented by: Syed Md. Shakir
What are Interconnected Networks and why do we need them? One way for processors to communicate data is to use a shared memory and shared variables. However this is unrealistic for large numbers of processors. A more realistic assumption is that each processor has its own private memory and data communication takes place using message passing via an Interconnection Network. The interconnection network plays a central role in determining the overall performance of a multicomputer system. If the network cannot provide adequate performance, for a particular application, nodes will frequently be forced to wait for data to arrive.
Parallel Computers Large-scale parallel computers are potential candidates for providing very high computational power These systems are usually organized as an ensemble of nodes, each with its own processor, local memory, and other supporting devices. The nodes are interconnected using a variety of topologies that can be classified into two broad categories: Direct Indirect.
Direct Networks In direct networks, each node has a point-to-point or direct connection to some of the other nodes, called neighboring nodes; examples of direct network topologies include hypercube, mesh, and tree.
Indirect Networks In indirect networks, the nodes are connected to other nodes or a shared memory through one or more switching elements. Examples of indirect networks include crossbar, bus, and multistage interconnection networks. Multistage interconnected Network
Indirect Network Cross Bar
Communication Latency The communication latency of direct networks depends on several factors including switching, routing, flow control, and topology. Several switching techniques have been proposed for direct networks. Wormhole switching has emerged as a popular technique and has been used in both commercial and experimental systems. Wormhole switching can be employed in both direct and indirect networks. It is widely used in contemporary multicomputer because of its low latency and requirement of small buffers at the nodes.
cont... The mesh is an asymmetrical topology in which the node degree depends on its location. Interprocessor communication performance depends on the location of source and destination. The torus and hypercube are symmetrical topologies in which the degree of a node is the same irrespective of its location in the network. Thus, unlike the mesh, all the nodes in tori and hypercubes are identical in connectivity.
Routing in Parallel Computers Parallel computers are modeled by directed graphs All interconnections between processors (nodes) occur in synchronous steps Each link can carry at most one unit message (packet) in one step During a step, a node can send at most one packet to each of its neighbors Each node is uniquely identified by a number between 1 and N
Switching Techniques In most multicomputer systems, a message enters the network from a source node and is switched or routed towards its destination through a series of intermediate nodes. Four types of switching techniques are usually used for this purpose: circuit switching packet switching virtual cut-through switching wormhole switching.
Circuit Switching In circuit switching, a dedicated path is established between the source and the destination before data transfer initiates. Once the data transfer is initiated the message is never blocked. As the channels creating the path are reserved exclusively, buffering of data is not required. On the other hand, establishing the path requires significant overhead: during the data-transmission phase, all channels are reserved for the entire duration of message transfer. Circuit switching thus degrades performance and is no longer used in commercial multicomputer systems.
Packet Switching In packet switching, a message is divided into packets that are independently routed towards its destination. The destination address is encoded in the header of each packet. The entire packet is stored at every intermediate node and then forwarded to the next node in its path. The main advantage of packet switching is that the channel resource is occupied only when a packet is actually transferred.
Packet Switching cont... Each packet contains the routing information and alternative paths can be selected upon encountering network congestion or faulty nodes. The major drawback of packet switching Since the packet is stored entirely at each intermediate node, the time to transmit a packet from source to destination is directly proportional to the number of hops in the path. At each intermediate node, we need buffer space to hold at least one packet.
Virtual Cut Through In order to reduce the time to store the packets at each node, Kermani and Kleinrock introduced a technique called virtual cut-through In this, while routing toward its destination, a message is stored at an intermediate node only if the next channel required is occupied by another packet. Now, the distance between the source and destination has little effect on communication latency.
cont... In an extreme case, when a message encounters blocking at all the intermediate nodes, the virtual cut-through technique reduces to packet switching. The disadvantage of the virtual cut-through technique Implementation cost: each node must provide sufficient buffer space for all the messages passing through it, and because multiple messages may be blocked at any node, a very large buffer space is required at each node. This implementation constraint limits the use of virtual cut-through technique.
Wormhole Switching Wormhole switching is a variant of the virtual cut-through technique that avoids the need for large buffer spaces. In wormhole switching, a packet is transmitted between the nodes in units of flits, the smallest units of a message on which flow control can be performed. The header flit(s) of a message contains all the necessary routing information and all the other flits contain the data elements. The flits of the message are transmitted through the network in a pipelined fashion.
cont... Since only the header flit(s) has the routing information, all the trailing flits follow the header flit(s) contiguously. Flits of two different messages cannot be interleaved at any intermediate node. Successive flits in a packet are pipelined asynchronously in hardware using a handshaking protocol. When the header flit is blocked, then all the trailing flits occupy the buffers at the intermediate nodes.
Wormhole Switching Message format and routing in Wormhole Switching D Messages D H Packets Flits D D D D D D D D D D D D D D H D: Data Flit H: Header Flit (a) (b) Message format and routing in Wormhole Switching
Advantages of Wormhole Switching The main advantage of wormhole switching derives from the pipelined message flow since transmission latency is insensitive to the distance between the source and destination. Moreover, since the message moves flit by flit across the network, each node needs to store only one flit. Some implementations, however, require storage of multiple flits at each node to improve routing performance. The reduction of buffer requirements at each node has a major effect on the cost and size of multicomputer systems.
Disadvantages of Wormhole Switching The main disadvantage of wormhole switching comes from the fact that only the header flit has the routing information. If the header flit cannot advance in the network due to resource contention, all the trailing flits are also blocked along the path and these blocked messages can block other messages. This chained blocking can also lead to deadlock where messages wait for each other in a cycle and hence no message can advance any further.
cont... Prevention of deadlock is one of the main issues in wormhole switching, and is usually accomplished by a suitable choice of routing function that selectively prohibits messages from taking all the available paths, thus preventing cycles in the network. Selection of a routing algorithm is thus a major issue in wormhole-switched networks.
Hypercube Network An n-dimensional hypercube network: Number of nodes: N = 2n Degree: n The node i with address (i1, i2, …, in) {0, 1}n and the node j with address (j1, j2, …, jn) {0, 1}n are connected if the hamming distance between (i1, i2, …, in) and (j1, j2, …, jn) is 1
Hypercube Topology
4d Hypercube K dimensional hypercube is formed by combining two k-1 dimensional hypercubes and connecting corresponding nodes i.e. hypercubes are recursive, each node is connected to k other nodes i.e. each is of degree k.
Static routing in Hypercube Given a source node Ns Destination node Nd The addresses of the 2n processors can be represented using n bits. Then the next node on the route from Ns to Nd is the node represented by bit pattern (en-l, . . ., cl, CO) with bit i flipped, that is to say, the message is routed in dimension i The algorithm continues in this way until the message arrives at node Nd.
Static routing Algorithm: Given a destination address d(i) and an intermediate node (i) Compare the bits of d(i) with (i) from left to right Identify the first bit position at which these two addresses differ Route this packet to its neighbor n(i) such that (i) and n(i) differ only in this bit position
Static Routing Algorithm Example: Source: (0, 0, 0, 0, 0, 0) Destination: (1, 0, 1, 0, 1, 1) (0, 0, 0, 0, 0, 0) (1, 0, 0, 0, 0, 0) (1, 0, 1, 0, 0, 0) (1, 0, 1, 0, 1, 0) (1, 0, 1, 0, 1, 1)
Advantages and Disadvantages No overhead for calculating new routes. Same CPU cycles can be used for other computational purpose. Disadvantage Blocking is a common consequence.
Dynamic routing It allows every message to select the (locally) optimal route under the current circumstances. In Dynamic routing, if link is blocked then attempt is made to pass the message through other link. More utilization of the network It uses local knowledge.
Dynamic routing Allows the message to route from Ns, to Nd ,depending on circumstances. Allows optimal route under the current circumstances; Overhead of implementing dynamic routing. At each node calculations have to be performed to determine the next node to which the message should be routed, and links have to be tested to see which ones are free.
Advantages And Disadvantages Blocking is not a major problem Disadvantages: overhead of implementing dynamic routing. At each node calculations have to be performed to determine the next node to which the message should be routed, links have to be tested to see which ones are free. The size of the overhead will vary from hypercube to hypercube. In some machines, the additional work can be done in hardware in parallel with other operations; in other machines, it must be done in software, using machine cycles that could otherwise be used for productive computing.
PRIORITIZATION If a number of messages are waiting to use a link, one method of choosing which message to transmit is on the basis of: (FIFO), the method used in commercial hypercubes. In the paper alternative prioritization schemes, such as LIFO, giving priority to the message with the maximum number of remaining hops is also considered
Other Prioritization Schema The processes form a DAG, each process can be assigned a sequence number such that every message is sent to a process with a higher sequence number than the sequence number of the process that generated the message. The sequence number of the generating process can then be used to prioritize messages
Message Format
The Prioritization Schema
The Simulator The simulator was constructed to investigate routing strategies. The header contains information such as source and destination node, as well as information needed when the order of transmission of messages is done on the basis of prioritization, such as sequence number, time generated, arrival time at the current node, and number of hops that still have to be traversed.
Execution Cycle Of The Simulator The simulator has three phases Message generation Message ordering Message routing.
Message Generation Phase In this phase each active process is checked to see if it has received all the messages it requires. If so, the messages it is to transmit are generated, and placed in the message buffer. The process then terminates. After all possible messages have been generated, the simulator enters the message ordering phase.
Message Ordering Phase After entering the message ordering phase the messages in each buffer are ordered according to the prioritization scheme currently being evaluated. In the case of equal priorities, ties are broken randomly. Finally, the message routing cycle commences.
Message Routing Phase After each message is fetched from the message buffer and an attempt is made to transmit it to a neighboring node. If static routing is being used, and the predetermined link is in use, then that particular message is blocked. When dynamic routing is used, an attempt is made to transmit the message over the first unused link that will move it closer to its destination.
Results Dynamic routing performs better than static routing, but the improvement factor varies depending on the prioritization scheme. At best, the improvement is by a factor of two. Best results occur when priority is given to messages with the lowest sequence number. Results almost as good are obtained when priority is given to messages with the fewer number of hops, either in the original message or remaining to be traversed.
Results Continued... Messages of lowest sequence number are essentially those transmitted earliest in the computation sequence. Giving priority to such messages essentially speeds up the rate at which processes can begin transmitting, and hence speeds up the computation as a whole. The traffic congestion in the hypercube is decreased by giving priority to messages with the fewest numbers of hops and therefore allowing the longer messages to proceed with less blocking than would otherwise be the case. By giving priority to messages with fewer choices, the overall amount of blocking is decreased
One Bidirectional Link Between Nodes
Two Unidirectional Links Between Nodes
Percentage Improvement When Two Unidirectional Lines Are Used
Observations From The Graphs Above Having two unidirectional links improve throughput over one bidirectional link Note : Improvement depends on the prioritization scheme. The percentage improvement is rarely more than fifteen per cent, and is usually much smaller. This effect may be caused by the fact that the problem graph is a DAG, thereby imposing a directionality on the flow of messages
Conclusions Throughput of a certain class of problems on a hypercube can be increased by up an order of two through use of dynamic rather than static routing algorithms, and also by prioritizing the messages. It is likely that different prioritization schemes would yield improved throughput for other classes of problems.
Questions ?