Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306
Broadcast communication(MPI_Bcast) n0n0 n1n1 n2n2 n3n3 n0n0 n1n1 n2n2 n3n3 Before After ABCD ABCDABCDABCDABCD Let T(msize) = time to send a message of size msize Broadcast(msize) >= T(msize)
Ethernet Switched Cluster switch
Problem statement: How to efficiently realize the broadcast operation with large message sizes on Ethernet switched clusters. Using pipelined broadcast can achieve near optimal results (T(msize) time for broadcasting a message of size msize). Finding contention free broadcast tree Finding a good segment size
Traditional Broadcast algorithms Linear tree Flat tree 0 Time = (P-1) x T(msize)
Binary tree k-ary tree Time = 2x(log 2 (P+1)-1)xT(msize)
Binomial tree Time = log 2 P x T(msize)
Scatter/Allgather n0n0 n1n1 n2n2 n3n3 Before ABCD ABCD Scatter Allgather ABCDABCDABCDABCD Time = 2 x T(msize)
Time Complexity for large messages Linear tree(P-1) x T(msize) Flat tree(P-1) x T(msize) Binary tree2x(log 2 (P+1)-1)xT(msize) Approx. 2xlog 2 P x T(msize) Binomial treelog 2 P x T(msize) Scatter/allgather2xT(msize)
Pipelined Broadcast Algorithm Linear pipeline 0123
Performance of pipelined broadcast: Assume no network contention a message of size msize be broken into X messages of msize/X. H: tree hight, D: the number of children Size of pipelined stage: D * T(msize/X) Total time T: (X + H –1) * (D * T(msize /X)) linear tree: H = P, D = 1, T = T(msize) Binary tree: H = log(P), D= 2, T = 2T(msize) K-ary tree: H = log_k(P), D = k, in general not as efficient as binary tree.
Time Complexity for large messages Pipelined (linear)T(msize) Pipelined (binary)2 x T(msize) k-ary pipelinek x T(msize) Binomial treelog 2 P x T(msize) Scatter/allgather2xT(msize)
Pipelined broadcast How to find a contention-free broadcast tree? How to select the best segment size?
Example of network contention Binary tree switch n 0,n 1,n 2,n 3 n 4,n 5,n 6,n 7 There is a link contention cause by communication (1 4), (2 5), (2 6), and (3 7)
Linear tree switch n 0,n 1,n 4,n 5 n 2,n 3,n 6,n 7 The linear tree 0 1 2 3 … 7 will have a contention caused by (1 2) and (5 6)
Algorithm for constructing contention free linear tree Step 1: Traverse through all switches using depth-first-search (DFS) algorithm, name the switch by the order of their arrival in DFS tree Step 2: The linear tree consists of all machines in switch S 0, follows by all machines in S 1, then S 2,and so on
Example of contention free linear tree Switch S0 Switch S1 n 0,n 1,n 4,n 5 n 2,n 3,n 6,n 7 Switch S3 Switch S2 n 12,n 13,n 14,n 15 n 8,n 9,n 10,n 11 Linear tree: n0 n1 n4 n5 2 3 6 7 8 9 … 15
Algorithm for constructing contention free binary tree Start with a contention free linear tree Recursively divide the tree into 2 sub-trees Make sure that the cannot be a contention The sub-trees are chosen such that the height of the whole tree will be minimal
Binary tree height Performance of binary pipeline broadcast depends on the height of a binary tree Even though contention free binary tree may not be a complete binary tree, its height is not that much more than a complete binary tree
Average tree heights for 20 randomly generated topologies
Evaluation Contention free pipelined algorithms: Routine generators from topology information The generated routines are based on MPICH p2p primitives. Linear tree Binary tree 3-nary tree Targets for comparison: MPICH: Binomial tree, Scatter/allgather LAM: Flat-tree, Binomial Topology unaware pipelined linear and binary algorithms
Evaluation
Performance of different pipelined trees (topology 1)
Comparing pipelined broadcast with other schemes
Topology unaware and contention-free pipelined broadcast
Segment size for pipelined broadcast
Conclusions Pipelined broadcast is faster than the current broadcast algorithm for medium and large messages Linear pipeline has a completion time roughly equal to T(msize) binary pipeline broadcast is best for medium messages Contention free broadcast tree is necessary for pipelined algorithms A good segment size for pipelined broadcast is not difficult to find.
Questions?