Germán Rodríguez Cyriel Minkenberg Ramon Beivide Ronald P. Luijten Jesus Labarta Mateo Valero Oblivious Routing Schemes in Extended Generalized Fat Tree Networks New Orleans, 2009 HPI-DC'09 (in conjunction with CLUSTER'09)
2 Summary ●We describe previously well known regular modulo-based routing algorithms for k-ary n-trees. ●We extend and analyze these algorithms for a broader class of networks: XGFTs, including cost-effective variants of k- ary n-trees ●We produce some combinatorial results that show that the two main variants for modulo-based algorithms perform equally well for a random distribution of traffic ●We identify two intrinsic flaws of oblivious modulo-based algorithms and propose a variant that improves over both.
3 ●XGFT topologies: ●k-ary n-trees and more cost-effective variants. ●Routing (State of the Art) ●Random ●Modulo-radix variants: Source-Mod-k and Destination-mod-k ●Experimental environment ●Analysis of Modulo-radix algorithms ●Proposal – random NCA up/down ●Evaluation ●Results ●Conclusion Outline
4 Extended Generalized Fat Trees I ●XGFT ( h ; m 1, …, m h ; w 1, …, w h ) ●Superclass of Multi-Trees ●k-ary n-trees [Petrini97] ●Slimmed trees [Navaridas07] ●h = height ●number of levels-1 ●levels are numbered 0 through h ●level 0 : compute nodes ●levels 1 … h : switch nodes ●m i = # children per node at level i, 0 < i ≤ h ●w i = # parents per node at level i-1, 0 < i ≤ h ●number of level 0 nodes = i m i ●number of level h nodes = i w i XGFT ( 3 ; 3, 2, 2 ; 2, 2,3 ) 0,0,00,0,10,0,20,1,00,1,10,1,21,0,01,0,11,0,21,1,01,1,11,1,2 0,0,01,0,00,1,01,1,00,0,11,0,10,1,11,1,1 0,0,00,1,01,0,01,1,00,0,10,1,11,0,11,1,1 0,0,00,0,10,0,20,1,00,1,10,1,21,0,01,0,11,0,21,1,01,1,11,1,2 4-ary 2-tree XGFT(3;4,4,4;1,4,1) – Slimmed tree Nearest Common Ancestors (NCA), Least Common Ancestors (LCA) or “roots” of a pair (s,d) or nodes are: The set of inner nodes at the lowermost level that are ancestors of both s and d.
5 Extended Generalized Fat Trees II XGFT ( 3 ; 3, 2, 2 ; 2, 2,3 ) 0,0,01,0,02,0,00,1,01,1,02,1,00,0,11,0,12,0,10,1,11,1,12,1,1 0,0,01,0,00,1,01,1,00,0,11,0,10,1,11,1,1 0,0,00,1,01,0,01,1,00,0,10,1,11,0,11,1,1 0,0,00,0,10,0,20,1,00,1,10,1,21,0,01,0,11,0,21,1,01,1,11,1, ary 2-tree XGFT(3;4,4,4;1,4,1) – Slimmed tree ●Number of nodes at level i, 0 < i < h ●Each node can be labeled as a h-tuple:, 0 ≤ M i ≤ m i, 0 ≤ W i ≤ w i which in combination with the level number i uniquely determines a node in the whole network (first W’s, then M’s) ●Equivalent variations in the labeling schemes have been proposed [Lin04,Gomez07]
6 XGFTs and Contention ●XGFTs provide multiple paths for every pair of nodes: ●Proportional to the “number of parents” (w i ) parameters up to the Least/Nearer Common ancestors of Source s and Destination d. ●Increasing the number of parents increases the cost. ●k-ary n-trees provide full-bisection and set a well-known trade-off between cost and performance ●Slimmed trees (with w i ≤ k) become more important with the increasing number of nodes ●Our analysis and proposal works better for slimmed trees than previous algorithms.
7 Related Work: Routing schemes ●Main Oblivious routing schemes for Fat Trees ●Random [Valiant81][Greenberg85] selection of upward paths ●Either Source [Leiserson92][Ohrin95][Kariniemi06] modulo assignment of upward links ●or Destination [Lin04][Gomez07][Johnson08] modulo assignment of upward links ●Pattern-aware (used in this work) ●Colored Heuristic [Rodriguez09] ●We use it as a base-line for comparison
8 Random Routing I ●The assignments of links to reach an NCA is totally random ●Idea: a random distribution should equally distribute the probability of having contention ●At each step choose a random parent until an NCA is reached, ●Then, follow the unique deterministic path down S Node 1Node 10
9 Regular Routings (s mod k, d mod k) ●“Self-routing” approach ●At each step, choose the parent by getting doing a modulo operation (k) ●Difference: The label of the source or destination is used to go up to the tree only Node Node 10 = Dest 26 = mod 3 = (port) 0 mod 3= (port) 1 mod 3 = (port) 0 mod 3 = (port) 0 Node Dest mod 3 = (port) 2 mod 3 = (port) 2 source mod k destination mod k
10 Combinatorial Analysis of Modulo-based algorithms: An interesting question arises: is any of the two variations (source or destination) of the modulo-based algorithms intrinsically better? Number of permutations routed ●By s-mod-k, by d-mod-k ●The same; why? ●Idea: For every P, exists Inverse (P) / if P has c conflicts with s-mod-k, Inverse of P has c conflicts with d-mod-k (details in the paper) Number of general patterns (no permutations) routed ●By s-mod-k, by d-mod-k ●The same; why? ●Idea: decompose the pattern in all possible permutations ●Compute the maximum c of all possible permutations for s-mod-k ●Invert the decomposed permutations and apply the previous result, the union of the inverted permutations have the same maximum c for d-mod-k ●Look for more details in the paper
11 Experimental Setup ●Collection of application traces and pattern extraction ●Co-simulation approach [Minkenberg09]: ●Dimemas replays the MPI activity of the trace of an application ●Venus simulates the transmission of the messages with a detailed model of the network statistics Venus Simulator routes mapping topology Config File: Adapter, Switch parameters, BW, Link delay statistics map2ned Myrinet’s route files Myrinet’s map files routereader Traffic Generator traces Dimemas Simulator Config File: Links, Bandwidth, #buses, latency, Eager/rendez-vous, etc. traces Execution of an Application Visualization, Analysis Validation (Paraver) ServerModClientMod Detailed level of simulation Applications/MPI
12 Applications ●WRF ●256 processors ●Each process sends 2 outstanding sends to destinations +/- 16 nodes away (except the first and the last 16 processes) ●CG ●128 processors
13 Results: WRF Progressive tree slimming ●Removing a single switch degrades the performance by 2 ●Removing 7 more middle switches has no impact for 3 routing schemes ●Regular modulo routings work very well (as good as the baseline), while Random does not.
14 Modulo-based Algorithms look good ●A word about contention: ●Two main types: endpoint contention, and network fabric contention ●Endpoint contention arises because a node is performing multiple outstanding sends or receives and has less adapters than it needs. ●Network fabric contention arises because there are not enough network resources or the routing algorithm is not using them adequately. ●Modulo-based routing algorithms work by using node labels to go up to the tree, concentrating endpoint contention for every particular node to a specific NCA ●S-mod-k uses the source label – endpoint contention at the source is concentrated ●D-mod-k uses the destination label – endpoint contention at the destination is concentrated However, modulo-based algorithms do not always work well...
15 Results: CG ●Oblivious routings cannot achieve the best performance ●It’s a pathological case for modulo-based oblivious algorithms ●Random routing does not achieve good performance ●The oblivious strategies do not match the baseline
16 Results: CG Communication Pattern ●Colored ●All phases take the same time ●Destination Mod K ●Non-local phase takes 8 times longer?
17 Results: CG Communication Pattern congruent with the modulo algorithm ●Why do oblivious algorithms work badly with CG? ●Only a phase in CG is non-local in our experiment: ●Each source sends to: ●destination = (source/2) * 16 + (source mod 2) ●Modulo-based routing algorithms in radix 16 networks ●OutputPort (destination) = ((source/2) * 16 + (source mod 2)) mod 16 == 0 or 1 ●Map the 16 outgoing communications to either port 0 or 1 ●8 to each – 8 contending communications ●14 unused ports in the switch…
18 Proposal: Random NCA up/down Oblivious algorithms: What does d-mod-k or s-mod-k do? Make certain “roots” responsible to route a collection of sources or destination. The distribution of roots is even (for a k- ary n-tree, but not for slimmed trees). Tries to concentrate endpoint contention either in the path up to the root (souce mod k) or down from the root (destination mod k) We can relabel the nodes and apply modulo-based algorithms to the new sources or destinations labels and define two families of algorithms: Random NCA up (using source labels) Random NCA down (using d labels) Idea: Each root is responsible to concentrate endpoint contention of a number of leaf nodes. Even distribution of leaf nodes to roots should lead to good performance.
19 A word on the results plots ●In each of the graphs there is a data point for: ●Source-mod-k (triangle up, centered) ●Destination-mod-k (triangle down, centered) ●And three boxes with (minimum,1 st quartile, median, 2 nd quartile and maximum) for: ●Random ●Random NCA up ●Random NCA down ●Note that although the random algorithms results are based on the statistical collection of 20 to 60 experiments with different seeds, the variance in the performance might not be noticeable, thus a single horizontal line is the whole “box”
20 Results: WRF Random-NCA-up and Random-NCA-down are almost as good as S-mod-K and D-mod-k
21 Results: CG Random-NCA-up and Random-NCA-down are mid-way between S-mod-K and D-mod-k and the baseline.
22 Routes per NCA ●Distribution of routes per NCA for several routing schemes ●X axis is the NCA number ●Left – non-slimmed ●Small variance of routes per NCA per routing and across ports ●Right – slimmed topology ●Source and destination modulo-based algorithms show a huge difference of routes assigned per NCA ●Random and the proposed family of random assignment of NCAs exhibit less variance across NCAs
23 Conclusions ●Conclusions ●There are no fundamental differences in performance for typical communication patterns between source and destination modulo-based algorithms ●Modulo-based algorithms present an intrinsic flaw for slimmed trees ●Non-balanced distribution of routes per NCA can lead to increased network contention ●A hybrid approach (randomly selecting NCAs that become “endpoint-contention” concentrators) helps and could be used as a better oblivious approach for both non-slimmed and slimmed networks.
24 THANKS HPIDC’09
25 Q & A
26 Q & A
27
28 Routing in XGFTs ●Selecting a link up-wards further limits the choice of links a the upper levels. ●In pink: the switches that can be visited after selecting the first leftmost parent of level 1 and the second leftmost link up of level
29 XGFTs I ● Superclass of Fat Tree topologies: ● XGFT( h ; m 1,..., m h ; w 1,..., w h ) ● h is the height of the tree. ● m i is the number of children per node at level i. ● w i is the number of parents per node at level i. XGFT(1;4,1) XGFT(1;4,2) XGFT(1;4,3) XGFT(1;4,4) 4-ary tree 4-ary 1-tree
30 Random Routing I ●The assignments of links to reach an NCA is totally random ●Idea: a random distribution should equally distribute the probability of having contention ●Drawback I: Suboptimal link assignment given a pattern
31 Random Routing II ●Drawback II ●Even a single conflict halves performance Links, 2 conflicts for 3 pairs of nodes6 Links, No conflicts 22
32 Coupled effects Topology Routing Communication Pattern Mapping Performance Contention Results