Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra, IPDPS(IEEE International Parallel & Distributed Processing Symposium) Reporter : Yu Tang Liu
Outline Abstract Introduction C4.5 Decision Tree algorithm Experimental Results and Analysis Conclusion
Abstract Selecting the close-to-optimal collective algorithm based on the parameters of the collective call at run time is an important step in achieving good performance of MPI applications. Explore the applicability of C4.5 decision trees to the MPI collective algorithm selection problem.
Introduction Performance of MPI collective operations depend on ◦ Total number of nodes involved in communication ◦ System and network characteristics ◦ Size of data being transferred ◦ Current load ◦ The operation that is being performed ◦ The segment size used for operation pipelining Selecting the best possible algorithm and segment size combination for every instance of collective operation.
Introduction Process of tuning a system 1.Detailed profiling of the system, possibly combined with communication modeling. 2.Analyzing the collected data and generating a decision function 3.During run-time, the decision function selects the close-to-optimal method(combination of algorithm and segment size) for a particular collective instance.
C4.5 Decision Tree Algorithm Decision Tree Example
C4.5 Decision Tree Algorithm In the decision tree each node corresponds to a non-categorical attribute and each arc to a possible value of that attribute. A leaf of the tree specifies the expected value of the categorical attribute for the records described by the path from the root to that leaf. In the decision tree at each node should be associated the non-categorical attribute which is most informative among the attributes not yet considered in the path from the root.
C4.5 Decision Tree Algorithm Requirement of application of C4.5 algorithm ◦ Attribute-value description ◦ Predefined classes ◦ Discrete classes ◦ Sufficient data ◦ “Logical” classification models
C4.5 Decision Tree Algorithm Additional parameters that affect the resulting decision tree ◦ Weight ◦ Confidence level ◦ Attribute grouping ◦ Windowing
C4.5 Decision Tree Algorithm ◦ ID3 algorithm ◦ C4.5 algorithm= ID3 algorithm +
Experimental Results and Analysis C4.5 decision tree for Alltoall on Nano cluster
Experimental Results and Analysis Barrier is a collective operation used to synchronize a group of nodes. It guarantees that by the end of the operation, all processes involved in the barrier have at least entered the barrier. ◦ In flat-tree/linear algorithm all nodes report to a preselected root; once every node has reported to the root, the root sends a releasing message to all participants. ◦ In the double ring algorithm, a zero-byte message is sent from a preselected root circularly to the right. A node can leave barrier only after it receives the message for the second time. ◦ Bruck algorithm requires communication steps. At step k, node r receives a zero-byte message from and sends message to node and node (with wrap around) respectively.
Experimental Results and Analysis Alltoall is used to exchange data among all processes in a group. The operation is equivalent to all processes executing the scatter operation on their local buffer. ◦ In the linear algorithm at step i, the ith node sends a message to all other nodes. The (i+1)th node is able to proceed and start sending as soon as it receives the complete message from the ith node. We allow for segmentation of messages being sent. ◦ In the pairwise exchange algorithm, at step i, node with rank r sends a message to node (r+i) and receives a message from the (r-i)th node, with wrap around. We do not segment messages in this algorithm.
Experimental Results and Analysis The Broadcast operation transmits an identical message from the root process to all processes of the group. At the end of the call, the contents of the root’s communication buffer is copied to all other processes. ◦ In flat-tree/linear algorithm root node sends an individual message to all participating nodes. ◦ In pipeline algorithm, messages are propagated from the root left to right in a linear fashion. ◦ In binomial and binary tree algorithms, messages traverse the tree starting at the root and going towards the leaf nodes through intermediate nodes. ◦ In the splitted-binary tree algorithm, the original message is split into two parts, and the “left” half of the message is sent down the left half of the binary tree, and the “right” half of the message is sent down the right half of the tree. In the final phase of the algorithm, every node exchanges message with their “pair” from the opposite side of the binary tree. ◦ binary tree algorithm
Experimental Results and Analysis The Reduce operation combines elements provided in the input buffer of each process within a group using the specified operation, and returns the combined value in the output buffer of the root process. ◦ flat-tree/linear ◦ Pipeline ◦ binomial tree ◦ binary tree ◦ k-chain tree.
Experimental Results and Analysis
Broadcast decision tree statistics corresponding to the data presented in last figure.
Experimental Results and Analysis Performance penalty of Broadcast decision trees corresponding to the data presented in last Figure and table
Experimental Results and Analysis
Statistics for combined Broadcast and Reduce decision trees corresponding to the data presented in last figure.
Experimental Results and Analysis Mean performance penalty of the combined decision tree for each of the collectives.
Experimental Results and Analysis Segment of combined Broadcast and Reduce decision tree ‘-m 40 –c 25’
Conclusion C4.5 decision tree can be used to generate a reasonably small and very accurate decision function: the mean performance penalty on existing performance data was within the measurement error for all trees we considered. These trees were also able to produce decision functions with less than 2.5% relative performance penalty for both collectives. This indicates that it is possible to use information about one MPI collective operation to generate a reasonable well decision function for another collective.