Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.

Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin _______________________________________________________ ISPDC’07 Hagenberg July 6, 2007

Outline ● Motivation and Goals ● Background ● The ‘Square-Corner’ Partitioning ● MPI Experiments / Results ● Conclusion / Future Work

Motivation ● Partitioning algorithms for parallel computing designed for n nodes result in partitionings which are not always optimal on a small number of nodes ● We previously presented a new ‘Square-Corner’ partitioning strategy for computing matrix products with two clusters which has two advantages over existing strategies: Reducing the Total Volume of Communication Overlapping Communication and Computation Motivation and Goals

Goal ● Our goal is to determine if the Square-Corner partitioning is applicable as the top-level partitioning of a hierarchal partitioning algorithm for parallel computations on three clusters. ● To do so we model three clusters with three processors. ● Controllable, tunable environment ● A top-level partitioning across clusters would treat clusters as aggregates, or individual nodes ● After the top-level partitioning, clusters can perform local partitioning according to local architecture Motivation and Goals

Goal The three processor model is viable because just as with clusters, local communications are often a magnitude faster than intra-processor/cluster communications. We assume perfect computational load balance. Motivation and Goals

Background

Rectangular Partitionings for Matrix Multiplication Total Volume of Inter-Processor Communication (TVC) = Background TVC = Two Nodes Three Nodes Arrows represent data movement necessary for each node to compute its partial product Each node ‘owns’ the correspondingly shaded partitions

Square-Corner Partitioning Two Nodes Three Nodes Background

The Square-Corner Partitioning: Reducing the total volume of communication (TVC)

Half Perimeters and the Lower Bound TVC is proportional to the sum of all partition perimeters. For simplicity we use the Sum of the Half-Perimeters (P). Lower Bound ( L ) of P is when all partitions are square: Reducing the Total Volume of Communication

Rectangular Partitioning Two NodesThree Nodes Reducing the Total Volume of Communication

Square-Corner Partitioning Two Nodes Three Nodes Reducing the Total Volume of Communication

Restriction In the Square-Corner partitioning, squares cannot ‘overlap’ which imposes the following restriction on the relative node speeds: Reducing the Total Volume of Communication

Network Topology With three nodes we have a choice of network topology, fully connected and linear array (non-wraparound). Reducing the Total Volume of Communication We must investigate each topology independently

Fully Connected Network When is the Square-Corner sum of half perimeters (and therefore the TVC) less than that of the Rectangular on a fully connected network? The hatched area violates the stated restriction. The striped area is where Square Corner has a lower TVC: at ratios of about 8:1:1 and greater. Reducing the Total Volume of Communication

RectangularSquare-Corner Reducing the Total Volume of Communication Linear Array Network Node 1 (the fastest node) is in the middle of the array

Linear Array Network When is the Square-Corner TVC less than that of the Rectangular on the Linear Array Network when Node 1 is the middle node? Reducing the Total Volume of Communication For all power ratios, subject to stated restriction

The Square-Corner Partitioning: Overlapping Communication and Computation

Overlapping Communication and Computation A sub-partition of Node 1’s C partition is immediately calculable. No communications are necessary to compute C1=A1×B1. Overlapping Communication and Computation Node 1’s Partitions

Square-Corner Partitioning Overlapping Communication and Computation Overlapping Communication and Computation

Results MPI experiments - Matrix Matrix Multiplication N = 5000 Bandwidth = 100Mb/s Three identical nodes with CPU limiting software (to achieve different power ratios) Node speed ratio expressed as S1:S2:S3 remember, S1 ≥ S2 ≥ S3 For simplicity, S2 = S3, S1+S2+S3 = 100 Results

Linear Array Communication Time Lower Total Volume of Communication translates to lower Communication Times. Average Reduction in Communication Time = 40% Results

Linear Array Execution Time Results Lower Communication Times result in lower Execution Times. Overlapping further reduces Execution Time

Fully-Connected Communication Time Lower Total Volume of Communication translates to lower Communication Times at expected ratios. (above 8:1:1) Results

Fully Connected Execution Time Results Overlapping further reduces execution time Overlapping also broadens the range of ratios where the Square- Corner partitioning is faster from >8:1:1 to >3:1:1

Results ● Similar Results were observed for different bandwidth values and power ratios (including when S2 ≠ S3). Results

Conclusions ● We successfully applied the Square-Corner Partitioning to Three Nodes. ● The Square-Corner Partitioning approaches the theoretical lower bound for the TVC unlike existing rectangular partitionings. ● For a fully connected network The Square-Corner Partitioning reduces the TVC and the communication time when the power ratio is ~ 8:1:1 or greater. ● For the linear array network it results in a lower TVC and communication time for all power ratios (due to restriction, greater than 2:1:1). ● These lower communication times directly influence execution times. _______________________________________________________ Conclusions

● The possibility of overlapping communication and computation brings further reductions in execution time ● Overlapping broadens the ratio range where the Square-Corner partitioning outperforms the Rectangular from >8:1:1 to >3:1:1. ● The Square-Corner Partitioning is viable for experimentation as the top-level partitioning of a hierarchal partitioning algorithm on Three Clusters _______________________________________________________ Conclusions

Future Work ● Apply the Three-Node Algorithm to Three Clusters ● Investigate an optimal Overlapping of Communications and Computations ● Investigate 4+ nodes/clusters _______________________________________________________ Future Work

Acknowledgements This work was supported by:

Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.

Similar presentations

Presentation on theme: "Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.

Similar presentations

Presentation on theme: "Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback