Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.

Similar presentations


Presentation on theme: "Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory."— Presentation transcript:

1 Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin _______________________________________________________ ISPDC’07 Hagenberg July 6, 2007

2 Outline ● Motivation and Goals ● Background ● The ‘Square-Corner’ Partitioning ● MPI Experiments / Results ● Conclusion / Future Work

3 Motivation ● Partitioning algorithms for parallel computing designed for n nodes result in partitionings which are not always optimal on a small number of nodes ● We previously presented a new ‘Square-Corner’ partitioning strategy for computing matrix products with two clusters which has two advantages over existing strategies: Reducing the Total Volume of Communication Overlapping Communication and Computation Motivation and Goals

4 Goal ● Our goal is to determine if the Square-Corner partitioning is applicable as the top-level partitioning of a hierarchal partitioning algorithm for parallel computations on three clusters. ● To do so we model three clusters with three processors. ● Controllable, tunable environment ● A top-level partitioning across clusters would treat clusters as aggregates, or individual nodes ● After the top-level partitioning, clusters can perform local partitioning according to local architecture Motivation and Goals

5 Goal The three processor model is viable because just as with clusters, local communications are often a magnitude faster than intra-processor/cluster communications. We assume perfect computational load balance. Motivation and Goals

6 Background

7 Rectangular Partitionings for Matrix Multiplication Total Volume of Inter-Processor Communication (TVC) = Background TVC = Two Nodes Three Nodes Arrows represent data movement necessary for each node to compute its partial product Each node ‘owns’ the correspondingly shaded partitions

8 Square-Corner Partitioning Two Nodes Three Nodes Background

9 The Square-Corner Partitioning: Reducing the total volume of communication (TVC)

10 Half Perimeters and the Lower Bound TVC is proportional to the sum of all partition perimeters. For simplicity we use the Sum of the Half-Perimeters (P). Lower Bound ( L ) of P is when all partitions are square: Reducing the Total Volume of Communication

11 Rectangular Partitioning Two NodesThree Nodes Reducing the Total Volume of Communication

12 Square-Corner Partitioning Two Nodes Three Nodes Reducing the Total Volume of Communication

13 Restriction In the Square-Corner partitioning, squares cannot ‘overlap’ which imposes the following restriction on the relative node speeds: Reducing the Total Volume of Communication

14 Network Topology With three nodes we have a choice of network topology, fully connected and linear array (non-wraparound). Reducing the Total Volume of Communication We must investigate each topology independently

15 Fully Connected Network When is the Square-Corner sum of half perimeters (and therefore the TVC) less than that of the Rectangular on a fully connected network? The hatched area violates the stated restriction. The striped area is where Square Corner has a lower TVC: at ratios of about 8:1:1 and greater. Reducing the Total Volume of Communication

16 RectangularSquare-Corner Reducing the Total Volume of Communication Linear Array Network Node 1 (the fastest node) is in the middle of the array

17 Linear Array Network When is the Square-Corner TVC less than that of the Rectangular on the Linear Array Network when Node 1 is the middle node? Reducing the Total Volume of Communication For all power ratios, subject to stated restriction

18 The Square-Corner Partitioning: Overlapping Communication and Computation

19 Overlapping Communication and Computation A sub-partition of Node 1’s C partition is immediately calculable. No communications are necessary to compute C1=A1×B1. Overlapping Communication and Computation Node 1’s Partitions

20 Square-Corner Partitioning Overlapping Communication and Computation Overlapping Communication and Computation

21 Results MPI experiments - Matrix Matrix Multiplication N = 5000 Bandwidth = 100Mb/s Three identical nodes with CPU limiting software (to achieve different power ratios) Node speed ratio expressed as S1:S2:S3 remember, S1 ≥ S2 ≥ S3 For simplicity, S2 = S3, S1+S2+S3 = 100 Results

22 Linear Array Communication Time Lower Total Volume of Communication translates to lower Communication Times. Average Reduction in Communication Time = 40% Results

23 Linear Array Execution Time Results Lower Communication Times result in lower Execution Times. Overlapping further reduces Execution Time

24 Fully-Connected Communication Time Lower Total Volume of Communication translates to lower Communication Times at expected ratios. (above 8:1:1) Results

25 Fully Connected Execution Time Results Overlapping further reduces execution time Overlapping also broadens the range of ratios where the Square- Corner partitioning is faster from >8:1:1 to >3:1:1

26 Results ● Similar Results were observed for different bandwidth values and power ratios (including when S2 ≠ S3). Results

27 Conclusions ● We successfully applied the Square-Corner Partitioning to Three Nodes. ● The Square-Corner Partitioning approaches the theoretical lower bound for the TVC unlike existing rectangular partitionings. ● For a fully connected network The Square-Corner Partitioning reduces the TVC and the communication time when the power ratio is ~ 8:1:1 or greater. ● For the linear array network it results in a lower TVC and communication time for all power ratios (due to restriction, greater than 2:1:1). ● These lower communication times directly influence execution times. _______________________________________________________ Conclusions

28 ● The possibility of overlapping communication and computation brings further reductions in execution time ● Overlapping broadens the ratio range where the Square-Corner partitioning outperforms the Rectangular from >8:1:1 to >3:1:1. ● The Square-Corner Partitioning is viable for experimentation as the top-level partitioning of a hierarchal partitioning algorithm on Three Clusters _______________________________________________________ Conclusions

29 Future Work ● Apply the Three-Node Algorithm to Three Clusters ● Investigate an optimal Overlapping of Communications and Computations ● Investigate 4+ nodes/clusters _______________________________________________________ Future Work

30 Acknowledgements This work was supported by:


Download ppt "Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory."

Similar presentations


Ads by Google