Download presentation
Presentation is loading. Please wait.
1
Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin _______________________________________________________ ISPDC’07 Hagenberg July 6, 2007
2
Outline ● Motivation and Goals ● Background ● The ‘Square-Corner’ Partitioning ● MPI Experiments / Results ● Conclusion / Future Work
3
Motivation ● Partitioning algorithms for parallel computing designed for n nodes result in partitionings which are not always optimal on a small number of nodes ● We previously presented a new ‘Square-Corner’ partitioning strategy for computing matrix products with two clusters which has two advantages over existing strategies: Reducing the Total Volume of Communication Overlapping Communication and Computation Motivation and Goals
4
Goal ● Our goal is to determine if the Square-Corner partitioning is applicable as the top-level partitioning of a hierarchal partitioning algorithm for parallel computations on three clusters. ● To do so we model three clusters with three processors. ● Controllable, tunable environment ● A top-level partitioning across clusters would treat clusters as aggregates, or individual nodes ● After the top-level partitioning, clusters can perform local partitioning according to local architecture Motivation and Goals
5
Goal The three processor model is viable because just as with clusters, local communications are often a magnitude faster than intra-processor/cluster communications. We assume perfect computational load balance. Motivation and Goals
6
Background
7
Rectangular Partitionings for Matrix Multiplication Total Volume of Inter-Processor Communication (TVC) = Background TVC = Two Nodes Three Nodes Arrows represent data movement necessary for each node to compute its partial product Each node ‘owns’ the correspondingly shaded partitions
8
Square-Corner Partitioning Two Nodes Three Nodes Background
9
The Square-Corner Partitioning: Reducing the total volume of communication (TVC)
10
Half Perimeters and the Lower Bound TVC is proportional to the sum of all partition perimeters. For simplicity we use the Sum of the Half-Perimeters (P). Lower Bound ( L ) of P is when all partitions are square: Reducing the Total Volume of Communication
11
Rectangular Partitioning Two NodesThree Nodes Reducing the Total Volume of Communication
12
Square-Corner Partitioning Two Nodes Three Nodes Reducing the Total Volume of Communication
13
Restriction In the Square-Corner partitioning, squares cannot ‘overlap’ which imposes the following restriction on the relative node speeds: Reducing the Total Volume of Communication
14
Network Topology With three nodes we have a choice of network topology, fully connected and linear array (non-wraparound). Reducing the Total Volume of Communication We must investigate each topology independently
15
Fully Connected Network When is the Square-Corner sum of half perimeters (and therefore the TVC) less than that of the Rectangular on a fully connected network? The hatched area violates the stated restriction. The striped area is where Square Corner has a lower TVC: at ratios of about 8:1:1 and greater. Reducing the Total Volume of Communication
16
RectangularSquare-Corner Reducing the Total Volume of Communication Linear Array Network Node 1 (the fastest node) is in the middle of the array
17
Linear Array Network When is the Square-Corner TVC less than that of the Rectangular on the Linear Array Network when Node 1 is the middle node? Reducing the Total Volume of Communication For all power ratios, subject to stated restriction
18
The Square-Corner Partitioning: Overlapping Communication and Computation
19
Overlapping Communication and Computation A sub-partition of Node 1’s C partition is immediately calculable. No communications are necessary to compute C1=A1×B1. Overlapping Communication and Computation Node 1’s Partitions
20
Square-Corner Partitioning Overlapping Communication and Computation Overlapping Communication and Computation
21
Results MPI experiments - Matrix Matrix Multiplication N = 5000 Bandwidth = 100Mb/s Three identical nodes with CPU limiting software (to achieve different power ratios) Node speed ratio expressed as S1:S2:S3 remember, S1 ≥ S2 ≥ S3 For simplicity, S2 = S3, S1+S2+S3 = 100 Results
22
Linear Array Communication Time Lower Total Volume of Communication translates to lower Communication Times. Average Reduction in Communication Time = 40% Results
23
Linear Array Execution Time Results Lower Communication Times result in lower Execution Times. Overlapping further reduces Execution Time
24
Fully-Connected Communication Time Lower Total Volume of Communication translates to lower Communication Times at expected ratios. (above 8:1:1) Results
25
Fully Connected Execution Time Results Overlapping further reduces execution time Overlapping also broadens the range of ratios where the Square- Corner partitioning is faster from >8:1:1 to >3:1:1
26
Results ● Similar Results were observed for different bandwidth values and power ratios (including when S2 ≠ S3). Results
27
Conclusions ● We successfully applied the Square-Corner Partitioning to Three Nodes. ● The Square-Corner Partitioning approaches the theoretical lower bound for the TVC unlike existing rectangular partitionings. ● For a fully connected network The Square-Corner Partitioning reduces the TVC and the communication time when the power ratio is ~ 8:1:1 or greater. ● For the linear array network it results in a lower TVC and communication time for all power ratios (due to restriction, greater than 2:1:1). ● These lower communication times directly influence execution times. _______________________________________________________ Conclusions
28
● The possibility of overlapping communication and computation brings further reductions in execution time ● Overlapping broadens the ratio range where the Square-Corner partitioning outperforms the Rectangular from >8:1:1 to >3:1:1. ● The Square-Corner Partitioning is viable for experimentation as the top-level partitioning of a hierarchal partitioning algorithm on Three Clusters _______________________________________________________ Conclusions
29
Future Work ● Apply the Three-Node Algorithm to Three Clusters ● Investigate an optimal Overlapping of Communications and Computations ● Investigate 4+ nodes/clusters _______________________________________________________ Future Work
30
Acknowledgements This work was supported by:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.