Download presentation
Presentation is loading. Please wait.
1
Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin _______________________________________________________ HeteroPar’06 Barcelona Sept. 28, 2006
2
Outline ● Motivation and Goals ● Introduction: ‘Straight-Line’ Partitionings ● The ‘Square-Corner’ Partitioning - Minimizing the Total Volume of Communication ● MPI Experiments / Results ● Conclusion / Future Work
3
Motivation and Goals ● Partitioning algorithms for MMM designed for n processors result in partitionings which are not always optimal on a small number of processors ● We seek to lower the Total Volume of Communication by utilizing a new partitioning strategy. ● Our ultimate interest is to determine if the Square-Corner partitioning is a viable technique for deployment on 2 interconnected Clusters.
4
Background: Straight-Line Partitioning Total Volume of Inter-Processor Communication (TVC) is proportional to the Sum of Half-Perimeters (S) Lower Bound ( L ) of S is when all partitions are square
5
Straight-Line Partitioning From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051. Average and Minimum values of for two million randomly generated areas
6
Background: Straight-Line Partitioning 2 Processors The Straight-Line Partitioning can not meet the lower bound, L
7
Background: Straight-Line Partitioning 2 Processors Total Volume of Inter-Processor Communication (TVC) = N 2
8
Introduction: Square-Corner Partitioning
9
Square-Corner Partitioning The Square-Corner Partitioning can meet the lower bound, L
10
Square-Corner Partitioning Average and Minimum values offor 2 million randomly generated areas Power Ratio > 3:1 Adapted From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.
11
Square-Corner Partitioning Minimizing the TVC The Square-Corner Partitioning has a lower Total Volume of Communication compared to the Straight-Line Partitioning Provided the Processor Power Ratio is > 3:1 The Total Volume of Communication is minimized when the slower processor’s partition is a square Theorem:
12
Results: Square-Corner Partitioning Matrix-Matrix Multiplication, N=6500, Bandwidth = 80Mb/s Lower TVC Lower Communication Time Lower Execution Time Average Reduction in Communication Time = 45% Average Reduction in Execution Time = 14%
13
Results: Square-Corner Partitioning Matrix-Matrix Multiplication, N=6500, Bandwidth = 380Mb/s Average Reduction in Communication Time = 44% Lower TVC Lower Communication Time Lower Execution Time Average Reduction in Execution Time = 10%
14
Square-Corner Partitioning Overlapping Communication and Computation A sub-partition of Processor 1’s C Partition is Immediately Calculable
15
Square-Corner Partitioning Overlapping Communication and Computation Overlapping more than doubled advantage of Square-Corner algorithm. ● No Overlapping → 17% faster than Straight-Line algorithm. ● Overlapping → 39% faster than Straight-Line algorithm. AlgorithmExecution TimeSpeedup Straight-Line83s0.94 Square-Corner (No Overlapping)69s1.13 Square-Corner (Overlapping)51s1.53 Sequential78sN/A MM Multiplication, N=4500, Bandwidth=100Mb/s, Ratio=5:1,
16
Square-Corner Partitioning Two Cluster Architecture Total of 20 Homogeneous Nodes in 2 Clusters
17
Square-Corner Partitioning Two Clusters AlgorithmExecution TimeSpeedup Straight-Line123s1.04 Square-Corner115s1.11 Sequential128sN/A MM Multiplication, N=9000, Bandwidth=100Mb/s All Machines are Homogeneous. One Cluster of 4, One Cluster of 16
18
Conclusions ● The Square-Corner Partitioning reduces the Total Volume of Communication provided the processor power ratio is > 3:1 ● The possibility of Overlapping Communication and Computation can bring further reductions in Execution Time ● The Square-Corner Partitioning is viable on Two Clusters _______________________________________________________
19
Current and Future Work ● We have successfully extended the Square-Corner Partitioning to Three Processors To do: ● Experiment on more Two-Cluster architectures ● Overlap Communication and Computation on Two Clusters ● Extend to Three-Processor Algorithm to Three Clusters _______________________________________________________
20
Acknowledgements This work was supported by:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.