Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.

Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin _______________________________________________________ HeteroPar’06 Barcelona Sept. 28, 2006

Outline ● Motivation and Goals ● Introduction: ‘Straight-Line’ Partitionings ● The ‘Square-Corner’ Partitioning - Minimizing the Total Volume of Communication ● MPI Experiments / Results ● Conclusion / Future Work

Motivation and Goals ● Partitioning algorithms for MMM designed for n processors result in partitionings which are not always optimal on a small number of processors ● We seek to lower the Total Volume of Communication by utilizing a new partitioning strategy. ● Our ultimate interest is to determine if the Square-Corner partitioning is a viable technique for deployment on 2 interconnected Clusters.

Background: Straight-Line Partitioning Total Volume of Inter-Processor Communication (TVC) is proportional to the Sum of Half-Perimeters (S) Lower Bound ( L ) of S is when all partitions are square

Straight-Line Partitioning From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051. Average and Minimum values of for two million randomly generated areas

Background: Straight-Line Partitioning 2 Processors The Straight-Line Partitioning can not meet the lower bound, L

Background: Straight-Line Partitioning 2 Processors Total Volume of Inter-Processor Communication (TVC) = N 2

Introduction: Square-Corner Partitioning

Square-Corner Partitioning The Square-Corner Partitioning can meet the lower bound, L

Square-Corner Partitioning Average and Minimum values offor 2 million randomly generated areas Power Ratio > 3:1 Adapted From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.

Square-Corner Partitioning Minimizing the TVC The Square-Corner Partitioning has a lower Total Volume of Communication compared to the Straight-Line Partitioning Provided the Processor Power Ratio is > 3:1 The Total Volume of Communication is minimized when the slower processor’s partition is a square Theorem:

Results: Square-Corner Partitioning Matrix-Matrix Multiplication, N=6500, Bandwidth = 80Mb/s Lower TVC  Lower Communication Time  Lower Execution Time Average Reduction in Communication Time = 45% Average Reduction in Execution Time = 14%

Results: Square-Corner Partitioning Matrix-Matrix Multiplication, N=6500, Bandwidth = 380Mb/s Average Reduction in Communication Time = 44% Lower TVC  Lower Communication Time  Lower Execution Time Average Reduction in Execution Time = 10%

Square-Corner Partitioning Overlapping Communication and Computation A sub-partition of Processor 1’s C Partition is Immediately Calculable

Square-Corner Partitioning Overlapping Communication and Computation Overlapping more than doubled advantage of Square-Corner algorithm. ● No Overlapping → 17% faster than Straight-Line algorithm. ● Overlapping → 39% faster than Straight-Line algorithm. AlgorithmExecution TimeSpeedup Straight-Line83s0.94 Square-Corner (No Overlapping)69s1.13 Square-Corner (Overlapping)51s1.53 Sequential78sN/A MM Multiplication, N=4500, Bandwidth=100Mb/s, Ratio=5:1,

Square-Corner Partitioning Two Cluster Architecture Total of 20 Homogeneous Nodes in 2 Clusters

Square-Corner Partitioning Two Clusters AlgorithmExecution TimeSpeedup Straight-Line123s1.04 Square-Corner115s1.11 Sequential128sN/A MM Multiplication, N=9000, Bandwidth=100Mb/s All Machines are Homogeneous. One Cluster of 4, One Cluster of 16

Conclusions ● The Square-Corner Partitioning reduces the Total Volume of Communication provided the processor power ratio is > 3:1 ● The possibility of Overlapping Communication and Computation can bring further reductions in Execution Time ● The Square-Corner Partitioning is viable on Two Clusters _______________________________________________________

Current and Future Work ● We have successfully extended the Square-Corner Partitioning to Three Processors To do: ● Experiment on more Two-Cluster architectures ● Overlap Communication and Computation on Two Clusters ● Extend to Three-Processor Algorithm to Three Clusters _______________________________________________________

Acknowledgements This work was supported by:

Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.

Similar presentations

Presentation on theme: "Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.

Similar presentations

Presentation on theme: "Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback