A Performance Model of Many-to-One Collective Communications for Parallel Computing Alexey Lastovetsky, Maureen O’Flynn UCD School of Computer Science and Informatics Belfield, Dublin 4, Ireland alexey.lastovetsky@ucd.ie, maureen.oflynn@ucd.ie 23 April 2019
Objectives Goal: prediction of the execution time of MPI collective communications on a heterogeneous cluster based on a switched Ethernet Background: performance models of single point-to-point, simultaneous independent point-to-point, one-to-many communications Observation: a significant increase in the execution time of many-to-one communication for medium-sized messages on all platforms and MPI implementations Problem: to model many-to-one communications for medium-sized messages
Performance models for point-to-point and one-to-many communications point-to-point: - execution time - message size - fixed delays - variable delays - transmission rate one-to-many:
Many-to-one collective communications: non-linear and non-deterministic escalations 0.25 0.2 Seconds 0.04 .00001 0 10 20 30 40 50 60 70 80 90 Message size in KB
Parameters of many-to-one model for medium-sized messages M1 MC M2 message size where escalations begin escalations stop occuring message size from which escalations occur with 100% certainty probabilities of escalations
Probability of escalation Discrete constant levels of escalation of values of 40, 200 and 250 times Probability of escalation to level is found
Many-to-one model for small messages
Many-to-one model for large messages
Multi-spectral satellite application A typical real-time satellite imaging application (512x512 bytes) A sequence of raw data images divided into partitions for parallel processing by a cluster M2 Mc M1 n1 n2
Redesigning Application Calculate the number of sub-partitions m of a partition of the medium size M so that: Replace MPI_Gather with sequence of MPI_Gather for smaller messages
Conclusion Results previously undocumented non-linear non-deterministic behaviour for medium messages is analysed many-to-one model is built on the empirical data and point-to-point model parallel application is redesigned in accordance with many-to-one model The work was supported by Science Foundation Ireland.