Download presentation
Presentation is loading. Please wait.
1
Optimizing MPI collectives for SMP clusters
Nizhni Novgorod State University Faculty of Computational Mathematics and Cybernetics Optimizing MPI collectives for SMP clusters
2
Project Optimizing Performance of MPI open-source implementations for Linux on POWER processor clusters Increasing efficiency of parallel applications: running on POWER clusters, under Linux, developed using open-source implementations of MPI
3
Analyzing MPI realizations
The main target are collective operations because they are most time-consuming procedures in MPI Rolf Rabenseifer. Automatic MPI Counter Profiling. 42nd GUG Conf.
4
Performance evaluation model
5
Alltoall Ring Algorithm
3 1 3 1 3 1 2 2 2 Step 1 Step 2 Step 3
6
Hockney model Ttransfer = + * n
Hockney model allows to estimate cost of message transfer using following parameters - latency (time to preparing data for transfer), - time for transferring 1 byte of data between two processors (i.e. 1/ is the network bandwidth), n – message size (bytes) Ttransfer = + * n
7
Cost of alltoall Ring algorithm
T = (p-1)* + (p-1)/p*n* + n + l + s – latency (or startup time) per message, independent of message size, – transfer time per byte, n – the number of bytes transferred, n – node contention overhead (when more than one node tries to send large messages to the same node), l – link contention overhead (when more than one communication uses the same links in the network), s – switch contention overhead (when the amount of data passing through the switch overflows the switch capacity to handle) Good for clusters with single-processor nodes
8
SMP-cluster Currently only two levels of cluster architecture is considered: data transfer inside SMP-node over shared memory, data transfer between SMP-nodes over network Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3
9
Challenges Variable cost of point-to-point operations
Increased role of processes placement topology on network hosts Ineffective use of shared memory in some realizations
10
Point-to-point operations
11
Applying Hockney model. Shared memory vs. Network…
POWER5 shared memory sh_mem 7*10-6 sh_mem 8.4*10-10 Myrinet network 4*10-5 network 2.6*10-8 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3
12
Applying Hockney model. Shared memory vs. Network
P-III Xeon shared memory sh_mem 1.3*10-5 sh_mem 8.3*10-9 Gigabit Ethernet network 5.9*10-5 network 1.9*10-8 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3
13
Applying Hockney model. Simultaneous transfers over network
Gigabit Ethernet 1 pair 2 pairs 3 pairs 4pairs network 5,88E-05 7,18E-05 8,94E-05 10,3E-05 network 1,93E-08 3,30E-08 4,52E-08 5,74E-08 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3
14
Applying Hockney model. Simultaneous transfers over shared memory
P-III Xeon shared memory 2 data flows 4 data flows 6 data flows 8 data flows network 1,3E-05 1,4E-05 network 0,83E-08 1,28E-08 1,96E-08 2,56E-08 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3
15
Collective operations and processes placement
16
Bcast operation. Binomial tree algorithm
k k-th process 1-st step transfer 4 2 1 2-nd step transfers 6 5 3 3-rd step transfers
17
Bcast binomial tree algorithm. Two-level cluster architecture…
Node 1 Processes within SMP-node can interact through shared memory Processes running on different nodes must use network for data transfer Node 2 Node 3
18
Bcast binomial tree algorithm. Two-level cluster architecture…
Standard process numeration k k-th process 3 6 1-st step transfer 2-nd step transfers 3-rd step transfers 2 5 1 4 On each step we send data over network
19
Bcast binomial tree algorithm. Two-level cluster architecture…
More efficient process numeration k k-th process 1 6 3-rd step transfers 1-st step transfer 2-nd step transfers 4 5 2 3 On 3-rd step we transfer data only over shared memory
20
Bcast binomial tree algorithm. Two-level cluster architecture…
Optimized algorithm use binomial tree algorithm for transferring message to all networks nodes, use binomial tree algorithm for transferring message to all processes on every SMP-node k k-th process 1-st stage – transfer over network 6 3 2 1 2-nd stage – transfer over shared memory 5 4
21
Bcast binomial tree algorithm. Two-level cluster architecture
Test results 26% Acceleration
22
Bcast operation Existing algorithms: Binomial tree algorithm,
Scatter-gather algorithm, Scatter-ring algorithm
23
Bcast operation. Different processes placement…
process topology process topology 4 2 6 Network 1 5 3 7 2 1 3 Network 4 6 5 7
24
Bcast operation. Different algorithms performance
25
Estimating the cost of collective operations
26
Estimating the cost of collective communication algorithm…
Presumptions: All cluster hosts are identical, Network connections between cluster hosts are symmetric
27
Estimating the cost of collective communication algorithm…
Incoming data: Costs of point-to-point operations over network depending on number of simultaneous transfers, Costs of point-to-point operations over shared memory depending on number of simultaneous transfers
28
Estimating the cost of collective communication algorithm…
Calculating number of steps Determining for each step: which processes take part in transfers, which resources are used in each transfer, cost of each transfer Cost of algorithm is assumed as sum of maximum costs on each step
29
Estimating the cost of collective communication algorithm…
30
Estimating the cost of collective communication algorithm
31
Effective use of shared memory
32
Using shared memory. Standard algorithms
In case of transferring the same data from one process to several others data is transferred for each process successively by separate operations, For transferring between each process pair separate shared “memory window” is used Node CPU0 CPU1 RAM CPU2 CPU3
33
Using shared memory. Binomial tree Bcast algorithm
Operation cost (number of shared memory transfers) TBcast = (p-1) * (sh_mem + sh_mem * n) where p – number of processes, n – message size Node CPU0 CPU1 - step 1 transfer RAM - step 2 transfers CPU2 CPU3
34
Using shared memory. Optimized algorithms
In case of transferring the same data from one process to several others data is transferred for every process by using only one operation Single shared memory window or set of windows (with number of windows = number of processes) is used for data transfer Node Node CPU0 CPU1 CPU0 CPU1 RAM RAM CPU2 CPU3 CPU2 CPU3
35
Using shared memory. Optimized Bcast algorithm
Operation cost (number of shared memory transfers) TBcast = p/2 * (sh_mem + sh_mem * n) where p – number of processes, n – message size Node CPU0 CPU1 - step 1 transfers RAM CPU2 CPU3
36
Using shared memory. Comparing algorithms performance…
Theoretical estimation – 33% faster
37
Using shared memory. Comparing algorithms performance…
Test results – 31% faster
38
Summary Effective realization should take into account:
Variable cost of point-to-point operations, Processes placement on network hosts, Relative costs of existing algorithms, Effective realization should use resources of hardware as much as possible
39
Optimized bcast algorithm. Estimated performance
40
Optimized bcast algorithm. Experimental data
41
Publications SCICOMP 11, Edinburgh, Scotland, 2005
European Power.org Community Conference, Barcelona, 2005 JSCC Power.org technical seminar, Moscow, 2005 Microsoft technologies in programming theory and practice, UNN, 2005
42
Research group Gergel V.P., professor
Grishagin V.A., associate professor Belov S.A., associate professor Linev A.V. Gergel A.V. Grishagin A.V. Kurylev A.L. Senin A.V. This work is partly supported by IBM Faculty Awards for Innovation Program
43
Contacts 603950, Nighni Novgorod Gagarina av., 23,
Nizhni Novgorod State University, Applied Mathematics and Cybernetics faculty Tel: +7 (8312)
44
Thank you for attention
Questions, Remarks, Comments
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.