Download presentation
Presentation is loading. Please wait.
Published byEvangeline Price Modified over 9 years ago
1
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de Geijn ScicomP 10, August 9-13 Austin, TX
2
Outline Model of Parallel Computation Collective Communications Algorithms Performance Results Conclusions and Future work
3
Model of Parallel Computation Target Architectures –distributed memory parallel architectures Indexing –p nodes –indexed 0 … p – 1 –each node has one computational processor
4
1 1 0 0 2 2 3 3 4 4 often logically viewed as a linear array 56 78
5
Model of Parallel Computation Logically Fully Connected –a node can send directly to any other node Communicating Between Nodes –a node can simultaneously receive and send Network Conflicts –sending over a path between two nodes that is completely occupied
6
Model of Parallel Computation Cost of Communication –sending a message of length n between any two nodes is the startup cost (latency) is the transmission cost (bandwidth) Cost of Computation –cost to perform arithmetic operation is reduction operations –sum –prod –min –max +n
7
Outline Model of Parallel Computation Collective Communications Algorithms Performance Results Conclusions and Future work
8
Collective Communications Broadcast Reduce(-to-one) Scatter Gather Allgather Reduce-scatter Allreduce
9
Lower Bounds (Latency) Broadcast Reduce(-to-one) Scatter/Gather Allgather Reduce-scatter Allreduce
10
Lower Bounds (Bandwidth) Broadcast Reduce(-to-one) Scatter/Gather Allgather Reduce-scatter Allreduce
11
Outline Model of Parallel Computation Collective Communications Algorithms Performance Results Conclusions and Future work
12
Motivating Example We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation.
13
A building block approach to library implementation Short-vector case Long-vector case Hybrid algorithms
14
Short-vector case Primary concern: –algorithms must have low latency cost Secondary concerns: –algorithms must work for arbitrary number of nodes in particular, not just for power-of-two numbers of nodes –algorithms should avoid network conflicts not absolutely necessary, but nice if possible
15
Minimum-Spanning Tree based algorithms We will show how the following building blocks: –broadcast/reduce –scatter/gather Using minimum spanning trees embedded in the logical linear array while attaining –minimal latency –implementation for arbitrary numbers of nodes –no network conflicts
16
General principles message starts on one processor
17
General principles divide logical linear array in half
18
General principles send message to the half of the network that does not contain the current node (root) that holds the message
19
General principles send message to the half of the network that does not contain the current node (root) that holds the message
20
General principles continue recursively in each of the two halves
21
General principles The demonstrated technique directly applies to –broadcast –scatter The technique can be applied in reverse to –reduce –gather
22
General principles This technique can be used to implement the following building blocks: –broadcast/reduce –scatter/gather Using a minimum spanning tree embedded in the logical linear array while attaining –minimal latency –implementation for arbitrary numbers of nodes –no network conflicts ? Yes, on linear arrays
23
Reduce-scatter (short vector)
24
Reduce Reduce-scatter (short vector)
25
Scatter Reduce-scatter (short vector)
26
Reduce Before Reduce ++++++++
27
+++++++ After +
34
Cost of Minimum-Spanning Tree Reduce number of stepscost per steps
35
Cost of Minimum-Spanning Tree Reduce number of stepscost per steps Notice: attains lower bound for latency component
36
Scatter Before After
43
Cost of Minimum-Spanning Tree Scatter Assumption: power of two number of nodes
44
Cost of Minimum-Spanning Tree Scatter Assumption: power of two number of nodes Notice: attains lower bound for latency and bandwidth components
45
Cost of Reduce/Scatter Reduce-scatter Assumption: power of two number of nodes reduce scatter
46
Cost of Reduce/Scatter Reduce-scatter Assumption: power of two number of nodes reduce scatter Notice: does not attain lower bound for latency or bandwidth components
47
Recap Reduce Scatter Broadcast Gather Allreduce Reduce-scatter Allgather
48
A building block approach to library implementation Short-vector case Long-vector case Hybrid algorithms
49
Long-vector case Primary concern: –algorithms must have low cost due to vector length –algorithms must avoid network conflicts Secondary concerns: –algorithms must work for arbitrary number of nodes in particular, not just for power-of-two numbers of nodes
50
Long-vector building blocks We will show how the following building blocks: –allgather/reduce-scatter Can be implemented using “bucket” algorithms while attaining –minimal cost due to length of vectors –implementation for arbitrary numbers of nodes –no network conflicts
51
A logical ring can be embedded in a physical linear array
52
General principles This technique can be used to implement the following building blocks: –allgather/reduce-scatter Send subvectors of data around the ring at each step until all data is collected, like a “bucket” –minimal cost due to length of vectors –implementation for arbitrary numbers of nodes –no network conflicts
53
Reduce-scatter Before Reduce ++++++++
54
+++++++ Reduce-scatter Reduce After +
65
Cost of Bucket Reduce-scatter number of stepscost per steps
66
Cost of Bucket Reduce-scatter number of stepscost per steps Notice: attains lower bound for bandwidth and computation component
67
Recap Reduce-scatter Scatter Allgather Gather Allreduce Reduce Broadcast
68
A building block approach to library implementation Short-vector case Long-vector case Hybrid algorithms
69
Hybrid algorithms (Intermediate length case) Algorithms must balance latency, cost due to vector length, and network conflicts
70
General principles View p nodes as a 2-dimensional mesh –p = r x c, row and column dimensions Perform operation in each dimension –reduce-scatter within columns, followed by reduce-scatter within rows
71
General principles Many different combinations of short- and long- algorithms Generally try to reduce vector length to use at next dimension Can use more than 2-dimensions
72
Example: 2D Scatter/Allgather Broadcast
73
Scatter in columns
74
Example: 2D Scatter/Allgather Broadcast Scatter in rows
75
Example: 2D Scatter/Allgather Broadcast Allgather in rows
76
Example: 2D Scatter/Allgather Broadcast Allgather in columns
89
Cost of 2D Scatter/Allgather Broadcast
90
Cost comparison Option 1: –MST Broadcast in column –MST Broadcast in rows Option 2: –Scatter in column –MST Broadcast in rows –Allgather in columns Option 3: –Scatter in column –Scatter in rows –Allgather in rows –Allgather in columns
91
Outline Model of Parallel Computation Collective Communications Algorithms Performance Results Conclusions and Future work
92
Testbed Architecture Cray-Dell PowerEdge Linux Cluster –Lonestar Texas Advanced Computing Center –J. J. Pickle Research Campus »The University of Texas at Austin –856 3.06 GHz Xeon processors and –410 Dell dual-processor PowerEdge 1750 compute nodes –2 GB of memory per node in the compute nodes –Myrinet-2000 switch fabric –mpich-gm library 1.2.5..12a
93
Performance Results Log-log graphs –Shows performance better than linear-linear graphs Used one processor per node Graphs –‘send-min’ is the minimum time to send between nodes –’MPI’ is the MPICH implementation –‘short’ is minimum spanning tree algorithm –‘long’ is the bucket algorithm –‘hybrid’ is a 3-dimensional hybrid algorithm two higher dimensions with ‘long’ algorithm Lowest dimension with ‘short’ algorithm
98
Conclusions and Future work Have a working prototype for an optimal collective communication library, using optimal algorithms for short, intermediate and long vector lengths. Need to obtain heuristics for cross-over points between short, hybrid and long vector algorithms, independent of architecture. Need to complete this approach for ALL MPI data types and operations. Need to generalize approach or have heuristics for large SMP nodes or small n-way node clusters.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.