Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

Similar presentations


Presentation on theme: "On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de."— Presentation transcript:

1 On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de Geijn ScicomP 10, August 9-13 Austin, TX

2 Outline Model of Parallel Computation Collective Communications Algorithms Performance Results Conclusions and Future work

3 Model of Parallel Computation Target Architectures –distributed memory parallel architectures Indexing –p nodes –indexed 0 … p – 1 –each node has one computational processor

4 1 1 0 0 2 2 3 3 4 4 often logically viewed as a linear array 56 78

5 Model of Parallel Computation Logically Fully Connected –a node can send directly to any other node Communicating Between Nodes –a node can simultaneously receive and send Network Conflicts –sending over a path between two nodes that is completely occupied

6 Model of Parallel Computation Cost of Communication –sending a message of length n between any two nodes  is the startup cost (latency)  is the transmission cost (bandwidth) Cost of Computation –cost to perform arithmetic operation is  reduction operations –sum –prod –min –max  +n 

7 Outline Model of Parallel Computation Collective Communications Algorithms Performance Results Conclusions and Future work

8 Collective Communications Broadcast Reduce(-to-one) Scatter Gather Allgather Reduce-scatter Allreduce

9 Lower Bounds (Latency) Broadcast Reduce(-to-one) Scatter/Gather Allgather Reduce-scatter Allreduce

10 Lower Bounds (Bandwidth) Broadcast Reduce(-to-one) Scatter/Gather Allgather Reduce-scatter Allreduce

11 Outline Model of Parallel Computation Collective Communications Algorithms Performance Results Conclusions and Future work

12 Motivating Example We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation.

13 A building block approach to library implementation Short-vector case Long-vector case Hybrid algorithms

14 Short-vector case Primary concern: –algorithms must have low latency cost Secondary concerns: –algorithms must work for arbitrary number of nodes in particular, not just for power-of-two numbers of nodes –algorithms should avoid network conflicts not absolutely necessary, but nice if possible

15 Minimum-Spanning Tree based algorithms We will show how the following building blocks: –broadcast/reduce –scatter/gather Using minimum spanning trees embedded in the logical linear array while attaining –minimal latency –implementation for arbitrary numbers of nodes –no network conflicts

16 General principles message starts on one processor

17 General principles divide logical linear array in half

18 General principles send message to the half of the network that does not contain the current node (root) that holds the message

19 General principles send message to the half of the network that does not contain the current node (root) that holds the message

20 General principles continue recursively in each of the two halves

21 General principles The demonstrated technique directly applies to –broadcast –scatter The technique can be applied in reverse to –reduce –gather

22 General principles This technique can be used to implement the following building blocks: –broadcast/reduce –scatter/gather Using a minimum spanning tree embedded in the logical linear array while attaining –minimal latency –implementation for arbitrary numbers of nodes –no network conflicts ? Yes, on linear arrays

23 Reduce-scatter (short vector)

24 Reduce Reduce-scatter (short vector)

25 Scatter Reduce-scatter (short vector)

26 Reduce Before Reduce ++++++++

27 +++++++ After +

28

29

30

31

32

33

34 Cost of Minimum-Spanning Tree Reduce number of stepscost per steps

35 Cost of Minimum-Spanning Tree Reduce number of stepscost per steps Notice: attains lower bound for latency component

36 Scatter Before After

37

38

39

40

41

42

43 Cost of Minimum-Spanning Tree Scatter Assumption: power of two number of nodes

44 Cost of Minimum-Spanning Tree Scatter Assumption: power of two number of nodes Notice: attains lower bound for latency and bandwidth components

45 Cost of Reduce/Scatter Reduce-scatter Assumption: power of two number of nodes reduce scatter

46 Cost of Reduce/Scatter Reduce-scatter Assumption: power of two number of nodes reduce scatter Notice: does not attain lower bound for latency or bandwidth components

47 Recap Reduce Scatter Broadcast Gather Allreduce Reduce-scatter Allgather

48 A building block approach to library implementation Short-vector case Long-vector case Hybrid algorithms

49 Long-vector case Primary concern: –algorithms must have low cost due to vector length –algorithms must avoid network conflicts Secondary concerns: –algorithms must work for arbitrary number of nodes in particular, not just for power-of-two numbers of nodes

50 Long-vector building blocks We will show how the following building blocks: –allgather/reduce-scatter Can be implemented using “bucket” algorithms while attaining –minimal cost due to length of vectors –implementation for arbitrary numbers of nodes –no network conflicts

51 A logical ring can be embedded in a physical linear array

52 General principles This technique can be used to implement the following building blocks: –allgather/reduce-scatter Send subvectors of data around the ring at each step until all data is collected, like a “bucket” –minimal cost due to length of vectors –implementation for arbitrary numbers of nodes –no network conflicts

53 Reduce-scatter Before Reduce ++++++++

54 +++++++ Reduce-scatter Reduce After +

55

56

57

58

59

60

61

62

63

64

65 Cost of Bucket Reduce-scatter number of stepscost per steps 

66 Cost of Bucket Reduce-scatter number of stepscost per steps Notice: attains lower bound for bandwidth and computation component 

67 Recap Reduce-scatter Scatter Allgather Gather Allreduce Reduce Broadcast

68 A building block approach to library implementation Short-vector case Long-vector case Hybrid algorithms

69 Hybrid algorithms (Intermediate length case) Algorithms must balance latency, cost due to vector length, and network conflicts

70 General principles View p nodes as a 2-dimensional mesh –p = r x c, row and column dimensions Perform operation in each dimension –reduce-scatter within columns, followed by reduce-scatter within rows

71 General principles Many different combinations of short- and long- algorithms Generally try to reduce vector length to use at next dimension Can use more than 2-dimensions

72 Example: 2D Scatter/Allgather Broadcast

73 Scatter in columns

74 Example: 2D Scatter/Allgather Broadcast Scatter in rows

75 Example: 2D Scatter/Allgather Broadcast Allgather in rows

76 Example: 2D Scatter/Allgather Broadcast Allgather in columns

77

78

79

80

81

82

83

84

85

86

87

88

89 Cost of 2D Scatter/Allgather Broadcast

90 Cost comparison Option 1: –MST Broadcast in column –MST Broadcast in rows Option 2: –Scatter in column –MST Broadcast in rows –Allgather in columns Option 3: –Scatter in column –Scatter in rows –Allgather in rows –Allgather in columns

91 Outline Model of Parallel Computation Collective Communications Algorithms Performance Results Conclusions and Future work

92 Testbed Architecture Cray-Dell PowerEdge Linux Cluster –Lonestar Texas Advanced Computing Center –J. J. Pickle Research Campus »The University of Texas at Austin –856 3.06 GHz Xeon processors and –410 Dell dual-processor PowerEdge 1750 compute nodes –2 GB of memory per node in the compute nodes –Myrinet-2000 switch fabric –mpich-gm library 1.2.5..12a

93 Performance Results Log-log graphs –Shows performance better than linear-linear graphs Used one processor per node Graphs –‘send-min’ is the minimum time to send between nodes –’MPI’ is the MPICH implementation –‘short’ is minimum spanning tree algorithm –‘long’ is the bucket algorithm –‘hybrid’ is a 3-dimensional hybrid algorithm two higher dimensions with ‘long’ algorithm Lowest dimension with ‘short’ algorithm

94

95

96

97

98 Conclusions and Future work Have a working prototype for an optimal collective communication library, using optimal algorithms for short, intermediate and long vector lengths. Need to obtain heuristics for cross-over points between short, hybrid and long vector algorithms, independent of architecture. Need to complete this approach for ALL MPI data types and operations. Need to generalize approach or have heuristics for large SMP nodes or small n-way node clusters.


Download ppt "On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de."

Similar presentations


Ads by Google