Download presentation
Presentation is loading. Please wait.
Published byJohnathan McLaughlin Modified over 9 years ago
1
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan
2
Authors Ernie Chan Robert van de Geijn Department of Computer Sciences The University of Texas at Austin William Gropp Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory
3
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
4
Testbed Architecture IBM Blue Gene/L 3D torus point-to-point interconnect network One rack 1024 dual-processor nodes Two 8 x 8 x 8 midplanes Special feature to send simultaneously Use multiple calls to MPI_Isend
5
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
6
Model of Parallel Computation Target Architectures Distributed-memory parallel architectures Indexing p computational nodes Indexed 0 … p - 1 Logically Fully Connected A node can send directly to any other node
7
Model of Parallel Computation Topology N-dimensional torus 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2
8
Model of Parallel Computation Old Model of Communicating Between Nodes Unidirectional sending or receiving
9
Model of Parallel Computation Old Model of Communicating Between Nodes Simultaneous sending and receiving
10
Model of Parallel Computation Old Model of Communicating Between Nodes Bidirectional exchange
11
Model of Parallel Computation Communicating Between Nodes A node can send or receive with 2N other nodes simultaneously along its 2N different links
12
Model of Parallel Computation Communicating Between Nodes Cannot perform bidirectional exchange on any link while sending or receiving simultaneously with multiple nodes
13
Model of Parallel Computation Cost of Communication α + nβ α: startup time, latency n: number of bytes to communicate β: per data transmission time, bandwidth
14
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
15
Sending Simultaneously Old Cost of Communication with Sends to Multiple Nodes Cost to send to m separate nodes (α + nβ) m
16
Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1)
17
Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one sendCost of extra sends
18
Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one sendCost of extra sends 0 ≤ τ ≤ 1
19
Sending Simultaneously Benchmarking Sending Simultaneously Logarithmic-Logarithmic timing graphs Midplane – 512 nodes Sending simultaneously with 1 – 6 neighbors 8 bytes – 4 MB
20
Sending Simultaneously
22
Cost of Communication with Simultaneous Sends (α + nβ) (1 + (m - 1) τ)
23
Sending Simultaneously
25
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
26
Collective Communication Broadcast (Bcast) Motivating example Before After
27
Collective Communication Scatter Before After
28
Collective Communication Allgather Before After
29
Collective Communication Broadcast Can be implemented as a Scatter followed by an Allgather
30
ScatterAllgather
31
Collective Communication Lower Bounds: Latency Broadcastlog 2N+1 (p) α Scatterlog 2N+1 (p) α Allgatherlog 2N+1 (p) α
32
Collective Communication Lower Bounds: Bandwidth Broadcast nβ 2N Scatter p - 1 nβ p 2N Allgather p - 1 nβ p 2N
33
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
34
Generalized Algorithms Short-Vector Algorithms Minimum-Spanning Tree Long-Vector Algorithms Bucket Algorithm
35
Generalized Algorithms Minimum-Spanning Tree
36
Generalized Algorithms Minimum-Spanning Tree Recursively divide network of nodes in half Cost of MST Bcast log 2 (p) (α + nβ) What if can send to N nodes simultaneously?
37
Generalized Algorithms Minimum-Spanning Tree Divide p nodes into N+1 partitions
38
Generalized Algorithms Minimum-Spanning Tree Disjointed partitions on N-dimensional mesh 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2
39
Generalized Algorithms Minimum-Spanning Tree Divide dimensions by a decrementing counter from N+1 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2
40
Generalized Algorithms Minimum-Spanning Tree Now divide into 2N+1 partitions 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2
41
Generalized Algorithms Minimum-Spanning Tree Cost of new Generalized MST Bcast log 2N+1 (p) (α + nβ) Attains lower bound for latency!
42
Generalized Algorithms Minimum-Spanning Tree MST Scatter Only send data that must reside in that partition at each step Cost of new generalized MST Scatter Attains lower bound for latency and bandwidth! log 2N+1 (p) α + p - 1 p nβnβ 2N
43
Generalized Algorithms Bucket Algorithm
44
Generalized Algorithms Bucket Algorithm Send n/p sized data messages at each step Cost of Bucket Allgather What if can send to N nodes simultaneously? p - 1 p nβnβ (p - 1) α +
45
Generalized Algorithms Bucket Algorithm Collect data around N buckets simultaneously 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2
46
Generalized Algorithms Bucket Algorithm Cannot send to N neighbors at each step 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2
47
Generalized Algorithms Bucket Algorithm Assume collecting data in buckets is free in all but one dimension D is an N-ordered tuple representing the number of nodes in each dimension of the torus π D i = p0 1 | D | = N i = 1 N
48
Generalized Algorithms Bucket Algorithm Cost of the new generalized Bucket Allgather where D j - 1 DjDj nβnβ (d - N) α + d = Σ D i i ≠ j, D j ≥ D i A i = 1 N
49
Generalized Algorithms Bucket Algorithm New generalized Bcast derived from MST Scatter followed by Bucket Allgather Cost of new long-vector Bcast p - 1 D j - 1 2Np D j nβnβ (log 2N+1 (p) + d - N) α + + ( )
50
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
51
Performance Results Logarithmic-Logarithmic Timing Graphs Collective Communication Operations Broadcast Scatter Allgather Algorithms MST Bucket 8 bytes – 4 MB
52
Performance Results Single point-to-point communication
53
Performance Results my-bcast-MST
54
Performance Results
59
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
60
IBM Blue Gene/L supports functionality of sending simultaneously Benchmarking along with model checking verifies this claim New generalized algorithms show clear performance gains
61
Conclusion Future Directions Room for optimization to reduce implementation overhead What if not using MPI_COMM_WORLD ? Possible new algorithm for Bucket Algorithm Questions?echan@cs.utexas.eduechan@cs.utexas.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.