Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan
Authors Ernie Chan Robert van de Geijn Department of Computer Sciences The University of Texas at Austin William Gropp Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Testbed Architecture IBM Blue Gene/L 3D torus point-to-point interconnect network One rack 1024 dual-processor nodes Two 8 x 8 x 8 midplanes Special feature to send simultaneously Use multiple calls to MPI_Isend
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Model of Parallel Computation Target Architectures Distributed-memory parallel architectures Indexing p computational nodes Indexed 0 … p - 1 Logically Fully Connected A node can send directly to any other node
Model of Parallel Computation Topology N-dimensional torus
Model of Parallel Computation Old Model of Communicating Between Nodes Unidirectional sending or receiving
Model of Parallel Computation Old Model of Communicating Between Nodes Simultaneous sending and receiving
Model of Parallel Computation Old Model of Communicating Between Nodes Bidirectional exchange
Model of Parallel Computation Communicating Between Nodes A node can send or receive with 2N other nodes simultaneously along its 2N different links
Model of Parallel Computation Communicating Between Nodes Cannot perform bidirectional exchange on any link while sending or receiving simultaneously with multiple nodes
Model of Parallel Computation Cost of Communication α + nβ α: startup time, latency n: number of bytes to communicate β: per data transmission time, bandwidth
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Sending Simultaneously Old Cost of Communication with Sends to Multiple Nodes Cost to send to m separate nodes (α + nβ) m
Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1)
Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one sendCost of extra sends
Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one sendCost of extra sends 0 ≤ τ ≤ 1
Sending Simultaneously Benchmarking Sending Simultaneously Logarithmic-Logarithmic timing graphs Midplane – 512 nodes Sending simultaneously with 1 – 6 neighbors 8 bytes – 4 MB
Sending Simultaneously
Cost of Communication with Simultaneous Sends (α + nβ) (1 + (m - 1) τ)
Sending Simultaneously
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Collective Communication Broadcast (Bcast) Motivating example Before After
Collective Communication Scatter Before After
Collective Communication Allgather Before After
Collective Communication Broadcast Can be implemented as a Scatter followed by an Allgather
ScatterAllgather
Collective Communication Lower Bounds: Latency Broadcastlog 2N+1 (p) α Scatterlog 2N+1 (p) α Allgatherlog 2N+1 (p) α
Collective Communication Lower Bounds: Bandwidth Broadcast nβ 2N Scatter p - 1 nβ p 2N Allgather p - 1 nβ p 2N
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Generalized Algorithms Short-Vector Algorithms Minimum-Spanning Tree Long-Vector Algorithms Bucket Algorithm
Generalized Algorithms Minimum-Spanning Tree
Generalized Algorithms Minimum-Spanning Tree Recursively divide network of nodes in half Cost of MST Bcast log 2 (p) (α + nβ) What if can send to N nodes simultaneously?
Generalized Algorithms Minimum-Spanning Tree Divide p nodes into N+1 partitions
Generalized Algorithms Minimum-Spanning Tree Disjointed partitions on N-dimensional mesh
Generalized Algorithms Minimum-Spanning Tree Divide dimensions by a decrementing counter from N
Generalized Algorithms Minimum-Spanning Tree Now divide into 2N+1 partitions
Generalized Algorithms Minimum-Spanning Tree Cost of new Generalized MST Bcast log 2N+1 (p) (α + nβ) Attains lower bound for latency!
Generalized Algorithms Minimum-Spanning Tree MST Scatter Only send data that must reside in that partition at each step Cost of new generalized MST Scatter Attains lower bound for latency and bandwidth! log 2N+1 (p) α + p - 1 p nβnβ 2N
Generalized Algorithms Bucket Algorithm
Generalized Algorithms Bucket Algorithm Send n/p sized data messages at each step Cost of Bucket Allgather What if can send to N nodes simultaneously? p - 1 p nβnβ (p - 1) α +
Generalized Algorithms Bucket Algorithm Collect data around N buckets simultaneously
Generalized Algorithms Bucket Algorithm Cannot send to N neighbors at each step
Generalized Algorithms Bucket Algorithm Assume collecting data in buckets is free in all but one dimension D is an N-ordered tuple representing the number of nodes in each dimension of the torus π D i = p0 1 | D | = N i = 1 N
Generalized Algorithms Bucket Algorithm Cost of the new generalized Bucket Allgather where D j - 1 DjDj nβnβ (d - N) α + d = Σ D i i ≠ j, D j ≥ D i A i = 1 N
Generalized Algorithms Bucket Algorithm New generalized Bcast derived from MST Scatter followed by Bucket Allgather Cost of new long-vector Bcast p - 1 D j - 1 2Np D j nβnβ (log 2N+1 (p) + d - N) α + + ( )
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
Performance Results Logarithmic-Logarithmic Timing Graphs Collective Communication Operations Broadcast Scatter Allgather Algorithms MST Bucket 8 bytes – 4 MB
Performance Results Single point-to-point communication
Performance Results my-bcast-MST
Performance Results
Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
IBM Blue Gene/L supports functionality of sending simultaneously Benchmarking along with model checking verifies this claim New generalized algorithms show clear performance gains
Conclusion Future Directions Room for optimization to reduce implementation overhead What if not using MPI_COMM_WORLD ? Possible new algorithm for Bucket Algorithm