On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de Geijn ScicomP 10, August 9-13 Austin, TX

Outline Model of Parallel Computation Collective Communications Algorithms Performance Results Conclusions and Future work

Model of Parallel Computation Target Architectures –distributed memory parallel architectures Indexing –p nodes –indexed 0 … p – 1 –each node has one computational processor

1 1 0 0 2 2 3 3 4 4 often logically viewed as a linear array 56 78

Model of Parallel Computation Logically Fully Connected –a node can send directly to any other node Communicating Between Nodes –a node can simultaneously receive and send Network Conflicts –sending over a path between two nodes that is completely occupied

Model of Parallel Computation Cost of Communication –sending a message of length n between any two nodes  is the startup cost (latency)  is the transmission cost (bandwidth) Cost of Computation –cost to perform arithmetic operation is  reduction operations –sum –prod –min –max  +n 

Collective Communications Broadcast Reduce(-to-one) Scatter Gather Allgather Reduce-scatter Allreduce

Lower Bounds (Latency) Broadcast Reduce(-to-one) Scatter/Gather Allgather Reduce-scatter Allreduce

Lower Bounds (Bandwidth) Broadcast Reduce(-to-one) Scatter/Gather Allgather Reduce-scatter Allreduce

Motivating Example We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation.

A building block approach to library implementation Short-vector case Long-vector case Hybrid algorithms

Short-vector case Primary concern: –algorithms must have low latency cost Secondary concerns: –algorithms must work for arbitrary number of nodes in particular, not just for power-of-two numbers of nodes –algorithms should avoid network conflicts not absolutely necessary, but nice if possible

Minimum-Spanning Tree based algorithms We will show how the following building blocks: –broadcast/reduce –scatter/gather Using minimum spanning trees embedded in the logical linear array while attaining –minimal latency –implementation for arbitrary numbers of nodes –no network conflicts

General principles message starts on one processor

General principles divide logical linear array in half

General principles send message to the half of the network that does not contain the current node (root) that holds the message

General principles continue recursively in each of the two halves

General principles The demonstrated technique directly applies to –broadcast –scatter The technique can be applied in reverse to –reduce –gather

General principles This technique can be used to implement the following building blocks: –broadcast/reduce –scatter/gather Using a minimum spanning tree embedded in the logical linear array while attaining –minimal latency –implementation for arbitrary numbers of nodes –no network conflicts ? Yes, on linear arrays

Reduce-scatter (short vector)

Reduce Reduce-scatter (short vector)

Scatter Reduce-scatter (short vector)

Reduce Before Reduce ++++++++

+++++++ After +

Cost of Minimum-Spanning Tree Reduce number of stepscost per steps

Cost of Minimum-Spanning Tree Reduce number of stepscost per steps Notice: attains lower bound for latency component

Scatter Before After

Cost of Minimum-Spanning Tree Scatter Assumption: power of two number of nodes

Cost of Minimum-Spanning Tree Scatter Assumption: power of two number of nodes Notice: attains lower bound for latency and bandwidth components

Cost of Reduce/Scatter Reduce-scatter Assumption: power of two number of nodes reduce scatter

Cost of Reduce/Scatter Reduce-scatter Assumption: power of two number of nodes reduce scatter Notice: does not attain lower bound for latency or bandwidth components

Recap Reduce Scatter Broadcast Gather Allreduce Reduce-scatter Allgather

Long-vector case Primary concern: –algorithms must have low cost due to vector length –algorithms must avoid network conflicts Secondary concerns: –algorithms must work for arbitrary number of nodes in particular, not just for power-of-two numbers of nodes

Long-vector building blocks We will show how the following building blocks: –allgather/reduce-scatter Can be implemented using “bucket” algorithms while attaining –minimal cost due to length of vectors –implementation for arbitrary numbers of nodes –no network conflicts

A logical ring can be embedded in a physical linear array

General principles This technique can be used to implement the following building blocks: –allgather/reduce-scatter Send subvectors of data around the ring at each step until all data is collected, like a “bucket” –minimal cost due to length of vectors –implementation for arbitrary numbers of nodes –no network conflicts

Reduce-scatter Before Reduce ++++++++

+++++++ Reduce-scatter Reduce After +

Cost of Bucket Reduce-scatter number of stepscost per steps 

Cost of Bucket Reduce-scatter number of stepscost per steps Notice: attains lower bound for bandwidth and computation component 

Recap Reduce-scatter Scatter Allgather Gather Allreduce Reduce Broadcast

Hybrid algorithms (Intermediate length case) Algorithms must balance latency, cost due to vector length, and network conflicts

General principles View p nodes as a 2-dimensional mesh –p = r x c, row and column dimensions Perform operation in each dimension –reduce-scatter within columns, followed by reduce-scatter within rows

General principles Many different combinations of short- and long- algorithms Generally try to reduce vector length to use at next dimension Can use more than 2-dimensions

Example: 2D Scatter/Allgather Broadcast

Scatter in columns

Example: 2D Scatter/Allgather Broadcast Scatter in rows

Example: 2D Scatter/Allgather Broadcast Allgather in rows

Example: 2D Scatter/Allgather Broadcast Allgather in columns

Cost of 2D Scatter/Allgather Broadcast

Cost comparison Option 1: –MST Broadcast in column –MST Broadcast in rows Option 2: –Scatter in column –MST Broadcast in rows –Allgather in columns Option 3: –Scatter in column –Scatter in rows –Allgather in rows –Allgather in columns

Testbed Architecture Cray-Dell PowerEdge Linux Cluster –Lonestar Texas Advanced Computing Center –J. J. Pickle Research Campus »The University of Texas at Austin –856 3.06 GHz Xeon processors and –410 Dell dual-processor PowerEdge 1750 compute nodes –2 GB of memory per node in the compute nodes –Myrinet-2000 switch fabric –mpich-gm library 1.2.5..12a

Performance Results Log-log graphs –Shows performance better than linear-linear graphs Used one processor per node Graphs –‘send-min’ is the minimum time to send between nodes –’MPI’ is the MPICH implementation –‘short’ is minimum spanning tree algorithm –‘long’ is the bucket algorithm –‘hybrid’ is a 3-dimensional hybrid algorithm two higher dimensions with ‘long’ algorithm Lowest dimension with ‘short’ algorithm

Conclusions and Future work Have a working prototype for an optimal collective communication library, using optimal algorithms for short, intermediate and long vector lengths. Need to obtain heuristics for cross-over points between short, hybrid and long vector algorithms, independent of architecture. Need to complete this approach for ALL MPI data types and operations. Need to generalize approach or have heuristics for large SMP nodes or small n-way node clusters.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

Similar presentations

Presentation on theme: "On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

Similar presentations

Presentation on theme: "On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de."— Presentation transcript:

Similar presentations

About project

Feedback