Collective Communication Implementations

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

Lecture 9: Group Communication Operations
1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
MPI Collective Communications
1 Collective Operations Dr. Stephen Tse Lesson 12.
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Network exercise (P2) Consider the PERT/CPM network shown below. D A 2
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
June 3, 2015Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Communication operations Efficient Parallel Algorithms COMP308.
CS 684.
CS 584. Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No.
Cache Oblivious Search Trees via Binary Trees of Small Height
CS 261 – Winter 2010 Trees. Ubiquitous – they are everywhere in CS Probably ranks third among the most used data structure: 1.Vectors and Arrays 2.Lists.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
Parallel Programming with Java
Collective Communication
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
1 GRAPHS - ADVANCED APPLICATIONS Minimim Spanning Trees Shortest Path Transitive Closure.
Broadcast & Convergecast Downcast & Upcast
Synchronization (Barriers) Parallel Processing (CS453)
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.
MPI Communications Point to Point Collective Communication Data Packaging.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
AES Advanced Encryption Standard. Requirements for AES AES had to be a private key algorithm. It had to use a shared secret key. It had to support the.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Parallel Programming with MPI By, Santosh K Jena..
MPI – Message Passing Interface Source:
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
CSE 461 University of Washington1 Topic How do we connect nodes with a switch instead of multiple access – Uses multiple links/wires – Basis of modern.
SORTING & SEARCHING - Bubble SortBubble Sort - Insertion SortInsertion Sort - Quick SortQuick Sort - Binary SearchBinary Search 2 nd June 2005 Thursday.
Huffman’s Algorithm 11/02/ Weighted 2-tree A weighted 2-tree T is an extended binary tree with n external nodes and each of the external nodes is.
MPI implementation – collective communication MPI_Bcast implementation.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
CS4402 – Parallel Computing
Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,
Basic Communication Operations Carl Tropper Department of Computer Science.
Interconnection Networks Communications Among Processors.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
Quartet distance between general trees Chris Christiansen Thomas Mailund Christian N.S. Pedersen Martin Randers.
INTERCONNECTION NETWORK
Articulation Points 1 of 2 (Ideas)
Collective Communication Implementations
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Dynamic connection system
More on MPI Nonblocking point-to-point routines Deadlock
Collective Communication Implementations
More on MPI Nonblocking point-to-point routines Deadlock
Adaptivity and Dynamic Load Balancing
Find the value of g. Find the value of h. 105° h g 75°
Optimizing MPI collectives for SMP clusters
Presentation transcript:

Collective Communication Implementations Sathish Vadhiyar

Binomial Tree Definition (Binomial Tree)  The binomial tree of order k≥0 with root R is the tree Bk defined as follows 1. If k=0, Bk ={R}. i.e., the binomial tree of order zero consists of a single node, R. 2. If k>0, Bk ={R, B0, B1,…Bk-1}. i.e., the binomial tree of order k>0 comprises the root R, and k binomial subtrees, B0 - Bk-1. Bk contains 2k nodes The height of Bk is k The number of nodes at level l in Bk, where 0≤l≤k, is given by the binomial coefficient kCl

Binomial Trees B0 1 1 2 B1 1 2 3 1 3 B2 B3

Binomial Trees B4 1 2 4 8 B0 3 5 6 9 10 12 B1 13 14 7 11 15 B2 B3

Binomial Trees Broadcast, Scatter and Gather usually implemented by binomial Takes logP communication steps instead of 2(logP-1) in binary 1 2 4 8 3 5 6 9 10 12 13 14 7 11 15

Barrier Algorithms Butterfly barrier by Eugene Brooks II In round k, i synchronizes with i 2k pairwise. If p not power of 2, existing procs. stand for missing ones. Worstcase – 2logP pairwise synchronizations by a processor + 1 2 3 4 5 6 7 8 9 10 11 Stage 0 Stage 1 Stage 2 Stage 3 Stage 4

Barrier Algorithms Dissemination barrier by Hensgen, Finkel and Manser In round k, i signals (i+2k)modP No pairwise synchronization Atmost log(next power of 2 > P) on critical path irrespective of P 1 2 3 4 5 6 7 8 9 10 11 Stage 0 Stage 1 Stage 2 Stage 3 – 1 more round

Barrier Algorithms 1 2 3 4 5 6 7 8 9 10 11 Stage 0 Stage 1 Stage 2 Tournament barrier by Hensgen, Finkel and Manser In the 1st round, each pair of nodes (players) synchronize (play a game) The receiver will be considered as the winner of the game In the 2nd round, the winners of the 1st round will synchronize (play games) The receiver in the 2nd round will advance to the 3rd round This process continues till there is 1 winner left in the tournament The single winner then broadcasts a message to all the other nodes At each round k, proc. j receives a message from proc. i, where i = j - 2k 1 2 3 4 5 6 7 8 9 10 11 Stage 0 Stage 1 Stage 2 Stage 3 Stage 4

Barrier Algorithms 1 2 3 4 5 6 7 8 9 10 11 Stage 0 Stage 1 Stage 2 MPICH Barrier (pairwise exchange with recursive doubling) Same as butterfly barrier. If nodes not equal to power, find the nearest power of 2, i.e. m = 2n The last surfeit nodes, i.e. surfeit = size – m, initially send messages to the first surfeit number of nodes The first m nodes then perform butterfly barrier Finally, the first surfeit nodes send messages to the last surfeit nodes 1 2 3 4 5 6 7 8 9 10 11 Stage first Stage 0 Stage 1 Stage 2 Stage last

Allgather implementation In general, optimized allxxx operations depend on hardware topology, network contentions etc. Circular/ring allgather Each process receives from left and sends to right P steps A0 A1 A2 A3 A4 A5 A6 A7 1 2 3 4 5 6 7 Stage 0 Stage 1 A0 A0 A0 A0 A0 A0 A0 A0 A1 A1 A1 A1 A1 A1 A1 A1 A2 A2 A2 A2 A2 A2 A2 A2 A3 A3 A3 A3 A3 A3 A3 A3 A4 A4 A4 A4 A4 A4 A4 A4 A5 A5 A5 A5 A5 A5 A5 A5 A6 A6 A6 A6 A6 A6 A6 A6 A7 A7 A7 A7 A7 A7 A7 A7

Bruck’s Allgather Similar to dissemination barrier logP steps A0 A1 A2 1 2 3 4 5 A0 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 A0 A2 A3 A4 A5 A0 A1 A3 A4 A5 A0 A1 A2 A4 A5 A0 A1 A2 A3 A5 A0 A1 A2 A3 A4

AlltoAll The naive implementation for all procs. i in order{ if i # my proc., then send to i and recv from i } MPICH implementation – similar to naïve, but doesn’t do it in order dest = (my_proc+i)modP src = (myproc-i+P)modP send to dest and recv from src

AlltoAll implementation Circular alltoall For step k in {1..P}, proc. i sends to (i+k)modP and receives from (i-k+P)modP 1 2 3 4 5 6 7 Stage 1 Stage 2 A0 A1 A2 A3 A4 A5 A6 A7 B0 B1 B2 B3 B4 B5 B6 B7 C0 C1 C2 C3 C4 C5 C6 C7 D0 D1 D2 D3 D4 D5 D6 D7 E0 E1 E2 E3 E4 E5 E6 E7 F0 F1 F2 F3 F4 F5 F6 F7 G0 G1 G2 G3 G4 G5 G6 G7 H0 H1 H2 H3 H4 H5 H6 H7

Reduce and AllReduce Reduce and allreduce can be implemented with tree algorithms, e.g. binary tree But in tree based algorithms, some processors are not involved in computation Rolf Rabenseifner of Stuttgart – algorithms for reduce and allreduce

Rabenseifner algorithm from http://www. hlrs This algorithm is explained with the example of 13 nodes. The nodes are numbered 0, 1, 2, ... 12. The sendbuf content is a, b, c, ... m. Each buffer array is notated with ABCDEFGH, this means that e.g. 'C' is the third 1/8 of the buffer size := number of nodes in the communicator. 2**n := the power of 2 that is next smaller or equal to the size. r := size - 2**n e.g., size=13, n=3, r=5

Rabenseifner algorithm - Steps from http://www. hlrs compute n and r if myrank < 2*r split the buffer into ABCD and EFGH even myrank: send buffer EFGH to myrank+1 receive buffer ABCD from myrank+1 compute op for ABCD receive result EFGH odd myrank: send buffer ABCD to myrank-1 receive buffer EFGH from myrank-1 compute op for EFGH send result EFGH Result: node: 0 2 4 6 8 10 11 12 value: a+b c+d e+f g+h i+j k l m

Rabenseifner algorithm - Steps from http://www. hlrs if(myrank is even && myrank < 2*r) || (myrank >= 2*r) define NEWRANK(old) := (old < 2*r ? old/2 : old-r) define OLDRANK(new) := (new < r ? new*2 : new+r) Result: old: 0 2 4 6 8 10 11 12 new: 0 1 2 3 4 5 6 7 val: a+b c+d e+f g+h i+j k l m

Rabenseifner algorithm - Steps from http://www. hlrs 5.1 Split the buffer (ABCDEFGH) in the middle, the lower half (ABCD) is computed on even (new) ranks, the upper half (EFGH) is computed on odd (new) ranks. exchange: ABCD from 1 to 0, from 3 to 2, from 5 to 4 and from 7 to 6 EFGH from 0 to 1, from 2 to 3, from 4 to 5 and from 6 to 7 compute op in each node on its half Result: node 0: (a+b)+(c+d) for ABCD node 1: (a+b)+(c+d) for EFGH node 2: (e+f)+(g+h) for ABCD node 3: (e+f)+(g+h) for EFGH node 4: (i+j)+ k for ABCD node 5: (i+j)+ k for EFGH node 6: l + m for ABCD node 7: l + m for EFGH

Rabenseifner algorithm - Steps from http://www. hlrs 5.2 Same with double distance and one more the half of the buffer. Result: node 0: [(a+b)+(c+d)] + [(e+f)+(g+h)] for AB node 1: [(a+b)+(c+d)] + [(e+f)+(g+h)] for EF node 2: [(a+b)+(c+d)] + [(e+f)+(g+h)] for CD node 3: [(a+b)+(c+d)] + [(e+f)+(g+h)] for GH node 4: [(i+j)+ k ] + [ l + m ] for AB node 5: [(i+j)+ k ] + [ l + m ] for EF node 6: [(i+j)+ k ] + [ l + m ] for CD node 7: [(i+j)+ k ] + [ l + m ] for GH

Rabenseifner algorithm - Steps from http://www. hlrs 5.3 Same with double distance and one more the half of the buffer. Result: node 0: { [(a+b)+(c+d)] + [(e+f)+(g+h)] } + { [(i+j)+k] + [l+m] } for A node 1: { [(a+b)+(c+d)] + [(e+f)+(g+h)] } + { [(i+j)+k] + [l+m] } for E node 2: { [(a+b)+(c+d)] + [(e+f)+(g+h)] } + { [(i+j)+k] + [l+m] } for C node 3: { [(a+b)+(c+d)] + [(e+f)+(g+h)] } + { [(i+j)+k] + [l+m] } for G node 4: { [(a+b)+(c+d)] + [(e+f)+(g+h)] } + { [(i+j)+k] + [l+m] } for B node 5: { [(a+b)+(c+d)] + [(e+f)+(g+h)] } + { [(i+j)+k] + [l+m] } for F node 6: { [(a+b)+(c+d)] + [(e+f)+(g+h)] } + { [(i+j)+k] + [l+m] } for D node 7: { [(a+b)+(c+d)] + [(e+f)+(g+h)] } + { [(i+j)+k] + [l+m] } for H

Rabenseifner algorithm - Steps (for reduce) from http://www. hlrs 6. Last step is gather Gather is by an algorithm similar to tournament algorithm. In each round, one process gathers from another The players are P/2 at the initial step and the distance is halved every step till 1. node 0: recv B => AB node 1: recv F => EF node 2: recv D => CD node 3: recv H => GH node 4: send B node 5: send F node 6: send D node 7: send H node 0: recv CD => ABCD node 1: recv GH => EFGH node 2: send CD node 3: send GH node 0: recv EFGH => ABCDEFGH

Rabenseifner algorithm - Steps (for allreduce) from http://www. hlrs 6. Similar to reduce, but Instead of gather by one process, both the processes exchange Both the players move to the next round Finally, transfer the result from the even to odd nodes in the old ranks. node 0: AB node 1: EF node 2: CD node 3: GH node 4: AB node 5: EF node 6: CD node 7: GH node 0: ABCD node 1: EFGH node 2: ABCD node 3: EFGH node 4: ABCD node 5: EFGH node 6: ABCD node 7: EFGH node 0: ABCDEFGH node 1: ABCDEFGH node 2: ABCDEFGH node 3: ABCDEFGH node 4: ABCDEFGH node 5: ABCDEFGH node 6: ABCDEFGH node 7: ABCDEFGH

Reduce-Scatter for commutative operations: Recursive halving algorithm Recursive doubling – in the first step, communication is with the neighboring process. In each step, the communication distance doubles Recursive halving – reverse of recursive doubling At the first step a process communicates with another process P/2 away sends data needed by the other half Receives data needed by its half Performs operation Next step – distance P/4 away and so on… lgP steps

Reduce-Scatter for non-commutative operations: Recursive doubling algorithm In the first step, data (all data except the one needed for its result) is exchanged with the neighboring process In the next step, (n-2n/p) data (all except the one needed by it and the one needed by process it communicated with the previous step) is communicated with process that is distance 2 apart In the third step (n-4n/p) data with process that is distance 4 apart and so on… lgP steps

General Notes on Optimizing Collectives 2 components for collective communications – latency and bandwidth Latency(α) – time when the collective completes with the first byte Bandwidth(β) – rate at which collective proceeds after the first byte transmission Cost for communication – α+nβ Latency is critical for small message sizes and bandwidth for large message sizes

Example - Broadcast Binomial Broadcast scatter and allgather log p steps Amount of data communicated at each step - n cost = log p (α+nβ) scatter and allgather Divide message into p segments Scatter the p segments to p processes using binomial scatter – log p α + (n/p)(p-1) β Scattered data collected at all processes using ring allgather – (p-1) α + (n/p)(p-1) β cost = (log p + p-1) α + 2(n/p)(p-1) β Hence binomial broadcast for small messages and (scatter+allgather) for long messages

MPICH Algorithms Allgather Broadcast Alltoall Bruck Algorithm (variation of dissemination) (< 80 KB) and non-power-of-two Recursive doubling (< 512 KB) for power-of-2 processes ring (> 512 KB) and (80-512 KB) for any processes Broadcast Binomial (< 12 KB), binomial scatter + ring all_gather (> 512 KB) Alltoall Bruck’s algorithm (for < 256 bytes) Post all irecvs and isends (for medium size messages. 256 bytes – 32 KB) Pairwise exchange (for long messages and power-of-2 processors) – p-1 steps. In each step k, each process i exchanges data with (i xor k) For non-power of 2, an algorithm in which in each step, k, process i sends data to (i+k) and receives from (i-k)

MPICH Algorithms Reduce-scatter Reduce AllReduce For commutative operations: Recursive halving (< 512 KB), pairwise exchange (> 512 KB; p-1 steps; rank+i at step i) For non-commutative: Recursive doubling (< 512 bytes), pairwise exchange (> 512 bytes) Reduce For pre-defined operations Binomial algorithm (< 2 KB), Rabenseifner (> 2 KB) For user-defined operations) Binomial algorithm AllReduce Recursive doubling (short) , Rabenseifner (long messages) For user-defined operations Recursive doubling

References Thakur et. al. – Optimization of Collective Communication Operations in MPICH. IJHPCA 2005. Thakur et. al. - Improving the Performance of Collective Operations in MPICH. EuroPVM/MPI 2003.