Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic Communication Operations Carl Tropper Department of Computer Science.

Similar presentations


Presentation on theme: "Basic Communication Operations Carl Tropper Department of Computer Science."— Presentation transcript:

1 Basic Communication Operations Carl Tropper Department of Computer Science

2 Building Blocks in Parallel Programs Common interactions between processes in parallel programs-broadcast, reduction, prefix sum…. Study their behavior on standard networks- mesh, hypercube, linear arrays Derive expressions for time complexity –Assumptions are: send/receive one message on a link at a time, can send on one link while receiving on another link, cut through routing, bidirectional links –Message transfer time on a link: t s + t w m

3 Broadcast and Reduction One to all broadcast All to one reduction p processes have m words apiece associative operation on each word- sum,product,max,min Both used in Gaussian elimination Shortest paths Inner product of vectors

4 One to all Broadcast on Ring or Linear Array One to all broadcast Naïve-send p-1 messages from source to other processes. Inefficient. Recursive doubling (below). Furthest node,half the distance…. Log p steps

5 Reduction on Linear Array Odd nodes send to preceding even nodes 0,2 to 0| 6,4 to 4 concurrently 4 to 0

6 Matrix vector multiplication on an nxn mesh Broadcast input vector to each row Each row does reduction

7 Broadcast on a Mesh Build on array algorithm One to all from source (0) to row (4,8,12) One to all on the columns

8 Broadcast on a Hypercube Source sends to nodes with highest dimension(msb), then next lower dimension,destination and source to next lower dimension,…..,everyone who can sends to lowest dimension

9 Broadcast on a Tree Same idea as hypercube Grey circles are switches Start from 0 and go to 4, rest the same as hypercube algorithm

10

11

12

13 All to all broadcast on linear array/ring Idea-each node is kept busy in a circular transfer

14 All to all broadcast on ring

15

16 All to all broadcast Mesh-each row does all to all. Nodes do all to all broadcast in their columns. Hypercube-pairs of nodes exchange messages in each dimension. Requires log p steps.

17

18

19 All to all broadcast on a mesh

20

21 All to all broadcast on a hypercube Hypercube of dimension p is composed of two hypercubes of dimension p-1 Start with dimension one hypercubes. Adjacent nodes exchange messages Adjacent nodes in dimension two hypercubes exchange messages In general, adjacent nodes in next higher dimensional hypercubes exchange messages Requires log p steps.

22 All to all broadcast on a hypercube

23 All to all broadcast time Ring or Linear Array T=(t s +t w m)(p-1) Mesh 1st phase: √p simultaneous all to all broadcasts among √p nodes takes time (t s + t w m)(√p-1) 2nd phase: size of each message is m√p, phase takes time (t s +t w m√p) )(√p-1) T=2t s (√p-1)+t w m(p-1) is total time Hypercube Size of message in ith step is 2 i-1 m Takes t s +2 i-1 t w m for a pair of nodes to send and receive messages T=∑ I=1 logp (t s +2 i-1 t w m)=t s log p + t w m(p-1)

24 Observation on all to all broadcast time Send time neglecting start up is the same for all 3 architectures:t w m(p-1) Lower bound on communication time for all 3 architectures

25

26 All Reduce Operation All reduce operation-all nodes start with buffer of size m. Associative operation performed on all buffers-all nodes get same result. Semantically equivalent to all to one reduction followed by one to all broadcast. All reduce with one word implements barrier synch for message passing machines. No node can finish reduction before all nodes have contributed to the reduction. Implement all reduce using all to all broadcast. Add message contents instead of concatenating messages. In hypercube, T=(t s +t w m)log p for log p steps because message size does not double in each dimension.

27 Prefix Sum Operation Prefix sums (scans) are all partial sums, s k, of p numbers, n 1,….,n p-1, one number living on each node. Node k starts out with n k and winds up with s k Modify all to all broadcast. Each node only uses partial sums from nodes with smaller labels.

28 . Prefix Sum on Hypercube

29 Prefix sum on hypercube

30 Scatter and gather Scatter (one to all personalized communication) source sends unique message to each destination (vs broadcast - same message for each destination) –Hypercube:use one to all broadcast. Node transfers half of its messages to one neighbor and half to other neighbor at each step.Data goes from one subcube to another. Log p steps. Gather: Node collects unique messages from other nodes –Hypercube:Odd numbered nodes send buffers to even nodes in other (lower dimensional) cube. Continue….

31 Scatter operation on hypercube

32 All to all personalized communication Each node sends distinct messages to every other node. Used in FFT, Matrix transpose, sample sort Transpose of A[i,j] is A T =A[j,i],0≤ i,j ≤ n Put row i {(i,o),(i,1),….,(i,n-1) } processor P i Transpose-(i,0) goes to P 0, (i,1) goes to processor 1,(i,n) goes to P n Every processor sends a distinct element to every other processor! All to all on ring,mesh,hypercube

33 Ring: All nodes send messages in same direction. Each node sends p-1 message of size m to neighbor. Nodes extract message(s) for them, forward rest Mesh: p x p mesh. Each node groups destination nodes into columns All to all personalized communication on each row Upon arrival, messages are sorted by row All to all communication on rows Hypercube p node hypercube – p/2 links in same dimension connect 2 subcubes with p/2 nodes Each node exchanges p/2 messages in a given dimension

34

35

36 All to all personalized communication -optimal algorithm on a hypercube Nodes chose partners for exchange in order to not suffer congestion In step j node i exchanges with node i XOR j First step-all nodes differing in lsb exchange messages………….. Last step-all nodes differing msb exchange messages E cube routing in hypercube implements this (Hamming) distance between nodes is # non-zero bits in i xor j Sort links corresponding to non-zero bits in ascending order Route messages along these links

37 Optimal algorithm picture

38 Optimal algorithm T=(t s +t w m)(p-1)

39 Circular q Shift Node i sends packet to (i+q) mod p Useful in string, image pattern matching, matrix comp… 5 shift on 4x4 mesh Everyone goes 1 to the right (q mod √p) Everyone goes 1 upwards q/√p Left column goes up before general upshift to make up for wraparound effect in first step

40 Mesh Circular Shift

41 Circular shift on hypercube Map linear array onto hypercube: map I to j, where j is the d bit reflected Gray code of i

42

43

44 What happens if we split messages into packets? One to all broadcast:scatter operation followed by all to all broadcast Scatter: t s logp p + t w (m/p)(p-1) All to all of messages of size m/p: t s log p +t w (m/p)(p-1) on a hypercube T~2 x (t s log p + t w m)on a hypercube Bottom line-double the start-up cost, but cost of t w reduced by (log p)/2 All to one reduction-dual of one to all broadcast, so all to all reduction followed by gather All reduce combine the above two

45

46


Download ppt "Basic Communication Operations Carl Tropper Department of Computer Science."

Similar presentations


Ads by Google