Download presentation
Presentation is loading. Please wait.
1
Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication Analysis of Floyd’s Algorithm
2
Parallel Reduction Evolution
3
Binomial Trees Subgraph of hypercube
4
Finding Global Sum 4207 -35-6-3 8123 -446
5
Finding Global Sum 17-64 4582
6
Finding Global Sum 8-2 910
7
Finding Global Sum 178
8
Finding Global Sum 25 Binomial Tree
9
Agglomeration
10
sum
11
Gather
12
All-gather
13
Complete Graph for All-gather
14
Hypercube for All-gather
15
Analysis of Communication Lambda is latency = message delay = overhead to send 1 message Beta is bandwidth = number of data items per unit time = bytes per message Sending a message with n data items costs
16
Communication Time for All-Gather Hypercube Complete graph
17
Adding Data Input
18
Scatter
19
Scatter in log p Steps 12345678 567812345612 7834
20
Communication Time for Scatter Hypercube Complete graph
21
Recall Parallel Floyd’s Computational Complexity Innermost loop has complexity (n) Middle loop executed at most n/p times Outer loop executed n times Overall computation complexity (n 3 /p)
22
Floyd’s Communication Complexity No communication in inner loop No communication in middle loop Broadcast in outer loop — complexity is Executed n times
23
Execution Time Expression (1) Iterations of outer loop Iterations of middle loop Cell update time Iterations of outer loop Messages per broadcast Message-passing time bytes/msg Iterations of inner loop
24
Accounting for Computation/communication Overlap Note that after the 1 st broadcast all the wait times overlap the computation time of Process 0.
25
Execution Time Expression (2) Iterations of outer loop Iterations of middle loop Cell update time Iterations of outer loop Messages per broadcast Message-passing time Iterations of inner loop Message transmission
26
Predicted vs. Actual Performance Execution Time (sec) ProcessesPredictedActual 125.54 213.0213.89 39.019.60 46.897.29 55.865.99 65.015.16 74.404.50 83.943.98 X=25.5 nsec L = 250 usecs B = 10MB/sec N = 1000
27
Summary Two matrix decompositions –Rowwise block striped –Columnwise block striped Blocking send/receive functions –MPI_Send –MPI_Recv Overlapping communications with computations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.