Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication Analysis of Floyd’s Algorithm
Parallel Reduction Evolution
Binomial Trees Subgraph of hypercube
Finding Global Sum
Finding Global Sum
Finding Global Sum
Finding Global Sum 178
Finding Global Sum 25 Binomial Tree
Agglomeration
sum
Gather
All-gather
Complete Graph for All-gather
Hypercube for All-gather
Analysis of Communication Lambda is latency = message delay = overhead to send 1 message Beta is bandwidth = number of data items per unit time = bytes per message Sending a message with n data items costs
Communication Time for All-Gather Hypercube Complete graph
Adding Data Input
Scatter
Scatter in log p Steps
Communication Time for Scatter Hypercube Complete graph
Recall Parallel Floyd’s Computational Complexity Innermost loop has complexity (n) Middle loop executed at most n/p times Outer loop executed n times Overall computation complexity (n 3 /p)
Floyd’s Communication Complexity No communication in inner loop No communication in middle loop Broadcast in outer loop — complexity is Executed n times
Execution Time Expression (1) Iterations of outer loop Iterations of middle loop Cell update time Iterations of outer loop Messages per broadcast Message-passing time bytes/msg Iterations of inner loop
Accounting for Computation/communication Overlap Note that after the 1 st broadcast all the wait times overlap the computation time of Process 0.
Execution Time Expression (2) Iterations of outer loop Iterations of middle loop Cell update time Iterations of outer loop Messages per broadcast Message-passing time Iterations of inner loop Message transmission
Predicted vs. Actual Performance Execution Time (sec) ProcessesPredictedActual X=25.5 nsec L = 250 usecs B = 10MB/sec N = 1000
Summary Two matrix decompositions –Rowwise block striped –Columnwise block striped Blocking send/receive functions –MPI_Send –MPI_Recv Overlapping communications with computations