Presentation is loading. Please wait.

Presentation is loading. Please wait.

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Similar presentations


Presentation on theme: "2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson."— Presentation transcript:

1 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2 2a.2 Sequential execution time, t s : Estimate by counting computational steps of best sequential algorithm. Parallel execution time, t p : In addition to number of computational steps, t comp, need to estimate communication overhead, t comm : t p = t comp + t comm

3 2a.3 Computational Time Count number of computational steps. When more than one process executed simultaneously, count computational steps of most complex process. Generally, function of n and p, i.e. t comp = f (n, p) Often break down computation time into parts. Then t comp = t comp1 + t comp2 + t comp3 + … Analysis usually done assuming that all processors are same and operating at same speed.

4 2a.4 Communication Time Many factors, including network structure. As a first approximation, use t comm = t startup + nt data t startup -- startup time, essentially time to send a message with no data. Assumed to be constant. t data -- transmission time to send one data word, also assumed constant, and there are n data words.

5 2a.5 Idealized Communication Time Number of data items (n) Startup time The equation to compute the communication time ignore the fact that the source and destination may not be directly linked in a real system so that the message may pass through intermediate nodes. It is also assumed that the overhead incurred by including information other than data in the packet is constant and can be part of startup time.

6 2a.6 Final communication time, t comm Summation of communication times of all sequential messages from one process, i.e. t comm = t comm1 + t comm2 + t comm3 + … Communication patterns of all processes assumed same and take place together so that only one process need be considered. Both t startup and t data, measured in units of one computational step, so that can add t comp and t comm together to obtain parallel execution time, t p.

7 Communication Time of Broadcast/Gather If broadcast is done through single shared wire for Ethernet, the time complexity is O(1) for single data item and O(w) if w data items. If binary tree is used as the underlying network structure and 1-to-N fan-out broadcast is used, then what about communication cost for p final destinations (leaf nodes) using w messages? –We assume the left and right child will receive the message from their parent in a sequential way. However, at each level, different parent nodes will send out the message at the same time. 2a.7

8 1-to-N fan-out Broadcast t comm = 2 (log p) (t startup + wt data ) It depends on number of levels and number of nodes at each level. For a binary tree and p final destinations at the leave level. 2a.8

9 2a.9 Benchmark Factors With t s, t comp, and t comm, can establish speedup factor and computation/communication ratio for a particular algorithm/implementation: Both functions of number of processors, p, and number of data elements, n.

10 2a.10 Factors give indication of scalability of parallel solution with increasing number of processors and problem size. Computation/communication ratio will highlight effect of communication with increasing problem size and system size. We wish to have dominant factor in computation instead of communication, as n increases, communication can be ignored and adding more processors can improve the performance.

11 2a.11 Example Adding n numbers using two computers, each adding n/2 numbers each. Numbers initially held in one computer. Computer 1 Computer 2 Send n/2 numbers Send result back Add up n/2 numbers Add partial sums t comm = t startup +(n/2)t data t comm = t startup + t data t comp = n/2 t comp = 1

12 2a.12 Overall t comm = 2t startup +(n/2 + 1)t data = O(n) t comp = n/2 + 1 = O(n) Computation/Communication ratio = O(1)

13 2a.13 Another problem Computation time complexity = O(n 2 ) Communication time complexity = O(n) Computation/Communication ratio = O(n)

14 2a.14 Cost Cost = (execution time ) x (number of processors) Cost of sequential computation = t s Cost of parallel computation = t p x p Cost-optimal algorithm When parallel computation cost is proportional to sequential computation: Cost = t p x p = k x t s k is a constant

15 2a.15 Example Suppose t s = O(n log n) for the best sequential program where n = number of data item p = number of processors For cost optimality if t p = O(n log n) / p = O(n/p log n) Not cost optimal if t p = O(n^2/p ) A parallel algorithm is cost-optimal if parallel time complexity times the number of processors equals the sequential time complexity.

16 Evaluating programs Measuring the execution time Time-complexity analysis gives an insight into the parallel algorithm and is useful in comparing different algorithms. We want to know how the algorithm actually performs in a real system. We can measure the elapsed time between two points in the code in seconds. –System calls, such as clock(), time(), or gettimeofday() or MPI_Wtime() –Example: L1: time(&t1);. L2: time(&t2); elapsed_time = difftime(t2, t1); 2a.16

17 Communication Time by the Ping-Pong Method Point-to-point communication time of a specific system can be found using the ping-pong method. One process p0 sends a message to another process, say p1. Immediately upon receiving the message, p1 sends the message back to p0. The time is divided by two to obtain an estimate of the time of one-way communication. For example, at p0: time(&t1); send(&x, p1); recv(&x, p1); time(&t2); elapsed_time = 0.5* difftime(t2, t1); 2a.17

18 Profilling A profile of a program is a histogram or graph showing the time spent on different part of the program. Showing the number of times certain source code are executed. It can help to identify certain hot spot places in a program visited many times during the execution. These places could be optimized first. 2a.18

19 Program Profile Histogram 2a.19 Statement number of region of program


Download ppt "2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson."

Similar presentations


Ads by Google