Analytical Modeling Of Parallel Programs

Slides:



Advertisements
Similar presentations
Fundamentals of Python: From First Programs Through Data Structures
Advertisements

Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Cmpt-225 Algorithm Efficiency.
Analytical Modeling of Parallel Systems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'',
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Combining the strengths of UMIST and The Victoria University of Manchester COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 7 Scalability.
Analysis of Algorithms
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallel Programming with MPI and OpenMP
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Concurrency and Performance Based on slides by Henri Casanova.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Algorithm Complexity is concerned about how fast or slow particular algorithm performs.
DCS/1 CENG Distributed Computing Systems Measures of Performance.
Algorithm Analysis 1.
Auburn University
OPERATING SYSTEMS CS 3502 Fall 2017
Applied Discrete Mathematics Week 2: Functions and Sequences
Performance Evaluation Frédéric Desprez INRIA
Design and Analysis of Algorithms Chapter -2
SCALABILITY ANALYSIS.
CSC 421: Algorithm Design & Analysis
CSC 421: Algorithm Design & Analysis
Analysis of Algorithms
Parallel Programming By J. H. Wang May 2, 2017.
CSC 421: Algorithm Design & Analysis
COSC160: Data Structures Linked Lists
Algorithm Analysis CSE 2011 Winter September 2018.
Chapter 3: Principles of Scalable Performance
Algorithm Analysis (not included in any exams!)
Software Reliability Models.
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Objective of This Course
A Kind of Binary Tree Usually Stored in an Array
Introduction to Algorithms Analysis
RS – Reed Solomon List Decoding.
Algorithm Efficiency Chapter 10.
Lectures on Graph Algorithms: searching, testing and sorting
Parallel Sorting Algorithms
Parallelismo.
COMP60621 Fundamentals of Parallel and Distributed Systems
COMP60611 Fundamentals of Parallel and Distributed Systems
Analytical Modeling of Parallel Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
CSC 421: Algorithm Design & Analysis
Potential for parallel computers/parallel programming
Memory System Performance Chapter 3
Analytical Modeling of Parallel Systems
Potential for parallel computers/parallel programming
To accompany the text “Introduction to Parallel Computing”,
Potential for parallel computers/parallel programming
COMP60611 Fundamentals of Parallel and Distributed Systems
CSC 421: Algorithm Design & Analysis
CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4: KFUPM CISE301_Topic1.
Presentation transcript:

Analytical Modeling Of Parallel Programs Dagoberto A.R.Justo PPGMAp UFRGS 12/3/2018 intro

Introduction Problem: How can we model the behavior of a parallel program in order to predict its execution time, using the size of the problem, the number of nodes/processors, the communication network parameters ts and tw? Clearly, we must consider the algorithm and the architecture Issues A serial program measures total execution time and typically depends mainly the size of its input A parallel program has several new issues The execution time of the program The speedup relative to the algorithm running serially However, is it speedup compared with the best serial algorithm or is it speed relative to the serialization of the parallel algorithm used?

Outline Of This Topic Sources of overhead in a parallel program Performance metrics for parallel systems The effect of granularity on performance Scalability of parallel systems Minimum execution time and minimum cost-optimal execution time Asymptotic analysis of parallel programs Other scalability metrics

Typical Parallel Execution Profile Essential/Excess Computation Interprocessor Communication Idling

Sources Of Overhead A profile illustrates the kinds of activities in a program execution Essential Computation Same computations in a serial program Some of these are overheads (time spent not computing directly what is needed) such as: Interprocess interaction Idling Excess computation These are all activities the serial program does not perform An efficient parallel program attempts to minimize these overheads to zero but of course this is not always possible

Interprocess Interaction and Idling Usually the most significant overhead Sometimes reduced by performing redundant computation Idling Caused by: load imbalance synchronization (waiting for collaborating processes to reach the synchronization point serial computation that cannot be avoided

Excess Computation The fastest known serial algorithm may not be easy to parallelize, especially for a large number of processes A different serial algorithm may be necessary, which is not optimal when implemented serially Such programs may perform more work It may be faster for all processes to compute common intermediate results than to compute them once and broadcast them to all processes that need them What we are thus faced with is how to measure the performance of a parallel program to tell whether the parallel program is worth using How does performance compare with a serial implementation? How does performance scale with adding more processes? How does performance scale with increasing the problem size?

Performance Metrics Serial and Parallel runtime TS : wall-clock time from start time to completion time (on the same processor as the parallel program) TP : wall-clock time from the time the parallel processing starts to the end time for the last process to complete TTP = pTP : Parallel Cost Total parallel overhead TO = pTP – TS Speedup S (How well is the parallel program performing?) The ratio of the parallel execution time with the execution of the best serial program S = TS / TP S is expect to be near p S=p is called linear speedup S<p most often. Generally, 80% of p is very good S>p, it is called superlinear speedup

Speedup and Efficiency Speedup S (How well is the parallel program performing?) The ratio of the parallel execution time with the execution of the best serial program S = TS / TP S is expect to be near p S=p is called linear speedup S<p most often. Generally, 80% of p is very good S>p, it is called superlinear speedup Efficiency The ratio of the speedup to the number of processors used E = S/p E=1 is ideal E>0.8 is generally good

Example: Summing n Numbers in Parallel With n Processors Using input decomposition, place each number on a different processor The sum is performed in log n phases Assume n is a power of 2 and assume the processors are arranged in a linear array numbered from 0 Phase 1: Odd numbered processors send xi to the left processor Even processor adds the two numbers Si=xi+xi+1 Phase 2: Every second processor sends Si to the processor two positions left This processor adds the partial sum it has to the received partial sum Continuing for log n phases, process 0 ends up with the sum

A Picture Of The Parallel Sum Process 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 (a) Initial data distribution and the first communication phase (b) Second communication step 3 2 1 7 6 5 4 11 10 9 8 15 14 13 12 (c) Third communication step 3 2 1 7 6 5 4 11 10 9 8 15 14 13 12 (d) Fourth communication step 3 2 1 7 6 5 4 11 10 9 8 15 14 13 12 (e) Accumulation of the sum at processor 0 after the final communication 3 2 1 7 6 5 4 11 10 9 8 15 14 13 12

Modelling execution time tC : the time to add two numbers Serial time TS = (n-1) tC = (n) Parallel time TP: in each of the log n phases, computation time: tC communication time: (ts + tw) TP = tC log n + (ts + tw) log n = (log n) The speedup is: S = TS / TP = (tC n)/(tC log n + (ts + tw) log n) = (tC /(tC + ts + tw)(n log n) = (n / log n) The overhead (p = n) is: TO = pTp – TS = n((ts + tw) log n + tC(log n – 1)) + tC = (n log n) The overhead is large -- why? Thus, this parallel algorithm does considerably more total work than the serial algorithm

Misleading Speedups Faster than one deserves: Consider a parallel implementation of bubble sort, called the odd-even sort, of 105 elements on 4 processors that completes in 40 seconds The serial version of bubble sort takes 150 seconds But serial quick-sort on the same set of elements takes 30 seconds The misleading speedup is: TS / TP = 150/40 = 3.75 The correct speedup is: TS / TP = 30/40 = 0.75 The required comparison is with quick-sort, the fastest serial algorithm and the particular parallel sort algorithm But the parallel code can run faster than expected For example: Because of caches effects Because of exploratory decomposition

Misleading Speedups Continued Cache effects Consider a problem of size W words(large but small enough to fit in the memory of a single processor) Suppose on a single processor an 80% cache hit ratio is observed for this problem Suppose cache latency to CPU is 2 ns Suppose the DRAM latency is 100 ns The access time per data item on average is then 0.8*2 + 0.2*100 = 21.6 ns Assume the program performs at 1 floating operation per memory access Then the performance is 1000/21.6 Mflops or 46.3 MFLOPS

Cache Effects Continued Consider solving the same problem on two processors by data decomposition of all data so that two sub-problems of size W/2 are solved The amount of data per processor is less so that we might expect the cache hit ratio to be higher, say 90% Of the remaining 10% of the accesses, assume 8% are from each processor's memory and 2% are from the other processor's memory Suppose the latencies are 2 ns, 100 ns, and 400 ns respectively (400 ns for access to the DRAM of the other processors memory) The access time per data item on average is then 0.9*2 + 0.08*100 + 0.02*400 = 17.8 ns per processor Assume, as before, the program performs 1 floating operation per memory access Then the performance is 1000/17.8 Mflops or 56.2 MFLOPS per processor for a total rate of 112.4 MFLOPS (2 processors used) Speedup is then 112.6/46.3 or 2.43, faster than we deserve

Faster Solutions Via Exploratory Decomposition Suppose the search tree looks like the following graph with the solution at the rightmost node Serial search is depth first, left first Parallel search is 2 processors Both depth first, left first algorithms The serial algorithm finds the solution in 14 steps The parallel algorithm takes 5 steps The speedup is 14/5 = 2.8 > 2 -- superlinear speedup results Processing element 0 begins searching here Processing element 1 begins searching here Solution

Efficiency Efficiency E is defined as: S: speedup and p: number of processors It in essence measures the average utilization of all processors E=1: Ideal speedup E>1: Superlinear speedup E>0.8: Very good speedups in practice Example: the parallel summation process of n numbers with n processors Notice the speedup decreases with the size of the problem That is, there is no point in using a large number of processors for computations with speedups like this

A More Satisfactory Example Edge detection of an n  n pixel image Applies a 3  3 multiplicitive template with summation (convolution) to each pixel The serial computation time TS is 9 tc n2 where tc is the average time for a multiply-add arithmetic operation The parallel algorithm using p processors divides the image into p column slices with n/p columns per slice The computation for the pixels on each processor is local except for the left and right edges which require the pixel values from the edge column of neighboring processes Thus, the parallel time is: TP = 9 tc n2 / p + 2(ts + n tw) The speedup and efficiency are: E increases with increasing n E decreases with increasing p

Convolution Of A Pixel Image -2 -1 1 2 2 -1 -2 1 Pixel Image Two different convolution templates for detecting edges P0 P1 P2 P3 Image partitioning amongst processes and data sharing for process 1

Parallel Cost And Cost-Optimality Defined as the number of processors p times the parallel computing time Tp The term cost-optimal refers to a parallel system that has the following property: The parallel system has a cost that has the same asymptotic growth as the fastest serial algorithm to solve the same problem; that is, has an efficiency of (1) Example: Adding numbers in parallel is (n log n) Adding the numbers serially is (n) This parallel addition algorithm is NOT cost-optimal Questions: Can this algorithm or any algorithm be made cost optimal by increasing the granularity? That is, decrease p and increase the work of each proc? Sometimes, but in general no. See the examples on the next few slides Is there a cost-optimal parallel algorithm for summing n numbers? Yes. But the cost-optimal algorithm is different from the algorithms so far but uses the idea of increased granularity

The Importance Of The Cost-Optimal Metric Consider a sorting algorithm to sort n elements using n processors that is not cost optimal. What does this mean in practice? Scalability performance is very poor. Let's see why Suppose this algorithm takes (log n)2 time to sort the list The best serial time is known to be n log n The parallel algorithm has a speedup and efficiency of n/log n and 1/log n respectively. The parallel algorithm is an improvement but not cost optimal because speedup is not (1) Now, consider increasing the granularity by decreasing the number of processors from n to p < n. The new parallel algorithm will take no more than n(log n)2/p time Its speedup is (nlog n)/(n(log n)2/p) = p/log n and its efficiency is the same For example, with 32 processors and n = 1024 and n = 106, the speedups are 3.2 and 1.6 respectively -- very poor scalability

Two Different Parallel Summation Algorithms The second sum algorithm uses the idea of increasing the granularity of the previous parallel algorithm, also called scaling down (see the next few slides for n = 16 and p = 4) Instead of using n processors, use p < n and place more of the addends, namely n/p, on each processor Add the corresponding numbers on each processor as with the first algorithm (communication) until all partial sums are on one processor Add the partial sums in pairs using the approach of the first parallel approach but on one processor The third sum algorithm is a modification of the above Add up all the numbers on each processor first using the usual serial algorithm and then apply the first parallel algorithm to p addends on p processors

Second Algorithm (n=16, p=4) -- First Step 2 3 5 4 6 7 9 8 10 11 13 12 14 15 Initial distribution and first communication step P0 P1 P2 P3 5 4 6 7 9 8 10 11 13 12 14 15 Substep 2 distribution and next communication step 01 23 P0 P1 P2 P3 Data distribution after last substep 01 23 45 67 89 1011 1213 1415 P0 P1 P2 P3 9 8 10 11 13 12 14 15 Substep 3 distribution and next communication step 01 23 45 67 P0 P1 P2 P3 13 12 14 15 Substep 4 distribution and next communication step 01 23 45 67 89 1011

Second Algorithm (n=16, p=4) -- Second Step Initial distribution before first substep and communication 01 23 45 67 89 1011 1213 1415 P0 P1 P2 P3 Substep 2 distribution before second substep and communication 03 45 67 89 1011 1213 1415 P0 P1 P2 P3 Substep 3 distribution before third substep and communication 03 47 89 1011 1213 1415 P0 P1 P2 P3 Substep 4 distribution before fourth substep and communication 03 47 811 1213 1415 P0 P1 P2 P3 Final distribution after last substep 03 47 811 1215

Second Algorithm (n=16, p=4) -- 3rd & 4th Step Data distribution before first substep and grouping of operations 03 47 811 1215 P0 P1 P2 P3 Data distribution before second substep and grouping of operations 07 811 1215 P0 P1 P2 P3 Final distribution of data after last substep 07 815 P0 P1 P2 P3 Data distribution before first substep and grouping of operations 07 815 P0 P1 P2 P3 Final result 015

Third Algorithm (n=16, p=4) -- A Cost Optimal Algorithm 8 12 5 1 9 13 6 2 10 14 7 3 11 15 Initial data distribution and grouping of operations P0 P1 P2 P3 Data distribution after first step and first communication step 03 811 47 1215 P0 P1 P2 P3 Data distribution after second step and second communication step 07 815 P0 P1 P2 P3 Final result after the last step 015

The Effect Of Granularity Increasing the granularity of a cost-optimal algorithm maintains cost optimality Suppose we have a cost-optimal algorithm using p processors and we increase its granularity by reducing the number of processors to q and increase the work per processor The work per processor would increase by a factor p/q The communication per processor should also grow by no more than a factor p/q provided the mapping is done carefully Thus, the parallel time would increase by a factor p/q Then parallel cost of the new algorithm would be qTp_new which is q(p/q)Tp_old = pTp_old Thus, granularity has not changed the cost of a cost-optimal algorithm Thus, to produce cost-optimal parallel codes from non-cost optimal parallel code, you may have to do more than increase granularity but it may help The two new sum algorithms illustrate this point The above proof does NOT show that increasing granularity preserves cost

Analysis Of The Sum Algorithms Serial sum algorithm costs: (n) The first parallel algorithm costs :(n log n) The second parallel algorithm is: n/p steps with log p sub-steps, taking ((n/p)log p) Then, we add n/p numbers, taking (n/p) Total time is: ((n/p)log p) Cost is: p ((n/p)log p) = (n log p) This is asymptotically higher than the serial algorithm and still is not cost optimal The third parallel algorithm is: The first step is n/p additions The second step consists of log p sub-steps consisting of an addition and communication Thus, the time is (n/p + log p) and the cost is (n + p log p) As long as p is not too large, like n = (p log p) , the cost is (n), which means this algorithm is cost optimal

Scalability Of Parallel Systems We typically develop parallel programs from small test cases It is very difficult to predict scalability (performance for large problems) from small test cases, unless you have done the analysis first We now study some tools to help in the prediction process See the FFT case study based on observation of performance in the small and its relation to performance for large sized problems Topics: Scaling characteristics of parallel programs Isoefficiency metric of scalability Problem size Isoefficiency function Cost optimality and the isoefficiency function A lower bound on the isoefficiency function Degree of concurrency and the isoefficiency function

FFT Case Study Three algorithms for performing FFTs Algorithms described in detail in Chapter 13 Binary exchange 2-D transpose 3-D transpose Speedup data given for 64 processors, for the FFT size n varying from 1 to 18K elements For small n (up to 7000 or so), 3-D transpose and binary exchange are best -- a lot of testing to see this For large n, 2-D transpose outperforms the others and continues faster for n > 14000 -- Can you believe this remains true though for even larger n? Not unless you have done the analysis to support the conjectured asymptotic behavior

Scaling Characteristics Of Parallel Programs The efficiency is: E = S/p = TS/(pTP) Using the expression involving TO (overhead -- slide 9) Unfortunately, the overhead is at least linear in p unless the algorithm is completely parallel Say, the parallel algorithm has a serial time of Tserial Then, all but one of the processors is idle during the time one processor is performing the serial computation Thus, the overhead is at least (p–1)Tserial Therefore, the efficiency is bounded above by:

Scaling Characteristics Continued From this expression for E (previous slide) The efficiency E decreases with the number of processors for a given problem size The efficiency E increases with larger problems (TS increases) Consider the cost optimal summation algorithm For this algorithm (assume unit time for addition and communication): n/p is the time for adding n/p items 2log p is the time for the addition and communication of phase 2 See the disappointing results for large p on the next slide See how an efficiency level of, say, 80% can be maintained by increasing n for each p

Speedup Curves Plots of S = n/(n/p + 2 log p) for the cost-optimal addition parallel algorithm for changing p and n

Efficiency Tables -- A Table Of Values Of E, For Different p and n 64 1.0 0.80 0.57 0.33 0.17 192 0.92 0.60 0.38 320 0.95 0.87 0.71 0.50 512 0.97 0.91 0.62 The function of p representing work (n) that keeps the efficiency fixed (with increasing p) is the isoefficiency function -- the bold entries are 80% efficiency

Scalability Overhead varies with the serial time (the amount of work) and the number of processors Clearly, overhead (communication) typically increases with the number of processors It often increases with the amount of work to be done, usually indicated by the sequential time TS However, as the problem size increases, overhead usually increases sublinearly as a percentage of the work This means that the efficiency increases with the problem size even when the number of processors is fixed For example, look at the columns of the last table Also, an efficiency level can be maintained by increasing both the number of processors p and the amount of work A parallel system that is able to maintain a specific efficiency in this manner is called a scalable parallel system The scalability is a measure of a system's capacity to increase speedup in proportion to the number of processing elements

Scalability And Cost Optimality Recall: Cost optimal algorithms have an efficiency of (1) Scalable parallel systems can be always made cost-optimal Cost-optimal algorithms are scalable Example: the cost-optimal algorithm for adding n numbers Its efficiency is: E = 1/(1+2p(log p)/n) Setting E equal to a constant, say K, means that n and p must vary as n = 2(K/(1–K))p log p Thus, for any p, the size of n can be selected to maintain efficiency K For example, for K = 80%, then, if 32 processors are used, then the problem size of 1480 must be used (Recall: this efficiency formula was based on adding 2 numbers and communicating 1 number took unit time -- not a very realistic assumption with current hardware)

Isoefficiency Metric Of Scalability Two observations: Efficiency always decreases with increase in the number of processors, approaching 0 for a large number of processors Efficiency often increases with the amount of work or size of the problem, approaching 1 It sometimes can then decrease for large work sizes (run out of memory, for example) A scalable system is one for which the efficiency can be made constant by increasing both the work and the number of processors The rate at which, for a fixed efficiency, the work or the problem size must increase with respect to the number of processors to maintain a fixed efficiency is called the degree of scalability of the parallel system This definition depends upon a clear definition of problem size Once problem size is defined, then the function determining problem size for varying number of processors and fixed efficiency is called the isoefficiency function

Problem Size Define problem size as the number of basic operations, such as arithmetic operations, data stores and loads, etc. It needs to be a measure that if the problem size is doubled, the computation time is doubled Measures such as the size (order) of a matrix are misleading Double the size of a matrix causes a computation dominated by matrix-matrix multiplication to increase by a factor of 8 Double the size of a matrix causes a computation dominated by matrix-vector multiplication to increase by a factor of 4 Double the size of vectors causes a computation dominated by vector dot product to increase by a factor of 2 In the formulas that follows, they assume the basic arithmetic operation takes 1 unit of time Thus, the problem size W is the same as the serial time TS of the fastest known serial algorithm

The Isoefficiency Function -- The General Case Parallel execution time TP is a function of: the problem size W, overhead function TO, and the number number p of processors Let's fix the efficiency to E and solve for W in terms of p and the fixed efficiency Let K = E/(1–E) and we get:

Development Continued Now solving for W gives: In the above equation, K is a constant For each choice of W, we solve for p This can usually be done algebraically Or, for each choice of p, we solve for W, possibly a non-linear equation, that has to be solved numerically or approximated somehow The resulting function of W in terms of p is called the isoefficiency function

Analysis Of The Isoefficiency Function Suppose this function has the property that: Small changes in p give small changes in W Then, the system is highly scalable Small changes in p result in large changes in W Then, the system is not very scalable The isoefficiency function may be difficult to find You can solve the above equation only with specific finite values and cannot solve it in general The isoefficiency function may not exist Such systems are not scalable

Two Examples Consider the cost-optimal addition algorithm The overhead function is 2p log p Thus, the isoefficiency function (from slide 40) is: 2Kp log p That is, if the number of processors is doubled, the size of the problem must be increased by a factor of 2 (1+ log p)/log p In this example, the overhead is only a function of the number of processors and not of the work W This is unusual Suppose we had a parallel problem with the following overhead function TO = p3/2 + p3/4W3/4 The equation to solve for W in terms of p is: W = K p3/2 + Kp3/4W3/4 For fixed K, this is a polynomial of degree 4 in W with multiple roots Although this case can be solved analytically, in general such equations cannot be solved analytically See the next slide for an appropriate approximate solution that provides the relevant asymptotic growth

Second Example Continued -- Finding An Approximate Solution The function W is the sum of two positive terms that increase with increasing p Let's take each term separately, find the growth rate consistent with each, and take the maximum growth rate as the asymptotic growth rate For just the first term, W = K p3/2 and W = (p3/2) For just the second term, W = K p3/4 W3/4 Solving for W gives W = (p3) The faster growth rate is (p3) Thus, for this parallel system, the work must grow like p3 in order to maintain a constant efficiency Thus the isoefficiency function is (p3)

Your Project Perform the corresponding analysis for your implementation and determine the isoefficiency function, if it exists It is the analysis that allows you to test the performance of your parallel code on a few test cases and then allows you to predict the performance in the large The test cases have to be selected so that your results do not reflect initial condition or small case effects

Cost-Optimality And The Isoefficiency Function Consider a cost-optimal parallel algorithm An algorithm is cost-optimal iff the efficiency is (1) That is, E = S/p = W/(pTP)  pTP = (W) But pTP = W + TO(W,p) so that TO(W, p) = (W) Thus, W = (TO(W, p)) Thus, an algorithm is cost-optimal iff its overhead does not exceed its problem size asymptotically In addition, if there exists an isoefficiency function f(p), then the relation W = (f(p)) must be satisfied in order to ensure the cost-optimality of the parallel system

A Lower Bound On the Isoefficiency Function We desire the smallest isoefficiency function (recall degrees of scalability -- eg. slides 37, 41) How small can the isoefficiency function be? The smallest possible function is (p) Argument: For W work, the maximum number of processors is W because the processors in excess of the number W will be idle -- no work to do For a problem size growing slower than (p), if the number of processors grows like order p, then eventually there are more processors than work Thus, such a system will not be scalable Thus, the ideal function is (p) This is hard to achieve -- the cost optimal add algorithm has an isoefficiency function of (p log p)

The Degree Of Concurrency And The Isoefficiency Function The degree of concurrency C(W) is the maximum number of tasks that can be executed concurrently for a computation with work W This degree of concurrency thus must limit the isoefficiency function. That is, For a problem size W with a degree of concurrency C(W), at most C(W) processors can be used effectively Thus, the isoefficiency function can be no better than (C(W))

An Example Consider solving Ax = b (a linear system of size n) via Gaussian elimination The total computation is (n3) time We eliminate one variable at a time in a serial fashion, taking (n2) time Thus, at most n2 processing elements can be used Thus, the degree of concurrency is (W2/3) Thus, from p = (W2/3), W = (p3/2) which is the isoefficiency function Thus, this algorithm cannot reach the ideal or optimal isoefficiency function (p)

Minimum Execution Time and Minimum Cost-Optimal Execution Time Parallel processing time TP often decreases as the number p of processors increases until either TP approaches a minimum asymptotically or increases The question now is what is that minimum and is it useful to know? We can find this minimum by taking the derivative of the function of parallel time with respect to the number of processors, setting this derivative to 0, and solving for the p that satisfies this derivative equation Let p0 be the value of p for which the minimum is attained Let Tpmin be the minimum parallel time Let's do this for the parallel summation system we are working with

An Example Consider the cost-optimal algorithm for adding n numbers Its parallel time is (slide 32): TP = n/p + 2 log p The equation from the first derivative set to 0 is: –n/p2 + 2/p = 0 The solution is: p0 = p = n/2 The minimum parallel time is: Tpmin = 2 log n The processor product time (work) is p Tpmin = (n log n) This is larger than the serial time which is (n) Thus, for this minimum time, the problem is not being solved cost-optimally (larger than the serial time)

The Cost-Optimal Minimum Time Tpcost_opt Let's characterize and find the minimum for the computation performed cost-optimally Cost-optimality can be related to the isoefficiency function and vice versa (see slide 45) If the isoefficiency function of a parallel system is (f(p)), then the problem of size W can solved cost-optimally iff W = (f(p)) That is, a cost-optimal solution requires p = (f- –1(p)) The parallel run-time is: TP = (W/p), (because pTP = (W)) Thus, a lower bound on the parallel runtime for solving a problem of size W cost-optimally is: Tpcost_opt = (W/ f- –1(p))

The Example Continued Estimating Tpcost_opt for the cost-optimal addition algorithm After some algebra, we get: Tpcost_opt = 2 log n – log log n Notice that Tpmin and Tpcost_opt are the same asymptotically, that is, both (log n) This is typical for most systems It is not true in general and we can have the situation that Tpcost_opt > (Tpmin)

An Example Of Tpcost_opt > (Tpmin) Consider the hypothetical system with (slide 42): TO = p3/2 + p3/4W3/4 Parallel runtime is: TP = (W + TO)/p = W/p + p1/2 + W3/4/p1/4 Taking the derivative to find Tpmin gives: p0 = (W) Substituting back in to give Tpmin gives: Tpmin = (W1/2) According to slide 43, the isoefficiency function W = (p3) = f(p) Thus, p = f -1(p) = (W1/3) Substituting into the equation for Tpcost_opt on slide 51 gives Tpcost_opt = (W2/3) Thus, Tpcost_opt > (Tpmin) This does happen often

Limitation By Degree Of Concurrency C(W) Beware: The study of asymptotic behavior is valuable and interesting, but increasing p asymptotically is unrealistic For example, p0 larger than C(W) is meaningless For such cases, Tpmin is: Needless to say, for problems where W grows unendlessly, C(W) may also grow unendlessly so that considering large p is reasonable

Recall: the serial best time is n log n Asymptotic Analysis Of Parallel Programs Table For 4 Parallel Sort Programs of n Numbers Algorithm A1 A2 A3 A4 p n2 log n n n TP 1 n log n S n log n E (log n)/n (log n)/ n pTP n1.5 Recall: the serial best time is n log n Question: Which is the best?

Comments On The Table Comparison by speed TP: A1 is best followed by A3, A4, and A2 But A1 is not practical for large n It requires n2 processors Let's compare via efficiency E: A2 and A4 are best, followed by A3 and A1 Look at the costs pTP now: A2 and A4 are cost-optimal where A3 and A1 are not Overall, then A2 is the best … if least number of processors is important Overall, then A4 is the best … if least parallel time is important

Other Scalability Metrics Other metrics to handle less general cases have been developed For example: Metrics that deal with problems that must be solved in a specified time -- real time problems Metrics that deal with the fact that memory may be the limiting factor and scaling of the number of processors may be necessary, not for increased performance, but because of increased memory That is, memory scales linearly with the number of processors p Scaled speedup Serial fraction

Scaled Speedup Analyze the speedup, increasing the problem size linearly with the number of processors This analysis can be done by constraining either time or memory in the analysis To see this, consider the following two examples: an parallel algorithm for matrix-vector products an parallel algorithm for matrix-matrix products

Scaled Speedup For Matrix-Vector Products The serial time Ts performing matrix-vector product for a matrix of size nn is: tc n2 where tc is the time for a multiply-add operation Suppose the parallel time Tp for a simple parallel algorithm (Section 8.2.1) is: Then, the speedup is:

Scaled Speedup For Matrix-Vector Products Continued Consider a memory scaling constraint Require the memory to scale as (p) But the memory requirement for the matrix is (n2) Therefore n2 = (p) or n2 = c  p Substituting into the speedup formula, we get: Thus, the scaled speedup with a memory constraint is (p)

Scaled Speedup For Matrix-Vector Products Continued Consider a time scaling constraint Require the time to be constant as the number of processors increases But the parallel time is (n2/p) Therefore n2/p = c or n2 = c  p This is the same requirement as for the memory constrained case and so the same scaled speedup results Thus, the scaled speedup with a time constraint is also (p)

Scaled Speedup For Matrix-Matrix Products The serial time Ts performing matrix-matrix product for a matrix of size nn is: tc n3 where tc is the time for a multiply-add operation Suppose the parallel time Tp for a simple parallel algorithm (Section 8.2.1) is: Then, the speedup is:

Scaled Speedup For Matrix-Matrix Products Continued Consider a memory scaling constraint Require the memory to scale as (p) But the memory requirement for the matrix is (n2) Therefore n2 = (p) or n2 = c  p Substituting into the speedup formula, we get: Thus, the scaled speedup with a memory constraint is (p), that is, linear

Scaled Speedup For Matrix-Matrix Products Continued Consider a time scaling constraint Require the time to be constant as the number of processors increases But the parallel time is (n3/p) Therefore n3/p = c or n3 = c  p Substituting into the speedup formula, we get: Thus, the scaled speedup with a time constraint is also (p5/6), that is, sublinear speedup

Serial Fraction Used as with the other measures to indicate the nature of the scalability of a parallel algorithm What is it? Assume the work W be broken into two parts The part that is totally serial, denoted as Tser We assume this includes all the interaction time The part that is totally parallel, denoted as Tpar Then the work W = Tser + Tpar Define the serial fraction as f = Tpar/W Now, we seek an expression for f in terms of p and S in order to study how f changes with p

Serial Fraction Continued From the definition of TP: Using the relation S = W/TP and solving for f gives a formula for f in terms of S and f : It is not clear how f varies with p here If f increases with increasing p, the system is considered scalable Let's look at what this formula tells us for the matrix-vector product

Serial Fraction Example For the matrix vector product: This indicates that the serial fraction f grows with increasing p and so the parallel algorithm is considered scalable