Chapter 7 Performance Analysis. 2 References (Primary Reference): Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated.

Slides:



Advertisements
Similar presentations
Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.
Advertisements

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
Potential for parallel computers/parallel programming
Fundamentals of Python: From First Programs Through Data Structures
Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 4 By Herb I. Gross and Richard A. Medeiros next.
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
The Growth of Functions
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Chapter 7 Performance Analysis. 2 Additional References Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Parallel Programming Chapter 3 Introduction to Parallel Architectures Johnnie Baker January 26 , 2011.
Steve Lantz Computing and Information Science Parallel Performance Week 7 Lecture Notes.
Parallel System Performance CS 524 – High-Performance Computing.
PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003
Algebra Problems… Solutions
Parallel and Distributed Algorithms Spring 2007
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
Computer Science 320 Measuring Speedup. What Is Running Time? T(N, K) says that the running time T is a function of the problem size N and the number.
Parallel Programming in C with MPI and OpenMP
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Performance Evaluation of Parallel Processing. Why Performance?
Chapter 7 Performance Analysis.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
The importance of sequences and infinite series in calculus stems from Newton’s idea of representing functions as sums of infinite series.  For instance,
“elbowing out” Processors used Speedup Efficiency timeexecution Parallel Processors timeexecution Sequential Efficiency   
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
A.Broumandnia, 1 3 Parallel Algorithm Complexity Review algorithm complexity and various complexity classes: Introduce the notions.
Chapter 1 Introduction and General Concepts. References Selim Akl, Parallel Computation: Models and Methods, Prentice Hall, 1997, Updated online version.
Combining the strengths of UMIST and The Victoria University of Manchester COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 7 Scalability.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Analysis of Algorithms
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Chapter 10 Algorithm Analysis.  Introduction  Generalizing Running Time  Doing a Timing Analysis  Big-Oh Notation  Analyzing Some Simple Programs.
Parallel Programming with MPI and OpenMP
Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker.
1 Algorithms  Algorithms are simply a list of steps required to solve some particular problem  They are designed as abstractions of processes carried.
College Algebra Sixth Edition James Stewart Lothar Redlin Saleem Watson.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Boyce/DiPrima 9 th ed, Ch 11.3: Non- Homogeneous Boundary Value Problems Elementary Differential Equations and Boundary Value Problems, 9 th edition, by.
Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.
Concurrency and Performance Based on slides by Henri Casanova.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Classification of parallel computers Limitations of parallel processing.
Potential for parallel computers/parallel programming
PERFORMANCE EVALUATIONS
Objective of This Course
CSE8380 Parallel and Distributed Processing Presentation
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Presentation transcript:

Chapter 7 Performance Analysis

2 References (Primary Reference): Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version available through website. (Textbook – Also important reference): Michael Quinn, Parallel Programming in C with MPI and Open MP, Ch. 7, McGraw Hill, Barry Wilkinson and Michael Allen, “Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers ”, Prentice Hall, First Edition 1999 or Second Edition 2005, Chapter 1. Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994, (a popular, earlier textbook by Quinn)

3 Learning Objectives Predict performance of parallel programs –Accurate predictions of the performance of a parallel algorithm helps determine whether coding it is worthwhile. Understand barriers to higher performance –Allows you to determine how much improvement can be realized by increasing the number of processors used.

4 Outline Speedup Superlinearity Issues Speedup Analysis Cost Efficiency Amdahl’s Law Gustafson’s Law (but not the Gustafson- Baris’s Law) Amdahl Effect

5 Speedup Speedup measures increase in running time due to parallelism. The number of PEs is denoted here by n. Based on running times, S(n) = t s /t p, where –t s is the execution time on a single processor, using the fastest known sequential algorithm –t p is the execution time using a parallel processor. For theoretical analysis, S(n) = t s /t p where –t s is the worst case running time for of the fastest known sequential algorithm for the problem –t p is the worst case running time of the parallel algorithm using n PEs.

6 Speedup in Simplest Terms Quinn’s notation for speedup is  (n,p) for data size n and p processors.

7 Linear Speedup Usually Optimal Speedup is linear if S(n) =  (n) Claim: The maximum possible speedup for parallel computers with n PEs is n. Usual Argument: (Assume ideal conditions) –Assume a computation is partitioned perfectly into n processes of equal duration. –Assume no overhead is incurred as a result of this partitioning of the computation – (e.g., partitioning process, information passing, coordination of processes, etc), –Under these ideal conditions, the parallel computation will execute n times faster than the sequential computation and the parallel running time will be t s /n. –Then the parallel speedup in this “ideal situation” is S(n) = t s /(t s /n) = n

8 Linear Speedup Usually Optimal (cont) This argument shows that we should usually expect “linear speedup to be optimal” This argument is valid for typical (or traditional) problems, but will be shown to be invalid for some types of nontraditional problems. Observe that acceptance of the preceding “ideal situations” argument that t p ≤ t s /n already implies that linear speedup is optimal.

9 Linear Speedup Usually Optimal (cont) Unfortunately, the best speedup possible for most applications is much smaller than n –The “ideal conditions” performance mentioned in earlier argument is usually unattainable. –Normally, some parts of programs are sequential and allow only one PE to be active. –Sometimes a significant number of processors are idle for certain portions of the program. During parts of the execution, many PEs may be waiting to receive or to send data. E.g., congestion may occur in message passing

10 Superlinear Speedup Superlinear speedup occurs when S(n) > n Most texts besides Akl’s argue that –Linear speedup is the maximum speedup obtainable. The earlier argument is used as a “proof” that superlinearity is always impossible. –Sometimes speedup that appears to be superlinear may occur, but can be explained by other reasons such as the extra memory in parallel system. a sub-optimal sequential algorithm is compared to parallel algorithm. “Luck”, in case of algorithm that has a random aspect in its design (e.g., random selection)

11 Superlinearity (cont) Selim Akl has given a multitude of examples that establish that superlinear algorithms are required for many non-standard problems –If a problem either cannot be solved or cannot be solved in the required time without the use of parallel computation, it seems fair to say that t s = . Since for a fixed t p >0, S(n) = t s /t p is greater than 1 for all sufficiently large values of t s, it seems reasonable to consider these solutions to be “superlinear”. –Examples include “nonstandard” problems involving Real-Time requirements where meeting deadlines is part of the problem requirements. Problems where all data is not initially available, but has to be processed after it arrives. Real life situations such as a “person who can only keep a driveway open during a severe snowstorm with the help of friends”. –Some problems are natural to solve using parallelism and sequential solutions are inefficient.

12 Superlinearity (cont) The last chapter of Akl’s textbook and several journal papers by Akl were written to establish that superlinearity can occur. –It may still be a long time before the possibility of superlinearity occurring is fully accepted. –Superlinearity has long been a hotly debated topic and is unlikely to be widely accepted quickly – even when theoretical proofs are provided. For more details on superlinearity, see “Parallel Computation: Models and Methods”, Selim Akl, pgs (Speedup Folklore Theorem) and Chapter 12. This material is typically covered in more detail in my PDA class.

13 Speedup Analysis Recall speedup definition:  (n,p) = t s /t p A bound on the maximum speedup is given by –Inherently sequential computations are  (n) –Potentially parallel computations are  (n) –Communication operations are  (n,p) –The “≤” bound above is due to the assumption in formula that the speedup of the parallel portion of computation will be exactly p. –Note  (n,p) = 0 for SIMDs, since communication steps are usually included with computation steps.

14 Execution time for parallel portion  (n)/p Shows nontrivial parallel algorithm’s computation component as a decreasing function of the number of processors used. processors time

15 Time for communication  (n,p) Shows a nontrivial parallel algorithm’s communication component as an increasing function of the number of processors. processors time

16 Execution Time of Parallel Portion  (n)/p +  (n,p) Combining these, we see for a fixed problem size, there is an optimum number of processors that minimizes overall execution time. processors time

17 Speedup Plot “elbowing out” processors speedup

18 Performance Metric Comments The performance metrics introduced in this chapter apply to both parallel algorithms and parallel programs. –Normally we will use the word “algorithm” The terms parallel running time and parallel execution time have the same meaning The complexity the execution time of a parallel program depends on the algorithm it implements.

19 Cost The cost of a parallel algorithm (or program) is Cost = Parallel running time  #processors Since “cost” has other meanings, the term “algorithm cost” is sometimes used for clarity. The cost of a parallel algorithm should be compared to the running time of a sequential algorithm. –Cost removes the advantage of parallelism by charging for each additional processor. –A parallel algorithm whose cost is big-oh of the running time of an optimal sequential algorithm is called cost-optimal.

20 Cost Optimal From last slide, a parallel algorithm is optimal if parallel cost = O(f(t)), where f(t) is the running time of an optimal sequential algorithm. Equivalently, a parallel algorithm for a problem is said to be cost-optimal if its cost is proportional to the running time of an optimal sequential algorithm for the same problem. –By proportional, we means that cost  t p  n = k  t s where k is a constant and n is nr of processors. In cases where no optimal sequential algorithm is known, then the “fastest known” sequential algorithm is sometimes used instead. –However, no guarantee of optimality exists

21 Efficiency

22 Bounds on Efficiency Recall (1) For algorithms for traditional problems, superlinearity is not possible and (2) speedup ≤ processors Since speedup ≥ 0 and processors > 1, it follows from the above two equations that 0   (n,p)  1 Algorithms for non-traditional problems also satisfy 0   (n,p). However, for superlinear algorithms, it follows that  (n,p) > 1 since their speedup > p.

23 Amdahl’s Law Let f be the fraction of operations in a computation that must be performed sequentially, where 0 ≤ f ≤ 1. The maximum speedup  achievable by a parallel computer with n processors is The word “law” is often used by computer scientists when it is an observed phenomena (e.g, Moore’s Law) and not a theorem that has been proven in a strict sense. However, a formal argument can be given that shows Amdahl’s law is valid for “traditional problems”.

24 Usual Argument: If the fraction of the computation that cannot be divided into concurrent tasks is f, and no overhead incurs when the computation is divided into concurrent parts, the time to perform the computation with n processors is given by t p ≥ ft s + [(1 - f )t s ] / n, as shown below:

25 Derivation of Amdahl’s Law (cont.) Using the preceding expression for t p The last expression is obtained by dividing numerator and denominator by t s, which establishes Amdahl’s law. Multiplying numerator & denominator by n produces the following alternate versions of this formula:

26 Amdahl’s Law Preceding argument assumes that speedup can not be superliner; i.e., S(n) = t s / t p  n –Assumption only valid for traditional problems. –Question: Where is this assumption used? The pictorial portion of this argument is taken from chapter 1 of the Wilkinson & Allen textbook Sometimes Amdahl’s law is just stated as S(n)  1/f Note that S(n) never exceeds 1/f and approaches 1/f as n increases.

27 Example 1 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs?

28 Example 2 5% of a parallel program’s execution time is spent within inherently sequential code. The maximum speedup achievable by this program, regardless of how many PEs are used, is

29 Pop Quiz An oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on 8 processors? Answer: 1/(0.2 + ( )/8)  3.3

30 Consequences of Amdahl’s Limitations to Parallelism For a long time, Amdahl’s law was viewed as a fatal flaw to the usefulness of parallelism. –Many computer professionals not in HPC area still believe this. Amdahl’s law is valid for traditional problems and has several useful interpretations. Some textbooks show how Amdahl’s law can be used to increase the efficient of parallel algorithms –See Reference (16), Jordan & Alaghband textbook Amdahl’s law shows that efforts required to further reduce the fraction of the code that is sequential may pay off in huge performance gains. Hardware that achieves even a small decrease in the percent of things executed sequentially may be considerably more efficient.

31 Flaws in Argument that Amdahl’s Law Kills Usefulness of Parallelism –The key flaw in arguments that Amdahl’s law is a fatal limit to the future of parallelism is Gustafon’s Law: The proportion of the computations that are sequential normally decreases as the problem size increases. –Note: Gustafon’s law is a “observed phenomena” and not a theorem. –Other limitations in applying Amdahl’s Law: Its proof focuses on the steps in a particular algorithm, and does not consider that other algorithms with more parallelism may exist Amdahl’s law applies only to ‘standard’ problems were superlinearity can not occur

32 Other Limitations of Amdahl’s Law Recall Amdahl’s law ignores the communication cost  (n,p)n in MIMD systems. –Considering  (n,p)n increases the Amdahl limitations –Recall this term does not occur in SIMD systems, as communications routing steps are deterministic and counted as part of computation cost. On communications-intensive applications, even the  (n,p) term does not accurately capture the additional communication slowdown due to network congestion. As a result, Amdahl’s law usually substantially overestimates the speedup achievable

33 Amdahl Effect Typically communications time  (n,p) has lower complexity than  (n)/p (i.e., time for parallel part) As n increases,  (n)/p dominates  (n,p) As n increases, –sequential portion of algorithm decreases –speedup increases Amdahl Effect: Speedup is usually an increasing function of the problem size.

34 Illustration of Amdahl Effect n = 100 n = 1,000 n = 10,000 Speedup Processors

35 Review of Amdahl’s Law Treats problem size as a constant Shows how execution time decreases as number of processors increases The limitations established by Amdahl’s law are both important and real. –It is now generally accepted by HPC professionals that Amdahl’s law is not a serious limit to the future of parallel computing.

36 The Isoefficiency Metric (Terminology) Parallel system – a parallel program executing on a parallel computer Scalability of a parallel system - a measure of its ability to increase performance as number of processors increases A scalable system maintains efficiency as processors are added Isoefficiency - a way to measure scalability

37 Notation Needed for the Isoefficiency Relation ndata size pnumber of processors T(n,p)Execution time, using p processors  (n,p) speedup  (n) Inherently sequential computations  (n)Potentially parallel computations  (n,p)Communication operations  (n,p)Efficiency Note: At least in some printings, there appears to be a misprint on page 170 in Quinn’s textbook, with  (n) being sometimes replaced with  (n). To correct, simply replace each  with .

38 Isoefficiency Concepts T 0 (n,p) is the total time spent by processes doing work not done by sequential algorithm. T 0 (n,p) = (p-1)  (n) + p  (n,p) We want the algorithm to maintain a constant level of efficiency as the data size n increases. Hence,  (n,p) is required to be a constant. Recall that T(n,1) represents the sequential execution time.

39 Isoefficiency Relation Derivation (Also see pages in Quinn)

40 The Isoefficiency Relation Suppose a parallel system exhibits efficiency  (n,p). Define In order to maintain the same level of efficiency as the number of processors increases, n must be increased so that the following isoefficiency inequality is satisfied.

41 Isoefficiency Relation Usage Used to determine the range of processors for which a given level of efficiency can be maintained The way to maintain a given efficiency is to increase the problem size when the number of processors increase. The maximum problem size we can solve is limited by the amount of memory available The memory size is a constant multiple of the number of processors for most parallel systems

42 The Scalability Function Reduce the isoefficiency relation to the form n  f(p) –Note that right side of above reduction defines f(p) Let M(n) denote memory required for problem of size n M(f(p))/p shows how memory usage per processor must increase to maintain same efficiency We call M(f(p))/p the scalability function –i.e., scale(p) = M(f(p))/p)

43 Meaning of Scalability Function To maintain efficiency when increasing p, we must increase n Maximum problem size is limited by available memory, which increases linearly with p Scalability function shows how memory usage per processor must grow to maintain efficiency If the scalability function is a constant this means the parallel system is perfectly scalable

44 Interpreting Scalability Function Number of processors Memory needed per processor Cplogp Cp Clogp C Memory Size Can maintain efficiency Cannot maintain efficiency

45 Example 1: Reduction Sequential algorithm complexity T(n,1) =  (n) Parallel algorithm –Computational complexity =  (n/p) –Communication complexity =  (log p) Parallel overhead T 0 (n,p) =  (p log p)

46 Reduction (continued) Isoefficiency relation: n  C p log p –Then f(p) = C p log p EVALUATE: To maintain same level of efficiency, how must n increase when p increases? Since the solution requires n values to be stored in parallel memory, M(n) = n, The system has good scalability

47 Example 2: Floyd’s Algorithm (Chapter 6 in Quinn Textbook) Sequential time complexity:  (n 3 ) Parallel computation time:  (n 3 /p) Parallel communication time:  (n 2 log p) Parallel overhead: T 0 (n,p) =  (pn 2 log p)

48 Floyd’s Algorithm (continued) Isoefficiency relation –n 3  C(p n 2 log p)  n  C p log p –Then f(p) = C p log p M(n) = n 2 since an adjacency matrix is stored in parallel memory in this algorithm The parallel system has poor scalability

49 Example 3: Finite Difference See Figure 7.5 Sequential time complexity per iteration:  (n 2 ) Parallel communication complexity per iteration:  (n/  p) Parallel overhead:  (n  p)

50 Finite Difference (continued) Isoefficiency relation n 2  Cn  p  n  C  p  f(p) = C  p M(n) = n 2 since parallel memory store a 2D table in solution given This algorithm is perfectly scalable

51 Summary (1) Performance terms –Running Time –Cost –Efficiency –Speedup Model of speedup –Serial component –Parallel component –Communication component

52 Summary (2) Some factors preventing linear speedup? –Serial operations –Communication operations –Process start-up –Imbalanced workloads –Architectural limitations